genetic-orange
genetic-orange3y ago

How to recrawl initial page for new links without purging to keep track of duplicates?

Thank you for releasing Crawlee as an open source project. I have setup a very simple CheerioCrawler that crawls the home page of a news website for starters. I am then scraping certain articles and saving some information to an external database. I aim to run the crawler at a regular interval (every few hours) to check for new links/articles. I want to keep track of what I've crawled in previous runs as to not re-visit those pages and waste resources (mine and the hosts), so I set CRAWLEE_PURGE_ON_START to false to keep track of what's been crawled. Current State: - Once I run the crawler once, the home page is marked as "handled" and never visited again on subsequent runs to look for new links within it. Desired State: - On each new run, crawl the same home page, and enqueue any and only new links found for handling/scaping. Is there a way to make my starting home page (example.com) re-crawlable on each run without purging? I believe it's something I can add within the default handler, I'm just not sure what exactly it is.
// .env
CRAWLEE_PURGE_ON_START=false

// main.ts
const startUrls = ["example.com"]
const crawler = new CheerioCrawler({
requestHandler: router,
maxRequestsPerCrawl: 5,
maxConcurrency: 1,
// maxRequestsPerMinute: 30,
});
const main = async () => {
await crawler.addRequests(startUrls);
await crawler.run();
};
main();

// router.ts
router.addDefaultHandler(async ({ enqueueLinks, log }) => {
log.info(`enqueueing new URLs`);
await enqueueLinks({
strategy: EnqueueStrategy.SameHostname,
globs: [
"https://example.com/news/*",
],
label: "detail"
})
});

router.addHandler("detail", async ({ request: req, $, log }) => {
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// Save to DB Code here
});
// .env
CRAWLEE_PURGE_ON_START=false

// main.ts
const startUrls = ["example.com"]
const crawler = new CheerioCrawler({
requestHandler: router,
maxRequestsPerCrawl: 5,
maxConcurrency: 1,
// maxRequestsPerMinute: 30,
});
const main = async () => {
await crawler.addRequests(startUrls);
await crawler.run();
};
main();

// router.ts
router.addDefaultHandler(async ({ enqueueLinks, log }) => {
log.info(`enqueueing new URLs`);
await enqueueLinks({
strategy: EnqueueStrategy.SameHostname,
globs: [
"https://example.com/news/*",
],
label: "detail"
})
});

router.addHandler("detail", async ({ request: req, $, log }) => {
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// Save to DB Code here
});
Thank you.
10 Replies
ambitious-aqua
ambitious-aqua3y ago
I'm trying to achieve the same thing. Couldn't find a built-in crawlee solution for it either. I think the industry term is "real-time data extraction". For the duplication though, I came up with a solution. I noticed that the unique request id of the queued urls never change. Not sure if this can change at a random time with a crawlee library update, I hope not. So when I extract data from the page, I also record the request id of the page's url. an example from my project:
const sceneDataset = await Dataset.open("scenes");

if (request.label === "scene-detail") {
// gets scenes data collection
const sceneData = await sceneDataset.getData();
// creates an array from the requestIDs
const sceneIds = sceneData.items.map((item) => item.requestId);

if (sceneIds.includes(request.id)) {
log.info(`Scene ${request.id} already exists`);
return;
} else {
let data = {
...
}
await sceneDataset.pushData(data);
}
.
.
.
const sceneDataset = await Dataset.open("scenes");

if (request.label === "scene-detail") {
// gets scenes data collection
const sceneData = await sceneDataset.getData();
// creates an array from the requestIDs
const sceneIds = sceneData.items.map((item) => item.requestId);

if (sceneIds.includes(request.id)) {
log.info(`Scene ${request.id} already exists`);
return;
} else {
let data = {
...
}
await sceneDataset.pushData(data);
}
.
.
.
I'll focus on the real-time extraction logic today. Ideally it should automatically switch to the real-time mode after the initial "full" scrape. Though a built-in solution for these would be way nicer. Maybe there is one, idk, still exploring.
genetic-orange
genetic-orangeOP3y ago
Thanks for your reply. I don't have an issue with the duplication actually, I can rely on Crawlee's built-in deduplication and it functions well so far. My main issue is re-crawling the same main page for new links. Crawlee considers the main page "crawled" and "handled" and as such on subsequent runs it does not crawl it anymore and the run ends without having crawled any pages at all.
genetic-orange
genetic-orangeOP3y ago
Actually found this discussion on the github repo that proved useful, maybe it'll be useful to you. It's a bit old, I think the interface for updating the request queue has changed, so I ended up deleting the request of my entry/start urls at the end of each run, using as you said the "unique" id of that particular request. https://github.com/apify/crawlee/discussions/1322 Let me know if you find a better way.
GitHub
Option to not cache entry URLs · Discussion #1322 · apify/crawlee
Is there a method to not cache the URLs added to requestQueue? // Open the default request queue associated with the actor run const queue = await Apify.openRequestQueue(); // Open a named request ...
absent-sapphire
absent-sapphire3y ago
I’d recommend using a named RequestQueue. Request queues by default don't allow the same URL to be crawled twice, so just assign a unique UUID to each request to the initial page, and it will only crawl new links on each run. Named storages also don’t get purged.
genetic-orange
genetic-orangeOP3y ago
That's a great suggestion in regards to the named RequestQueue, thank you. But can you clarify how I would assign a unique UUID to the initial page, I can't find any function in the documentation that gives me that ability.
MEE6
MEE63y ago
@Warmanz just advanced to level 1! Thanks for your contributions! 🎉
absent-sapphire
absent-sapphire3y ago
Download the UUID package and use the v4() function
genetic-orange
genetic-orangeOP3y ago
Hahaha thank you but I'm aware on how to generate a UUID, just not how to assign it to a request. Is it by creating a RequestOptions with a url and uniqueKey (the generated UUID) and giving that to .addRequests? Is that the right way to go? This seems to work.
// startUrls = string[] of entry urls
const startUrlsWithOptions: RequestOptions[] = startUrls.map(
(url: string) => {
return {
url,
uniqueKey: uuidv4(),
};
}
);

await crawler.addRequests(startUrlsWithOptions);
// startUrls = string[] of entry urls
const startUrlsWithOptions: RequestOptions[] = startUrls.map(
(url: string) => {
return {
url,
uniqueKey: uuidv4(),
};
}
);

await crawler.addRequests(startUrlsWithOptions);
absent-sapphire
absent-sapphire3y ago
Yup! Apologies for not understanding your question well. To explain why your solution works: By default, Crawlee uses the request URL as the unique key for the request. So, if you enqueue two requests to https://google.com, only one will be added to the queue and handled. When you provide your own uniqueKey, it goes off of that instead. You can also use useExtendedUniqueKey, which is great when making POST requests to the same URL, but with different payloads and headers (ex. when GraphQL scraping).
Lukas Krivka
Lukas Krivka3y ago
This is something we have been discussing for years. Apify is now working on new request queue implementation that will make it easier to manipulate requests. For now, my solution would be to simply store crawled URLs to a named dataset or KV store. You load that on every start and simply filter out when enqueueing.

Did you find this page helpful?