national-gold
national-gold2y ago

Crawlee Not scraping when provided with the same link twice

Hello I'm using crawlee to crawl web pages. The scene in question is I try to crawl same url multi times, I have tried add uniqueKey like await crawler.run([{url: config.url, uniqueKey: uuid}]); and
await enqueueLinks({
globs: typeof config.match === "string" ? [config.match] : config.match,
transformRequestFunction: (request) => {
request.uniqueKey = `${request.url}:${uuid}`;
return request;
}
});
await enqueueLinks({
globs: typeof config.match === "string" ? [config.match] : config.match,
transformRequestFunction: (request) => {
request.uniqueKey = `${request.url}:${uuid}`;
return request;
}
});
but it doesn't work. Then I try use requestQueue
const requestQueue = await RequestQueue.open(uuid)
await enqueueLinks({
globs: typeof config.match === "string" ? [config.match] : config.match,
transformRequestFunction: (request) => {
request.uniqueKey = `${request.url}:${uuid}`;
return request;
},
requestQueue
});
const requestQueue = await RequestQueue.open(uuid)
await enqueueLinks({
globs: typeof config.match === "string" ? [config.match] : config.match,
transformRequestFunction: (request) => {
request.uniqueKey = `${request.url}:${uuid}`;
return request;
},
requestQueue
});
with maxPageCrawl: 5 config, I got output
INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: Crawling: Page 1 / 5 - URL: https://www.prisma.io/docs...
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":32848,"requestsFinishedPerMinute":2,"requestsFailedPerMinute":0,"requestTotalDurationMillis":32848,"requestsTotal":1,"crawlerRuntimeMillis":33028}
INFO PlaywrightCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
Found 1 files to combine...
data/6c9d4f5a-5629-49fa-b299-ca27de45391f-1.json
Wrote 1 items to data/6c9d4f5a-5629-49fa-b299-ca27de45391f-1.json
INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: Crawling: Page 1 / 5 - URL: https://www.prisma.io/docs...
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":32848,"requestsFinishedPerMinute":2,"requestsFailedPerMinute":0,"requestTotalDurationMillis":32848,"requestsTotal":1,"crawlerRuntimeMillis":33028}
INFO PlaywrightCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
Found 1 files to combine...
data/6c9d4f5a-5629-49fa-b299-ca27de45391f-1.json
Wrote 1 items to data/6c9d4f5a-5629-49fa-b299-ca27de45391f-1.json
what's the problem about my code? and how can I implement this scene?
1 Reply
Lukas Krivka
Lukas Krivka2y ago
uniqueKey will work so the issue must be somewhere else

Did you find this page helpful?