frail-apricot
frail-apricotโ€ข2y ago

purging request queue

Hello everyone, I'm trying to integrate crawlee on an express server so that I can start crawling a site when I go to a specific route. Everything works fine for the first request, but when I make a second request the urls are no longer crawled. From what I understand, this is because the urls already crawled are stored with their id. How do I empty the url table? I tried adding CRAWLEE_PURGE_ON_START = 'true' without much success. The first crawl :
INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":2,"requestsFailed":0,"retryHistogram":[2],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1498,"requestsFinishedPerMinute":36,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2996,"requestsTotal":2,"crawlerRuntimeMillis":3334}INFO PlaywrightCrawler: Finished! Total 2 requests: 2 succeeded, 0 failed. {"terminal":true}
INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":2,"requestsFailed":0,"retryHistogram":[2],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1498,"requestsFinishedPerMinute":36,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2996,"requestsTotal":2,"crawlerRuntimeMillis":3334}INFO PlaywrightCrawler: Finished! Total 2 requests: 2 succeeded, 0 failed. {"terminal":true}
The second crawl (on same url):
INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":239}
INFO PlaywrightCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed. {"terminal":true}
INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":239}
INFO PlaywrightCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed. {"terminal":true}
Thanks in advance to anyone who can help me. ๐Ÿ™
2 Replies
frail-apricot
frail-apricotOPโ€ข2y ago
It's always when you ask the question that you find the solution a few minutes later ๐Ÿ˜‘ . For those looking for the solution: https://github.com/apify/crawlee/discussions/2026
GitHub
not able to call two different crawler with same url ยท apify crawle...
Which package is this bug report for? If unsure which one to select, leave blank None Issue description First call the CheerioCrawler with the url after few seconds call the PlaywrightCrawler with ...
xenial-black
xenial-blackโ€ข2y ago
also with named queues you have to purge them manually, crawlee wont putge them and the way to purge it is by calling queue.drop() and then again instantiating the queue

Did you find this page helpful?