frail-apricot•2y ago

purging request queue

Hello everyone, I'm trying to integrate crawlee on an express server so that I can start crawling a site when I go to a specific route. Everything works fine for the first request, but when I make a second request the urls are no longer crawled. From what I understand, this is because the urls already crawled are stored with their id. How do I empty the url table? I tried adding CRAWLEE_PURGE_ON_START = 'true' without much success. The first crawl :

INFO  PlaywrightCrawler: Starting the crawler.
INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":2,"requestsFailed":0,"retryHistogram":[2],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1498,"requestsFinishedPerMinute":36,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2996,"requestsTotal":2,"crawlerRuntimeMillis":3334}INFO  PlaywrightCrawler: Finished! Total 2 requests: 2 succeeded, 0 failed. {"terminal":true}

INFO  PlaywrightCrawler: Starting the crawler.
INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":2,"requestsFailed":0,"retryHistogram":[2],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1498,"requestsFinishedPerMinute":36,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2996,"requestsTotal":2,"crawlerRuntimeMillis":3334}INFO  PlaywrightCrawler: Finished! Total 2 requests: 2 succeeded, 0 failed. {"terminal":true}

The second crawl (on same url):

INFO  PlaywrightCrawler: Starting the crawler.
INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":239}      
INFO  PlaywrightCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed. {"terminal":true}

INFO  PlaywrightCrawler: Starting the crawler.
INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":239}      
INFO  PlaywrightCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed. {"terminal":true}

Thanks in advance to anyone who can help me. 🙏

2 Replies

frail-apricotOP•2y ago

It's always when you ask the question that you find the solution a few minutes later 😑 . For those looking for the solution: https://github.com/apify/crawlee/discussions/2026

GitHub

not able to call two different crawler with same url · apify crawle...

Which package is this bug report for? If unsure which one to select, leave blank None Issue description First call the CheerioCrawler with the url after few seconds call the PlaywrightCrawler with ...

xenial-black•2y ago

also with named queues you have to purge them manually, crawlee wont putge them and the way to purge it is by calling queue.drop() and then again instantiating the queue

purging request queue

Did you find this page helpful?