Clear URL queue at end of run?

At a glance

The community member, a data reporter at CBS News, is using crawlee to archive web pages. They are experiencing an issue where the next crawl continues to crawl pages enqueued in the previous crawl, even after the crawl is finished. The community member has looked at the documentation for the persist_storage and purge_on_start parameters, but is unsure of how to fix the issue.

In the comments, another community member suggests that the issue may be related to the max_requests_per_crawl setting, where crawlee stops making requests but doesn't clear the request queue. Another community member suggests that the issue may be related to using a Jupyter Notebook, where the queue and cache stored in memory are not cleared without session termination.

The community member confirms that they are running the crawler inside a Django app, not a Jupyter Notebook. They then share that they were able to get the crawler to crawl the correct pages by limiting the crawl depth instead of setting a limit on the number of requests. However, they are now seeing an issue where the crawler is refusing to crawl a page that it has previously crawled.

Another community member suggests that the solution is to provide a unique unique_key

Useful resources

CChris

I'm a data reporter at CBS News using crawlee to archive web pages. Currently, when I finish a crawl, the next crawl continues to crawl pages enqueued in the previous crawl.

Is there an easy fix to this? I've looked at the docs, specifically the persist_storage and purge_on_start parameters, but it's unclear from the documentation what exactly those do.

Happy to provide a code sample if helpful

9 comments

CChris

If anyone comes across this post, I think I understand what's happening now - if crawlee hits the max number of requests defined in max_requests_per_crawl, it stops making requests but doesn't clear the request queue, so if you're running enqeue you'll end up with more pages in the queue.

MMantisus

Hi, are you by any chance using Jupiter Notebook when working with crawlee?

Since the behavior you describe corresponds to purge_on_start=False
That is, it reaches the max_requests_per_crawl limit and aborts, but after starting it continues where it left off, since the queue is not cleared.

But if you are working with Jupiter Notebook, the queue and cache stored in memory are not cleared without session termination.

CChris

Nope this is running inside a django app I'm building, not a notebook

CChris

I managed to get it to crawl the correct pages by not setting a limit of requests but rather limiting the crawl depth

CChris

However, i'm now seeing an issue where the crawler is refusing to crawl a page that it's previously crawled, not clear why

MMantisus

If the crawler is to crawl the same page, you must pass unique_key

example:

Plain Text

async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    request_1 = Request.from_url("https://httpbin.org/get", unique_key="1")
    request_2 = Request.from_url("https://httpbin.org/get", unique_key="2")
    request_3 = Request.from_url("https://httpbin.org/get", unique_key="3")

    await crawler.run(
        [
            request_1,
            request_2,
            request_3
        ]
    )

CChris

Boom that worked for me, thanks so much for the help

AApifyBot

@Chris just advanced to level 1! Thanks for your contributions! 🎉

CChris

For posterity if anyone comes across this thread, I had to provide a unique_key (I used a uuid because I want them to be crawled every time) to the Request object AND in the user_data argument to enqueue_links

Add a reply

Apify Discord Mirror

Clear URL queue at end of run?