recent-tealR
Apify & Crawlee11mo ago
9 replies
recent-teal

Error on cleanup PlaywrightCrawler

I use PlaywrightCrawler with headless=True
The package that I use is: crawlee[playwright]==0.6.1

When running the crawler I noticed when waiting for remaining tasks to finish it sometimes receives an error like you can see in the screenshot. Is this something that can be resolved easily?

Because I think this error is also related to another issue I have.
In my code I have my own batching system in place. But I noticed that my memory slowly started to increase on each batch.
After some investigation I saw that ps -fC headless_shell gave me a lot headless_shell with <defunct> (zombie processes). So I assume this is related to the cleanup that is failing on each crawl.

Below you can see my code for the batching system:
    # Create key values stores for batches
    scheduled_batches = await prepare_requests_from_mongo(crawler_name)
    processed_batches = await KeyValueStore.open(
        name=f'{crawler_name}-processed_batches'
    )

    # Create crawler
    crawler = await create_playwright_crawler(crawler_name)

    # Iterate over the batches
    async for key_info in scheduled_batches.iterate_keys():
        urls: List[str] = await scheduled_batches.get_value(key_info.key)
        requests = [
            Request.from_url(
                url,
                user_data={
                    'page_tags': [PageTag.HOME.value],
                    'chosen_page_tag': PageTag.HOME.value,
                    'label': PageTag.HOME.value,
                },
            )
            for url in urls
        ]
        LOGGER.info(f'Processing batch {key_info.key}')
        await crawler.run(requests)
        await scheduled_batches.set_value(key_info.key, None)
        await processed_batches.set_value(key_info.key, urls)
Was this page helpful?