10 replies

Why request_queues's metadata file is not cleaned when purge_on_start=True?

Hello,

I find this behavior odd that when

Configuration(purge_on_start=True)

Configuration(purge_on_start=True)

set, then the

storage/request_queues/default/__metadata__.json

storage/request_queues/default/__metadata__.json

file is not purged in method

crawlee.storage_clients._file_system._request_queue_client.FileSystemRequestQueueClient.purge

crawlee.storage_clients._file_system._request_queue_client.FileSystemRequestQueueClient.purge

along with the request files.

What is the reason behind this? Which use-case this covers?

If you run the repro code (it crawls 2 URLs) multiple times right after another and check the log/metadata json. I see the following:

# first run
[BeautifulSoupCrawler] INFO  Crawled 0/2 pages, 0 failed requests, desired concurrency 1.

# second run
[BeautifulSoupCrawler] INFO  Crawled 0/4 pages, 0 failed requests, desired concurrency 1.

# first run
[BeautifulSoupCrawler] INFO  Crawled 0/2 pages, 0 failed requests, desired concurrency 1.

# second run
[BeautifulSoupCrawler] INFO  Crawled 0/4 pages, 0 failed requests, desired concurrency 1.

The 2nd run 0/4 is misleading because the only two requests were scheduled and the previous one has been purged.
After the 2nd run the manifest.json content contains handled_request_count: 4 and total_request_count: 4 which is printed to the logs.
My expectation would be 2 for both values

{
  "id": "7xvLZTJolTixoRk1x",
  "name": null,
  "accessed_at": "2026-01-02 15:10:06.721976+00:00",
  "created_at": "2026-01-02 15:09:48.267199+00:00",
  "modified_at": "2026-01-02 15:10:06.719162+00:00",
  "had_multiple_clients": false,
  "handled_request_count": 4,
  "pending_request_count": 0,
  "total_request_count": 4
}

{
  "id": "7xvLZTJolTixoRk1x",
  "name": null,
  "accessed_at": "2026-01-02 15:10:06.721976+00:00",
  "created_at": "2026-01-02 15:09:48.267199+00:00",
  "modified_at": "2026-01-02 15:10:06.719162+00:00",
  "had_multiple_clients": false,
  "handled_request_count": 4,
  "pending_request_count": 0,
  "total_request_count": 4
}

I've attached repro code as a file due to post length limitation

Thank you,
CL

crawlee_purge_repro.py885B

Apify & Crawlee•3mo ago•

10 replies

cdog

Why request_queues's metadata file is not cleaned when purge_on_start=True?

Hello,

I find this behavior odd that when

Configuration(purge_on_start=True)

Configuration(purge_on_start=True)

set, then the

storage/request_queues/default/__metadata__.json

storage/request_queues/default/__metadata__.json

file is not purged in method

crawlee.storage_clients._file_system._request_queue_client.FileSystemRequestQueueClient.purge

crawlee.storage_clients._file_system._request_queue_client.FileSystemRequestQueueClient.purge

# first run
[BeautifulSoupCrawler] INFO  Crawled 0/2 pages, 0 failed requests, desired concurrency 1.

# second run
[BeautifulSoupCrawler] INFO  Crawled 0/4 pages, 0 failed requests, desired concurrency 1.

# first run
[BeautifulSoupCrawler] INFO  Crawled 0/2 pages, 0 failed requests, desired concurrency 1.

# second run
[BeautifulSoupCrawler] INFO  Crawled 0/4 pages, 0 failed requests, desired concurrency 1.

{
  "id": "7xvLZTJolTixoRk1x",
  "name": null,
  "accessed_at": "2026-01-02 15:10:06.721976+00:00",
  "created_at": "2026-01-02 15:09:48.267199+00:00",
  "modified_at": "2026-01-02 15:10:06.719162+00:00",
  "had_multiple_clients": false,
  "handled_request_count": 4,
  "pending_request_count": 0,
  "total_request_count": 4
}

{
  "id": "7xvLZTJolTixoRk1x",
  "name": null,
  "accessed_at": "2026-01-02 15:10:06.721976+00:00",
  "created_at": "2026-01-02 15:09:48.267199+00:00",
  "modified_at": "2026-01-02 15:10:06.719162+00:00",
  "had_multiple_clients": false,
  "handled_request_count": 4,
  "pending_request_count": 0,
  "total_request_count": 4
}

I've attached repro code as a file due to post length limitation

Thank you,
CL

crawlee_purge_repro.py885B

Why request_queues's metadata file is not cleaned when purge_on_start=True?

Similar Threads

Why request_queues's metadata file is not cleaned when purge_on_start=True?

Similar Threads

Similar Threads

Similar Threads