conventional-blackC
Apify & Crawleeβ€’3y agoβ€’
12 replies
conventional-black

Workflow for manually reprocessing requests when using @apify/storage-local for SQLite Request Queue

Use case: I'm debugging a crawler. Majority of request handlers succeed, only few fail. I wanna fix/adjust the request handler logic; get from logs failed urls; open some SQLite editor; find those requests in request queue table and somehow mark them as unprocessed. Then rerun the crawler with CRAWLEE_PURGE_ON_START=false so it only run the previously problematic urls. Iterate few times to catch all bugs, and then run the whole crawler with purged storage.

After lot of debugging/investigating Crawlee & @Apify/storage-local I've managed to figure out a working workflow, but it's kinda laborious:
* set row's orderNo to some future date in ms from epoch
* edit rows' json and remove handledAt property [2]
* run the crawler, which will re-add handledAt property
* delete row's orderNo (not sure why that is not done automatically)

That's kinda tedious, do you know of some better way? Or is there some out-of-the-approach for my usecase without hacking SQLite? I've found out this approach recommended by one-and-only @Lukas Krivka here πŸ™‚ https://github.com/apify/crawlee/discussions/1232#discussioncomment-1625019

[1]
https://github.com/apify/apify-storage-local-js/blob/8dd40e88932097d2260f68f28412cc29ff894e0f/src/emulators/request_queue_emulator.ts#L341
[2]
https://github.com/apify/crawlee/blob/52b98e3e997680e352da5763b394750b19110953/packages/core/src/storages/request_queue.ts#L164
Screen_2024-01-03_at_21.06.08.png
GitHub
How can I make requests in the RequestQueueue queue available for re-processing? I am interested in how to do this both for the entire queue and for individual requests. Suppose the actor has proce...
How can I mark requests in the queue as unprocessed? Β· apify crawle...
GitHub
Local emulation of the apify-client NPM package, which enables local use of Apify SDK. - apify/apify-storage-local-js
apify-storage-local-js/src/emulators/request_queue_emulator.ts at 8...
GitHub
Crawleeβ€”A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - apify/crawlee
crawlee/packages/core/src/storages/request_queue.ts at 52b98e3e9976...
Was this page helpful?