Eric
Eric3w ago

AdaptativePlaywright crawler with SQL queue gets stuck

I am using multiple instances of an AdaptativePlaywright crawler in different servers, using the SQL queue. After running for some time, it looks like the crawler stops working and I only regularly see this logs: │ [crawlee._autoscaling.autoscaled_pool] WARN Task timed out after not set seconds │ │ [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 9; desired_concurrency = 10; cpu = 0.0; mem = 0.0; event_loop = 0.023; client_info = 0.0 │ │ [crawlee._autoscaling.autoscaled_pool] WARN Task timed out after not set seconds │ │ [crawlee.crawlers._basic._basic_crawler] INFO Crawled 0/1157424 pages, 0 failed requests, desired concurrency 10.
Restarting the pod where the crawler runs doesn't fix the issue
11 Replies
Eric
EricOP3w ago
this seems to be happening only when having more than one instance of the crawler, when having one everything looks fine @Mantisus have you seen this before? Tasks are being scheduled as we see the debug log "Scheduling a new task" but nothing happens (not even the logs we have in the pre_navigation_hook show) and the task eventually times out "Task timed out after not set seconds" We also see current concurrency is 0. Could you help us? (we don't see any logs of the queries to request queue failing whatsoever, we just assume it has to be different instances stepping on each other with locks or so). Sorry I can't provide more details or reproduceability, but it happens a bit randomly
Mantisus
Mantisus3w ago
Hmm, that's strange. I don't have any ideas yet as to what might be causing this. Is there any desynchronization of statuses in the database? For example, is pending + handling request greater than total in the ‘request_queues’ table? Are there any requests that have a value in time_blocked_until? time_blocked_until acts as a flag that the request is already being executed by one of the clients, and if that client crashes, the request will be blocked until time_blocked_until expires (5 minutes after the request was taken into work). After this time has elapsed, the request is returned to the queue. I am concerned about "Task timed out after not set seconds." Perhaps the problem is not related to SQL, but to something else. I will try to reproduce this and analyze it.
Restarting the pod where the crawler runs doesn't fix the issue
When you restart, how many unprocessed requests do you see in database?
Eric
EricOP3w ago
I had approximately 1 milion unprocessed requests I don't think it is an issue with the data in the db because when I reduce the replicas of the crawler from 10 to 1 it starts working perfectly. Is it possible that it is something related to the locking system used to update the request queue (not time_blocked_until but the locks themselves)? Just guessing... because I was only able to reproduce with multiple instances of the crawler Are there any requests that have a value in time_blocked_until? Yes there were For example, is pending + handling request greater than total in the ‘request_queues’ table? I am not sure what you mean here Hi @Mantisus is there a way I can provide you with more info about this problem?
MEE6
MEE63w ago
@Eric just advanced to level 2! Thanks for your contributions! 🎉
Mantisus
Mantisus3w ago
Hi @Eric. I'm not sure. I need to be able to analyze both the status of certain crawlee elements and the status of the database. Therefore, I will try to reproduce this in my own environment.
Eric
EricOP3w ago
okay, thanks! in case it helps, I found the pending + handling in the request_queues table and they are consistent with the numbers inside request_queue_records table. However, I saw "had_multiple_clients" is set to False, while it should be True as I have multiple instances. Is this a bug?
Mantisus
Mantisus3w ago
However, I saw "had_multiple_clients" is set to False, while it should be True as I have multiple instances. Is this a bug?
The error is that this flag is not updated to True. However, this does not affect the work of the queue. It is for informational purposes only.
Eric
EricOP3w ago
I continued debugging and I am seeing this query SELECT * FROM request_queue_records WHERE is_handled = false AND (time_blocked_until IS NULL OR time_blocked_until < now()) ORDER BY sequence_number ASC LIMIT 10 FOR UPDATE SKIP LOCKED; is timing out (this is the query done in fetch_new_requests) but the query without checking the locks works perfectly SELECT * FROM request_queue_records WHERE is_handled = false AND (time_blocked_until IS NULL OR time_blocked_until < now()) ORDER BY sequence_number ASC LIMIT 10; so it has to be something related to the locking mechanism we are using in postgres I forked the project and remove the lock for update in the fetch_new_requests method, and now everything works. I will continue investigating but doing a postgres explain shows that when locking for update, it has to do a full table scan without paralelism I found the underlying issue, it is not related to the locks. The query is super slow because it does a full table scan. If you run EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM request_queue_records where request_queue_id = [your id] and is_handled = false AND (time_blocked_until IS NULL OR time_blocked_until < now()) ORDER BY sequence_number ASC LIMIT 10 FOR UPDATE SKIP LOCKED; you'll see it is not using the index idx_fetch_available, but doing a full table scan. I will open an issue now with the details, but I fixed my issue running CREATE INDEX sorting_by_sequence_number ON public.request_queue_records USING btree (sequence_number)
Mantisus
Mantisus3w ago
Thank you so much for your analysis and for delving deeper into this!
Eric
EricOP3w ago
thank you for developing crawlee 🙂
Eric
EricOP2w ago
GitHub
Fetching new requests in Postgres does full table scan · Issue #15...
I am using version 1.0.4 with SQL storage in Postgres. My queue has around 2 millions rows (only approx 10% have been handled yet) and I am using 10 crawler instances. At some point, all my instanc...

Did you find this page helpful?