AdaptativePlaywright crawler with SQL queue gets stuck
I am using multiple instances of an AdaptativePlaywright crawler in different servers, using the SQL queue. After running for some time, it looks like the crawler stops working and I only regularly see this logs:
│ [crawlee._autoscaling.autoscaled_pool] WARN Task timed out after not set seconds │
│ [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 9; desired_concurrency = 10; cpu = 0.0; mem = 0.0; event_loop = 0.023; client_info = 0.0 │
│ [crawlee._autoscaling.autoscaled_pool] WARN Task timed out after not set seconds │
│ [crawlee.crawlers._basic._basic_crawler] INFO Crawled 0/1157424 pages, 0 failed requests, desired concurrency 10.
Restarting the pod where the crawler runs doesn't fix the issue
Restarting the pod where the crawler runs doesn't fix the issue
11 Replies
this seems to be happening only when having more than one instance of the crawler, when having one everything looks fine
@Mantisus have you seen this before? Tasks are being scheduled as we see the debug log "Scheduling a new task" but nothing happens (not even the logs we have in the pre_navigation_hook show) and the task eventually times out "Task timed out after not set seconds"
We also see current concurrency is 0. Could you help us?
(we don't see any logs of the queries to request queue failing whatsoever, we just assume it has to be different instances stepping on each other with locks or so). Sorry I can't provide more details or reproduceability, but it happens a bit randomly
Hmm, that's strange. I don't have any ideas yet as to what might be causing this.
Is there any desynchronization of statuses in the database? For example, is pending + handling request greater than total in the ‘request_queues’ table?
Are there any requests that have a value in
time_blocked_until?
time_blocked_until acts as a flag that the request is already being executed by one of the clients, and if that client crashes, the request will be blocked until time_blocked_until expires (5 minutes after the request was taken into work). After this time has elapsed, the request is returned to the queue.
I am concerned about "Task timed out after not set seconds." Perhaps the problem is not related to SQL, but to something else.
I will try to reproduce this and analyze it.
Restarting the pod where the crawler runs doesn't fix the issueWhen you restart, how many unprocessed requests do you see in database?
I had approximately 1 milion unprocessed requests
I don't think it is an issue with the data in the db because when I reduce the replicas of the crawler from 10 to 1 it starts working perfectly. Is it possible that it is something related to the locking system used to update the request queue (not time_blocked_until but the locks themselves)? Just guessing... because I was only able to reproduce with multiple instances of the crawler
Are there any requests that have a value in time_blocked_until?
Yes there were
For example, is pending + handling request greater than total in the ‘request_queues’ table?
I am not sure what you mean here
Hi @Mantisus is there a way I can provide you with more info about this problem?
@Eric just advanced to level 2! Thanks for your contributions! 🎉
Hi @Eric. I'm not sure. I need to be able to analyze both the status of certain crawlee elements and the status of the database.
Therefore, I will try to reproduce this in my own environment.
okay, thanks!
in case it helps, I found the pending + handling in the request_queues table and they are consistent with the numbers inside request_queue_records table. However, I saw "had_multiple_clients" is set to False, while it should be True as I have multiple instances. Is this a bug?
However, I saw "had_multiple_clients" is set to False, while it should be True as I have multiple instances. Is this a bug?The error is that this flag is not updated to True. However, this does not affect the work of the queue. It is for informational purposes only.
I continued debugging and I am seeing this query
SELECT *
FROM request_queue_records
WHERE is_handled = false
AND (time_blocked_until IS NULL OR time_blocked_until < now())
ORDER BY sequence_number ASC
LIMIT 10
FOR UPDATE SKIP LOCKED;
is timing out (this is the query done in fetch_new_requests) but the query without checking the locks works perfectly
SELECT *
FROM request_queue_records
WHERE is_handled = false
AND (time_blocked_until IS NULL OR time_blocked_until < now())
ORDER BY sequence_number ASC
LIMIT 10;
so it has to be something related to the locking mechanism we are using in postgres
I forked the project and remove the lock for update in the fetch_new_requests method, and now everything works. I will continue investigating but doing a postgres explain shows that when locking for update, it has to do a full table scan without paralelism
I found the underlying issue, it is not related to the locks. The query is super slow because it does a full table scan. If you run
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT *
FROM request_queue_records
where request_queue_id = [your id]
and is_handled = false
AND (time_blocked_until IS NULL OR time_blocked_until < now())
ORDER BY sequence_number ASC
LIMIT 10
FOR UPDATE SKIP LOCKED;
you'll see it is not using the index idx_fetch_available, but doing a full table scan. I will open an issue now with the details, but I fixed my issue running
CREATE INDEX sorting_by_sequence_number ON public.request_queue_records USING btree (sequence_number)
Thank you so much for your analysis and for delving deeper into this!
thank you for developing crawlee 🙂
I opened this issue https://github.com/apify/crawlee-python/issues/1526
GitHub
Fetching new requests in Postgres does full table scan · Issue #15...
I am using version 1.0.4 with SQL storage in Postgres. My queue has around 2 millions rows (only approx 10% have been handled yet) and I am using 10 crawler instances. At some point, all my instanc...