genetic-orange
genetic-orange16mo ago

Detect when a specific request finishes for a Express served crawler

I'm developing a long-lived crawler that's being served behind Express. A user sends a request to "localhost:8347/search?q={query}", and the crawler searches Google to find sites to scrape. Currently, it only retrieves the page titles from each site. Problem: I need to determine when a specific user's request has finished processing, and I need to differentiate between requests from those of other users. The (naive) solution is to check if the RequestQueue is empty, but that isn't really feasible when there are multiple requests made by different users filling the same RequestQueue. The only solution I can think of right now involves finding every request with a specific datasetIndex in its request.userData propety and checking if all of those requests are marked as "handled", but I don't exactly know how to implement this yet. Are there any built-in methods in Crawlee that could perhaps better solve this? Example scenario: 1. User 1 makes a request: localhost:8347/search?q="silmarillion"%20"1999"%20site:osta.ee 2. User 2 makes a request: localhost:8347/search?q="tasuja"%20"2017"%20site:osta.ee 3. User 1's request finishes (no more Google results to scrape). 4. searchGoogle needs to detect when User 1's request is complete and return the results to the Express route while differentiating it from User 2's request.
3 Replies
genetic-orange
genetic-orangeOP16mo ago
genetic-orange
genetic-orangeOP16mo ago
Lukas Krivka
Lukas Krivka15mo ago
Hello, one solution is to create a map of HTTP requests/responses and Crawlee Requests. At the end of the requestHandler, you send the HTTP response. See how it is done here https://github.com/apify/super-scraper/blob/master/src/router.ts#L110
GitHub
super-scraper/src/router.ts at master · apify/super-scraper
Generic REST API for scraping websites. Drop-in replacement for ScrapingBee, ScrapingAnt, and ScraperAPI services. And it is open-source! - apify/super-scraper

Did you find this page helpful?