exotic-emerald•2y ago
long running scraper, 500+ pages for each crawl
Hello,
I have a playwright crawler that is listening to db changes, whenever a page gets added I want it to scrape and enqueue 500 links for the whole scraping process, but, there can be multiple things added to the DB at the same time. I've tried keepAlive and the maxRequests thing is hard to manage if we just keep adding urls to the same crawler.
My question is: what's the best way to create a playwright crawler that will automatically handle the processing of 500 pages for each start?
14 Replies
apify approach is to do multiple runs, so running crawlee in apify cloud considered the best way so far by many peoples 😉
exotic-emeraldOP•2y ago
@Alexey Udovydchenko hahha, good upsell but I don't want to use that 😛 any other suggestions? cc @Pepa J
I was thinking overridi ng is finished function
you can use it partially, i.e. by running nodejs processes at your server with named request queue and dataset in cloud, or dockerize and run instances entirely at your own host, I think approach will be the same in any case, you need environment for multiple runs
exotic-emeraldOP•2y ago
@Alexey Udovydchenko is there no way to have 1 crawler handle multiple request queues?
But what is the problem you are solving? If it is processing speed, then you need bigger server or more servers (Apify, AWS, etc.)
Otherwise, I don't see a problem with
keepAlive: true
, just can also use the forefront
option of request queue to prioritize some requests over others.exotic-emeraldOP•2y ago
@Lukas Krivka thanks for commenting. I basically have the addRequests listening to a db trigger, so whenever new rows come in, we add it to the request but each database row will need to do 500 crawls
@bmax just advanced to level 2! Thanks for your contributions! 🎉
exotic-emeraldOP•2y ago
so I need to control maxRequestsPerCrawl but per database row.. i.e either have a crawler per or somehow have a requestQueue per db id but that means the crawler will need to manage multiple queues.
Hope that makes sense, thanks for the help etc.
You can just track that in abitrary state object using
useState
exotic-emeraldOP•2y ago
@Lukas Krivka I'm getting so many of these:
[31mERROR[39m[33m PlaywrightCrawler:[39m Request failed and reached maximum retries. elementHandle.textContent: Target page, context or browser has been closed
Any way to debug this or maybe can I ask one of you to look at my code to give me some advice? willing to pay $Either you are running out of memory and the page crashed or you don't await some code and it already was closed when you tried to get the text
exotic-emeraldOP•2y ago
I imagine it has to do with the hackiness of how I'm starting the playwright crawler.
then I do to start it:
Error: Object with guid handle@dc8fe92256cc3997e03d3b2bf1e26da6 was not bound in the connection
elementHandle.evaluate: Target page, context or browser has been closed
I get these errors.
probably using apify will solve my problems, but, scared it will get expensive.
do you have an example of this?exotic-emeraldOP•2y ago
https://github.com/microsoft/playwright/issues/27997#issuecomment-1812673983 this actually solved a lot downgrading to 1.38.0
GitHub
[BUG] Playwright-Java: Getting 'browser.newContext: Target page, co...
System info Playwright Version: [v1.39] Operating System: [Windows 11 Pro] Browser: [Chrome] Other info: [Java (Open JDK)- v11.0.1] Issue Description: I am getting 'browser.newContext: Target p...
yeah, your code feels like it has some unhandled promises, probably missing some await