exotic-emerald
exotic-emerald•2y ago

long running scraper, 500+ pages for each crawl

Hello, I have a playwright crawler that is listening to db changes, whenever a page gets added I want it to scrape and enqueue 500 links for the whole scraping process, but, there can be multiple things added to the DB at the same time. I've tried keepAlive and the maxRequests thing is hard to manage if we just keep adding urls to the same crawler. My question is: what's the best way to create a playwright crawler that will automatically handle the processing of 500 pages for each start?
14 Replies
Alexey Udovydchenko
Alexey Udovydchenko•2y ago
apify approach is to do multiple runs, so running crawlee in apify cloud considered the best way so far by many peoples 😉
exotic-emerald
exotic-emeraldOP•2y ago
@Alexey Udovydchenko hahha, good upsell but I don't want to use that 😛 any other suggestions? cc @Pepa J I was thinking overridi ng is finished function
Alexey Udovydchenko
Alexey Udovydchenko•2y ago
you can use it partially, i.e. by running nodejs processes at your server with named request queue and dataset in cloud, or dockerize and run instances entirely at your own host, I think approach will be the same in any case, you need environment for multiple runs
exotic-emerald
exotic-emeraldOP•2y ago
@Alexey Udovydchenko is there no way to have 1 crawler handle multiple request queues?
Lukas Krivka
Lukas Krivka•2y ago
But what is the problem you are solving? If it is processing speed, then you need bigger server or more servers (Apify, AWS, etc.) Otherwise, I don't see a problem with keepAlive: true, just can also use the forefront option of request queue to prioritize some requests over others.
exotic-emerald
exotic-emeraldOP•2y ago
@Lukas Krivka thanks for commenting. I basically have the addRequests listening to a db trigger, so whenever new rows come in, we add it to the request but each database row will need to do 500 crawls
MEE6
MEE6•2y ago
@bmax just advanced to level 2! Thanks for your contributions! 🎉
exotic-emerald
exotic-emeraldOP•2y ago
so I need to control maxRequestsPerCrawl but per database row.. i.e either have a crawler per or somehow have a requestQueue per db id but that means the crawler will need to manage multiple queues. Hope that makes sense, thanks for the help etc.
Lukas Krivka
Lukas Krivka•2y ago
You can just track that in abitrary state object using useState
exotic-emerald
exotic-emeraldOP•2y ago
@Lukas Krivka I'm getting so many of these: ERROR PlaywrightCrawler: Request failed and reached maximum retries. elementHandle.textContent: Target page, context or browser has been closed Any way to debug this or maybe can I ask one of you to look at my code to give me some advice? willing to pay $
Lukas Krivka
Lukas Krivka•2y ago
Either you are running out of memory and the page crashed or you don't await some code and it already was closed when you tried to get the text
exotic-emerald
exotic-emeraldOP•2y ago
I imagine it has to do with the hackiness of how I'm starting the playwright crawler.
export const createCrawler = async (place_id: string, pdf_crawler: BasicCrawler) => {
const request_queue = await RequestQueue.open(place_id)
return new PlaywrightCrawler({
export const createCrawler = async (place_id: string, pdf_crawler: BasicCrawler) => {
const request_queue = await RequestQueue.open(place_id)
return new PlaywrightCrawler({
then I do to start it:
const pdf_crawler = await createPDFCrawler(place.id,)
const web_crawler = await createWebCrawler(place.id, pdf_crawler)

console.log('Starting crawlers', place.name, place.id)
const requests = queuedRunDocument['urls'].map((url) => {
return new CrawleeRequest({
url: url,
userData: {
place: { name: place.name, id: place.id, url },
},
})
})

await web_crawler.addRequests(requests)
const web_promise = new Promise((resolve, reject) => {
web_crawler
.run()
.then(() => {
console.log('web crawler finished', place.id)
resolve(true)
})
.catch((e) => {
console.log('web crawler error', e)
reject(e)
})
})
const pdf_crawler = await createPDFCrawler(place.id,)
const web_crawler = await createWebCrawler(place.id, pdf_crawler)

console.log('Starting crawlers', place.name, place.id)
const requests = queuedRunDocument['urls'].map((url) => {
return new CrawleeRequest({
url: url,
userData: {
place: { name: place.name, id: place.id, url },
},
})
})

await web_crawler.addRequests(requests)
const web_promise = new Promise((resolve, reject) => {
web_crawler
.run()
.then(() => {
console.log('web crawler finished', place.id)
resolve(true)
})
.catch((e) => {
console.log('web crawler error', e)
reject(e)
})
})
Error: Object with guid handle@dc8fe92256cc3997e03d3b2bf1e26da6 was not bound in the connection elementHandle.evaluate: Target page, context or browser has been closed I get these errors. probably using apify will solve my problems, but, scared it will get expensive. do you have an example of this?
exotic-emerald
exotic-emeraldOP•2y ago
GitHub
[BUG] Playwright-Java: Getting 'browser.newContext: Target page, co...
System info Playwright Version: [v1.39] Operating System: [Windows 11 Pro] Browser: [Chrome] Other info: [Java (Open JDK)- v11.0.1] Issue Description: I am getting 'browser.newContext: Target p...
Lukas Krivka
Lukas Krivka•2y ago
yeah, your code feels like it has some unhandled promises, probably missing some await

Did you find this page helpful?