useful-bronze•2y ago
Help regarding request_queue
I am building a website scraper for my users. I want to support upto x no.of child URLs to be scrapable, starting from the startUrl. In somecases, I am seeing duplicate links to be scraped. And in somecases, the no.of urls identified goes into the order of 1000s. I want to control the enqueuing of the urls into the request_queue, to avoid unnecessary costs and duplication of URLs that are being scraped.
Here is my enque function:
Also, I have set the link selector as
a
tag. Should I not use this in the scrape request's input ?1 Reply
You will need to create a global object that will track number of requests enqueued per start Url. You can pass the Start Url to its children via
userData