extended-salmon
extended-salmon2y ago

Requests queues and preserving write usage

Hello, i'm creating a supermarket data scraper. The supermarket i'm scraping has a sitemap where are the urls for every product are listed. Currently i'm loading those in like this:
const { urls } = await Sitemap.load('https://.../entities/products/detail.xml');
const { urls } = await Sitemap.load('https://.../entities/products/detail.xml');
And the passing them to my crawler:
await crawler.run(urls);
await crawler.run(urls);
However this writes all of them again to the default request queue. Writing +23.000 items to the requests queue every run costs me minimally $0.50 every time. Is there any way I can write the the request queue (or another place) once, and then read from there the next runs?
1 Reply
Oleg V.
Oleg V.2y ago
But the list of URLs from sitemap is dynamic, no ? That's why You need to update / scrape it if You want up-to-date information from your target site. In Your case You can use named request queue:
const queueWithName = await RequestQueue.open('some-name');
const queueWithName = await RequestQueue.open('some-name');
Or You can try to store all URLs in named Key Value store if it makes sense for You.

Did you find this page helpful?