sunny-green
sunny-green2y ago

Not generating jsons but crawling

I have scraped so far 300k links and generated json files. My crawler still crawls through the links as per my labels and globs patterns but it isn’t generating json files. I have 2 questions here: 1. Does the crawler.rub() take not only the url but also a label like {url: xyz.com, label:”website”} 2. If I can fix this and start scraping again I don’t want the 300k to get started again, instead I would want only the delta. Any help would be much much appreciated! 🙏 Note: I built my crawler in Typescript
3 Replies
Lukas Krivka
Lukas Krivka2y ago
You can abort it and run again if you make sure you don't purge the state. I would add more logs into the code to make sure you make know what is happening. https://crawlee.dev/api/core/interface/ConfigurationOptions#purgeOnStart
sunny-green
sunny-greenOP2y ago
Thank you @Lukas Krivka @Lukas Krivka I was able to deal with the purgeOnStart. Thank you for that. I wanted to add an array of urls by labelling them as mentioned above and I included them in my run() function. The code doesn’t break but neither I am getting any information why it has scrapped some pages while not others. Could you please help me with that
Lukas Krivka
Lukas Krivka2y ago
This is really impossible to understand without more info. Maybe the enqueuLinks is not capturing all URLs you think it should? You can change it to more manual URL collection so you can log them on each page.

Did you find this page helpful?