absent-sapphire•2y ago
How to find the end of all request handlers ?
Hello,
I'm using JSDOMCrawler and inside requestHandler I have some logic for scraping the data and pushing that into an array called messages.
I want to push all these messages to AWS SQS for further processing when the crawlee ends, but I found requestHandler is working asynchronously and due to this I'm not able to get all messages at the end.
Any solution for this scenario ?
8 Replies
You can put those messages to the dataset and after the crawler finishes put all from the dataset to the aws.
absent-sapphireOP•2y ago
I don't want to use dataset, any other way to achieve this ?
My logic inside requestHandler is quite simple
1- create one entry in mongo
2- upload scarped data to AWS S3
3- push message with all details to one array
But the most important thing I want to know is when everything is getting carwled so that I can proceed with further process
Hi @AlgoAlchemist ,
If I understand it correctly you just need to wait for everything to be scraped. In that case you can do somenthing like this:
absent-sapphireOP•2y ago
Hello @Pepa J ,
Thank you for your suggestion, but this fails when you set
maxConcurrency
@AlgoAlchemist what do you mean by it fails?
absent-sapphireOP•2y ago
@Pepa J When we set
maxConcurrency
requestHandler keeps running in the background even though the crawler is stopped and eventually it leads to less no. of messages into the array at the end@AlgoAlchemist just advanced to level 1! Thanks for your contributions! 🎉
@AlgoAlchemist Can you make minimal reproduceable example and send it here? It seems to me there has to be problem somewhere else..