automatic-azure•2y ago
Adding multiple requestHandler as a workflow
I am making a crawler which scrapes socials from a website. So the current workflow of the crawler I want is:
Given a list of urls and names of the business websites
- It first goes to the homepage and looks for socials there
- then enques the links which includes path like contact or about and crawl that and then moves to the other url
The thing I want to add is after it finishes with scraping homepage, contact or about (if any). I have another class which takes page as a params and has a function
getGoogleResults
(get top search results when enter a business name). I want to invoke that requestHandler which invokes getGoogleResults(query)
function and add the results to the Datasets.
Another thing I want to do is make the datasets returns the results instead saving it individually. The problem with that is it saves duplicates. For example if the socials are in the footer, and it also has about page. then socials are saved in duplicates twice which I want to remove that2 Replies
automatic-azureOP•2y ago
my current requestHandler code:
@Helper
You can always enqueue the next request from the previous one so you always go 1 by 1: home -> about -> contact -> google. You can pass the intermediate data in request.userData and only push in the last step