rare-sapphireโข3y ago
Scrape different layouts
Hi,
I am just getting started with Apify web scraping.
I am trying to scrape a page with different page layouts. There are many pages of listed items with links to the item pages. First I scrape the links from the list and then I need to fetch the actual data from each item page.
How can I manage this in Apify? My current solution right now was to wrap it in an if/else depending on the content of the URL. However, this gives issues when I try to add new requests as it apparently can't use the await statement anywhere else but at the top level of bodies of modules.
11 Replies
Hello @AK,
Generally speaking if the logic of scraping one page is different from others it is implemented as another standalone actor.
But even your if-else solution should work.
May you share the block of code where the await statement cannot be used in? We may figure it out.
rare-sapphireOPโข3y ago
Does this mean it is possible to use the output from one actor as input to another actor? ๐
This is essentially what I am trying to do right now that doesn't work:
The error I get is this:
ERROR Compilation of pageFunction failed.
await is only valid in async functions and the top level bodies of modules
sorry about the shitty formatting in that codeblock - it fucked up when I copied it
There is several ways how to deal with this, the most easy to understand could be dealing with
await outside of the (for)each:
Actually it would generate less requests to Apify API (it will use only one, with all the urls at once).rare-sapphireOPโข3y ago
that makes sense. So the issue is the foreach. Before changing it I just appended it to a list. I'll go back to doing that ๐ Thanks man
Is it possible to use the output of an actor as input in another actor?
Also I found another issue. context.EnqueueRequests isn't a function that exists when I try it out. Is there any other way to queue multiple requests at the same time?
Which version of apify, do you use? (can see it in
package.json)
In the latest it could be await context.addRequests(requests)rare-sapphireOPโข3y ago
I have no idea tbh ๐ I just fired it up in my browser today using the online console
@AK just advanced to level 1! Thanks for your contributions! ๐
rare-sapphireOPโข3y ago
but addRequests doesn't work either. Again I just get the message that no such function exists
I found the apify docs, but I still don't see anything there indicating I can add multiple requests to the queue
Oh, so are you using the puppeteer scraper actor (https://console.apify.com/actors/YJCnS9qogi9XxDgLB) or you created a new one from the template?
If so I am not that deeply familiar with the version of Apify SDK, but:
should also work.
rare-sapphireOPโข3y ago
Hi again @Pepa J
I spent the rest of yesterday trying to figure out what to do from here.
The for-loop seems to be the same solution as my foreach solution? With the same amount of calls?
What I want to do from here is to transfer my solution into an application locally and make API calls towards Apify. But I still can't see anywhere in the documentation that I can add multiple requests in one call - it would greatly improve the amount of calls I have to make to the API, so it would be much appreciated if we could figure out if this actually exists despite it not being obvious from the docs. ๐
@Pepa J you don't have to answer me anymore. I have given up on Apify and will just build my own scraper in python. I realize the documentation is quite shit for python and close to non-existent, and it won't take me long to build my own. Thanks for your help though - I am sorry that Apify isn't mature enough for proper python usage and in depth documentation in that area.
@AK I am sorry to hear that.
The
puppeteer-scraper that you are using is meant to be pre-made standalone actor solution, that has its limitations (as mentioned in Readme) but it is easy quick-to-setup-and-run. For further customization it is worth to create your own actor - with tools like apify-cli (https://docs.apify.com/cli/) it should take few minutes to create it locally and push it to platform - or run it locally.
About the python doc - we recently added official support for python (few weeks ago) so even the doc will be improved in future.