Apify & CrawleeA&CApify & Crawlee
Powered by
efficient-indigoE
Apify & Crawlee•17mo ago•
5 replies
efficient-indigo

Crawlee stops after about 30 items pushed to the datastore, repeats the same data on next run.

I'm writing my first Actor using Crawlee and Playwright crawler to scrape website https://sreality.cz.

I wrote a crawler using as much as possible from the examples in the documentation. It works like this:

1. Start on the first page of search, for example this one.
2. Skip ad dialog, if it shows.
3. Find all links to next pages and add them to the queue with
enqueueLinks()
enqueueLinks()
.
4. Find all links to individual items (apartments, houses, whatever) and add them to the queue with
enqueueLinks()
enqueueLinks()
.
5. If next page to process is an item page, scrape the data and save with
pushData()
pushData()
. Otherwise, if it's another page, repeat from 3.

In theory, this is all I need to scrape the entire search result list. However what I experience is that it will enqueue all the links (around 185) but only process around 30 of them before finishing. Very strange.

I tried to set
maxRequestsPerCrawl: 1000
maxRequestsPerCrawl: 1000
, didn't help.

Maybe I'm missing something but I don't see why it would just stop after around 30 pages. Is there another config somewhere that controls this?

Even more strange, it then logs the final statistic where it says something like
"requestsFinished":119
"requestsFinished":119
. A number that doesn't make sense at all. Less than the number of actually enqueued links but a lot more than the number of actuall processed pages.
Sreality.cz • reality a nemovitosti z celé ČR
Největší nabídka nemovitostí v ČR: 98 210 realit. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu.
Sreality.cz • reality a nemovitosti z celé ČR
Prodej bytu Ústí nad Labem • Sreality.cz
188 bytů v aktuální nabídce Byty k prodeji Ústí nad Labem ✓ Parametry hledání: byt, Ústí nad Labem ✓ Největší nabídka nemovitostí v Česku (98 210 inzerátů) s hledáním na mapě a filtrováním s desítkami parametrů ✓
Prodej bytu Ústí nad Labem • Sreality.cz
Apify & Crawlee banner
Apify & CrawleeJoin
This is the official developer community of Apify and Crawlee.
14,091Members
Resources
Recent Announcements

Similar Threads

Was this page helpful?
Recent Announcements
ellativity

**Update to Store Publishing Terms and Acceptable Use Policy** Due to an influx of fraudulent reviews recently, Apify's Legal team has taken some actions to protect developers, customers, and Apify, by updating the Store Publishing Terms and Acceptable Use Policy. Please pay special attention to the updated terms in section 4 of the Store Publishing Terms here: https://docs.apify.com/legal/store-publishing-terms-and-conditions Additionally, please review the changes to section 2 of the Acceptable Use Policy here: https://docs.apify.com/legal/acceptable-use-policy If you have any questions, please ask them in <#1206131794261315594> so everyone can see the discussion. Thanks!

ellativity · 3w ago

ellativity

Hi @everyone I'm hanging out with the Creator team at Apify in https://discord.com/channels/801163717915574323/1430491198145167371 if you want to discuss Analytics and Insights!

ellativity · 4w ago

ellativity

2 things for <@&1092713625141137429> members today: 1. The Apify developer rewards program is open for registrations: https://apify.notion.site/developer-rewards This is the program where you will earn points for marketing activities. The rewards are still TBC, but the real purpose of the program is to help you structure your marketing activities and efforts. In the coming weeks, I will be populating that link with guides to help you identify the best ways to market your Actors, as well as scheduling workshops and office hours to help you create content and develop your own marketing strategy. 2. At 2PM CET (in about 80 minutes) there will be an office hour with the team behind Insights and Analytics, who want your feedback on how to improve analytics for you. Join us in https://discord.com/channels/801163717915574323/1430491198145167371 to share your ideas!

ellativity · 4w ago

Similar Threads

Crawlee stops scanning for links with different anchors (#xyz) but the same base URL
living-lavenderLliving-lavender / crawlee-js
2y ago
Crawlee scrapper invoking the same handler multiple times
progressive-amaranthPprogressive-amaranth / crawlee-js
3y ago
Running crawlee multiple times with the same URL
progressive-amaranthPprogressive-amaranth / crawlee-js
3y ago
How to reset crawlee URL cache/add the same URL back to the requestQueue?
worthy-azureWworthy-azure / crawlee-js
2y ago