Apify & Crawlee

AC

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻creators-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

How to aggregate user data from duplicate URLs

How can I save the categories of a product with multiple categories? Currently, I'm passing the category in userData to the category handler. However, since a URL is only scraped once, only the first category gets saved and all others discarded. Here's a minimal example....

Impit part of Crawlee

Will Impit become a part of Crawlee and replace GotScraping?

Error handling

Hello error handling behaviour is little unclear for some requests 400 are considered sometime as ok sometime as error even more weird i sometime got empty html (i'm 100% sure that not the produced response)
<html><head></head><body></body></html>
<html><head></head><body></body></html>
? even after adding
await waitForSelector('#myID')
await waitForSelector('#myID')
how i suppose to deal with that ? thanks...

Retry requests with different headers and etc

Can we get this page also in JS version: https://crawlee.dev/python/docs/guides/error-handling Also I am interested regarding best practices how to change headers when 403 or 429 are encountered, so I don't repeat same request with same headers and different IP only...

Problem with scraping a site that requires login

I have a paid actor I am renting out to customers that is failing because of a recent anti bot mitigation that prevents scraping pages past 10 without logging in. I have implemented Google login and store session cookies in a shared key value store for the actor to use and this seem to work fine. However Google has flagged account and logins as being a bot and has since terminated the account thus login fails and then scraping fails as well. Before the Google account termination, I experienced that the site I scrape, also seemed to throttle my requests - however this is without using a proxy so might be possible to circumvent, however this has never been an issue before with this site. The site has option for Google, Facebook, Apple or email login and I chose Google because email requires to receive a login code to the email each time a login is performed, which I couldn't automate. I have been trying to resolve this for the past week and was successfull until the Google login termination....

How would you log a nicely formatted full JSON response for debugging?

How would you log a nicely formatted full JSON response for debugging using the log?

How to use Impit as the http client for Crawlee?

How can I use Impit as the http client when using the httpcrawler?

Why does crawlee generate warning when both blockedStatusCodes and retryOnBlocked set?

Crawlee generates the following warning when we set both blockedStatusCodes and retryOnBlocked 'Both blockedStatusCodes and retryOnBlocked are set. Please note that the retryOnBlocked feature might not work as expected' When I set retryOnBlocked to true, crawlee automatically sets default blocked status codes for the session pool an empty array. When I do handleCloudflareChallenge inside Post navigation hooks, I get 403 status code. Even though handleCloudflareChallenge tries to remove 403 from sessionpool's blockedStatusCodes since it is empty it doesn't have any effect....

Best practices for long living crawler & rabbitmq

Hi guys, I’m here to ask best practices. I have coded a simple manager: - starts crawlee instance and initializes queue object - receives rabbitmq messages and pushes them to the queue ...

addRequestsBatched isn’t working for me on Apify, even though it works perfectly fine locally

Hi, I’m having trouble with addRequestsBatched (& also addRequests) It works perfectly fine when I run it locally, but it on Apify it does not seem to do anything. Has anyone faced this issue or knows what might be causing it? I'm using the exact same versions of node (22.19.0) , crawlee (3.15.1) & apify (3.4.5)...

Key-Value-Store Issue

Does anyone know how to use writeStream to save files into key-value-store? It seems that it does work, however file do not show up when running on the Apify platform? Also, saving the files usually with await <kvs>.setValue('<file-key-here>,<Buffer>,<content-type>) ; Errors saying that the data JSON.stringify() returned undefined...

Customize Run Options(memory and time out) in Playwright Node.js Crawler Actor

Hello everyone I’ve been running into some issues when trying to customize the Run Options for my Apify Actor, especially with increasing the allocated memory. Does anyone know if it’s possible to configure or increase the memory amount directly from within a PlaywrightCrawler (Crawlee) Node.js code? Or is this something that can only be set through the Apify Console / API run options? If you’ve had experience dealing with this, I’d really appreciate any insights or best practices you can share...

Limit request queue

I have some crawlers that are consuming from RabbitMQ, but they obviously take all the messages from Rabbit and move them to the internal queue. Can I somehow cap the requestQueue ? so it can only take a finite amount of requests ?

RabbitMQ and RequestQueue or List

Did anybody managed to actually straight connect the RequeustQueue or RequestList to a continuous consumption process from a RabbitMQ queue ?

Initializing CloudFlare cookies with Crawlee

Hi, I am currently using a PlaywrightScrapper to initialize cloudflare cookies to then send requests to the website programmatically. My problem is that the target website does multiple redirects to itself before the CF cookie is ready, which I did not achieve to handle using my code. You know the cookie is ready when you get a 200 status code....

Safe to parallelize Dataset writes across processes?

Context: • Crawlee v3.13.10, Node 22 • Linux (ext4), using storage-local • Multiple forked workers share one RequestQueueV2 (with request locking) • Each worker does:...
Next