Apify & Crawlee

AC

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻creators-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Failed to make outgoing request to sample ZIP file

My actor tried to call the ZIP file via URL but error. I want to check if the runner can perform an outgoing URL? URL: https://getsamplefiles.com/download/zip/sample-1.zip...

Strange behaviour when using rq_client()

I have been struggling with tests for a while, and finally reduced it to a simple test for which I don't understand the behaviour. Is this expected (and I am missing sth) or is this expected? This test fails async def test_failing(): storage_client = MemoryStorageClient()...

Downloading pdfs and other files

When crawling, whenever a download starts in a webpage (pdf or similar), crawlee errors. What would be the correct way to catch this error and do the download myself? I have a similar problem with XMLs. In other words I am using a Playwright crawler but I want to be able to download content (and parse + enqueue links) on my own when my crawler can't. I was thinking on having another request queue for pdfs that I dequeue using another crawler (and add to this queue infering it is a pdf from the u...

i face problem when listen abort event [python sdk]

I followed the official documentation's template code to implement an Apify Actor, built and deployed it on the Apify platform. When I run the Actor and click Abort (graceful abort), the ABORTING event handler handler_foo does not log anything — Actor.log.info(...) inside it never appears. I have no way to confirm whether the handler is actually being triggered, and I need to perform custom actions when the user aborts the run. Questions: ...

Field required [type=missing, input_value={'id': 'e65qXc7GXZBNFA8CD...runs/e65qXc7GXZBNFA8CD'}, inpu

https://console.apify.com/view/runs/e65qXc7GXZBNFA8CD My actor suddenly threw an error, and I didn't make any changes to the code. What's more, it works fine when I run it myself, but there's an issue when my users run it....

AdaptativePlaywright crawler with SQL queue gets stuck

I am using multiple instances of an AdaptativePlaywright crawler in different servers, using the SQL queue. After running for some time, it looks like the crawler stops working and I only regularly see this logs: │ [crawlee._autoscaling.autoscaled_pool] WARN Task timed out after not set seconds │ │ [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 9; desired_concurrency = 10; cpu = 0.0; mem = 0.0; event_loop = 0.023; client_info = 0.0 │ │ [crawlee._autoscaling.autoscaled_pool] WARN Task timed out after not set seconds │ │ [crawlee.crawlers._basic._basic_crawler] INFO Crawled 0/1157424 pages, 0 failed requests, desired concurrency 10. ...

Crawl Speed

Lets say i have one session that makes 15 requests. I do not like that it finishes crawl in 25 seconds. How to stretch crawl duration / slow down speed of crawl for small ones? max_requests_per_minute is useless in this situation right? What options i have?

How to maximize throughput with concurrency settings?

I am deploying crawlee in my kubernetes cluster and first I tried not setting a maximum (nor a minimum) of concurrent tasks, but crawlee kept on taking more and more memory until the pod was killed by the cluster (took too much memory). Now I set a conservative maximum but I see resources are very underutilised in some moments depending on the page I am crawling. Is there a way I am not seeing on how to do this correctly? Or is it possible that there is a bug when determining a max amount of co...

Whole crawler dies because "failed to lookup address information: Name or service not known"

I am not able to reproduce it in a simple example (it may be a transient error), but I have gotten this error regularly and it kills the crawler completely. ``` Traceback: File "crawlee/crawlers/_basic/_basic_crawler.py", line 1366, in __run_task_function...

For 429 errors, do you consider adding a setting or change the default behaviour to allow retries?

From what I have seen, when there is a 429 error (rate limit), Crawlee tries rotating the session, but if all of the sessions it uses return 429 (for example all proxies have been rate-limited) the request is just marked as handled and forgotten. I would expect this to be considered a transient error that can be fixed with time, same as a 500 error. Do you expect to support this change in behaviour in the future?

Not able to connect Kiwi MCP Server through my OpenAI agent workflow

You can see the attached screen, I'm trying to connect the MCP server inside my Agnet workflow, but It is giving following error, { "error": { "message": "Error retrieving tool list from MCP server: 'http://wondrous-director--kiwi-mcp-server-task.apify.actor/mcp'. Http status code: 424 (Failed Dependency)", "type": "external_connector_error",...
No description

The (python-client) Group Scraper does not fetch the post title, only the post description.

In the given image, using python-client I can only get the post description but I cannot fetch the title of the post itself (shown in bold). Any clue how to solve this?...
No description

Session Cookies

When i use bs4 crawler i would expect that session stores received cookies and then reuses in next requests but SessionCookies are empty. Any idea why and how to make it work?

How to set RAM when calling an Actor via API (Python client)?

Hi, I’m running the novi/fast-tiktok-api Actor from my Python app using the Apify Python Client. Here is a simplified version of my code: ...

Have a subscription but can't run anything!

Hello, I've used apify before, infact I have an active subscription, but from this morning (don't know why) I can't run any actor and can't set the maximum charged results. I try to change browser, empy the cache but nothing works. This is the error I got: "Error: Cannot run Actor (Expected property maxItems to be of type number but received type string in object)". Anybody can help me please'

Pay-per-Event testing

Jo how do I test the monitization with Pay per Event if I run it on apify and not locally on my device?

Issue with the Apollo scrapper

I am scapping apollo with this scrapper: https://console.apify.com/actors/jljBwyyQakqrL1wae but the problem is it stops in the middle. it says its limits are 50k, but it stops around 30k. the whole lead list is actually 39k, so it should be fine. any ideas how to fix this and ehy this is happening?...
No description

Crawler become slower as time goes on

Hello guys, thanks for your great tools. I have a problem with crawlee, it works well when I run crawler at the beginning, but when my vpn has a problem and I switch my config, crawler won't continue and I have to restart it. Are there any timeout fields to manage the maximum time each request could take? Or sometimes it becomes slower for no reason and it won't get back the same speed(rpm) as the beginning. ```python...

Apify Actors Not Working | HELP PLEASE

I need help. For a reason I don't know, every time I use an apify actor it doesn't run near as much as I want it too, sometimes not at all. For example if I run the Facebook Search Scraper for 100 max results, it stops after 10 and says run successful. Can anyone help.

Crawling one link at a time

Hi all, moving my first steps with crawlee and trying out the basic example on https://crawlee.dev/python. Is there a (clean) way to run the crawling one step at a time - with a step being a single execution of request_handler?...
Next