Apify & Crawlee

AC

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

đŸ’»hire-freelancers

🚀actor-promotion

đŸ’«feature-request

đŸ’»creators-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

đŸ•·crawlee-announcements

đŸ‘„community

Skip request in preNavigationHooks

is it possible to skip que request for url in preNavigationHooks ? I don't want to do the request at all in request handler if something occurs in preNavigationHooks. The only thing that worked for me is throwing a NonRetryableError but I think this is not ideal. The request.skipNavigation is not ideal because the request itself still occurs. ATM I'm using NonRetryableError but my logs are ugly. How do I suppress the logs?...
Solution:
hmmm I like the idea with SKIP label.. I'll try that. Thanks

postNavigationHooks timeout

I'm using camoufox and handleCloudflareChallenge utility in postNavigationHooks and request timeout after 100 seconds. Is it possible to lower the timeout limit from 100 in postNavigationHooks? it seems like it doesn't respect requestHandlerTimeoutSecs or navigationTimeoutSecs
Solution:
RequestHandlerTimeoutSecs is enforced by Apify/Crawlee’s overall request handler, but once inside a postNavigationHook, you're in user-defined logic. If handleCloudflareChallenge doesn't internally support a timeout (or ignores one), it might block longer than desired. navigationTimeoutSecs applies to page.goto() and similar calls — not necessarily to post-navigation scripts....

Rotate country in proxy for each request

can we rotate country for proxy without relaunching crawlee? I need to use specific country for every url, without relaunching crawlee everytime.

Crawleee js vs crawlee python

I've only used crawlee js and I'm wondering does cralee js has the same features as crawlee python? Is one better than the other in some cases?
Solution:
Hi JS version is much older, hence more battle tested, but we are getting close to the feature parity with the upcoming v1 release of crawlee for python.

Managing duplicate queries using RequestQueue but it seems off.

Description It appears that my custom RequestQueue isn't working as expected. Very few jobs are being processed, even though my RequestQueue list has many more job IDs. ``` import { RequestQueue } from "crawlee";...

re-enqueue request without throwing error

Is there any method of doing a request again without throwing an error but also respecting the maximum retries ?

Configuring playwright + crawlee js to bypass certain sites

I have noticed some pages that appear completely normal are sometimes hard to fetch content from. After some investigation, it might have something to do with the site being behind cloudflare. Do you have any suggestions on how to get past this? I believe in certain cases, it's simply a matter of popups and accepting some cookies. I do have stealth plugin added, but it still does not pierce through.

Target business owners #crawlee-js

Business Owners: Automate the Impossible — Before Your Competitors Do From securing high-demand tickets to automating bulk product checkouts, online reservations, and real-time data scraping Whether you're in:...

Pure LLM approach

How would you go about this problem? Given x topic, you want to extract y data from a list of website base urls. Is there any built-in functionality for this? If not, how do you solve this? I have attempted crawling entire sites, and one shot prompt the entire aggregated stuff to LLM given context window 1mill or higher. Seems to work okay, but I'm positive there are techniques to scrap tags / unrelated meta data from each url straped within every site....
Solution:
Yeah, Crawlee doesn’t have a built-in way to strip irrelevant stuff like headers or ads automatically. You’re not missing anything — cleanup is still a manual step. You can use libraries like readability or unfluff to extract the main content, or filter DOM sections manually (like removing .footer, .nav, etc.). For trickier cases, you can even use the LLM to clean up pages before extraction. Embedding-based filtering is also a nice option if you want to skip irrelevant pages before sending to the LLM, but it adds complexity. You're on the right track — it's just about fine-tuning the cleanup now....

Anyone here automated LinkedIn profile analytics before?

Trying to build a dashboard that fetches data like impressions, followers, views, etc. Using Playwright with saved cookies, realistic headers, delays, etc., but still running into issues: - Getting blocked by bot detection...

X's terms of service

Hello @Kacka H. , Does crawlee & apify service abide by X's terms of services when i use it to collect tweets for academic purposes ? Thanks in advance....

Invalidate request queue after some time

Hello! I would like to know if there's a builtin feature to invalidate (purge) the request queue after some time? Thanks!...

Apify Question:

Does anyone know a good option (preferably free, but I don't mind paying a little if it's good) that can extract data? I'm looking for a tool that can search specific websites (ideally 5 of my choice) for job offers and then forward the results to Make.com, which will handle the rest of the workflow....

crawlee/js playwrite help related to clicking

when we are building any scraper we something need to click on any button or somewhere but after reading crawlee docs i didn't find anything related to click option, can someone please guide me to do it

LinkedIn DM Sync to DB

LinkedIn and Sales Navigator messaging capabilities are insufficient for efficient operations in our use case. We want to implement a system that enables real-time, bidirectional syncing of LinkedIn messages to a database, allowing us to build additional features on top of them (e.g. unified inbox functionality). What is the best approach to achieve this, considering LinkedIn’s API limitations and anti-automation policies?...

Crawlee PuppeteerCrawler not starting with Chrome Profile

I need a Chrome profile to run the scraper, since I need my session cookies to access precise pages. This is my code ```js...

enqueueLinks with urls don't trigger router handler

Hello my "search" handler enqueues a url ( I have verified and the url exists and is valid ) to my "subprocessors" handler but for some reasons it's not being triggered ```js router.addHandler( "search",...
Solution:
Different domains maybe? Have you tried with a different strategy? Try with All , more information here https://crawlee.dev/js/api/core/enum/EnqueueStrategy

Is the default request queue the same for different crawler instances?

Hello everyone, I would like to know if the default request queue (if not specified in the Crawler options) is the same for all instances? I tried to run an HttpCrawler next to a PlaywrightCrawler and for some unknown reason the HttpCrawler picked a request which was for the PlaywrightCrawler...
Solution:
Yes, I believe. If you don't specify a queue's name, they both would use the same default queue. Solution is to use named queues that you'd drop at the end of the actor run See: https://github.com/apify/crawlee/discussions/2026?utm_source=chatgpt.com#discussioncomment-6656135...

What proxy providers work best with Crawlee?

We are trying to benchmark different proxies - which ones are the best?

Max requests per second

Hello! I would like to know, is there, like for the maxRequestsPerMinute / maxTasksPerMinute an option but for second? If not, what would be the easiest way to implement this? Always waiting 1s in the request handler and relying on maxConcurrency? ...