Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻creators-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

wise-white

2/27/2025

Shared external queue between multiple crawlers

Hello folks! Is there any way i can force cheerio/playwright crawlers to stop using their own internal request queue and instead "enqueue links" to another queue service such as Redis? I would like to achieve this in order to be able to run multiple crawlers on a single website and i would need them to share the same queue so they won't use duplicate links. Thanks in advance!...

correct-apricot

2/27/2025

Reclaiming failed request back to the list or queue

Hello. I am facing this issue regularly. I am using Crawlee with Cheerio. How can I resolve this? ...

absent-sapphire

2/21/2025

Disable write to disk

By default, data will be write to ./storage, is there a way to turn off this and use memory instead ?

NeoNomade

2/14/2025

CheerioCrawler headerGenerator help

Hello ! I kept reading the docs but couldn't find a clear information about this. When we use Puppeteer or Playwright we can tweak in browserPool the fingerprintGenerator. For Cheerio we have the headerGenerator from got, how we can adjust it inside the CheerioCrawler ?...

other-emerald

2/12/2025

Is it possible to bypass proxies for specific requests?

I have a use case where I want to have a crawler running permanently. This crawler has a tieredProxyList set up that it will iterate over in case some of them don't work. For scraping some pages I don't want to use proxies to reduce the amount of money I am spending on them (When I scrape my own page I don't want to proxy, but I want to use the same logic / handlers. Is it possible to specify either the proxy that should be used for specific requests? Or maybe even the proxy tier? Basic Setup: const proxyConfiguration = new ProxyConfiguration({tieredProxyUrls: [{'proxyTier1'], ['proxyTier2']]});...

fascinating-indigo

2/10/2025

Issue: Playwright-chrome Docker with pnpm

Hello! I'm trying to run the actor using pnpm instead of npm. In my local, running pnpm run start:dev , pnpm run start:prod and apify run works as expected. apify push is also successful. ...

Shannon

2/10/2025

More meaningful error than ERR_TUNNEL_CONNECTION_FAILED

Hi There. I am using a proxy to crawl some sites and encounter a ERR_TUNNEL_CONNECTION_FAILED error. I am using brightData as my proxy service. If i was to curl my proxy endpoint I get a meaningful errror. For example...

wise-white

2/5/2025

crawlee not respecting cgroup resource limits

crawlee doesnt seem to respect resource limits imposed by cgroups. This poses problems for containerised enviroments where ethier crawlee gets oom killed or silently slows to a crawl as it thinks it has much more resource available then it actually does. reading and setting the maximum ram is pretty easy ```typescript function getMaxMemoryMB(): number | null { const cgroupPath = '/sys/fs/cgroup/memory.max';...

plain-purple

2/4/2025

Looking for feedback/review of my scraper

It's already working but I'm fairly new to scraping and just want to learn the best possible practises. The script is 300-400 lines (Typescript) total and contains a login routine + session retention, network listeners as well as DOM querying and is running on a Fastify backend. Dm me if you are down ♥️...

rare-sapphire

2/2/2025

Trying out Crawlee, etsy not working..

Hi Apify,
Thank you for this fine auto-scraping tool Crawlee! I wanted to try out along with the tutorial but with different url e.g. https://www.etsy.com/search?q=wooden%20box but it failed with PlaywrightCrawler. ``` // For more information, see https://crawlee.dev/...

vicious-gold

1/31/2025

Only the first crawler runs in function

When running the example below, only the first crawler (crawler1) runs, and the second crawler (crawler2) does not work as intended. Running either crawler individually works fine, and changing the URL to something completely different also works fine. Here is an example. ``` import { PlaywrightCrawler } from 'crawlee'; ...

flat-fuchsia

1/30/2025

How to retry only failed requests after the crawler has finished ?

I finished the crawler with around 1.7M, and got around 100k failed requests. Is there a way to retry just the failed requests ?

conscious-sapphire

1/30/2025

Crawlee support esm?

I try to integrate with Nuxt3, when i run on production mode it doesnt work


[nuxt] [request error] [unhandled] [500] Cannot find module '/app/server/node_modules/puppeteer/lib/cjs/puppeteer/puppeteer.js'

...

fascinating-indigo

1/30/2025

Max Depth option

Hello! Just wondering whether it is possible to set max depth for the crawl? Previous posts (2023) seems to make use of 'userData' to track the depth. Thank you....

flat-fuchsia

1/29/2025

Max session count 1 doesn't work, session got called concurrently upon starting.

flat-fuchsia

1/29/2025

How can I pass context to createNewSession ?

I want to use existing crawler setting (JSON/ cherioo ) upon creating new session, signing in / signing up user there while associating cookies, token with the session. Currently I put these new session creation conditionally inside preNavigation hook (context is passed as arg there), but not in createNewSession...

flat-fuchsia

1/27/2025

how do i create organize 1 auth per session, ip, user agent ?

I want to create bunch of authenticated users, each with their consistent browser, proxy, user agent, fingerprints, schedule, browsing pattern, etc.

flat-fuchsia

1/26/2025

Is there a way to get the number of enqueued links?

I have the following code for AdaptivePlaywrightCrawler and I want to log the number of enqueued links after calling enqueueLinks. ` router.addDefaultHandler(async ({ request, enqueueLinks, parseWithCheerio, querySelector, log, page }) => {
await enqueueLinks({...

national-gold

1/21/2025

One or multiple instances of CheerioCrawler?

Hi community! I'm new to Crawlee, and I'm building a script that scrapes a lot of specific, different domains. These domains each have a different number of pages to scrape; some have 2 to 3 thousand pages, while others might have just a few hundred (or even less). The thing I have doubts about is: if I put all starting URLs in the same crawler instance, it might finish scraping a domain way before another one. I thought about separating domains, creating a crawler instance for each domain, just so that I can run each crawler separately and let them run their own course. Is there any downside to this, e.g. will it need significantly more resources? Is there a better strategy? TIA...

wise-white

1/14/2025

Handling Dynamic Links with Crawlee PlaywrightCrawler

I’m working on a project using PlaywrightCrawler to scrape links from a dynamic JavaScript-rendered website. The challenge is that the <a> tags don’t have href attributes, so I need to click on them and capture the resulting URLs. - Delayed Link Rendering: Links are dynamically rendered with JavaScript, often taking time due to a loader. How can I ensure all links are loaded before clicking? - Navigation Issues: Some links don’t navigate as expected or fail when trying to open in a new context. - Memory Overload: I get the warning "Memory is critically overloaded" during crawls...

Previous Next