Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻creators-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

correct-apricot

1/10/2025

AdaptivePlaywrightCrawler starts crawling the whole web at some point.

I want to use the AdaptivePlaywrightCrawler, but it seems like it wants to crawl the entire web. Here is my code. `const crawler = new AdaptivePlaywrightCrawler({ renderingTypeDetectionRatio: 0.1,...

other-emerald

1/7/2025

Moving from Playwright to Crawlee/Playwright for Scraping

Are there actually any ressources on building a scraper with crawlee except the one in the docs? Where do I set all the browser context for example? ```javascript const launchPlaywright = async () => {...

quaint-moccasin

1/4/2025

How scrape the emails from linkedin

I am building the linkedin email scraper actor and having some issue requesting if anyone to help me on these: Scraped data: { name: 'Join LinkedIn', title: 'Not found', email: 'Not found', location: 'Not found'...

conscious-sapphire

1/4/2025

How to implement persistent login with crawlee-js/playwright?

I need to scrape content on multiple pages in one social network (x.com) that requires auth. Where to implement the login mechanism in order to it happened before following urls and persisted to use it until it is valid?

stormy-gold

1/3/2025

Incremental Web scraping using Crawlee

Hey everyone. :perfecto: :crawlee: Currently, I am working on scraping one website where new content (pages) is added frequently (as an example we can say like a blog). So when I run my scraper it scrapes all pages successfully but when I run it for example tomorrow (when new pages are added to websites) it will start scraping everything again. I would be thankful if you could give me some advice, ideas, solutions, or examples out there of efficiently re-scraping without crawling the entire site again. ...

conscious-sapphire

12/30/2024

Managing Queue using redis or something similar and having worker nodes listening on queue

I'm trying to run Crawlee for production use and try to scale where we can have a cluster of worker nodes who will be ready for crawling pages based on the request. How can achieve this. The RequestQueue is basically writing requests to files and not utilizing any queueing system. I couldn't find doc that said how i can utilise Redis queue or something similar....

Strijdhagen

12/23/2024

Anyone managed to get past Datadome?

I've been struggling hard with Datadome on sites like Wellfound.com. Has anyone been able to crack this one?...

genetic-orange

12/22/2024

Site can detect headless mode

I have a Crawlee Playwright bot that logs into a website and performs some actions on a schedule. I made a public version here without the site or actions: https://github.com/raywalz/web-automation-starter For some reason, the website can detect headless mode despite the stealth plugin. It works fine in headed mode though. Any ideas? I have documentation on the setup in the readme of that project. I may give up and use XVFB and headed mode all the time like I’ve seen a previous post here mention, but I want to try to keep it headless if I can....

complex-teal

12/20/2024

Still confusing...

Hello, I am now trying to crawl App store reviews of an app based on South Korea location. I tryed the json as { "appId": "ai.replika.app", "country": "kr",...

stormy-gold

12/17/2024

Does CheerioCrawler shares global state among its instances?

I implemented a class for creating CheerioCrawler and adding routers etc. and I extended this class to create specific implementations for various websites. When I run them it finishes after doing amount of max request that I set. Problem is it counts amount of max request for all the instances and stops after that, instead of handling instances seperately. ``` INFO CheerioCrawler: Starting the crawler. INFO CheerioCrawler: Crawler reached the maxRequestsPerCrawl limit of 50 requests and will shut down soon. Requests that are in progress will be allowed to finish....

stormy-gold

12/16/2024

Error: Operation failed! (You cannot publish an Actor. Please, contact support.)

when i try to publish my actor i get the follwing error

fair-rose

12/12/2024

Lock file already held

Hi I seem to be running into this issue with lock file being held? I don't need to persist state as I'm returning it in memory ```javascript return callback(Object.assign(new Error('Lock file is already being held'), { code: 'ELOCKED', file })); ^...

fair-rose

12/11/2024

Multiple instance - PlaywrightCrawler, is it possible?

If I am calling const crawler = new PlaywrightCrawler({}) is there any state being shared between the instances?

fair-rose

12/9/2024

Scrape/crawl transactional rather than batch

Hi, I'm looking to introduce crawling websites into an existing workflow which doesn't suit batch processing. i.e. I want to scrape each website get the result and do some further processing downstream. I do have this working with the code attached however I imagine there's a better way to achieve this given I'll be concurrently processing this with up to 500 websites and my concern is memory allocation ```javascript export async function crawlWebsiteForAddresses(url: string) { const ukPostcodeRegex = /\b([A-Z]{1,2}[0-9][A-Z0-9]?)\s?([0-9][A-Z]{2})\b/;...

extended-salmon

12/2/2024

How to close Puppeteer browser mid-run while continuing actor execution in crawlee?

Hi everyone, I’m using PuppeteerCrawler for scraping because it unblocks websites effectively and allows JavaScript execution. However, I’m facing an issue: After accessing a website, I extract the required data from network requests (e.g., HTML) and parse it later with cheerio....

sensitive-blue

11/28/2024

What is headless shell

I noticed that npx playwright install chromium install chromium headless shell And now those run in processes instead of chromium app, I think they take less cpu, but i couldnt find any information about them on crawlee...

vicious-gold

11/28/2024

Downloading JSON and YAML files while crawling with Playwright

Hi there. Is it possible to detect the Content-Type header of responses and download JSON or YAML files? I'm using Playwright to crawl my sites and have some JSON and YAML content I would like to capture, as well.

like-gold

11/25/2024

Digital Ocean

Is there any documentation around using Digital Ocean for a Crawlee scrapper? I see options for GC and AWS but looking more for just setting something up on a droplet.

stormy-gold

11/23/2024

`maxRequestsPerMinute` But for session

:perfecto: Hey! Firstly I just want to thank you for creating such an amazing product ❤️ ! Question itself:...

other-emerald

11/19/2024

Massive Scraper

Hi I have a (noob) question. So I want to crawl many different urls from different pages so they need their own crawler implementation. Some can use the same also. How can I achieve this in crawlee such that they run in parallel and can be lal executed with a single command or also in isolation? Input and example repos etc. would be highly appreciated...

Previous Next