Apify & Crawlee

AC

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻creators-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

sensitive-blue
sensitive-blue11/14/2024

await a promise set in a pre navigation hook

Hi All, I have a pre navigation hook that listens for requests and if they return images saves them to the cloud ```typescript...
harsh-harlequin
harsh-harlequin11/12/2024

Generative Bayesian Network Docs

I'm looking at the generative-bayesian-network package part of the fingerprint suite. https://www.npmjs.com/package/generative-bayesian-network However, I cant find any kind of documentation whatsoever on this package. It looks interesting and I want to figure out how to use it. Are there docs anywhere for this?...

Does crawlee support sock5 proxies with authentication?

Does crawlee support sock5 proxies with authentication? I am building a crawler based in crawlee with playwright. And it's need to use sock5 proxies with authentication. But I don't find the anything about that in the crawlee document ....
deep-jade
deep-jade11/10/2024

ERROR: We've encountered an unexpected system error. If the issue persists, please contact support.

Hi people, I am having this problem with Docker, it runs reursively and fails, it is on Platform. I can't find an error and every single file of the project seems to be ok. Any idea? - Pulling Docker image of build XXXXX frpm repository - Creating Docker container - Starting Docker container...
correct-apricot
correct-apricot11/6/2024

retryOnBlocked with HttpCrawler

Hi, I'm using the HttpCrawler to scrape a static list of URLs. However, when I do get a 403 response as a result of CloudFlare challenge, the request is not retried with retryOnBlocked: true. However, if I remove retryOnBlocked, I see my errorHandler getting invoked and the request is retried. Do I understand retryOnBlocked wrong?
dependent-tan
dependent-tan11/6/2024

Goodbye Crawlee (migrated to Hero)

I migrated my scraping code from Crawlee to Hero (see https://github.com/ulixee/hero). It works. Everything that worked with Crawlee - works with Hero. Why I migrated: can not handle the over-engineered Crawlee API more (and bugs related to this). It was just too much APIs (different APIs!) for my simple case. Hero has about 5 times simpler API. ...
ambitious-aqua
ambitious-aqua11/6/2024

PlaywrightCrawler proxy issue

my crawler with PlaywrightCrawler works just fine but I have issue when adding proxy !!! this is the code ```ts import { PlaywrightCrawler, ProxyConfiguration } from "crawlee";...
subsequent-cyan
subsequent-cyan11/3/2024

Stop Crawlee When Condition Met

I am trying to scrape an ecommerce site and would like to scrape only 20 items. How can I stop the process when this many items are scraped.
ratty-blush
ratty-blush11/2/2024

Crawlee stops after about 30 items pushed to the datastore, repeats the same data on next run.

I'm writing my first Actor using Crawlee and Playwright crawler to scrape website https://sreality.cz. I wrote a crawler using as much as possible from the examples in the documentation. It works like this: 1. Start on the first page of search, for example this one....
sensitive-blue
sensitive-blue10/31/2024

autoscale pool trying to scale up without suffecient memory

Hi All, im running a playwright crawler and am running into a bit of an issue with crawler stability. Have a look at these two log messages ...

Max redirects

I am getting this error message, how to best deal with it? Reclaiming failed request back to the list or queue. Redirected 10 times. Aborting. Can I increase the max number of redirects for my CheerioCrawler?...
correct-apricot
correct-apricot10/29/2024

Anyone have any example scraping multiple different websites?

The structure i am doing idoes not look like the best. I am basically creating several routers and then doing something like: ```ts...
like-gold
like-gold10/24/2024

How to override `maxRequestRetries` error log

there is a function ```typescript protected async _handleFailedRequestHandler(crawlingContext: Context, error: Error): Promise<void> { // Always log the last error regardless if the user provided a failedRequestHandler const { id, url, method, uniqueKey } = crawlingContext.request;...
provincial-silver
provincial-silver10/23/2024

Log In instagram using facebook

hello, I try to log into instagram using facebook, using Playwright. I am struggling with a pop up. Miss the right timing, accessing the "Allow all cookies" button. https://www.loom.com/share/a50934922679402cb46ecf59b80d88f7...
No description
extended-salmon
extended-salmon10/22/2024

enqueue urls / request queue not being unique

I'm seeing a lot of the same exact URL's being ran twice? Any ideas?

Issue with RequestQueue2

I am having an issue with 'queues', so here is the scenario, I am rotating sessions and getting next error : "Error: Detected a session error, rotating session..." and after 10 retries I got eventually: ...
like-gold
like-gold10/19/2024

How to throttle enqueuing urls to next router

```ts splitAndExecute({ callback: async (urlBatch, batchIndex) => { // log that we are enqueuing the nth batch of preview jobs from job-id etc...
dependent-tan
dependent-tan10/18/2024

Error: PlaywrightCrawler:SessionPool:Session "Cookie not in this host's domain"

I am using PlaywrightCrawler with Firefox. When accessing wellfound.com and see this error:
DEBUG PlaywrightCrawler:SessionPool:Session: Could not set cookies. {"errorMessages":["Cookie not in this host's domain. Cookie:prod.website-files.com Request:wellfound.com"]}
DEBUG PlaywrightCrawler:SessionPool:Session: Could not set cookies. {"errorMessages":["Cookie not in this host's domain. Cookie:prod.website-files.com Request:wellfound.com"]}
...
conscious-sapphire
conscious-sapphire10/17/2024

SC-CH-UA header includes 'Headless Chrome' when using @sparticuz/chromium

I've been playing around with deploying PlaywrightCrawler to AWS Lambda and it's working well. I've used @sparticuz/chromium for the chrome exe as per this doc: https://crawlee.dev/docs/deployment/aws-browsers However, upon examining the request headers it's generating, I've discovered the sec-ch-ua hint header is always as follows: "HeadlessChrome";v="129", "Not=A?Brand";v="8", "Chromium";v="129" ...
dependent-tan
dependent-tan10/17/2024

A site that shows cloudflare captcha ALWAYS

I immediately get captcha on every URL. Accessing it in a normal GUI browser typing site homepage URL: captcha. Searching this site in google, clicking on the link in google results: browser shows site address and...: captcha. (by the way, they changed it, few months ago this site was not that restrictive)...
No description