Apify & Crawlee

AC

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

πŸ’»hire-freelancers

πŸš€actor-promotion

πŸ’«feature-request

πŸ’»creators-and-apify

πŸ—£general-chat

🎁giveaways

programming-memes

🌐apify-announcements

πŸ•·crawlee-announcements

πŸ‘₯community

broad-brown
broad-brown9/9/2024

Prevent automatic reclaim of failed requests

Hi everyone! Hope you're all doing well. I have a small question about Crawlee. My use case is a little simpler than a crawler; I just want to scrape a single URL every few seconds. To do this, I create a RequestList with just one url and start the Crawler. Sometimes, the crawler returns HTTP errors and fails. However, I don't mind as I'm going to run the crawler again after a few seconds and I'd prefer the errors to be ignored rather than automatically reclaimed....
plain-purple
plain-purple9/9/2024

How to make sure all external requests have been awaited and intercepted?

I'm scrapping pages of a website as part of a content migration. Some of those pages make some post requests to algolia (3-4 requests) on the client side and I need to intercept those requests, since I need some data that is sent in the request body. One thing that is important to note is that I don't know which pages make the requests and which pages don't. Because of that, I'd need to find a way to await for all the external requests FOR EACH PAGE and just strat crawling the page html after that. That way, if I can ensure I awaited for all the requests and still it didn't intercept the algolia request, it would mean that specific page didn't make a request to algolia. I created a solution that seemed to be working at first. However, after crawling the pages a few times, I noticed that, sometimes, it wouldn't show the algolia data in the dataset for a few pages but I could confirm in the browser that the page makes the algolia request. So, my guess is that it ends crawiling the page html before intercepting that algolia request (??). Ideally, it would only start crawling the html AFTER all the external requests ended. I used puppeteer because I found the addInterceptRequestHandler in the docs, but I could use Playwright if it's easier. Can someone here help me to understand what I'm doing wrong? Here is a gist with the code I'm using: https://gist.github.com/lcnogueira/d1822287d718731a7f4a36f05d1292fc (I can't post it here, otherwise my message becomes too long)...
optimistic-gold
optimistic-gold9/8/2024

chromium issue in apify/actor-node-playwright-chrome:22

Hi folks, I pulled the latest revision of the actor-node-playwright-chrome:22 docker image but when i tried to run the project I got the "install browsers" playwright error ``` azure:service-bus:receiver:warning [connection-1|streaming:discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24] Abandoning the message with id '656b7051a08b4b759087c40d0ecef687' on the receiver 'discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24' since an error occured: browserType.launch: Executable doesn't exist at /home/myuser/pw-browsers/chromium-1129/chrome-linux/chrome ╔═════════════════════════════════════════════════════════════════════════╗...
sunny-green
sunny-green9/8/2024

How to launch a crawlee browser that I can manually pass the cloudfare anti-bot protection

This is my code to launch the browser with headless: false mode. I mannualy input the URL and try to pass the captcha challenges. But the challenges keep failing This is the code ``` const { launchPuppeteer } = require('crawlee');...
variable-lime
variable-lime9/2/2024

Apify-cli create new actor error:

Hi everyone I'm trying to use the cli to create a new javascript, crawlee + cheerio project However I get the error :
Error: EINVAL: invalid argument, mkdir 'C:\Crawlee-latest\Crawlee\my-new-actor\C:'
Error: EINVAL: invalid argument, mkdir 'C:\Crawlee-latest\Crawlee\my-new-actor\C:'
It seems that is automatically adds a "C:" at the end of my current terminal path which I believe makes it to fail ( the installation of the project is not complete as I'm missing modules(like cheerio, etc) I have only crawlee and apify in my package.json file of the project. ...
metropolitan-bronze
metropolitan-bronze8/28/2024

save HTML file using crawlee

Has anybody tried downloading the HTML file of the URL using Crawlee? Was wondering if Crawlee has a capacity of downloading the HTML file of the URL since I've just been using Crawlee and really loving the experience.
other-emerald
other-emerald8/28/2024

How can I override the default logs of Crawlee?

Hello I wonder how to override the default logs of crawler, this is how it looks: This logs came from basic-crawle library: (https://github.com/apify/crawlee/blob/3ffcf56d744ac527ed8d883be3b1a62356a5930c/packages/basic-crawler/src/internals/basic-crawler.ts#L891) ...
No description
fascinating-indigo
fascinating-indigo8/28/2024

Error in crawlee import: `Module '"cheerio"' has no exported member 'Element'`

After running npm run build on version 3.11.1 of Crawlee, I get the error below: ``` node_modules/.pnpm/@crawlee+playwright@3.11.1_playwright@1.46.1/node_modules/@crawlee/playwright/internals/adaptive-playwright-crawler.d.ts:7:29 - error TS2305: Module '"cheerio"' has no exported member 'Element'. ...
national-gold
national-gold8/27/2024

Unable to install crawlee on node 18

On yarn add crawlee, I got the error:
cheerio@1.0.0: The engine "node" is incompatible with this module. Expected version ">=18.17". Got "18.16.0"
cheerio@1.0.0: The engine "node" is incompatible with this module. Expected version ">=18.17". Got "18.16.0"
then on upgrading my node to 18.17, I tried again, and got this:...
ratty-blush
ratty-blush8/26/2024

Scraping Government Websites

I'm trying to scrape a business entity search like this (https://ccfs.sos.wa.gov/#/BusinessSearch) but I cannot get any results. To be fair - I'm definitely more of a novice. But from some online forum reading, I'm not sure if this is even possible. Specifically - I want to get the pdfs from the filing history on each individual business entity site. Can I do this with the Website Content Crawler or Web Scraper or am I in over my head?...
deep-jade
deep-jade8/26/2024

Is there a way to select the name of the dataset files?

I am scraping product information from Amazon, where every product has an ID (ASIN). Does pushData method (or something else) allow me to select the name of the generated file in the dataset? I prefer using the ASIN code instead of the adding index because it does not make any sense in my case.
rival-black
rival-black8/26/2024

Any relative imported files throws error in Crawlee

{ "extends": "@apify/tsconfig", "compilerOptions": { "module": "NodeNext", "moduleResolution": "NodeNext",...
exotic-emerald
exotic-emerald8/24/2024

why is crawlee running old deleted code for the first url?

main.ts ``` import { PlaywrightCrawler } from 'crawlee'; import { router } from './routes.js'; ...
adverse-sapphire
adverse-sapphire8/23/2024

How to debug seemingly no html in crawled response (CheerioCrawler)

Duplicated a custom apify actor that was working great, didn't really change anything but a few selectors and pointed at a new site. Unfortunately the actor seems to exit "successfully" after the first url (only start url) is handled. None of my logging shows anything is in the html returned, and enqueuelinks ofc does nothing, yet cheerio beleives the page request responded successfully. How would I approach debugging this situation? I've so far checked that $('body').html() returns empty string and attempted using RESIDENTIAL proxy in local geolocation to the website in case it was clever blocking but no success. The url being scraped is https://www.tesco.com/groceries/en-GB/shop/health-and-beauty/shampoo/all?page=1&count=48...
other-emerald
other-emerald8/21/2024

Issues with dependencies when attempting to bundle

Hello, I attempting to bundle Crawlee with a VS Code extension using esbuild. However, I'm running into many of the same issues listed here: https://github.com/apify/crawlee/issues/2208 I attempted to follow the same setup that one user was able to get working to bundle Crawlee here: https://github.com/apify/crawlee/issues/2208#issuecomment-1987270051 Additionally, I've had to add some things to the excludes in my esbuild, and copy over specific files that Crawlee is looking for into my bundle....

How should I fix the userData if I run 2 different crawler in the same app?

I am building a scrape app and encountered an issue. I set up two different crawler tasks in the same app. When the first crawler task is completed, the app uses the abort method to exit the first task and then starts the second task. However, the task object obtained in the route handler still contains the task configuration of the first crawler task. Every time I run a crawler instance, I create it using the new method. The route handlers on the instance are also created with new, returning new instances each time, not following a singleton pattern. The userData I pass in is also the task object for the current run....
typical-coral
typical-coral8/18/2024

Running userscripts in Playwright/Puppeteer crawler

Hey all! I am making a crawler for a site that uses captchas. There's an userscript available to solve these captchas, and instead of rewriting the entire logic over to a Crawlee crawler, I was wondering if I could instead inject this custom JS code in the site and then have Crawlee interact with that :)
unwilling-turquoise
unwilling-turquoise8/14/2024

Cheerio not persisting cookies

Cheerio is not able to persist cookies that are set in the session. I have persistCookiesPerSession: true and I also verify that the cookie is being saved in the session in the requestHandler. But when i print out the request headers the cookie header is not present. The session in preNavigationHooks also does not contain the cookies ```ts const crawler = new CheerioCrawler({ minConcurrency: 1, maxConcurrency: 10,...
unwilling-turquoise
unwilling-turquoise8/14/2024

Cheerio Fingerprint

Is there a way to use fingerprints with the Cheerio crawler? I need it to send Firefox headers. It's currently sending chromium ones ``` Host: localhost:8000 Connection: keep-alive...