Apify & Crawlee

AC

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻creators-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

continuing-cyan
continuing-cyan10/17/2024

bot detection (captcha) changed, Playwright+Crawlee+Firefox+rotating proxies does not help any more

I have a program: Playwright+Crawlee+Firefox+rotating proxies used to scrape jobs from wellfound.com In may 2024 (and earlier) it worked quite well, many months, despite captcha protection on site. Today I get HTTP 403 and captcha (from ct.captcha-delivery.com). My code is not changed! Proxies: iproyal.com, "residential-proxies", session time 1 min ("sticky session"). What I did: in the same session accessed URL1 and than URL2. URL1 has no captcha, URL2 contains info I need, and is/was protected with captcha. In the past the trick with "URL1 and than URL2 in the same session" worked well. Today I get captcha when accessing URL2....
extended-salmon
extended-salmon10/15/2024

chromium version error in path

Hey Playwright creators! 👋 I'm running into a frustrating issue with Playwright and Chromium, and I could really use some help. Here's what's going on: The Error:...
inland-turquoise
inland-turquoise10/14/2024

Scrape JSON and HTML responses in different handlers

I do not know how to scrape a website, that contains JSON and HTML responses My scraper need to: 1. Send a request and parse a JSON response which contains a list of URL that I will enqueue. 2. Scrape those URLs but in HTML using cheerio or whatever is required to do so....
extended-salmon
extended-salmon10/11/2024

Playwright with Firefox: New Windows vs Tabs and Chromium-specific Features

Hey Playwright community! I've been using Firefox with Playwright because it uses less CPU, but I've run into a couple of issues I'd love some help with: 1. New Windows Instead of Tabs I'm running Firefox in headless: false mode to check how things look, and I've noticed it opens new windows for each URL instead of opening new tabs. Is there a way to configure this behavior? I'd prefer to have new tabs open instead of separate windows. ...

crawlee.run only scrap the first URL

Hi my problem is crawler.run(['https://keepa.com/#!product/4-B07GS6ZB7T', 'https://keepa.com/#!product/4-B0BZSWWK48']) only scrap the first URL I think this is because crawlee think they are the same URL , if i replace the "#" with a "?" it works , is there any way to make it work with url like this ?
extended-salmon
extended-salmon10/10/2024

Router Class

I recently read a blog post about Playwright web scraping (https://blog.apify.com/playwright-web-scraping/#bonus-routing) and implemented its routing concept in my project. However, I'm encountering an issue with handling failed requests. Currently, when a request fails, the application stalls instead of proceeding to the next request. Do you have any suggestions for implementing a failedRequestHandler to address this problem?
rare-sapphire
rare-sapphire10/10/2024

WebRTC IP leak?

Hi, so for the last couple days I am on a quest to evade detection for a project that proved to be quite challanging. As I researched the issue, I noticed that my real IP leaks through WebRTC with a default Crawlee Playwright CLI project. I see a commit to the fingerprint-suite that I think should prevent that, but based on my tests it doesn't. Does it need special setup or anything?
rare-sapphire
rare-sapphire10/8/2024

Crawlee Playwright is detected as bot

Checking on this page, Crawlee Playwright is detected as bot due to CDP. https://www.browserscan.net/bot-detection This is a known issue, also discussed on:...
extended-salmon
extended-salmon10/8/2024

How can I wait with processing further logic untill all request from batch are proceeded

Hi I have this code: ```typescript async processBatch(batch){...
optimistic-gold
optimistic-gold10/7/2024

Puppeteer browser page stuck on redirections

when i use puppeteer & fingerprint injector with generator, some redirects make puppeteer on page firefox/chromium stuck after these redirections the page stops logging my interceptors (they just write the url), the page stops responding to the resize if I create a new page manually in this browser and follow the link with redirections, it's fine without injector and generator everything works fine too...
adverse-sapphire
adverse-sapphire10/7/2024

Saving scraped data from dynamic URLs using Crawlee in an Express Server?

Hello all. I've been trying to build an app that triggers a scraping job when the api is hit. The initial endpoint hits a crawlee router which has 2 handlers. one for the url-list scraping and the other for scraping the detail from each of the detail-page. (the url-list handler enqueues the next url-list page to url-list handler too btw) ...
No description
itchy-amethyst
itchy-amethyst10/4/2024

All requests from the queue have been processed, the crawler will shut down.

I'm working on news web crawler, and setting purgeOnStart=false so that I don't scrape duplicated news, however sometimes in some cases I got the message "All requests from the queue have been processed, the crawler will shut down." and the crawler don't run, any suggestion to fix this issue??
genetic-orange
genetic-orange10/2/2024

Crawlee not working with cloudflare

It keeps on returning 403 even with rotating proxy pool Source code: ``` import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';...
No description
vicious-gold
vicious-gold10/1/2024

How to define custom delimiter on the Dataset.exporToCSV method?

The default delimiter is "," but i want to use "|" instead. On the "DatasetExportToOptions" that i can use with the "Dataset.exporToCSV" method there is no way i can define the delimiter, the options define other things. Is there another solution for this?
sunny-green
sunny-green9/26/2024

Express better then node with crawlee? Or is it really not any big difference?

Express better then node with crawlee? Or is it really not any big difference? Any short comings with express over node with crawlee or apify sdk?...
plain-purple
plain-purple9/23/2024

Save a webpage to a PDF file using Actor.setValue()

Hi, I'm new to PuppeteerCrawler. I'm trying to create a simple script to save a webpage as a PDF. For this purpose, I created a new Actor from the Crawlee - Puppeteer - TypeScript template in Apify. This is my main.ts code: ```javascript import { Actor } from 'apify'; import { PuppeteerCrawler, Request } from 'crawlee'; ...
No description
national-gold
national-gold9/22/2024

Any suggestions for improving the speed of the crawling run?

Hello there! Beside reducing the scope of what is being crawled, for example of number of pages, etc, what can we do in order to accelerate the run? Any suggestions are welcomed, I'm simply curious....
extended-salmon
extended-salmon9/9/2024

Prevent automatic reclaim of failed requests

Hi everyone! Hope you're all doing well. I have a small question about Crawlee. My use case is a little simpler than a crawler; I just want to scrape a single URL every few seconds. To do this, I create a RequestList with just one url and start the Crawler. Sometimes, the crawler returns HTTP errors and fails. However, I don't mind as I'm going to run the crawler again after a few seconds and I'd prefer the errors to be ignored rather than automatically reclaimed....
deep-jade
deep-jade9/9/2024

How to make sure all external requests have been awaited and intercepted?

I'm scrapping pages of a website as part of a content migration. Some of those pages make some post requests to algolia (3-4 requests) on the client side and I need to intercept those requests, since I need some data that is sent in the request body. One thing that is important to note is that I don't know which pages make the requests and which pages don't. Because of that, I'd need to find a way to await for all the external requests FOR EACH PAGE and just strat crawling the page html after that. That way, if I can ensure I awaited for all the requests and still it didn't intercept the algolia request, it would mean that specific page didn't make a request to algolia. I created a solution that seemed to be working at first. However, after crawling the pages a few times, I noticed that, sometimes, it wouldn't show the algolia data in the dataset for a few pages but I could confirm in the browser that the page makes the algolia request. So, my guess is that it ends crawiling the page html before intercepting that algolia request (??). Ideally, it would only start crawling the html AFTER all the external requests ended. I used puppeteer because I found the addInterceptRequestHandler in the docs, but I could use Playwright if it's easier. Can someone here help me to understand what I'm doing wrong? Here is a gist with the code I'm using: https://gist.github.com/lcnogueira/d1822287d718731a7f4a36f05d1292fc (I can't post it here, otherwise my message becomes too long)...