Apify & Crawlee

AC

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻creators-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

extended-salmon
extended-salmon11/15/2024

Actor keeps showing memory issues

I keep getting this error message "The Actor hit an OOM (out of memory) condition. You can resurrect it with more memory to continue where you left off." It keeps resurrecting from failed status and then running into the same issue however out of 32 Gb memory it only uses 1-2 GB before the error is brought up...
wise-white
wise-white11/15/2024

How to use CrawleeLogFormatter?

I want to adapt Crawlee's log format. From my research, it seems I need to use the CrawleeLogFormatter API. However, I couldn't find any usage examples for this API. Could you explain how to use it?
metropolitan-bronze
metropolitan-bronze11/12/2024

Scraping Capabilities with navigation

Hello Apify Support Team, I hope this message finds you well. I would like to inquire if it's possible to use Apify to scrape data from the following page: https://fr.iherb.com/specials. Specifically, my objective is to: Scrape all listed products on the specials page....
fair-rose
fair-rose11/8/2024

user_data is not working as expected

For some reason, when I push data to user_data (str type in this case), when I get user_data variables in another handler I get different values. In this case the error is on tab_number. When I push tab_number to user_data the values seems to be good (values ranged from 1 to 100). But when I get tab_number through tab_handler I get a different value. For example, for values from 1 to 19, I get tab_number 1, instead of the correct tab_number: tab_number pushed to user_data: "19", tab_number requested from user_data: "1". I cannot find the error. Here is the code: ...
foreign-sapphire
foreign-sapphire11/5/2024

Debug and troubleshooting Crawlee

When there is an error in the response-handler, then it does not show an error, but it will fail unnoticed, and Crawlee will retry 9 more times. To illustrate the following syntax-error: element = await context.page.query_selector('div[id="main"') ...
subsequent-cyan
subsequent-cyan11/1/2024

How to retry when hit with 429

When using crawlee-js its working fine, but when using python 429 is not getting retried. Is there anything I am missing I am using BeautifulSoupCrawler. Please help....
correct-apricot
correct-apricot10/30/2024

Trying to get rid out of 429 errors

Hello, do you have some nice tips how to get rid out of 429? I am not exactly how parrarellism working here, but I am afraid, even I am putting in the sleep the process, the other parallel request as considered as request and they can lead to 429, is there any nice tip/best practice how can I defend against it ? 😄
foreign-sapphire
foreign-sapphire10/30/2024

Simple POST-example

Flaw in tutorial on basic POST-functionality: https://crawlee.dev/python/docs/examples/fill-and-submit-web-form It makes an actual POST-request, but the data is not reaching the server, tried on various endpoints. ...
foreign-sapphire
foreign-sapphire10/30/2024

Adding session-cookies

After following the tutorial on scraping using crawlee, I cannot figure out how to add specific cookies (key-value pair) to the request. E.g., sid=1234 There is something like a session, and a session-pool, but how to reach these objects? Then max_pool_size of session-pool has default size of 1000, should one then iterate through the sessions in the session-pool to set the session-id to the session.cookies (dict)?...
correct-apricot
correct-apricot10/27/2024

How to set concurrency/cpu's/memory correcty

Hello, I would like to use PlayWrightCrawler for scraping , but it is not clear from the documentation how can I set up correctly concurrency, memory, cpu's, etc. Can someone help me out? What is the best practice to set up this Crawler to make scraping parallel? Thanks in advance!
fair-rose
fair-rose10/24/2024

How to store data in the same dict from different URL?

I have a list of results where I enqueue the link for each item. For each item, I need to crawl internal pages (tabs) and extract the data in tables and add the data to the same dict. I can extract the data from all the pages with router and enqueue_links but I am not able to gather all data in the same dict for each item. What is the best way to do it?
deep-jade
deep-jade10/21/2024

Website Content Crawler Issue

Dev team, hope you're doing well! I'm running this actor "Website Content Crawler" by Apify and I've run into an issue for the automation I'm building. Here's what's happening: When a lead enters my database (Airtable), I want to use the Website Content Crawler to scrape through the website to provide LLM ready data. A manual trigger within Airtable is used to trigger the run itself on the website. There's a timeout limitation of 120 seconds whether you run the scrape synchronously or not....
No description
environmental-rose
environmental-rose10/13/2024

How can I use the Playwright Crawler and BeautifulSoup Crawler in the same Actor?

This is so that Playwright can fill in and submit a website search page which uses dynamic Javascript. When the results are shown I want to be able to use the BeautifulSoup crawler to open each product page and parse the information. If I use Playwright to open each product page, this takes a very long time. I cannot seem to run both Crawlers at the same time.
ambitious-aqua
ambitious-aqua10/7/2024

How to use proxies with Playwright? And, what are the best proxy service providers?

How to use proxies with Playwright? And, what are the best proxy service providers? Note that I'm new to web-scraping and I'm using Crawlee python
foreign-sapphire
foreign-sapphire10/2/2024

Parallel Scraping

Does anyone know when Parallel Scraping and Request Locking are comming to the python version?
inland-turquoise
inland-turquoise10/1/2024

Pydantic Exception

I'm building a request queue of URLs and most run fine but I will receive the following exception and not sure how to proceed. pydantic_core._pydantic_core.ValidationError: 1 validation error for Request user_data.__crawlee.state Input should be 0, 1, 2, 3, 4, 5, 6 or 7 [type=enum, input_value='RequestState.REQUEST_HANDLER', input_type=str]...
deep-jade
deep-jade10/1/2024

How to send post request (I'm doing reverse engineering)

I'm conducting reverse engineering and have discovered a link that retrieves all the data I need using the POST method. I've copied the request as cURL to analyze the parameters required for making the correct request. I've modified the parameters to make the request using the POST method. I've successfully tested this using httpx, but now I want to implement it using the Crawlee framework. How can I change the method used by the HTTP client to retrieve the data, and how can I pass the modified parameters I've prepared?...
constant-blue
constant-blue9/29/2024

Need Some Help On How To Send Datasets Automatically To Actors

Hey Guys, been trying to figure out how to integrate the following actors: 1. compass/crawler-google-places 2. vdrmota/contact-info-scraper 3. lukaskrivka/dedup-datasets. What Im Looking To Do: 1. Google Maps Scraper runs and the results from the dataset go to the ContactScraper to enrich the data (DONE Since they integrate using the APIFY INTEGRATION). 2. After point 1. I have two datasets (one from Google Maps Scraper and the other from the Contact Scraper). I want to reference these datasets in the third actor (Dedup-Datasets) ,so it will merge/match the data into a clean output....
foreign-sapphire
foreign-sapphire9/25/2024

How to re visit a url that is already scraped?

Hi I'm making a simple app that gets updated information from a website. This is inside a fastapi app and It uses AsyncIOScheduler to run the script every day, The issue is since the crawl is already visited the main page, for the next call, It will not re visit the page. I've did a lot of research but couldn't find a solution, other scrapers has someting like force= parameter to force the scrape. ...
multiple-amethyst
multiple-amethyst9/25/2024

limit extraction for free plan users

I habe built an Instagram profile scraper, in python, but want to limit the scraping result to 25 for those free plan users not paid plan users. Can anybody help me out?...