Apify & Crawlee

AC

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻creators-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

mute-gold
mute-gold12/30/2024

Google Maps Extractor-Retrieve a location with place ID

Hello, I need to retrieve a location and I have its place id with google maps extractor (the fastest one). However it does not work even when I put it this way: place_id:ChIJUVJFh84jyUwRKS6pYdw9O5w. For Google maps scraper it works but it is too slow.
fair-rose
fair-rose12/26/2024

TELL ME ABOUT THE FIX FOR THIS...........

Error: Operation failed! (You currently don’t have the necessary permissions to publish an Actor. This is expected behavior. Please contact support for assistance in resolving the issue.)
correct-apricot
correct-apricot12/23/2024

Splitting the handlers into multiple files and testing

Hello so Ideally i would like to have a file for website im scraping (so ome will contain more than one handler per py file). Im thinking of what the best pattern for that is. I was just going from the docs and have router = Router[BeautifulSoupCrawlingContext]() as a global var in my routes.py but i would need to either pass that router around as a singleton into the different handler files or i would import the files into the one routes.py and then register the handers there which sounds better but then I have something like webpage_handler.py which has my handler_one(context) and handler_two(context) then i register them in routes with. Whitch is fine but doesn't look too pretty. ``` @router.handler("my_label") async def handler(context: BeautifulSoupCrawlingContext) -> None: handler_one(context)...
fascinating-indigo
fascinating-indigo12/22/2024

infinite scrolling - best practice

I'm crawling a products page on an online toy store. There is an 'all products' page that loads 20 products at a time. It's an infinite scrolling scenario so no pagination buttons. I followed the tutorial from the blog and added the await for netowrkidle state and infinite_scroll method to implement the page to be scrolled continuously. After each scroll, 20 more products are loaded and the process repeats. The issue I'm facing is that there are thousands of products to retrieve. So, eventually a timeout error is thrown. To me, this approach seems rather inefficient. Not only is the timeout an issue (which I'm sure can be adjusted), but since it might take an hour to scroll all products any random event (popup, or some anti-scrape challenge) might appear and end the run before the enqueued product detail links can even be processed. I get that using API requests, where possible is preferable but I haven't identified those yet. I do see that after each scroll, the URL is appended with ?page=2. So scroll down again and it become ?page=3, etc. ...
helpful-purple
helpful-purple12/21/2024

Who is familiar with web crawling?

Hello, everyone! Who is familiar with web crawling? I have paid project but I can't do it. Who is confident? Budget: 500$ ...

Actor Unexpectedly Stopped After 3600s Despite Longer Timeout Setting

Subject: Actor Unexpectedly Stopped After 3600s Despite Longer Timeout Setting Issue Description: Our actor unexpectedly terminated after running for 3600 seconds (1 hour), despite: • Configured timeout: 360,000 seconds...
foreign-sapphire
foreign-sapphire12/12/2024

Concurrency Settings vs Autoscaling Pool

I am really curious about what I configure and what I see. I am deploying it on a beefy EC2 with the following settings: concurrency_settings = ConcurrencySettings(...
foreign-sapphire
foreign-sapphire12/12/2024

Pass args to handler

Hey I have a crawler which scrapes a lot of different websites, each with multiple urls. Each website has an associated id, I need for the dataset. ...
correct-apricot
correct-apricot12/5/2024

Playwright increase timeout

While using playwright with proxies sometimes the page is taking more time to load, so how can I increase the load time.
Page.goto: Timeout 30000ms exceeded
Page.goto: Timeout 30000ms exceeded
...
flat-fuchsia
flat-fuchsia12/4/2024

Clear URL queue at end of run?

I'm a data reporter at CBS News using crawlee to archive web pages. Currently, when I finish a crawl, the next crawl continues to crawl pages enqueued in the previous crawl. Is there an easy fix to this? I've looked at the docs, specifically the persist_storage and purge_on_start parameters, but it's unclear from the documentation what exactly those do. Happy to provide a code sample if helpful...
ratty-blush
ratty-blush12/2/2024

How to set ram used by the crawler

Ive scoured the docs and used chatgpt/perplexity. I for the life of me cannot work out how to set the ram available to the crawler. I want to give it 20gb i have a 32gb system
afraid-scarlet
afraid-scarlet11/27/2024

ERROR You are being rate limited automatically due to high usage or misuse

Hi, I'm getting this error in the logs: ERROR You are being rate limited automatically due to high usage or misuse, such as monitoring activities. Please contact the developer (Tweet Scraper V2 (Pay Per Result) - X / Twitter Scraper ) At https://sublime.app we use the scrapper to create a card with a tweet's data every time a user adds a tweet to their sublime library....
evident-indigo
evident-indigo11/26/2024

Apify proxies

Hello, I would like to use some proxies with crawlee and I am curious how strong is free proxy from apify. Is it what is used in backend by default right now? When I s witch to standard plan I will feel the difference, and how I can implement it in python? Thanks!
continuing-cyan
continuing-cyan11/25/2024

Fingerprint generator error

I’m trying use fingeprint generator , but i have problems
No description
correct-apricot
correct-apricot11/22/2024

Proxy configuration am I doing it wrong?

```async def main() -> None: """The crawler entry point.""" proxy_configuration = ProxyConfiguration( proxy_urls=[...
foreign-sapphire
foreign-sapphire11/21/2024

actor

"I want to use an Instagram scraper, but it gives this message:
'The actor you are looking for could not be found.'
What's the problem?"...
eastern-cyan
eastern-cyan11/20/2024

How to maintain the same session when using enqueue_links()?

I get the following error: The session has been lost.
stormy-gold
stormy-gold11/19/2024

How to set launchContext?

just like below code in crawlee-js: launchContext: { // Native Puppeteer options launchOptions: { args: ['--disable-features=TrackingProtection3pcd', '--disable-web-security', '--no-sandbox'],...
correct-apricot
correct-apricot11/15/2024

Actor keeps showing memory issues

I keep getting this error message "The Actor hit an OOM (out of memory) condition. You can resurrect it with more memory to continue where you left off." It keeps resurrecting from failed status and then running into the same issue however out of 32 Gb memory it only uses 1-2 GB before the error is brought up...