Apify & Crawlee

AC

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻creators-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

constant-blue
constant-blue8/20/2024

crunchbase

hello, I just wanted to ask a question regarding https://apify.com/curious_coder/crunchbase-scraper. Do I have to have an account with crunchbase to get say funding results? If I have an account with you will I be able to scrape all search results for say https://www.crunchbase.com/discover/funding_rounds/911b14126f22caf2fb5adaf7f66bee67 ? Or will I only get the top xx visible results? TY
magic-amber
magic-amber8/15/2024

How to send post requests

Hello.How can I use normal params for requests such as header, cookies,json in the enque links method
sensitive-blue
sensitive-blue8/14/2024

How to save network requests made by the webpage I am scraping?

Hello, the scraping that I'm trying to do is not of actual content on the page, but rather network requests. I don't need anything too fancy - I just basically want to dump everything one would see in the Network tab of their browser's Inspect tool. I tried searching through the docs, but"request" gives back a lot of unrelated stuff since that word is pretty central to how Crawlee works :). If there is another tool that would be more appropriate for this, please let me know. I still need to be able to deal with JS-heavy pages, and I still need to be able to follow links. It's just that the end product I need is requests, not page elements....
molecular-blue
molecular-blue8/12/2024

Resume unfinished queue

Hi, I'm just wondering how the persisted queue can be used to resume a crawl that has stopped (eg the Playwright crawler (https://crawlee.dev/python/docs/examples/playwright-crawler) example ends with 1 outstanding link in the queue). I've looked at the RequestQueue class, and I can see there is one item in the request_queue json dataset that has a place in the queue, but it's not obvious to me how I could resume this queue. Any pointers? Thanks!
jolly-crimson
jolly-crimson8/12/2024

Python - Selenium - Chrome driver

Hi everyone, i have been using python/selenium for years now but I am new to apify. I have two questions : - Question 1 regarding the live view : i understood that the Apify Console allows you to monitor your actor's execution in real-time through the Live View. When I run an actor in headful mode, the Live View tab is desactivated. I also can't access the container web server to monitor the execution of my actor. -> are the live view and container web server possible to use with Python - Selenium - Chrome driver ?...
like-gold
like-gold8/7/2024

Cookies and other inputs

Hello everyone, I am new to crawlee. I used apify api version, now I want to apply same logic with python version. My pain points are input values in api version which I didnt find out where I should write them in python version....
magic-amber
magic-amber8/6/2024

Hey guys. why do i get the same url for different sessions? what am in doing wrong

Hey guys. why do i get the same url for different sessions? what am in doing wrong? import json import random...
eastern-cyan
eastern-cyan8/2/2024

How to handle TimeoutErrors

Hey, We've just started using Crawlee 0.1.2 for some basic web scraping tasks, and I'm unsure from the docs how to handle cases where the HTTP request times out. Below is a simple script that scrapes a webpage and outputs the number of anchor tags on the page. This particular site blocks the request (in that it never responds) and the script hangs indefinitely. ...
crude-lavender
crude-lavender7/29/2024

Issue with Extracting <table> Data Using Apify

Hi, I'm using Apify to create a chatbot with the following code: ``` apify = ApifyWrapper() ...
sensitive-blue
sensitive-blue7/29/2024

how to pass proxies using selenium

I am try to deploy my python actor on Apify but I need to use proxies with selenium so I will not face blocking issue. There is any sample actor for check ?...
vicious-gold
vicious-gold7/21/2024

Robots.txt

Hey, do you have any idea how to respect robots.txt? We must code that ourself?
rival-black
rival-black7/19/2024

How do you add custom link 🔗

Hello 😊 In most cases is use the enqueue_links to add links via a selector. On rare occasions, I need to add a custom link with a label when crawling (adding a calulcated ?page=x to the url). How can I do that?...
like-gold
like-gold7/17/2024

Can I transfer the data scraped to Azure database directly inside Apify?

I'd like to import the data scraped to my Azure database. Is there any way to do it?
correct-apricot
correct-apricot7/15/2024

Yarn start not generating api-typedoc-generated.json

Hi everyone. I cloned the crawlee-python repository onto my machine. I followed the steps in contributing.md. But I'm not sure how to fix this error in trying to run the documentation using docosaurus. When I type yarn start into the console. I get this error. My node version is 20.15.1 and my docosaurus version is 3.4.0...
No description
fascinating-indigo
fascinating-indigo7/8/2024

Getting empty results via Python client

Hi everyone! I have an issue. I want to use the Instagram User Scraper. When I was running that on the Apify web site, it was working OK. However, when I run that via Python IDE, I recieve the empty data. Although, In the table of runs I don't see some mistakes. Do I do something wrong? Please, help me.
No description
wise-white
wise-white6/30/2024

Getting callbacks after extraction is completed

Hello, Is there any callbacks we can able to get from apify when the web extraction is done?...
genetic-orange
genetic-orange6/24/2024

Crawler terminates when URL is invalid.

I try to crawl a website which contains an invalid link. I dynamically add the links through context.enqueue_links. The detected link looks like this:
<a href="http://DoorLock-WA2 – DATENBLATT" target="_blank" rel="noreferrer noopener">DATENBLATT KXC-WA2-IP1, KXC-WA2-IP2</a>
<a href="http://DoorLock-WA2 – DATENBLATT" target="_blank" rel="noreferrer noopener">DATENBLATT KXC-WA2-IP1, KXC-WA2-IP2</a>
I get the following error:
httpx.InvalidURL: Invalid IDNA hostname: 'DoorLock-WA2 – DATENBLATT'
httpx.InvalidURL: Invalid IDNA hostname: 'DoorLock-WA2 – DATENBLATT'
...
noble-gold
noble-gold6/24/2024

Unable to setup apify actor using python-scrapy template in Windows

Hi Team, I'm new to Apify. I'm trying to setup Apify actor using python-scrapy template. But I'm getting following error. ``` C:\Users\Guest User\Desktop>apify create my-actor -t python-scrapy Info: Making sure that Apify CLI is up to date......
correct-apricot
correct-apricot6/19/2024

Question regarding Subscription

Hello I'd like to inquire about your LinkedIn profile scraper API (https://console.apify.com/actors/PEgClm7RgRD7YO94b/). The pricing page mentions a $25/month + usage fees, but I'm curious if that's a recurring charge regardless of usage, or if I'll only be billed $25 once I've exhausted the previous month's allotment. Thanks...
genetic-orange
genetic-orange6/14/2024

how do i do a simple http url with onclick jv buttons

trying to do a headless java onclick an pull current info from http website, any idea why it's not working.... ive tried scrapy selenium etc...