Apify Community

Apify Discord Mirror

Is it possible to setup a slack notification for a low account balance?

We're using the proxy feature and our usage is somewhat difficult to predict, we'd like to either be notified via slack when our account balance goes below a given threshold OR we'd like to setup automatic account balance top offs.

Are either of these possible?

pputrafajarh

Crawlee support esm?

I try to integrate with Nuxt3, when i run on production mode it doesnt work


[nuxt] [request error] [unhandled] [500] Cannot find module '/app/server/node_modules/puppeteer/lib/cjs/puppeteer/puppeteer.js'

im import using module
import { Dataset, PuppeteerCrawler } from 'crawlee'

i check the node_modules/puppeteer/lib only esm folder there

why PuppeteerCrawler still want to run on cjs, any idea?

mmjh

Max Depth option

Hello! Just wondering whether it is possible to set max depth for the crawl?
Previous posts (2023) seems to make use of 'userData' to track the depth.
Thank you.

1 comment

sshinobi

Solved

do I need to signup for a paid plan, if my requirement is only one time?

I can just use the free plan and refill some credits for pay as you go ? Becuase my usage is only one time thing.

2 comments

VVi

How can I pass context to createNewSession ?

I want to use existing crawler setting (JSON/ cherioo ) upon creating new session, signing in / signing up user there while associating cookies, token with the session.

Currently I put these new session creation conditionally inside preNavigation hook (context is passed as arg there), but not in createNewSession

2 comments

!!!!Joefree!!! 👑

console WEBUI cosmetic BUG: icon with black border

When using a transparent icon for the actor (WEBP or PNG images), an unexpected black border appears (on Google Chrome 80% Zoom)

1 comment

__bms.

Testing my first actor

Hi there. I am coming from scraperPAI solutions and I am having issues w/ them. I just want to try Apify.
I am trying to build my firt actor without any succeed currently.
The test actor sample offers a full example. Sounds great but I get error when I try to use another URL than the one proposed by default (https://www.apify.com) I get an error. For example I try the following https://fr.indeed.com and I get an error. Any idea?

1 comment

RROYOSTI

Solved

Chromium sandboxing failed

I run Crawlee in a docker container. That docker container is used in a Jenkins task.
When starting the crawler I receive the following error:

Plain Text

    Browser logs:
      Chromium sandboxing failed!
      ================================
      To avoid the sandboxing issue, do either of the following:
        - (preferred): Configure your environment to support sandboxing
        - (alternative): Launch Chromium without sandbox using 'chromiumSandbox: false' option
      ================================

The full error log can be found in the attachment.
This error only occurs after upgrading crawlee[playwright] to 0.5.2

What are the advantages/disadvantages of launching Chromium without sandbox? How could I configure my environment to support sandboxing?

4 comments

VVi

how do i create organize 1 auth per session, ip, user agent ?

I want to create bunch of authenticated users, each with their consistent browser, proxy, user agent, fingerprints, schedule, browsing pattern, etc.

13 comments

CCupOfGeo

Not scheduling new tasks - system is overloaded - gcp cloud run

getting this system overloading message just trying to scrape two urls. this check just keeps looping for almost 10 mins now. i set the cpu to 4 and memeory to 4gb but still getting this message. i know cloud runs dont like threads and background tasks is that the real issue? not sure wondering if anyone has run them on cloud run

Plain Text

[90m[crawlee.events._event_manager][0m [34mDEBUG[0m LocalEventManager.on.listener_wrapper(): Awaiting listener task...
[90m[crawlee.events._event_manager][0m [34mDEBUG[0m LocalEventManager.on.listener_wrapper(): Awaiting listener task...
'[90m[crawlee._autoscaling.autoscaled_pool][0m [34mDEBUG[0m Not scheduling new tasks - system is overloaded
'[90m[crawlee.storages._request_queue][0m [34mDEBUG[0m There are still ids in the queue head that are pending processing [90m({"queue_head_ids_pending": 1})[0m
[90m[crawlee._utils.system][0m [34mDEBUG[0m Calling get_memory_info()...
'[90m[crawlee._autoscaling.autoscaled_pool][0m [34mDEBUG[0m Not scheduling new tasks - system is overloaded
'[90m[crawlee.storages._request_queue][0m [34mDEBUG[0m There are still ids in the queue head that are pending processing [90m({"queue_head_ids_pending": 1})[0m
'[90m[crawlee._autoscaling.autoscaled_pool][0m [34mDEBUG[0m Not scheduling new tasks - system is overloaded
'[90m[crawlee.storages._request_queue][0m [34mDEBUG[0m There are still ids in the queue head that are pending processing [90m({"queue_head_ids_pending": 1})[0m
[90m[crawlee._utils.system][0m [34mDEBUG[0m Calling get_cpu_info()...
'[90m[crawlee._autoscaling.autoscaled_pool][0m [34mDEBUG[0m Not scheduling new tasks - system is overloaded
'[90m[crawlee.storages._request_queue][0m [34mDEBUG[0m There are still ids in the queue head that are pending processing [90m({"queue_head_ids_pending": 1})[0m
'[90m[crawlee._autoscaling.autoscaled_pool][0m [34mDEBUG[0m Not scheduling new tasks - system is overloaded

1 comment

AAmmarSalmiDz

Apify cli can't find python

When I use apify run it say python can't be detected. It's installed and it's in PATH variable and everything and works from cmd and powershell like charm. Also, I updated node and npm to the latest version and reinstalled apify-cli

bbillsauce

error

hi why do i always get this error: raise ApifyApiError(response, attempt)
apify_client._errors.ApifyApiError: You must rent a paid Actor in order to run it. i have apify pro

�🟢mido🟢

Apify Proxy not working with https urls

I want to test the apify proxy and how it works to integrate it with my py code.
Running a very simple check I found it's not working with https urls. here's a snippet:

Plain Text

import asyncio, httpx
from apify import Actor
import dotenv

async def main():
    async with Actor:
        proxy_configuration = await Actor.create_proxy_configuration(
            password=dotenv.get_key('.env', 'APIFY_PROXY_PASSWORD'),
        )
        proxy_url = await proxy_configuration.new_url()
        proxies = {
            'http': proxy_url,
            'https': proxy_url,
        }
        async with httpx.AsyncClient(proxy=proxy_url) as client:
            for _ in range(3):
                response = await client.get('https://httpbin.org/ip')
                if response.status_code == 200:
                    print(response.json())
                elif response:
                    print(response.text)

if __name__ == '__main__':
    asyncio.run(main())

giveing me a proxy error:

Plain Text

          raise mapped_exc(message) from exc
      httpx.ReadTimeout
[apify] INFO  Exiting Actor ({"exit_code": 91})

If i just only change the protocol to http://httpbin.org/ip it works.
Apify proxy should support https as stated on the site. Thanks in advance.

3 comments

�

NNth

Is there a way to get the number of enqueued links?

I have the following code for AdaptivePlaywrightCrawler and I want to log the number of enqueued links after calling enqueueLinks.

    router.addDefaultHandler(async ({ request, enqueueLinks, parseWithCheerio, querySelector, log, page }) => {
      
      await enqueueLinks({
        strategy: 'same-domain',
        globs: globs,
        transformRequestFunction: (request) => {

          return request;
        },
      });

    });

1 comment

3333mmm333

nested tranformation

What is wrong with my transformation?
everything under physicianInfo is not beeing displayed on joboverview

RROYOSTI

Solved

Enqueue_links only on match in url path? Cancel request in pre_navigation_hook?

I have set up my handler that it only enqueue links that match on certain keywords Problem here is that I want the code to only check the URL Path and not the full URL.

To give an example:
Lets say I only want to enqueue links where the keyword "team" or "about" is part of the URL path.
When crawling www.example.com and it would find an url with www.example.com/team. I want that URL to queue.
When crawling www.my-team.com it would match on all urls on that website because team is part of the main url. But that is not the desired behaviour I want.

I thought of using a pre_navigation_hook and check there again with the following code, but I don't think it's possible to cancel a request that is already queued?

Plain Text

    parsed_url = urlparse(context.request.url)
    path_name = parsed_url.path

    results = _get_regex_matches(path_name)

    if not results:
        context.log.info(
            f'No match found for URL: {context.request.url} in path: '
            f'{path_name}'
        )
        # TODO: CANCEL REQUEST

In the docs I found something like await request_list.mark_request_as_handled(request) but I don't think I have any access to a request_list or something simular in the PlaywrightPreNavCrawlingContext

It would be great if someone can point me in the right direction!

2 comments

VVice

One or multiple instances of CheerioCrawler?

Hi community! I'm new to Crawlee, and I'm building a script that scrapes a lot of specific, different domains. These domains each have a different number of pages to scrape; some have 2 to 3 thousand pages, while others might have just a few hundred (or even less).
The thing I have doubts about is: if I put all starting URLs in the same crawler instance, it might finish scraping a domain way before another one. I thought about separating domains, creating a crawler instance for each domain, just so that I can run each crawler separately and let them run their own course.
Is there any downside to this, e.g. will it need significantly more resources? Is there a better strategy?
TIA

2 comments

BBesteon

Does Crawlee crawl both root-relative and base-relative urls?

Root relative - prefixed with '/', ie href=/ASDF brings you to example.com/ASDF

base-relative - no prefix, ie. href=ASDF from example.com/test/ brings you to example.com/test/ASDF

If someone could point me to where in the library this logic occurs, I would be forever grateful

1 comment

DDuxSec

Solved

Double log output

in main.py logging works as expected, however in routes.py logging is printed twice for some reason.
I did not setup any custom logging, I just use
Actor.log.info("STARTING A NEW CRAWL JOB")

example:

Plain Text

[apify] INFO  Checking item 17
[apify] INFO  Checking item 17 ({"message": "Checking item 17"})
[apify] INFO  Processing new item with index: 17
[apify] INFO  Processing new item with index: 17 ({"message": "Processing new item with index: 17"})

If I add this in my main.py (https://docs.apify.com/sdk/python/docs/concepts/logging)

Plain Text

async def main() -> None:
    async with Actor:
        ##### SETUP LOGGING #####
        handler = logging.StreamHandler()
        handler.setFormatter(ActorLogFormatter())

        apify_logger = logging.getLogger('apify')
        apify_logger.setLevel(logging.DEBUG)
        apify_logger.addHandler(handler)

it prints everything from main.py 2x, and everything from routes.py 3x.

Plain Text

[apify] INFO  STARTING A NEW CRAWL JOB
[apify] INFO  STARTING A NEW CRAWL JOB ({"message": "STARTING A NEW CRAWL JOB"})
[apify] INFO  STARTING A NEW CRAWL JOB ({"message": "STARTING A NEW CRAWL JOB"})

11 comments

DDuxSec

Solved

clean way to stop "request queue seems to be stuck for 300.0"

A scraper that I am developing, scrapes a SPA with infinite scrolling. This works fine, but after 300 seconds, I get a WARN , which spawns another playwright instance.
This probably happens since I only handle 1 request (I do not add anything to the RequestQueue), from which I just have a while until finished condition is met.

Plain Text

[crawlee.storages._request_queue] WARN  The request queue seems to be stuck for 300.0s, resetting internal state. ({"queue_head_ids_pending": 0, "in_progress": ["tEyKIytjmqjtRvA"]})

What is a clean way to stop this from happening?

3 comments

DDuxSec

how to pass data to routes.py

If i use multiple files, what is the best way to pass data (user input, which contains 'max_results' or something) to my routes.py?

example snippet main.py

Plain Text

        max_results = 5 # example

        crawler = PlaywrightCrawler(
            headless=False, 
            request_handler=router,
        )
        await crawler.run([start_url])

snippet routes.py

Plain Text

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    max_results = ???

3 comments

TTGTGamer

Solved

Crawlee with multiple Crawlers?

Does the python crawlee allow for multiple crawlers to be run using one router?

Plain Text

router = Router[BeautifulSoupCrawlingContext]()

Just asking as a coleague asked me if it would be possible because curl requests are a lot faster than playwright, so if we can use curl for half the requests and only load the browser for the other portion where it's needed, it could significantly speed up some processes

1 comment

JJerome

Actor pay per event

Hi, I've seen mentions of a "pay per event" pricing model https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event & https://apify.com/mhamas/pay-per-event-example, but can't find how to use it for one of my actor, i only see rental or pay per result options.
How can we use this pay per event pricing model?

8 comments

bboatbxy

Handling Dynamic Links with Crawlee PlaywrightCrawler

I’m working on a project using PlaywrightCrawler to scrape links from a dynamic JavaScript-rendered website. The challenge is that the <a> tags don’t have href attributes, so I need to click on them and capture the resulting URLs.

Delayed Link Rendering: Links are dynamically rendered with JavaScript, often taking time due to a loader. How can I ensure all links are loaded before clicking?
Navigation Issues: Some links don’t navigate as expected or fail when trying to open in a new context.
Memory Overload: I get the warning "Memory is critically overloaded" during crawls

I've attached images of my code (it was too long so I couldn't paste it)

How can I handle these issues more efficiently, especially for dynamic and JavaScript-heavy sites?
I would appreciate any help

2 comments

DDohnos

Google Lens - extract product name

Hello, I would like to ask if any Apify tool can, for example, find a similar image - https://i.postimg.cc/KzRHFKQc/55.jpg and extract the product name from the links to CSV. We can use Google Lens? I want to use this to automatically name antique products.

Thanks for the all informations and help! 👋

1 comment