Apify Discord Mirror

I finished the crawler with around 1.7M, and got around 100k failed requests. Is there a way to retry just the failed requests ?
I try to integrate with Nuxt3, when i run on production mode it doesnt work

[nuxt] [request error] [unhandled] [500] Cannot find module '/app/server/node_modules/puppeteer/lib/cjs/puppeteer/puppeteer.js'

im import using module
import { Dataset, PuppeteerCrawler } from 'crawlee'

i check the node_modules/puppeteer/lib only esm folder there

why PuppeteerCrawler still want to run on cjs, any idea?
m
mjh
·

Max Depth option

Hello! Just wondering whether it is possible to set max depth for the crawl?
Previous posts (2023) seems to make use of 'userData' to track the depth.
Thank you.
3 comments
В
R
m
I want to use existing crawler setting (JSON/ cherioo ) upon creating new session, signing in / signing up user there while associating cookies, token with the session.

Currently I put these new session creation conditionally inside preNavigation hook (context is passed as arg there), but not in createNewSession
2 comments
V
R
I run Crawlee in a docker container. That docker container is used in a Jenkins task.
When starting the crawler I receive the following error:
Plain Text
    Browser logs:
      Chromium sandboxing failed!
      ================================
      To avoid the sandboxing issue, do either of the following:
        - (preferred): Configure your environment to support sandboxing
        - (alternative): Launch Chromium without sandbox using 'chromiumSandbox: false' option
      ================================

The full error log can be found in the attachment.
This error only occurs after upgrading crawlee[playwright] to 0.5.2

What are the advantages/disadvantages of launching Chromium without sandbox? How could I configure my environment to support sandboxing?
4 comments
M
R
I want to create bunch of authenticated users, each with their consistent browser, proxy, user agent, fingerprints, schedule, browsing pattern, etc.
13 comments
V
C
A
R
getting this system overloading message just trying to scrape two urls. this check just keeps looping for almost 10 mins now. i set the cpu to 4 and memeory to 4gb but still getting this message. i know cloud runs dont like threads and background tasks is that the real issue? not sure wondering if anyone has run them on cloud run
Plain Text
[crawlee.events._event_manager] DEBUG LocalEventManager.on.listener_wrapper(): Awaiting listener task...
[crawlee.events._event_manager] DEBUG LocalEventManager.on.listener_wrapper(): Awaiting listener task...
'[crawlee._autoscaling.autoscaled_pool] DEBUG Not scheduling new tasks - system is overloaded
'[crawlee.storages._request_queue] DEBUG There are still ids in the queue head that are pending processing ({"queue_head_ids_pending": 1})
[crawlee._utils.system] DEBUG Calling get_memory_info()...
'[crawlee._autoscaling.autoscaled_pool] DEBUG Not scheduling new tasks - system is overloaded
'[crawlee.storages._request_queue] DEBUG There are still ids in the queue head that are pending processing ({"queue_head_ids_pending": 1})
'[crawlee._autoscaling.autoscaled_pool] DEBUG Not scheduling new tasks - system is overloaded
'[crawlee.storages._request_queue] DEBUG There are still ids in the queue head that are pending processing ({"queue_head_ids_pending": 1})
[crawlee._utils.system] DEBUG Calling get_cpu_info()...
'[crawlee._autoscaling.autoscaled_pool] DEBUG Not scheduling new tasks - system is overloaded
'[crawlee.storages._request_queue] DEBUG There are still ids in the queue head that are pending processing ({"queue_head_ids_pending": 1})
'[crawlee._autoscaling.autoscaled_pool] DEBUG Not scheduling new tasks - system is overloaded
1 comment
M
I have the following code for AdaptivePlaywrightCrawler and I want to log the number of enqueued links after calling enqueueLinks.

router.addDefaultHandler(async ({ request, enqueueLinks, parseWithCheerio, querySelector, log, page }) => { await enqueueLinks({ strategy: 'same-domain', globs: globs, transformRequestFunction: (request) => { return request; }, }); });
1 comment
R
I have set up my handler that it only enqueue links that match on certain keywords Problem here is that I want the code to only check the URL Path and not the full URL.

To give an example:
Lets say I only want to enqueue links where the keyword "team" or "about" is part of the URL path.
When crawling www.example.com and it would find an url with www.example.com/team. I want that URL to queue.
When crawling www.my-team.com it would match on all urls on that website because team is part of the main url. But that is not the desired behaviour I want.

I thought of using a pre_navigation_hook and check there again with the following code, but I don't think it's possible to cancel a request that is already queued?
Plain Text
    parsed_url = urlparse(context.request.url)
    path_name = parsed_url.path

    results = _get_regex_matches(path_name)

    if not results:
        context.log.info(
            f'No match found for URL: {context.request.url} in path: '
            f'{path_name}'
        )
        # TODO: CANCEL REQUEST


In the docs I found something like await request_list.mark_request_as_handled(request) but I don't think I have any access to a request_list or something simular in the PlaywrightPreNavCrawlingContext

It would be great if someone can point me in the right direction!
2 comments
M
Hi community! I'm new to Crawlee, and I'm building a script that scrapes a lot of specific, different domains. These domains each have a different number of pages to scrape; some have 2 to 3 thousand pages, while others might have just a few hundred (or even less).
The thing I have doubts about is: if I put all starting URLs in the same crawler instance, it might finish scraping a domain way before another one. I thought about separating domains, creating a crawler instance for each domain, just so that I can run each crawler separately and let them run their own course.
Is there any downside to this, e.g. will it need significantly more resources? Is there a better strategy?
TIA
2 comments
O
V
Root relative - prefixed with '/', ie href=/ASDF brings you to example.com/ASDF

base-relative - no prefix, ie. href=ASDF from example.com/test/ brings you to example.com/test/ASDF

If someone could point me to where in the library this logic occurs, I would be forever grateful
1 comment
M
A scraper that I am developing, scrapes a SPA with infinite scrolling. This works fine, but after 300 seconds, I get a WARN , which spawns another playwright instance.
This probably happens since I only handle 1 request (I do not add anything to the RequestQueue), from which I just have a while until finished condition is met.

Plain Text
[crawlee.storages._request_queue] WARN  The request queue seems to be stuck for 300.0s, resetting internal state. ({"queue_head_ids_pending": 0, "in_progress": ["tEyKIytjmqjtRvA"]})


What is a clean way to stop this from happening?
3 comments
E
D
A
Does the python crawlee allow for multiple crawlers to be run using one router?
Plain Text
router = Router[BeautifulSoupCrawlingContext]()

Just asking as a coleague asked me if it would be possible because curl requests are a lot faster than playwright, so if we can use curl for half the requests and only load the browser for the other portion where it's needed, it could significantly speed up some processes
1 comment
T
I’m working on a project using PlaywrightCrawler to scrape links from a dynamic JavaScript-rendered website. The challenge is that the <a> tags don’t have href attributes, so I need to click on them and capture the resulting URLs.

  • Delayed Link Rendering: Links are dynamically rendered with JavaScript, often taking time due to a loader. How can I ensure all links are loaded before clicking?
  • Navigation Issues: Some links don’t navigate as expected or fail when trying to open in a new context.
  • Memory Overload: I get the warning "Memory is critically overloaded" during crawls
I've attached images of my code (it was too long so I couldn't paste it)

How can I handle these issues more efficiently, especially for dynamic and JavaScript-heavy sites?
I would appreciate any help
2 comments
A
b
Hi,
I need help with finding an actor or setting up the settings of the website content crawler to extract all the URLs from a site but not the content from the URL, I want to filter the URLs by keywords to find the one Im looking for, but dont need the content from the URL

Thanks for your help
When I build actor and run it, I get the following error:
2025-01-10T18:47:43.475Z Traceback (most recent call last):
2025-01-10T18:47:43.476Z File "<frozen runpy>", line 198, in _run_module_as_main
2025-01-10T18:47:43.477Z File "<frozen runpy>", line 88, in _run_code
2025-01-10T18:47:43.478Z File "/usr/src/app/src/main.py", line 3, in <module>
2025-01-10T18:47:43.479Z from .main import main
2025-01-10T18:47:43.479Z File "/usr/src/app/src/main.py", line 9, in <module>
2025-01-10T18:47:43.480Z from apify import Actor
2025-01-10T18:47:43.481Z File "/usr/local/lib/python3.12/site-packages/apify/init.py", line 7, in <module>
2025-01-10T18:47:43.482Z from apify._actor import Actor
2025-01-10T18:47:43.483Z File "/usr/local/lib/python3.12/site-packages/apify/_actor.py", line 16, in <module>
2025-01-10T18:47:43.483Z from crawlee import service_container
2025-01-10T18:47:43.484Z ImportError: cannot import name 'service_container' from 'crawlee' (/usr/local/lib/python3.12/site-packages/crawlee/init.py)

I did not change anything in my docker file:
FROM apify/actor-python:3.12
COPY requirements.txt ./
...


In requirements.txt I install the following module:
apify ~= 2.0.0

Anyone else facing the same issue ?
3 comments
M
H
A
I want to use the AdaptivePlaywrightCrawler, but it seems like it wants to crawl the entire web.
Here is my code.

const crawler = new AdaptivePlaywrightCrawler({ renderingTypeDetectionRatio: 0.1, maxRequestsPerCrawl: 50, async requestHandler({ request, enqueueLinks, parseWithCheerio, querySelector, log, urls }) { console.log(request.url, request.uniqueKey); await enqueueLinks(); } }); crawler.run(['https://crawlee.dev']);
3 comments
N
E
Hi everyone! I am creating a crawler using crawlee for python. I noticed the Parsel crawler makes the requests in much higher frequency than the Beautiful soup crawler. Is there a way to make the Parsel crawler slower, so we avoid getting blocked better? Thanks!
5 comments
E
R
M
Are there actually any ressources on building a scraper with crawlee except the one in the docs?
Where do I set all the browser context for example?

Plain Text
const launchPlaywright = async () => {
  const browser = await playwright["chromium"].launch({
    headless: true,
    args: ["--disable-blink-features=AutomationControlled"],
  });

  const context = await browser.newContext({
    viewport: { width: 1280, height: 720 },
    userAgent:
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
    geolocation: { longitude: 7.8421, latitude: 47.9978 },
    permissions: ["geolocation"],
    locale: "en-US",
    storageState: "playwright/auth/user.json",
  });
  return await context.newPage();
};
2 comments
a
Hello, I'm seeing (https://playwright.dev/python/docs/library#incompatible-with-selectoreventloop-of-asyncio-on-windows) that there is an incompatibility between Playwright and Windows SelectorEventLoop -- which Crawlee seems to require? Can you confirm whether it is possible to use a PlaywrightCrawlingContext in a Windows environment? I'm running into am asyncio NotImplementedError when trying to run the crawler, which suggests to me that there might be an issue. Thanks for the help.
1 comment
M