Apify Discord Mirror

During my Apify scraping runs with Crawlee / puppeteer, 32GB RAM per run, my jobs stop showing There was an uncaught exception during the run of the Actor and it was not handled.
And the logs you see in the screenshot at the end.
This often happens for runs that are running for 30+ minutes. Under 30 minutes is less likely to have this error.
I've tried "Increase the 'protocolTimeout' setting ", but observed that the error still happens, just after a longer wait.
Tried different concurrency settings, even leaving to default, but consistently have seen this error.

Plain Text
const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
            args: [
                "--no-sandbox", // Mitigates the "sandboxed" process issue in Docker containers,
                "--ignore-certificate-errors",
                "--disable-dev-shm-usage",
                "--disable-infobars",
                "--disable-extensions",
                "--disable-setuid-sandbox",
                "--ignore-certificate-errors",
                "--disable-gpu", // Mitigates the "crashing GPU process" issue in Docker containers
            ],
        },
    },
    maxRequestRetries: 1,
    navigationTimeoutSecs: 60,
    autoscaledPoolOptions: { minConcurrency: 30 },
    maxSessionRotations: 5,
    preNavigationHooks: [
        async ({ blockRequests }, goToOptions) => {
            if (goToOptions) goToOptions.waitUntil = "domcontentloaded"; // Set waitUntil here
            await blockRequests({
                urlPatterns: [
...
                ],
            });
        },
    ],
    proxyConfiguration,
    requestHandler: router,
});
await crawler.run(startUrls);
await Actor.exit();
1 comment
O
Hi! Error with Lodash in Crawlee

Please help. I ran the actor and got this error. I tried changing to different versions of Crawlee, but the error still persists.

node:internal/modules/cjs/loader:1140
const err = new Error(message);
^

Error: Cannot find module './_baseGet'
Require stack:

  • C:\wedat\dat-spain\apps\actor\node_modules\lodash\get.js
  • C:\wedat\dat-spain\apps\actor\node_modules@sapphire\shapeshift\dist\cjs\index.cjs
  • C:\wedat\dat-spain\apps\actor\node_modules@crawlee\memory-storage\memory-storage.js
  • C:\wedat\dat-spain\apps\actor\node_modules@crawlee\memory-storage\index.js
  • C:\wedat\dat-spain\apps\actor\node_modules@crawlee\core\configuration.js
4 comments
A
O
Hi!

I'm new to Crawlee, I'm super excited to migrate my scraping architecture to Crawlee but I can't find how to achieve this.

My use case:
I'm scraping 100 websites multiple times a day. I'd like to save the working configurations (cookies, headers, proxy) for each site.

From what I understand, Session are made for this.
However, I'd like to have the working Sessions in my database: this way working sessions persists even if the script shutdown...

Also, saving the working configurations in a database would be useful when scaling Crawlee to multiple server instances.

My ideal scenario would be to save all the configurations for each sites (including the type of crawler used (cheerio, got, playwright), css selectors, proxy needs, headers, cookies...)

Thanks a lot for your help!
3 comments
O
F
Using our own developed PPE actors causes us to appear as paid users on the analytics dashboard. However, using our own PPR and rented actors does not reflect as a paying user. This issue with PPE actors can be confusing for developers, and since there is no actual profit/cost change, it may appear as if the actor has issues with charging.

Additionally, having more detailed indicators for PPE actors in the analytics dashboard would be very beneficial. For example, it would be great to see how much each event is charged per execution for each actor.
Hi, we are trying to upgrade to a paid solution and we can't get the payment through. We checked the billing details and contacted the card company, and there was no issues from their end. They said that there was no payment attempt from Apify. Can you please assist with this issue?
14 comments
O
c
A
I am runnign a twitter scraper actor v2 on apify, and I see that my run succeeded and says 100 resutls,
but when I got to the details page, it is just an array of 100 items of {'demo': true}
how can I get proper details?
1 comment
O
❗ Guys, was something recently released or changed at Apify related to actors resources, etc.? I have an actor that has been running fine for a while, but in the past few days, migrations have become frequent, causing issues for some of my paid actor users. ⚠️
1 comment
O
Parameter name containing a dot (.) with editor stringList doesn't work on web console.

Example INPUT_SCHEMA.JSON
Plain Text
{
    "title"            : "Test",
    "type"            : "object",
    "schemaVersion"    : 1,
    "properties"    : {        
        
        "search.location": {"title": "Locations #1", "type": "array", "description":"", "editor":"stringList", "prefill": ["Bandung"]}, ### <-- Problem
        
        "search_location": {"title": "Locations #2", "type": "array", "description":"", "editor":"stringList", "prefill": ["Bandung"]}
    }
}

check Actor-ID: acfF0psV9y4e9Z4hq
Can't click the +Add button. When edited using Bulk button, the resulting Json is weird. It automatically become Object Structure which is nice effect. not sure if this really a Bug, or new features ?
2 comments
O
!
I want an apify actor that scrapes and returns LinkedIn geolocation ID as output from the input location name. Is there any such actor available in the apify store or any platform in general?
2 comments
O
k
input_schema.json
'''
{
"title": "Base64 Image Processor",
"type": "object",
"schemaVersion": 1,
"properties": {
"files": {
"type": "array",
"description": "Array of file objects to process",
"items": {
"type": "object",
"properties": {
"file": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string"},
"size": {"type": "integer"},
"content": {"type": "string"},
"description": {"type": "string"}
},
"required": ["name", "type", "size", "content", "description"]
}
},
"required": ["file"]
}
}
},
"required": ["files"]
}
'''


run and start show error
2025-03-16T08:19:28.275Z ACTOR: ERROR: Input schema is not valid (Field schema.properties.files.enum is required)

need help
1 comment
O
I create an API with express that runs crawle when called on an endpoint.

It is weird that it works completly fine on the first request I make to the API, but fails on the next ones.

I get the error: Request queue with id: [id] not does not exist.

I think I'm making some JavaScript mistake tbh, I don't have much experience with it.

Here is the way I'm doing the API:
Plain Text
import { crawler } from './main.js'  // Import the exported crawler from main file
import express from "express";

const app = express();
app.use(express.json());

const BASE_URL = "https.....";

app.post("/scrape", async (req, res) => {
    if (!req.body || !req.body.usernames) {
        return res.status(400).json({ error: "Invalid input" });
    }

    const { usernames } = req.body;
    const urls = usernames.map(username => `${BASE_URL}${username}`);

    try {
        await crawler.run(urls);
        const dataset = await crawler.getData();


        return res.status(200).json({ data: dataset });
    } catch (error) {
        console.error("Scraping error:", error);
        return res.status(500).json({ error: "Scraping failed" });
    }
});


const PORT = parseInt(process.env.PORT) || 3000;
app.listen(PORT, () => console.log(`Server running on port ${PORT}`));

Here is how my crawler look:

Plain Text
const proxies = [...] //my proxy list

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: proxies,
});


export const crawler = new CheerioCrawler({
    proxyConfiguration,

    requestHandler: async ({ request, json, proxyInfo  }) => {
        log.info(JSON.stringify(proxyInfo, null, 2))

        /// Scraping logic

        await Dataset.pushData({
            // pushing data
        });
    }, new Configuration({
    persistStorage: false,
}));
1 comment
O
I am trying to scrape particular website
but it seems to have cloudfare or some advanced firewall preventing any bot or automated script.

Please guide me with a strategy that will work against such advanced methods.
1 comment
O
r
royrusso
·
r
Solved

Only-once storage

Helllo all,

I’m looking to understand how crawlee uses storage a little better and have a question regarding that:

Crawlee truncates the storage of all indexed pages every time I run. Is there a way to not have it do that? Almost like using it as an append-only log for new items found.

Worst case scenario, I can keep an in-memory record of all pages and simply not write to disk when I see it. Curious what best practices are here.
1 comment
r
Hey everyone! 👋

I'm running into some trouble getting VNC to connect to my Docker container. Using apify/actor-node-playwright-chrome and running it as-is, but no luck in headful mode. The chrome_test.js and main.js run perfectly but VNC and Remote Debugging is not working.

I'm on windows 11, using vscode, wsl2, docker-desktop.I tried pulling the image from docker repo but then, I built the image on Ubuntu distro via WSL2 and Doker-Desktop with WSL integration enabled.

Here’s what I’ve tried so far:

Modified chrome_test.js to add a delay when headless: false
Exposed the necessary ports
Removed -nolisten tcp from both VNC servers
Still can’t connect via VNC (RealVNC) or Chrome Remote Debugging
Is the image missing something like a vnc server?
Does the xvfb/xvfb-run serves as the vnc server. Cause it usually used with a vnc server like X11vnc.
I exposed as will Chrome Remote Debugging port with out a success to establish a connection.

Not sure what I’m missing. Trying to set up Docker properly before diving into actor development. Anyone run into this before? Would appreciate any tips! 🙏
2 comments
O
m
Hey there! I am creating an intelligent crawler using crawlee. Was previously using crawl4ai but switched since crawlee seems much better at anti-blocking.

The main issue I am facing is I want to filtering the urls to crawl for a given page using LLMs. Is there a clean way to do this? So far I implemented a transformer for enqueue_links which saves the links to a dict and then process those dicts at a later point of time using another crawler object. Any other suggestions to solve this problem? I don't want to make the llm call in the transform function since that would be an LLM call per URL found which is quite expensive.

Also when I run this on my EC2 instance with 8GB of RAM it constantly runs into memory overload and just gets stuck i.e. doesn't even continue scraping pages. Any idea how I can resolve this? This is my code currently
I have a project that is using the PlaywrightCrawler from Crawlee.
If I create the template camoufox it's running perfectly, when I take the same commands from the package.json of the template and basically following the same example in my project I get the following error:
Plain Text
2025-03-13T11:58:38.513Z [Crawler] [INFO ℹ️] Finished! Total 0 requests: 0 succeeded, 0 failed.
{"terminal":true}
2025-03-13T11:58:38.513Z [Crawler] [ERROR ❌] BrowserLaunchError: Failed to launch browser. Please check the following:
- Check whether the provided executable path "/Users/dp420/.cache/camoufox/Camoufox.app/Contents/MacOS/camoufox" is correct.
- Try installing the required dependencies by running `npx playwright install --with-deps` (https://playwright.dev/docs/browsers).

Of course none of those 2 ideas are helping, camoufox binary is already there, and playwright install --with-deps have been already ran because the project was previously running firefox.

the entire error log is attached
3 comments
n
N
While running my code on Apify IDE I'm getting this error.

The build was successful.
3 comments
S
M
A
Hi! This is my url:
https://api.apify.com/v2/acts/crypto-scraper~dexscreener-tokens-scraper/run-sync-get-dataset-items?token=<my-token>
Body:
{
"chainName": "solana",
"filterArgs": [
"?rankBy=trendingScoreH24&order=desc",
"?rankBy=marketCap&order=desc&limit=10&minMarketCap=1"
],
"fromPage": 1,
"toPage": 1
}
I want to limit fetching data only to 100 or less

I changed my url with: https://api.apify.com/v2/acts/crypto-scraper~dexscreener-tokens-scraper/run-sync-get-dataset-items?token=<my-token>&limit=100

But it is still Runs more that 100

Can someone experienced this? What am I doing wrong? Thanks in advance!
2 comments
A
Im trying to make a simple crawler, how do proper control the redirects? Some bad proxies sometimes redirect to auth page , in this case i want to mark the request as failed if the redirect URL ( target ) contains something like /auth/login. Whats the best to handle this scenarios and abort the request earlier?
5 comments
A
n
O
I have this in my .actor/pay_per_event.json and my calling this in my main.py. And i do get this warning in my terminal 2025-03-08T14:09:14.994Z [apify] WARN Ignored attempt to charge for an event - the Actor does not use the pay-per-event pricing
If i use await Actor.charge('actor-start-gb '), will be correctly using PPE. Please let me know. thank you in advance


{ "actor-start": { "eventTitle": "Price for Actor start", "eventDescription": "Flat fee for starting an Actor run.", "eventPriceUsd": 0.1 }, "task-completed": { "eventTitle": "Price for completing the task", "eventDescription": "Flat fee for completing the task.", "eventPriceUsd": 0.4 } }

main.py
async def main(): """Runs the AI Travel Planner workflow.""" async with Actor: await Actor.charge('actor-start') actor_input = await Actor.get_input() or {} Actor.log.info(f"Received input: {actor_input}") travel_query = TravelState(**actor_input) # Execute workflow final_state = travel_workflow.invoke(travel_query) Actor.log.info(f"Workflow completed. Final state: {final_state}") await Actor.charge('task-completed') # Save the final report await save_report(final_state)
7 comments
S
A
B
I'm having one issue when I'm using the apify run

it's doing Python -m src

whereas to run my project I need to run python3.10 -m src

Is there any way I can fix that?

while using python -m src, it's using the 3.13 version which is my default version and throwing error so for the project i have used python3.10

Would be great if you share any fixes for this
3 comments
S
O
Adding requests with crawler.run(["https://website.com/1234"]); works locally while in the apify cloud it breaks with the following error: Reclaiming failed request back to the list or queue. TypeError: Invalid URL

It appears that while running in the cloud, the URL is split by character and each creates a request in the queue, as it can be seen in the screenshot.

The bug happens no matter the URL is hardcoded in the code or added dynamically via input.

I'm using crawlee 3.13.0.

Complete error stack:
```
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. TypeError: Invalid URL
2025-03-11T19:21:27.987Z at new URL (node:internal/url:806:29)
2025-03-11T19:21:27.988Z at getCookieContext (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:75:20)
2025-03-11T19:21:27.989Z at CookieJar.getCookies (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:452:23)
2025-03-11T19:21:27.989Z at CookieJar.callSync (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:168:16)
2025-03-11T19:21:27.990Z at CookieJar.getCookiesSync (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:575:21)
2025-03-11T19:21:27.991Z at Session.getCookies (/home/myuser/node_modules/@crawlee/core/session_pool/session.js:264:40)
2025-03-11T19:21:27.992Z at PlaywrightCrawler._applyCookies (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:344:40)
2025-03-11T19:21:27.992Z at PlaywrightCrawler._handleNavigation (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:329:20)
2025-03-11T19:21:27.993Z at async PlaywrightCrawler._runRequestHandler (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:260:13)
2025-03-11T19:21:27.994Z at async PlaywrightCrawler._runRequestHandler (/home/myuser/node_modules/@crawlee/playwright/internals/playwright-crawler.js:114:9) {"id":"PznVw0jlt50G6EL","url":"D","retryCount":1}
3 comments
g
A
O
I'm trying to run my actor & it's giving this log:
Plain Text
2025-03-09T00:13:41.538Z ACTOR: Pulling Docker image of build 20IgkKFk3QAzeFbk9 from repository.
2025-03-09T00:13:42.170Z ACTOR: Creating Docker container.
2025-03-09T00:13:42.237Z ACTOR: Starting Docker container.
2025-03-09T00:13:44.148Z Downloading model definition files...
2025-03-09T00:13:44.419Z Error downloading fingerprint-network.zip: [Errno 13] Permission denied: '/usr/local/lib/python3.13/site-packages/browserforge/fingerprints/data/fingerprint-network.zip'
2025-03-09T00:13:44.430Z Downloading model definition files...
2025-03-09T00:13:44.452Z Error downloading input-network.zip: [Errno 13] Permission denied: '/usr/local/lib/python3.13/site-packages/browserforge/headers/data/input-network.zip'
...
2025-03-09T00:13:44.580Z   File "/usr/local/lib/python3.13/site-packages/browserforge/bayesian_network.py", line 288, in extract_json
2025-03-09T00:13:44.582Z     with zipfile.ZipFile(path, 'r') as zf:
2025-03-09T00:13:44.583Z          ~~~~~~~~~~~~~~~^^^^^^^^^^^
2025-03-09T00:13:44.586Z   File "/usr/local/lib/python3.13/zipfile/__init__.py", line 1367, in __init__
2025-03-09T00:13:44.588Z     self.fp = io.open(file, filemode)
2025-03-09T00:13:44.590Z               ~~~~~~~^^^^^^^^^^^^^^^^
2025-03-09T00:13:44.592Z FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.13/site-packages/browserforge/headers/data/input-network.zip'
11 comments
A
a
Getting an error for a basic crawler when passing in my starting arguments.

It says it the input must contain "url", which it does already.

Plain Text
2025-03-07T21:22:12.478Z ACTOR: Pulling Docker image of build aJ5w2MnrBdaZRxGeA from repository.
2025-03-07T21:22:13.611Z ACTOR: Creating Docker container.
2025-03-07T21:22:13.835Z ACTOR: Starting Docker container.
2025-03-07T21:22:14.208Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp
2025-03-07T21:22:14.210Z Executing main command
2025-03-07T21:22:15.368Z INFO  System info {"apifyVersion":"3.3.2","apifyClientVersion":"2.12.0","crawleeVersion":"3.13.0","osType":"Linux","nodeVersion":"v20.18.3"}
2025-03-07T21:22:15.498Z INFO  Starting the crawl process {"startUrls":[{"url":"https://salesblaster.ai"}],"maxRequestsPerCrawl":100,"datasetName":"default"}
2025-03-07T21:22:15.905Z ERROR Error running scraper: {"error":"Request options are not valid, provide either a URL or an object with 'url' property (but without 'id' property), or an object with 'requestsFromUrl' property. Input: {\n  url: { url: 'https://salesblaster.ai' },\n  userData: {\n    datasetName: 'default',\n    initialUrl: { url: 'https://salesblaster.ai' }\n  }\n}"}
1 comment
A
Hey everyone,

I have built an Instagram Scraper using Selenium and Chrome that works perfectly until I deploy it as an actor here on Apify.

It signs in fine but fails every time no matter what I do or try when it gets to the Search button.

I have iterated through:

1) search_icon = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, "svg[aria-label='Search']"))
)
search_icon.click()

-----

2) search_icon = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH, "//span[contains(., 'Search')]"))
)
search_icon.click()

-----

3) search_icon = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH, "//svg[@aria-label='Search']"))
)
search_icon.click()

----

4) try:
search_button = WebDriverWait(driver, 30).until(
EC.element_to_be_clickable((
By.XPATH,
"//a[.//svg[@aria-label='Search'] and .//span[normalize-space()='Search']]"
))
)
# Scroll the element into view just in case
driver.execute_script("arguments[0].scrollIntoView(true);", search_button)
search_button.click()
except TimeoutException:
print("Search button not clickable.")

----

5) search_button = WebDriverWait(driver, 30).until(
EC.element_to_be_clickable((
By.XPATH,
"//a[.//svg[@aria-label='Search'] and .//span[normalize-space()='Search']]"
))
)
driver.execute_script("arguments[0].scrollIntoView(true);", search_button)
search_button.click()


And I have have tried all of these with residential proxies, data center proxies and at different timeout lengths, NOTHING works and there is nothing that I can find in the documentation to help with this issue.

does anyone have any insight into this??

I'd understand if this was failing to even sign in but it is failing at the search button, is the page rendered differently for Apify than it is if your running this from your computer maybe?
2 comments
M
В