rival-black
rival-black•3y ago

Need help with Crawlee

I am getting the following error when crawling
No description
23 Replies
rival-black
rival-blackOP•3y ago
@Helper Even though it was workin before but when I add a new link, it didn't work
rival-black
rival-blackOP•3y ago
No description
rival-black
rival-blackOP•3y ago
what could possibly go wrong? I also tried the other method which is to pass an array of urls in crawler. Run directly but got the same err
adverse-sapphire
adverse-sapphire•3y ago
And which url worked for you?
like-gold
like-gold•3y ago
I think some adresses dont allow to be crawled try it for different urls, if it works for one then it can work for others too
adverse-sapphire
adverse-sapphire•3y ago
I guess that target url might be a json api endpoint. Try to add application/json to https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#additionalMimeTypes. (This won't fix your error but you could possibly read data from response object. But if it is json I would suggest to use https://crawlee.dev/api/http-crawler instead
rival-black
rival-blackOP•3y ago
I tried two urls crawlee.dev + github.com Plus now I faced a new problem I want to crawl through search engines like google, bing crawl all of the links that appear on the search result https://google.com/search?q=restaurants when feeding this url and setting maxRequestsPerCrawl to any number, it just sends only one request
adverse-sapphire
adverse-sapphire•3y ago
Seems like docs outdated a bit, you can read json data from context object ({ json }) without passing json mime type using cheerio crawler This option is not what you think it is. It sends one request because url itself is unique key. maxRequestsPerCrawl is safe guard that will stop crawler if it finds more urls than is set in this option.
rival-black
rival-blackOP•3y ago
I know that await enqueueLinks() is what lets it crawl more than one request right? setting that I would expect to get 20 links from google search result but what does it stop? @yellott
MEE6
MEE6•3y ago
@0xBitShoT just advanced to level 2! Thanks for your contributions! 🎉
adverse-sapphire
adverse-sapphire•3y ago
Sorry you didn't specify you were using enqueuLinks, have no idea honestly. Never parsed google myself since there is apify scrapper for that. Most likely it detects cheerio immediately, try using browser based crawlers if you want to try to implement it yourself
rival-black
rival-blackOP•3y ago
I am trying browser-based crawler
adverse-sapphire
adverse-sapphire•3y ago
There's something todo with https://crawlee.dev/docs/examples/crawl-relative-links, but with default EnqueueStrategy it should have crawled at least google links. If you want to scrape google search results urls (and not to crawl them) you need to collect them from a page using selectors.
MEE6
MEE6•3y ago
@yellott just advanced to level 4! Thanks for your contributions! 🎉
adverse-sapphire
adverse-sapphire•3y ago
I see. You need to start with 'https://www.google.com/search?q=restaurants' since google redirects to that page from 'https://google.com/search?q=restaurants' Or use SameDomain strategy to enqueue all links to google domain. But I don't think this is what you want to achieve. Naive implementation of crawler that walks through search result pages and also enqueues urls from search result page might look like this:
import { CheerioCrawler, createCheerioRouter, EnqueueStrategy } from 'crawlee';

const startUrls = ['https://www.google.com/search?q=restaurants'];
const searchPageNavUrlSelector = 'div[role="navigation"] table a';
const searchResultsUrlSelector = 'div[id="search"] div[data-sokoban-container] a[data-ved]';

export const router = createCheerioRouter();

router.addDefaultHandler(async ({ enqueueLinks, log, request }) => {
log.info(`Search page`, { url: request.loadedUrl });

await enqueueLinks({
strategy: EnqueueStrategy.SameDomain,
selector: searchPageNavUrlSelector,
});

await enqueueLinks({
strategy: EnqueueStrategy.All,
selector: searchResultsUrlSelector,
label: 'SEARCH_RESULT_URL',
});
});

router.addHandler('SEARCH_RESULT_URL', async ({ request, log }) => {
log.info(`Search result url:`, { url: request.loadedUrl });
});

const crawler = new CheerioCrawler({
requestHandler: router,
// This still is a safeguard only in this implementation.
maxRequestsPerCrawl: 30,
});

await crawler.run(startUrls);
import { CheerioCrawler, createCheerioRouter, EnqueueStrategy } from 'crawlee';

const startUrls = ['https://www.google.com/search?q=restaurants'];
const searchPageNavUrlSelector = 'div[role="navigation"] table a';
const searchResultsUrlSelector = 'div[id="search"] div[data-sokoban-container] a[data-ved]';

export const router = createCheerioRouter();

router.addDefaultHandler(async ({ enqueueLinks, log, request }) => {
log.info(`Search page`, { url: request.loadedUrl });

await enqueueLinks({
strategy: EnqueueStrategy.SameDomain,
selector: searchPageNavUrlSelector,
});

await enqueueLinks({
strategy: EnqueueStrategy.All,
selector: searchResultsUrlSelector,
label: 'SEARCH_RESULT_URL',
});
});

router.addHandler('SEARCH_RESULT_URL', async ({ request, log }) => {
log.info(`Search result url:`, { url: request.loadedUrl });
});

const crawler = new CheerioCrawler({
requestHandler: router,
// This still is a safeguard only in this implementation.
maxRequestsPerCrawl: 30,
});

await crawler.run(startUrls);
rival-black
rival-blackOP•3y ago
ah now I recved this err CheerioCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 429 status code. {"id":"lbvAGmHKVGPGH6n","url":"https://google.com/search?q=restaurants","retryCount":2} I thinks google showing captcha
adverse-sapphire
adverse-sapphire•3y ago
You need to use SERP proxies
rival-black
rival-blackOP•3y ago
ok let me research about that
adverse-sapphire
adverse-sapphire•3y ago
Btw, it is required to start google url with www. when using SERP proxies
rival-black
rival-blackOP•3y ago
why is it like that?
adverse-sapphire
adverse-sapphire•3y ago
From the docs https://docs.apify.com/platform/proxy/google-serp-proxy
Requests made through the proxy are automatically routed through a proxy server from the selected country and pure HTML code of the search result page is returned.

Important: Only HTTP requests are allowed, and the Google hostname needs to start with the www. prefix.

For code examples on how to connect to Google SERP proxies, see the examples page.
Requests made through the proxy are automatically routed through a proxy server from the selected country and pure HTML code of the search result page is returned.

Important: Only HTTP requests are allowed, and the Google hostname needs to start with the www. prefix.

For code examples on how to connect to Google SERP proxies, see the examples page.
Google SERP proxy | Apify Documentation
Learn how to collect search results from Google Search-powered tools. Get search results from localized domains in multiple countries, e.g. the US and Germany.
Alexey Udovydchenko
Alexey Udovydchenko•3y ago
to crawl same URL again recommended approach is to add request as { url, uniqueKey: [GENERATE_RANDOM_KEY_OR_USE_COUNTER] } since when you adding anchor #COUNTER its in-page navigation actually (for browser it means same page should be opened then content scrolled to #anchor) in regards of google search - save snapshot if you opening page(s) by browser based crawler or save body under cheerio then check actual content available to scraper at time of running. If you not getting links it means bot is blocked in one of other way.
Lukas Krivka
Lukas Krivka•3y ago
btw: for debugging, just store the HTML to KV store to see what was loaded, then you can see if it was html, json or text

Did you find this page helpful?