vicious-gold•2y ago
What would be the best way to use Crawlee to optimize for speed?
I'm running an instance of Playwright or Cheerio depending on conditions.
My use-case is crawling a website and finding its pricing page. Most target websites are SASS marketing websites, about half of them use React.
My current setup is a simple Express server on Google Cloud Run, as per the tutorial on the Crawlee website.
I'm aiming for high performance - I want to get as close as possible to a < 5s response time. I'm looking at > 10s right now, and > 20s in worst cases per website.
Factors that are slowing me down:
1. There's a warm-up period for Google Cloud Run in case the instance is not up. I guess this can be fixed by moving to dedicated server, but it's not a factor during periods of intense use so this is low on my agenda
2. It take a while to get Playwright or Cheerio to get started. Takes 2-3 seconds in the best case. Is there a way to keep it "warm" to improve these numbers?
3. I think there's an issue with starting multiple instances of Playwright. The way I built it now - it takes one root URL and crawls up to 5 pages until it finds something that looks like the pricing page. I would like to batch several websites into one request but that breaks the crawl logic because I set maxRequestsPerCrawl: 5 and if I give it multiple websites to crawl, it maxes out the maxRequests limit on the first one. So the question here is two-fold: a. Any way I can stop the Playwright instance once I find the specific page I'm looking for? b. Can I run multiple Playwright instances in parallel? If so, how many?
Also, perhaps my whole thinking is wrong here? What else can I do to improve performance?
6 Replies
There will be off course a lot of ways we can try to optimize the code but I think a easy and scaleable way to do it would be to use something like scrapy-splash(js renderer) to render the webpage and then use cheerio on top of that.
How it works?
Well, instead of spinning up multiple instances of playwright browsers you can create 1 or multiple instances of scrapy-splash server(that takes in a url, renders it and return the html code).
Now from your crawlee code you can call the scrapy-spash server(with an url) and then when you have the rendered html, you can use cheerio to parse it. And the scrapy-splash servers are running all the time and so may significantly reduce your cold start issues. You can even load balance if you have more than 1 instances of scrapy-splash servers to scale up.
vicious-goldOP•2y ago
Cool idea, but how can I set this up to work with Cheerio? I need actual crawling capabilities here to find the page that I'm looking for.
You can use crawlee's HTTPCrawler to send the target url to the splash-server and get the rendered html in return as response which you can then parse using cheerio.
You can keep Crawler instance alive and just add stuff to the queue, that might reduce some of the startup time
https://discord.com/channels/801163717915574323/1170295320161308722/1170295320161308722
I think for browsers, 5 sec is quite optimistic for random URL. For HTTP it is totally doable.
vicious-goldOP•2y ago
I tried the scrapy-splash but had issues with it. Instead I'm now using http crawler and switch to playwright if I need to render JS.
I also ditched cloud functions in favor of a dedicated server.
Now, my server is a very simple express server with two get routes, one starts an http crawler and the other a playwright crawler. If I send my requests one by one, I get very quick responses. But the problem is that I do 5-10 parallel requests, and for some reason this is really challenging for the scraper. What's the best way to structure my scraper to fix this issue?
app.get("/pricing-html", async (req, res) => {
try {
const url = req.query;
if (!url) {
return res.status(400).send("url query parameter is required");
}
const crawler = new HttpCrawler(
{
maxRequestsPerCrawl: 5,
maxRequestRetries: 0,
navigationTimeoutSecs: 5,
async requestHandler({ request, body, pushData }) {
const html = body.toString();
if (
request.url.includes("pricing")
request.url.includes("plans")
request.url.includes("product") ||
request.url.includes("features")
) {
const framework = detectFramework(html);
if (framework === "Unknown") {
const title = extractTitle(html);
const content = extractContent(html);
await pushData({
url: request.loadedUrl,
title,
content,
});
}
} else {
const links = extractLinks(html).filter((link) => [
/^.pricing.$/,
/^.plans.$/,
/^.product.$/,
/^.features.$/,
].some((regexp) => regexp.test(link)));
const absoluteUrls = links .map((link) => new URL(link, request.loadedUrl).href) .filter((link) => { const linkHostName = new URL(link).hostname; const requestHostName = new URL(request.loadedUrl).hostname; return linkHostName === requestHostName; }); await crawler.addRequests(absoluteUrls); } }, }, new Configuration({ persistStorage: false, }) ); await crawler.run([url]); const result = await crawler.getData(); const { items } = result; return res.send(items); } catch (e) { console.error(e); return res.status(500).send(e.message); } }); This ^ is a route example, this one for the http crawler and I have pretty much similar for playwright - I do "new crawler" in both
const absoluteUrls = links .map((link) => new URL(link, request.loadedUrl).href) .filter((link) => { const linkHostName = new URL(link).hostname; const requestHostName = new URL(request.loadedUrl).hostname; return linkHostName === requestHostName; }); await crawler.addRequests(absoluteUrls); } }, }, new Configuration({ persistStorage: false, }) ); await crawler.run([url]); const result = await crawler.getData(); const { items } = result; return res.send(items); } catch (e) { console.error(e); return res.status(500).send(e.message); } }); This ^ is a route example, this one for the http crawler and I have pretty much similar for playwright - I do "new crawler" in both
1. 10 parallel requests shouldn't be a problem for HTTP since HTML parsing is relatively low CPU intensive but it will be slower with Playwright so you would need faster machine that would be able to put each browser into separate CPU core.
2. I recommend setting
desiredConcurrency
or maybe even minConcurrency
to higher number for max speed if you do more requests inside one crawler
3. As I shared above, you don't need to spawn separate crawler, you can keep them running. But to be honest, it might not have real performance impact