What would be the best way to use Crawlee to optimize for speed?
My use-case is crawling a website and finding its pricing page. Most target websites are SASS marketing websites, about half of them use React.
My current setup is a simple Express server on Google Cloud Run, as per the tutorial on the Crawlee website.
I'm aiming for high performance - I want to get as close as possible to a < 5s response time. I'm looking at > 10s right now, and > 20s in worst cases per website.
Factors that are slowing me down:
1. There's a warm-up period for Google Cloud Run in case the instance is not up. I guess this can be fixed by moving to dedicated server, but it's not a factor during periods of intense use so this is low on my agenda
2. It take a while to get Playwright or Cheerio to get started. Takes 2-3 seconds in the best case. Is there a way to keep it "warm" to improve these numbers?
3. I think there's an issue with starting multiple instances of Playwright. The way I built it now - it takes one root URL and crawls up to 5 pages until it finds something that looks like the pricing page. I would like to batch several websites into one request but that breaks the crawl logic because I set maxRequestsPerCrawl: 5 and if I give it multiple websites to crawl, it maxes out the maxRequests limit on the first one. So the question here is two-fold: a. Any way I can stop the Playwright instance once I find the specific page I'm looking for? b. Can I run multiple Playwright instances in parallel? If so, how many?
Also, perhaps my whole thinking is wrong here? What else can I do to improve performance?
