optimistic-gold•17mo ago

Scraper testing

OK, had a look at the documentation and there is one point that is not completely clear to me. Testing. Is there any place where I can find what is the recommended approach to testing scrapers? In other scraping systems I did, I used to save pages in my project and run tests by using the scrapers with those saved pages. all within the context of some unit testing library, of course, not just manually running the scraper on the saved page.

8 Replies

MEE6•17mo ago

@Pedro just advanced to level 1! Thanks for your contributions! 🎉

Saurav Jain•17mo ago

hey @Pedro, have you looked into these articles: https://docs.apify.com/platform/actors/development/deployment/automated-tests https://apify.com/pocesar/actor-testing

optimistic-goldOP•17mo ago

but these are specifically for testing apify actors, right? I'm looking for testing a crawlee crawler independently of apify. I wouldn't like to couple so tightly my crawlers to Apify yet, and definitely not because of testing. I'm looking for some article where jest is used to test the crawler or some simple setup similar to that. I find it interesting that there are barely no resources covering the topic of having unit tests with crawlee online. As fancy as all the crawlee features are, I'm starting to think that lacking something so basic is a big no-go for me. Will keep searching, but this is not a good sign. Also, I understand that crawlee.js intends to be a standalone library, with easy integration to Apify, but that can be used independently with full capacities, right? The fact that both testing resources above are within Apify, for actors, is somehow raising a concern that crawlee's purpose might be just to be an entry to later upsell a paid Apify service. Don't get me wrong, I appreciate the easy integration with Apify, in fact, I pretend to use it because it might give an easy and fast way to have my crawlers deployed. But I don't want to build a set of crawlers where I'm tied to Apify for something as basic as testing.

Saurav Jain•17mo ago

Thank you so much for the feedback. Crawlee is a standalone library on itself, the resources I shared was of Actors, I slipped on sharing the Apify one 🙂 And you are totally right about the testing section, I will pass it to the team and we will add it very soon. Just FYI, we are also working on a full fledged guide for Crawlee as well but it will take time. We constantly look forward to improve Crawlee, feedbacks like this always help. Generally, we do - unit - Test parsers vs HTML/JSON snapshots of the website (don't catch if website changes) - end to end - Test the Crawler and check dataset or even start a new process.

optimistic-goldOP•17mo ago

Exactly, that's what I want to do, the unit tests against html snapshots of the website. How are you doing that with Playwright and with a CheerioCrawler? I tried passing a local file to the PlaywrightCrawler as an argument to crawler.run(), but the validator is complaining that the scheme should be either http or https I passed the local file as a "file://..." url

lemurio•16mo ago

hey, when you refer to unit tests against html snapshots, I understand it as testing the parsing logic that utilizes the cheerio api. Currently, the optimal approach seems to be extracting the parsing logic from the request handler, then manually loading the html file using cheerio and test your parsers on that without using the crawler

optimistic-goldOP•16mo ago

I understand, not sure how tedious that is. The goal here is to nail down the parsing logic, but also to be able to quickly save that site's snapshot when something changes and have a convenient way to adapt the logic. I found a solution to do that, without needing to modify the crawler at all, by spinning up a simple http server during my tests so the crawler can target those snapshots.

lemurio•16mo ago

nice solution!

Scraper testing

Did you find this page helpful?