Apify Discord Mirror

Updated 2 months ago

Goodbye Crawlee (migrated to Hero)

At a glance

The community member migrated their scraping code from Crawlee to Hero, a simpler API, as they found Crawlee's API to be over-engineered for their simple use case. They use a message queue system with beanstalkd to handle retries and HTML manipulations in a separate program. Another community member evaluated Hero but found it less capable than Crawlee, especially in handling Cloudflare detection, which the Crawlee team is reportedly working on a solution for.

Useful resources
I migrated my scraping code from Crawlee to Hero (see https://github.com/ulixee/hero). It works. Everything that worked with Crawlee - works with Hero.

Why I migrated: can not handle the over-engineered Crawlee API more (and bugs related to this).
It was just too much APIs (different APIs!) for my simple case.
Hero has about 5 times simpler API.

In both cases (Crawlee and Hero) I am using only scraping library, no additional (cloud) services, no docker containers.

I am not manipulating DOM, not doing retries, not doing any complex things in Typescript. I am just accessing the URL (in some cases the URL1 and after this the URL2 to pretentd I'm normal user), grab the rendered HTML and that's it. All the HTML manipulations (extracting the data from the HTML) done in completely different program (written in different programming language, not in Typescript).
Re-try logic -> again, this is implemented in that different program.

I use beanstalkd (see https://github.com/beanstalkd/beanstalkd/) message queue between that "different program" and the scraper. So I just replaced the Crawlee-based-scraper with Hero-based-scraper without touching other parts of the system. Usage of beanstalkd was already discussed in this forum: use search to find these discussions.

Goodbye Crawlee.
J
S
M
3 comments
Hey @new_in_town , thanks to this post I spent a day evaluating Hero. I kind of like it but I think it's apples to oranges comparison. I have used each API of Crawlee and I love them. The only reason I am looking elsewhere is Cloudflare detection. I would argue Hero is not simpler, if you take into account the different clients, plugins, moving pieces. Took me hours to access an iframe, and finally I got stuck on a client rendered page. Hero just doesn't recognize elements that are rendered there. Also, after a day of evaluation I concluded that Cloudflare still blocks it 95% of the time. The Crawlee team is working on a new solution targeting that issue. I hope it gets out soon.
@Jeno could you link to where the team is working on cloudflare?
Add a reply
Sign up and join the conversation on Discord