stormy-gold•3y ago
Custom storage provider for RequestQueue?
It's probably a little out of the ordinary, but I'm building a crawler project that stores a pretty large pile of information in a database, rather than Crawlee's native KVS and DataSet. I'm curious if there are any examples of using alternative backends to store Crawlee's own datasets and request queue? If possible i'd love to consolidate the storage in one place, particularly since it would allow me to query and manage the request pool a bit easier…
15 Replies
@eaton just advanced to level 1! Thanks for your contributions! 🎉
Not sure about examples but Crawlee already has generic storage API that can be implemented. And we already support 2 implementations - Apify API and local filesystem. So you can add 3rd implementation
I wanted to use some database but did not find any good matches, basically its either something external in other cloud (like firebase), otherwise makes not a lot of sense to use it. Embedded DBs technically possible but because of "migration" imho nearly useless (actor might be shutdown, moved to other server instance and then restarted at any point of runtime)
stormy-goldOP•3y ago
We're using ArangoDB — it's a "multi-modal" database that has native support for mongodb-style document storage and neo4j style graph queries in the same data store; it's proven very useful for complex analysis of large sites — queries like "find high-traffic pages that are fewer than 5 clicks from that page, but only if the links are in main body of an article, not the footer".
other-emerald•3y ago
@Lukas Krivka Can you provide some references on Crawlee generic storage API?
@LeMoussel just advanced to level 4! Thanks for your contributions! 🎉
@LeMoussel Example here: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts
It uses
this.client
which is any class that implements DatasetClient, e.g. here - https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/resource-clients/dataset.ts#L34
I will tell the team to provide more examplesGitHub
crawlee/dataset.ts at master · apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - crawlee/dataset.ts at master · apify/crawlee
GitHub
crawlee/dataset.ts at master · apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - crawlee/dataset.ts at master · apify/crawlee
@eaton I think more direct approach is https://github.com/arangodb/arangojs with exactly
and make sure you handling data along with handled requests, it should be enough. As already mentioned you must have your own hosted solution
stormy-goldOP•3y ago
@Alexey Udovydchenko Yeah, we're already using arangojs to map site data to a custom domain model! But we're finding that we have to do more and more housekeeping to ensure that crawlee's request queue and other data stay in sync; unifying them seems like it would be a big win but I was concerned we'd be biting off a huge chunk of work. From the code that @Lukas Krivka posted, it looks like it's at least in the realm of 'reasonable to consider'
other-emerald•3y ago
@eaton If you make an open source code for this, tell me.
stormy-goldOP•3y ago
@LeMoussel it's quite rough at the moment, but the project we've been working on is already on github. https://github.com/autogram-is/spidergram There's a lot of *ugh, we need to improve that" there — in particular, we have a clunky wrapper around PlaywrightCrawler that we're just going to be replacing with a custom BrowserCrawler implementation, but it does the work.
Most of what we do is less "scraping" and more "building a map of several interlinked web sites and using graph queries to tease out structural patterns", which is why we end up going in a few slightly-different directions
https://github.com/autogram-is/spidergram/blob/main/OVERVIEW.md explains a bit more about the domain model it maintains
Oh, so you not expecting to host your solution in Apify cloud (https://github.com/autogram-is/spidergram/blob/main/package.json), right?
stormy-goldOP•3y ago
At least not for the time being — we’ve been doing all our work locally to bootstrap the project and may eventually build out Apify actors for it but atm, just slinging around about 4-5gb of crawled data locally, heh
Well, as I see it actor(s) designed to be isolated, might be worth considering to do so from very beginning.
I still plan to check the code, thanks for open sourcing it