stormy-gold
stormy-gold3y ago

Custom storage provider for RequestQueue?

It's probably a little out of the ordinary, but I'm building a crawler project that stores a pretty large pile of information in a database, rather than Crawlee's native KVS and DataSet. I'm curious if there are any examples of using alternative backends to store Crawlee's own datasets and request queue? If possible i'd love to consolidate the storage in one place, particularly since it would allow me to query and manage the request pool a bit easier…
15 Replies
MEE6
MEE63y ago
@eaton just advanced to level 1! Thanks for your contributions! 🎉
Lukas Krivka
Lukas Krivka3y ago
Not sure about examples but Crawlee already has generic storage API that can be implemented. And we already support 2 implementations - Apify API and local filesystem. So you can add 3rd implementation
Alexey Udovydchenko
I wanted to use some database but did not find any good matches, basically its either something external in other cloud (like firebase), otherwise makes not a lot of sense to use it. Embedded DBs technically possible but because of "migration" imho nearly useless (actor might be shutdown, moved to other server instance and then restarted at any point of runtime)
stormy-gold
stormy-goldOP3y ago
We're using ArangoDB — it's a "multi-modal" database that has native support for mongodb-style document storage and neo4j style graph queries in the same data store; it's proven very useful for complex analysis of large sites — queries like "find high-traffic pages that are fewer than 5 clicks from that page, but only if the links are in main body of an article, not the footer".
other-emerald
other-emerald3y ago
@Lukas Krivka Can you provide some references on Crawlee generic storage API?
MEE6
MEE63y ago
@LeMoussel just advanced to level 4! Thanks for your contributions! 🎉
Lukas Krivka
Lukas Krivka3y ago
@LeMoussel Example here: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts It uses this.client which is any class that implements DatasetClient, e.g. here - https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/resource-clients/dataset.ts#L34 I will tell the team to provide more examples
GitHub
crawlee/dataset.ts at master · apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - crawlee/dataset.ts at master · apify/crawlee
GitHub
crawlee/dataset.ts at master · apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - crawlee/dataset.ts at master · apify/crawlee
Alexey Udovydchenko
@eaton I think more direct approach is https://github.com/arangodb/arangojs with exactly
const db = new Database({
url: "http://YOURDOMAIN_OR_IP:8529",
databaseName: "pancakes",
auth: { username: "root", password: "hunter2" },
});
const db = new Database({
url: "http://YOURDOMAIN_OR_IP:8529",
databaseName: "pancakes",
auth: { username: "root", password: "hunter2" },
});
and make sure you handling data along with handled requests, it should be enough. As already mentioned you must have your own hosted solution
stormy-gold
stormy-goldOP3y ago
@Alexey Udovydchenko Yeah, we're already using arangojs to map site data to a custom domain model! But we're finding that we have to do more and more housekeeping to ensure that crawlee's request queue and other data stay in sync; unifying them seems like it would be a big win but I was concerned we'd be biting off a huge chunk of work. From the code that @Lukas Krivka posted, it looks like it's at least in the realm of 'reasonable to consider'
other-emerald
other-emerald3y ago
@eaton If you make an open source code for this, tell me.
stormy-gold
stormy-goldOP3y ago
@LeMoussel it's quite rough at the moment, but the project we've been working on is already on github. https://github.com/autogram-is/spidergram There's a lot of *ugh, we need to improve that" there — in particular, we have a clunky wrapper around PlaywrightCrawler that we're just going to be replacing with a custom BrowserCrawler implementation, but it does the work. Most of what we do is less "scraping" and more "building a map of several interlinked web sites and using graph queries to tease out structural patterns", which is why we end up going in a few slightly-different directions https://github.com/autogram-is/spidergram/blob/main/OVERVIEW.md explains a bit more about the domain model it maintains
Alexey Udovydchenko
Oh, so you not expecting to host your solution in Apify cloud (https://github.com/autogram-is/spidergram/blob/main/package.json), right?
stormy-gold
stormy-goldOP3y ago
At least not for the time being — we’ve been doing all our work locally to bootstrap the project and may eventually build out Apify actors for it but atm, just slinging around about 4-5gb of crawled data locally, heh
Alexey Udovydchenko
Well, as I see it actor(s) designed to be isolated, might be worth considering to do so from very beginning.
Lukas Krivka
Lukas Krivka3y ago
I still plan to check the code, thanks for open sourcing it

Did you find this page helpful?