rare-sapphire
rare-sapphire•3y ago

Approach to store scrapped data in database (postgres)

(Apologises for the crosslink: https://github.com/apify/crawlee/discussions/1577) Hi, I recently discovered Crawlee and I'm trying to figure out how can I store the scraped data in database instead in local directorio storage. Is there any plugin for that? How must I proceed to implement one? Must I code my own class that implements StorageClient interface? If so how must I injected later to be used. Thanks!
13 Replies
HonzaS
HonzaS•3y ago
you need to implement your own logic instead of Dataset.push() just call insert to your db
rare-sapphire
rare-sapphireOP•3y ago
isn't a good practice or have any benefit to implement StorageClient?
national-gold
national-gold•3y ago
If you want your crawler to be practical and performant, I wouldn't recommend pushing into a Dataset, then into your PostgreSQL database. At that point, the Dataset would just be an unnecessary middle man. The only way that'd be beneficial is if you'd like to validate the data with some custom scripts before actually pushing it into the production DB. Otherwise, just push directly into your DB.
rare-sapphire
rare-sapphireOP•3y ago
Thanks Matt, I mean implement a custom StorageClient so when you write Dataset.push() really you store data in postgres instead in local filesystem
MEE6
MEE6•3y ago
@acanimal just advanced to level 1! Thanks for your contributions! 🎉
Alexey Udovydchenko
Alexey Udovydchenko•3y ago
Its not a common case, so not covered by SDK, imho just use external package like https://github.com/supabase/supabase
old-apricot
old-apricot•3y ago
Yeah, I'm actually using a graph database to store crawl results, and it performs very well — the only hitch has been making sure that my logic for what constitutes a "unique item" etc meshes with crawlee's
national-gold
national-gold•3y ago
At that point, I'd recommend just using Sequelize to connect to your remote database and push data into it. Sequelize is (in my opinion) the best ORM.
absent-sapphire
absent-sapphire•14mo ago
Hi all, I'd looking to push straight to postgres. Wondering if anyone would be willing to share their implementation of doing so? @acanimal, sorry to ping, did you implement this?
rare-sapphire
rare-sapphire•10mo ago
Sorry to necro an older thread but Im looking at pushing data into postgres as well Is the suggestion to skip Data.push entirely and just save directly into the DB? I havent seen any examples of using PostGres (or any database for that matter)
like-gold
like-gold•10mo ago
This is something I am wanting to do as well
Strijdhagen
Strijdhagen•9mo ago
I use Supabase as my postgres platform and simply await my table and insert the data within the request handler
stormy-gold
stormy-gold•9mo ago
I recently implemented a custom storage client to store request queues in postgres as the storage costs for request queues in apify is too much. reduced my costs from 500 usd per month to 25 (25 is for the managed postgres service) The same thing can also be extended to store datasets. I only did it for request queues. For dataset and key value the custom client still uses the apify storage.

Did you find this page helpful?