quickest-silver•13mo ago
save HTML file using crawlee
Has anybody tried downloading the HTML file of the URL using Crawlee? Was wondering if Crawlee has a capacity of downloading the HTML file of the URL since I've just been using Crawlee and really loving the experience.
7 Replies
Post created!
This post has been synced with the Apify community site and will be indexed by search engines
You can download HTML content of a webpages using Crawlee
It depends on which crawler you are using:
- Cheerio: https://cheerio.js.org/docs/api/classes/Cheerio#html
- Playwright: https://playwright.dev/docs/api/class-page#page-content
- Puppeteer: https://pptr.dev/api/puppeteer.page.content
Page.content() method | Puppeteer
The full HTML contents of the page, including the DOCTYPE.
Class: abstract Cheerio\ | cheerio
The cheerio class is the central class of the library. It wraps a set of
Page | Playwright
* extends: [EventEmitter]
quickest-silverOP•13mo ago
Thanks for this awesome answer! Was wondering if Crawlee has examples on how to save it to a file?
You can use the KeyValueStore: https://crawlee.dev/api/core/class/KeyValueStore. E.g., with Cheerio:
harsh-harlequin•13mo ago
everything you save to a crawlee store is saved to disk, and can be accessed through crawlee or otherwise. Something my company is using is saving page.content() to a cloud storage bucket during the request handler which has worked quite well for us since it offloads the data nice and quickly.
fascinating-indigo•12mo ago
1. I want to retrieve html for some specific table, say I want to retrieve html for "item-list" div in the code below (not the data inside each element). How to do this?
2. I do not want resultant html to be saved in a file/disk, I want to return my desired result to my api for further processing. How to do this?