wise-white
wise-white15mo ago

save HTML file using crawlee

Has anybody tried downloading the HTML file of the URL using Crawlee? Was wondering if Crawlee has a capacity of downloading the HTML file of the URL since I've just been using Crawlee and really loving the experience.
7 Replies
Hall
Hall15mo ago
Post created!
This post has been synced with the Apify community site and will be indexed by search engines
Exp
Exp15mo ago
You can download HTML content of a webpages using Crawlee
Marco
Marco15mo ago
Page.content() method | Puppeteer
The full HTML contents of the page, including the DOCTYPE.
Class: abstract Cheerio\ | cheerio
The cheerio class is the central class of the library. It wraps a set of
Page | Playwright
* extends: [EventEmitter]
wise-white
wise-whiteOP15mo ago
Thanks for this awesome answer! Was wondering if Crawlee has examples on how to save it to a file?
Marco
Marco15mo ago
You can use the KeyValueStore: https://crawlee.dev/api/core/class/KeyValueStore. E.g., with Cheerio:
await store.setValue('my-html', $.html('html'), { contentType: 'text/html' });
await store.setValue('my-html', $.html('html'), { contentType: 'text/html' });
fair-rose
fair-rose15mo ago
everything you save to a crawlee store is saved to disk, and can be accessed through crawlee or otherwise. Something my company is using is saving page.content() to a cloud storage bucket during the request handler which has worked quite well for us since it offloads the data nice and quickly.
genetic-orange
genetic-orange14mo ago
1. I want to retrieve html for some specific table, say I want to retrieve html for "item-list" div in the code below (not the data inside each element). How to do this?
<body>
...
...
<div class="item-list">
<div class="item">
<div class="product-label"></div>
<div class="product-image"></div>
<div class="product-cost"></div>
</div>
<div class="item">
<div class="product-label"></div>
<div class="product-image"></div>
<div class="product-cost"></div>
</div>
...
...
</div> <!-- end item-list -->
...
...
<div class="testimonials">
...
...
...
<div>
</body>
<body>
...
...
<div class="item-list">
<div class="item">
<div class="product-label"></div>
<div class="product-image"></div>
<div class="product-cost"></div>
</div>
<div class="item">
<div class="product-label"></div>
<div class="product-image"></div>
<div class="product-cost"></div>
</div>
...
...
</div> <!-- end item-list -->
...
...
<div class="testimonials">
...
...
...
<div>
</body>
2. I do not want resultant html to be saved in a file/disk, I want to return my desired result to my api for further processing. How to do this?

Did you find this page helpful?