quickest-silver
quickest-silver13mo ago

save HTML file using crawlee

Has anybody tried downloading the HTML file of the URL using Crawlee? Was wondering if Crawlee has a capacity of downloading the HTML file of the URL since I've just been using Crawlee and really loving the experience.
7 Replies
Hall
Hall13mo ago
Post created!
This post has been synced with the Apify community site and will be indexed by search engines
Exp
Exp13mo ago
You can download HTML content of a webpages using Crawlee
Marco
Marco13mo ago
Page.content() method | Puppeteer
The full HTML contents of the page, including the DOCTYPE.
Class: abstract Cheerio\ | cheerio
The cheerio class is the central class of the library. It wraps a set of
Page | Playwright
* extends: [EventEmitter]
quickest-silver
quickest-silverOP13mo ago
Thanks for this awesome answer! Was wondering if Crawlee has examples on how to save it to a file?
Marco
Marco13mo ago
You can use the KeyValueStore: https://crawlee.dev/api/core/class/KeyValueStore. E.g., with Cheerio:
await store.setValue('my-html', $.html('html'), { contentType: 'text/html' });
await store.setValue('my-html', $.html('html'), { contentType: 'text/html' });
harsh-harlequin
harsh-harlequin13mo ago
everything you save to a crawlee store is saved to disk, and can be accessed through crawlee or otherwise. Something my company is using is saving page.content() to a cloud storage bucket during the request handler which has worked quite well for us since it offloads the data nice and quickly.
fascinating-indigo
fascinating-indigo12mo ago
1. I want to retrieve html for some specific table, say I want to retrieve html for "item-list" div in the code below (not the data inside each element). How to do this?
<body>
...
...
<div class="item-list">
<div class="item">
<div class="product-label"></div>
<div class="product-image"></div>
<div class="product-cost"></div>
</div>
<div class="item">
<div class="product-label"></div>
<div class="product-image"></div>
<div class="product-cost"></div>
</div>
...
...
</div> <!-- end item-list -->
...
...
<div class="testimonials">
...
...
...
<div>
</body>
<body>
...
...
<div class="item-list">
<div class="item">
<div class="product-label"></div>
<div class="product-image"></div>
<div class="product-cost"></div>
</div>
<div class="item">
<div class="product-label"></div>
<div class="product-image"></div>
<div class="product-cost"></div>
</div>
...
...
</div> <!-- end item-list -->
...
...
<div class="testimonials">
...
...
...
<div>
</body>
2. I do not want resultant html to be saved in a file/disk, I want to return my desired result to my api for further processing. How to do this?

Did you find this page helpful?