quickest-silver•13mo ago

save HTML file using crawlee

Has anybody tried downloading the HTML file of the URL using Crawlee? Was wondering if Crawlee has a capacity of downloading the HTML file of the URL since I've just been using Crawlee and really loving the experience.

7 Replies

Hall•13mo ago

Post created!

This post has been synced with the Apify community site and will be indexed by search engines

Exp•13mo ago

You can download HTML content of a webpages using Crawlee

Marco•13mo ago

It depends on which crawler you are using: - Cheerio: https://cheerio.js.org/docs/api/classes/Cheerio#html - Playwright: https://playwright.dev/docs/api/class-page#page-content - Puppeteer: https://pptr.dev/api/puppeteer.page.content

Page.content() method | Puppeteer

The full HTML contents of the page, including the DOCTYPE.

Class: abstract Cheerio\ | cheerio

The cheerio class is the central class of the library. It wraps a set of

Page | Playwright

* extends: [EventEmitter]

quickest-silverOP•13mo ago

Thanks for this awesome answer! Was wondering if Crawlee has examples on how to save it to a file?

Marco•13mo ago

You can use the KeyValueStore: https://crawlee.dev/api/core/class/KeyValueStore. E.g., with Cheerio:

await store.setValue('my-html', $.html('html'), { contentType: 'text/html' });

await store.setValue('my-html', $.html('html'), { contentType: 'text/html' });

harsh-harlequin•13mo ago

everything you save to a crawlee store is saved to disk, and can be accessed through crawlee or otherwise. Something my company is using is saving page.content() to a cloud storage bucket during the request handler which has worked quite well for us since it offloads the data nice and quickly.

fascinating-indigo•12mo ago

1. I want to retrieve html for some specific table, say I want to retrieve html for "item-list" div in the code below (not the data inside each element). How to do this?

<body>
    ...
    ...
    <div class="item-list">
        <div class="item">
            <div class="product-label"></div>
            <div class="product-image"></div>
            <div class="product-cost"></div>
        </div>
        <div class="item">
            <div class="product-label"></div>
            <div class="product-image"></div>
            <div class="product-cost"></div>
        </div>
        ...
        ...
    </div> <!-- end item-list -->
    ...
    ...
    <div class="testimonials">
        ...
        ...
        ...
    <div>
</body>

<body>
    ...
    ...
    <div class="item-list">
        <div class="item">
            <div class="product-label"></div>
            <div class="product-image"></div>
            <div class="product-cost"></div>
        </div>
        <div class="item">
            <div class="product-label"></div>
            <div class="product-image"></div>
            <div class="product-cost"></div>
        </div>
        ...
        ...
    </div> <!-- end item-list -->
    ...
    ...
    <div class="testimonials">
        ...
        ...
        ...
    <div>
</body>

2. I do not want resultant html to be saved in a file/disk, I want to return my desired result to my api for further processing. How to do this?

save HTML file using crawlee

Did you find this page helpful?