Apify & CrawleeA&CApify & Crawlee
Powered by
opposite-copperO
Apify & Crawlee•4y ago•
33 replies
opposite-copper

Scraping auth-protected pages with CheerioCrawler, should I use Session?

I am trying to scrape some pages that only have certain information available when the user is logged in (as a personal project, I understand the risks)
At first, I tried to add a request to the queue that executes a POST request to perform a login, and then save those cookies into the route handler session using
session.setCookiesFromResponse
session.setCookiesFromResponse
, and afterwards add the starting point for my scraping.

However, for some reason the session is always empty (since the session was destroyed) and the next handler has always a new session, even though I set the following configuration to my crawler:
    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
        maxPoolSize: 1,
    },
    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
        maxPoolSize: 1,
    },


I've seen that
session.isBlocked()
session.isBlocked()
and
session.isExpired()
session.isExpired()
are always true, even before I set the cookies from the login response.

Am I understanding sessions wrong? Are they supposed to be only available when running Apify actors?
If so, what kind of flow should I use to include the authentication headers to all my requests?

Thank you in advance 🙂

PD: I want to run this scraper only in my local environment.

PD2: Basically what I would want to do is something similar to this
Apify Store scraping
Apify Store scraping
https://crawlee.dev/docs/introduction/scraping
But, using CheerioCrawler and now imagine that the apify actor pages are auth-protected so you need login cookies, how would you do it then?
Apify & Crawlee banner
Apify & CrawleeJoin
This is the official developer community of Apify and Crawlee.
14,091Members
Resources
Recent Announcements

Similar Threads

Was this page helpful?
Recent Announcements
ellativity

**Update to Store Publishing Terms and Acceptable Use Policy** Due to an influx of fraudulent reviews recently, Apify's Legal team has taken some actions to protect developers, customers, and Apify, by updating the Store Publishing Terms and Acceptable Use Policy. Please pay special attention to the updated terms in section 4 of the Store Publishing Terms here: https://docs.apify.com/legal/store-publishing-terms-and-conditions Additionally, please review the changes to section 2 of the Acceptable Use Policy here: https://docs.apify.com/legal/acceptable-use-policy If you have any questions, please ask them in <#1206131794261315594> so everyone can see the discussion. Thanks!

ellativity · 3w ago

ellativity

Hi @everyone I'm hanging out with the Creator team at Apify in https://discord.com/channels/801163717915574323/1430491198145167371 if you want to discuss Analytics and Insights!

ellativity · 4w ago

ellativity

2 things for <@&1092713625141137429> members today: 1. The Apify developer rewards program is open for registrations: https://apify.notion.site/developer-rewards This is the program where you will earn points for marketing activities. The rewards are still TBC, but the real purpose of the program is to help you structure your marketing activities and efforts. In the coming weeks, I will be populating that link with guides to help you identify the best ways to market your Actors, as well as scheduling workshops and office hours to help you create content and develop your own marketing strategy. 2. At 2PM CET (in about 80 minutes) there will be an office hour with the team behind Insights and Analytics, who want your feedback on how to improve analytics for you. Join us in https://discord.com/channels/801163717915574323/1430491198145167371 to share your ideas!

ellativity · 4w ago

Similar Threads

got-scraping vs cheerioCrawler or sendRequest
HonzaSHHonzaS / crawlee-js
2y ago
Node-cron with CheerioCrawler
popular-magentaPpopular-magenta / crawlee-js
3y ago
Submit login form with CheerioCrawler
ripe-grayRripe-gray / crawlee-js
2y ago
what HTTP client/library does CheerioCrawler use?
azzouzanaAazzouzana / crawlee-js
2y ago