worthy-azureW
Apify & Crawlee4y ago
6 replies
worthy-azure

How to handle sequential steps (like a login flow or a wizard) in headless browser?

Context
We need to log in to establish a session, then visit a 'content page' to scrape the data we want.

Goal
We're trying to understand the correct way to set up Crawlee for this scenario. Do we do it serially with page.goto as is done in the forms example[1]? Should we set up handlers for each page type (loginHandler and contentPageHandler) and just add the pages to the RequestQueue? Or do we do something else entirely?

Questions
- How do we ensure that the login step occurs before the the 'content page' is visited and scraped?
- Is there a suggested method for persisting session data so it can be used in a serverless environment where crawlee and the browser are ephemeral?
- If we visit a URL but our session has expired and we need to log back in, is there a recommended method for ensuring the content page in question remains on the queue?

Thanks in advance -- I looked in the documentation but couldn't find any fully fleshed out authentication examples with session persistence and request order guarantees in the docs.

[1] https://crawlee.dev/docs/examples/forms
Was this page helpful?