wise-white
wise-white3y ago

How to handle sequential steps (like a login flow or a wizard) in headless browser?

Context We need to log in to establish a session, then visit a 'content page' to scrape the data we want. Goal We're trying to understand the correct way to set up Crawlee for this scenario. Do we do it serially with page.goto as is done in the forms example[1]? Should we set up handlers for each page type (loginHandler and contentPageHandler) and just add the pages to the RequestQueue? Or do we do something else entirely? Questions - How do we ensure that the login step occurs before the the 'content page' is visited and scraped? - Is there a suggested method for persisting session data so it can be used in a serverless environment where crawlee and the browser are ephemeral? - If we visit a URL but our session has expired and we need to log back in, is there a recommended method for ensuring the content page in question remains on the queue? Thanks in advance -- I looked in the documentation but couldn't find any fully fleshed out authentication examples with session persistence and request order guarantees in the docs. [1] https://crawlee.dev/docs/examples/forms
6 Replies
Alexey Udovydchenko
normally its router with labels or the same concept done by handlePageFunction, the idea is to use separate requests instead of .goto
wise-white
wise-whiteOP3y ago
Thanks for the response @Alexey Udovydchenko. Are you saying I should: - Add all of the pages to the RequestQueue with a label. - Register handlers for each label in the router with the corresponding behavior (login vs. scrape) I'm still not sure how to manage the sequencing, though: - How do I ensure that the login action occurs before the contentPage action? In Crawling the detail pages[0] , the example just ignores requests with the DETAIL label. How would I ensure that the page gets retried? - If I throw an exception, the docs say Crawlee will "try to re-crawl the request later." [1] Is the link appended to the end of the RequestQueue? [0] https://crawlee.dev/docs/introduction/crawling#crawling-the-detail-pages [1] https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions#requestHandler
fascinating-indigo
fascinating-indigo3y ago
I am quite uncertain how to handle this as well.
apparent-cyan
apparent-cyan3y ago
To manage the sequencing, you can set the maximum number of tasks running in parallel to 1. See maxConcurrency (https://crawlee.dev/api/core/interface/AutoscaledPoolOptions#maxConcurrency) You should : - Set maxConcurrency to 1 - Add login page to the RequestQueue with a label. - Add contentpage to the RequestQueue with a label. Thus the login page action will occur before the content page action.
Alexey Udovydchenko
Expected approach is to add Login URL only to start URLs. Then handle login by router or handle page function. When login is done and you verified access by code, i.e. dashboard reached, there is no login errors etc you can context.crawler.requestQueue.addRequest to continue with next steps. Concurrency therefore can be any, its advantage to be able to process multiple pages at the same time, just create correct data flow, add content URLs after you passed login, not at the start
Lukas Krivka
Lukas Krivka3y ago
1. The easy thing is to just use a single handler and page.goto but that is not ideal since you give up the main reason for Crawlee which is managing the requests. 2. Customizing SessionPool is quite hard now so I would recommend doing your own management. That would be to store cookies at the end of each page (before you would do page.goto) before. Then enqueue the next request (next step) and pass the cookies as userData. And then you can set them up for the new request in preNavigationHooks

Did you find this page helpful?