correct-apricot•3y ago
How to handle sequential steps (like a login flow or a wizard) in headless browser?
Context
We need to log in to establish a session, then visit a 'content page' to scrape the data we want.
Goal
We're trying to understand the correct way to set up Crawlee for this scenario. Do we do it serially with
page.goto as is done in the forms example[1]? Should we set up handlers for each page type (loginHandler and contentPageHandler) and just add the pages to the RequestQueue? Or do we do something else entirely?
Questions
- How do we ensure that the login step occurs before the the 'content page' is visited and scraped?
- Is there a suggested method for persisting session data so it can be used in a serverless environment where crawlee and the browser are ephemeral?
- If we visit a URL but our session has expired and we need to log back in, is there a recommended method for ensuring the content page in question remains on the queue?
Thanks in advance -- I looked in the documentation but couldn't find any fully fleshed out authentication examples with session persistence and request order guarantees in the docs.
[1] https://crawlee.dev/docs/examples/forms6 Replies
normally its router with labels or the same concept done by handlePageFunction, the idea is to use separate requests instead of .goto
correct-apricotOP•3y ago
Thanks for the response @Alexey Udovydchenko.
Are you saying I should:
- Add all of the pages to the RequestQueue with a label.
- Register handlers for each label in the router with the corresponding behavior (login vs. scrape)
I'm still not sure how to manage the sequencing, though:
- How do I ensure that the
login action occurs before the contentPage action? In Crawling the detail pages[0] , the example just ignores requests with the DETAIL label. How would I ensure that the page gets retried?
- If I throw an exception, the docs say Crawlee will "try to re-crawl the request later." [1] Is the link appended to the end of the RequestQueue?
[0] https://crawlee.dev/docs/introduction/crawling#crawling-the-detail-pages
[1] https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions#requestHandlerratty-blush•3y ago
I am quite uncertain how to handle this as well.
sensitive-blue•3y ago
To manage the sequencing, you can set the maximum number of tasks running in parallel to 1. See
maxConcurrency (https://crawlee.dev/api/core/interface/AutoscaledPoolOptions#maxConcurrency)
You should :
- Set maxConcurrency to 1
- Add login page to the RequestQueue with a label.
- Add contentpage to the RequestQueue with a label.
Thus the login page action will occur before the content page action.Expected approach is to add Login URL only to start URLs. Then handle login by router or handle page function. When login is done and you verified access by code, i.e. dashboard reached, there is no login errors etc you can
context.crawler.requestQueue.addRequest to continue with next steps. Concurrency therefore can be any, its advantage to be able to process multiple pages at the same time, just create correct data flow, add content URLs after you passed login, not at the start1. The easy thing is to just use a single handler and page.goto but that is not ideal since you give up the main reason for Crawlee which is managing the requests.
2. Customizing SessionPool is quite hard now so I would recommend doing your own management. That would be to store cookies at the end of each page (before you would do page.goto) before. Then enqueue the next request (next step) and pass the cookies as userData. And then you can set them up for the new request in
preNavigationHooks