wise-white•3y ago
How to handle sequential steps (like a login flow or a wizard) in headless browser?
Context
We need to log in to establish a session, then visit a 'content page' to scrape the data we want.
Goal
We're trying to understand the correct way to set up Crawlee for this scenario. Do we do it serially with
page.goto
as is done in the forms example[1]? Should we set up handlers for each page type (loginHandler
and contentPageHandler
) and just add the pages to the RequestQueue
? Or do we do something else entirely?
Questions
- How do we ensure that the login step occurs before the the 'content page' is visited and scraped?
- Is there a suggested method for persisting session data so it can be used in a serverless environment where crawlee and the browser are ephemeral?
- If we visit a URL but our session has expired and we need to log back in, is there a recommended method for ensuring the content page in question remains on the queue?
Thanks in advance -- I looked in the documentation but couldn't find any fully fleshed out authentication examples with session persistence and request order guarantees in the docs.
[1] https://crawlee.dev/docs/examples/forms6 Replies
normally its router with labels or the same concept done by handlePageFunction, the idea is to use separate requests instead of .goto
wise-whiteOP•3y ago
Thanks for the response @Alexey Udovydchenko.
Are you saying I should:
- Add all of the pages to the RequestQueue with a label.
- Register handlers for each label in the router with the corresponding behavior (login vs. scrape)
I'm still not sure how to manage the sequencing, though:
- How do I ensure that the
login
action occurs before the contentPage
action? In Crawling the detail pages[0] , the example just ignores requests with the DETAIL
label. How would I ensure that the page gets retried?
- If I throw an exception, the docs say Crawlee will "try to re-crawl the request later." [1] Is the link appended to the end of the RequestQueue?
[0] https://crawlee.dev/docs/introduction/crawling#crawling-the-detail-pages
[1] https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions#requestHandlerfascinating-indigo•3y ago
I am quite uncertain how to handle this as well.
apparent-cyan•3y ago
To manage the sequencing, you can set the maximum number of tasks running in parallel to 1. See
maxConcurrency
(https://crawlee.dev/api/core/interface/AutoscaledPoolOptions#maxConcurrency)
You should :
- Set maxConcurrency
to 1
- Add login
page to the RequestQueue
with a label.
- Add content
page to the RequestQueue
with a label.
Thus the login
page action will occur before the content
page action.Expected approach is to add Login URL only to start URLs. Then handle login by router or handle page function. When login is done and you verified access by code, i.e. dashboard reached, there is no login errors etc you can
context.crawler.requestQueue.addRequest
to continue with next steps. Concurrency therefore can be any, its advantage to be able to process multiple pages at the same time, just create correct data flow, add content URLs after you passed login, not at the start1. The easy thing is to just use a single handler and page.goto but that is not ideal since you give up the main reason for Crawlee which is managing the requests.
2. Customizing SessionPool is quite hard now so I would recommend doing your own management. That would be to store cookies at the end of each page (before you would do page.goto) before. Then enqueue the next request (next step) and pass the cookies as userData. And then you can set them up for the new request in
preNavigationHooks