How to access browser instance in Playwright Crawler?

At a glance

The community member is trying to port their scrapers from Selenium/Python to Crawlee, mainly due to the anti-bot protections built into Crawlee. They are having trouble translating their functions 1-to-1 from Selenium to Crawlee, as a lot of it depends on the Selenium driver or Playwright's browser instance. For example, they need to click on an element to get a link because there's a redirect in between, and they need to wait for it before grabbing the link, but they can't use enqueueLinksByClickingElements because they need it in the same request for their dataset to be complete.

The community member asks if it's possible to achieve this functionality with Crawlee, or if there are any workarounds they can use. Another community member suggests using context.browserController.browser to access the full browser API that Playwright provides, which seems to solve the issue. The community members test this approach and confirm that it works.

AAltairSama2

I have been trying to port our scrapers from Selenium/Python to crawlee mainly because of the anti bot protections already built into it. The issue I am facing is I am having a hard time translating our functions 1-to-1 from selenium to Crawlee because a lot of it depends on the selenium driver or in Playwright's case browser instance, for e.g.

I need to click on an element to get the link because there's a redirect in between and I need to wait for it before grabbing it and I cant use enqueueLinksByClickingElements because I need it in the same request for my dataset to be complete.

There are other such issues I am having trouble with and I know we have Page exposed but that's just a single tab in a browser's context and I need more control over it for my usecase.

Is this something that's possible with Crawlee? or are there any workarounds that I can use for this same functionality?

14 comments

PPepa J

Hey ,
Are you talking about context.page.browser()?

Plain Text

const crawler = new PlaywrightCrawler({
    requestHandler: async (context) => {
        // context.page.browser()
        context.browserController.browser
    },
});

AAltairSama2

will this give me the full browser api that playwright has? if so then this is exactly what I need

AAltairSama2

I went through the docs on apify

AAltairSama2

but coudlnt figure out how to get those inside crawler

AAltairSama2

if this is the same then I can simply refer to playwright's docs for opening new tabs etc and have the crawler handle the anti bot stuff etc

AAltairSama2

not the most optimized thing but I'll refactor once I am more familiar with crawlee

PPepa J

Please test it and let us know, if it solve your problem.