PlaywrightCrawler exception: page.content: Target page,...

At a glance

The community members are experiencing an issue where the code content = await page.content(); throws an exception "page.content: Target page, context or browser has been closed" in about 15-20% of requests. They suspect this is a well-known issue and wonder if they should check or wait for something before calling page.content().

The comments suggest this issue is not specific to Crawlee, but rather a situation where the page or browser has been closed or crashed. The community members have a try/catch block, but the request is already useless at that point. They are advised to try "headful" mode to see why the browser is crashing, which could be due to memory overload.

The community members have found that increasing the retireBrowserAfterPageCount setting from 3 to 100 resolved the issue. They thought Crawlee/Playwright would handle the dependencies and retire the browser instance when all requests were finished, but this was not the case. After further experimentation, they found that reducing the closeInactiveBrowserAfterSecs setting from 30 to 200 also helped resolve the issue.

The community members also discussed the need to have a unique fingerprint per

Useful resources

nnew_in_town

This exception happens in about 15-20% of all requests... quite often!

This line in code:

Plain Text

content = await page.content();

Throws this exception

Plain Text

page.content: Target page, context or browser has been closed
   at (<somewhere-in-my-code>.js:170:54)
   at PlaywrightCrawler.requestHandler (<somewhere-in-my-code>.js:596:15)
   at async wrap (.../node_modules/@apify/timeout/index.js:52:21)

Is it something well known?

Should I check(wait for) something before calling page.content() ?
It is already checked that status of response.status() is less than 400 (it is actually 200, i see it in the logs)

7 comments

LLukas Krivka

This is not really connected to Crawlee, it is just a situation that either the page was closed or it crashed. You can try/catch that but at that point the requests is useless anyway.

I would try headful mode and see why it crashed. Sometimes it could be memory overload.

nnew_in_town

I have that "catch" block, and I see that content is empty in case this exception thrown... so the catch block is only to prevent crash of my program - this request is already useless, you are right.

Regarding "headful" mode... well, only some requests crash! from my point of view "headful" mode will not be helpful.

I would focus on logs.

page was closed or it crashed.

As far as I understand it - this "page" is something in other process, I mean it is in browser. In the JS/Node/Playwright world we have only some... handle/connection to browser+page. Is it correct?

I am sure such thing as headless Firefox (i use only Firefox at the moment) can write log files. How to enable/setup headless Firefox logging? Where to look?

LLukas Krivka

Yes, page is a handle to a tab in a browser. So if the underlying page (tab) or browser closes/crashes, Playwright cannot do anything about it and will throw this erorr.

I think there are 2 possible cases:

You accidentally close the page somewhere or you are missing some await and the handler is already done when you call your code.
The browser/tab just crashed which can happen also on your laptop if it runs out of memory. Most cases of crash are because of that and it is fine to let the request just retry.

nnew_in_town

Well, it seems this error: page.content: Target page, context or browser has been closed is related to this setting:

Plain Text

    browserPoolOptions: {
        retireBrowserAfterPageCount: 3,
       ...
   }

I changed retireBrowserAfterPageCount to 100 -> and the error disappeared.

I should test a bit more to be 100% sure...

Anyway... I thought Crawlee/Playwright knew about these dependencies -
"this browser instance still processing request(s) - lets wait and retire it later - when all requests completely finished". it turns out that's not true.

LLukas Krivka

It should be true and retire only after pages were processed. you would need to share your code

nnew_in_town

Well, I found a solution for this!

With these settings I saw many "Target page, context or browser has been closed" errors:

Plain Text

    browserPoolOptions: {
         operationTimeoutSecs: 40,
         retireBrowserAfterPageCount: 3,
         maxOpenPagesPerBrowser: 3,
         closeInactiveBrowserAfterSecs: 30,

Than I changed a few settings... and no errors any more:

Plain Text

    browserPoolOptions: {
         operationTimeoutSecs: 30,
         retireBrowserAfterPageCount: 2,
         maxOpenPagesPerBrowser: 10,
         closeInactiveBrowserAfterSecs: 200,

Probably the low value of closeInactiveBrowserAfterSecs (30) caused these errors, but I am not sure.

The very low value of retireBrowserAfterPageCount is needed to change browser fingerprint often: my goal is to have
unique fingerprints per request. With retireBrowserAfterPageCount=2 I have unique fingerprint every two requests, which isn't perfect, but it's not bad.

By the way, we discussed this "New fingerprint per new page in browser-pool" thing in the past (and still no good solution for use with PlaywrightCrawler as far as I understand... but this should be discussed separately):
https://discord.com/channels/801163717915574323/1060467542616965150/1062991696813625405

LLukas Krivka

Hmm, interesting, I have never changed the default value of closeInactiveBrowserAfterSecs.

The new fingerprint per page (we call it a session which is fingerprint + proxy IP) is just experimental hack https://crawlee.dev/api/browser-pool/class/LaunchContext#experimentalContainers

Add a reply

Apify Discord Mirror

PlaywrightCrawler exception: page.content: Target page, context or browser has been closed