awake-maroonA
Apify & Crawlee3y ago
6 replies
awake-maroon

Addressing playwright memory limitations in crawlee

Hello,

I am currently using crawlee on a medium sized project and I am generally happy with it. I am targeting e-commerce websites and I am interested in the presentation of various products on the website, therefore I opted of a browser automation solution, to be able to "see" the page.

I am using playwright as the browser automation tool. Recently I noticed some of my scraping instances fail with the following error:
While handling this request, the container instance was found to be using too much memory and was terminated.


I did some digging around the web and I found the following:
https://stackoverflow.com/questions/72954376/python-playwright-memory-overlad

It seems that the playwright context just grows over time. It is a known issue, but playwright itself will not handle this because they are primarily a web testing tool, not a scraping tool.

The mentioned solution is to save the state of the context on the disk, and restart the context every once in a while. I was wondering if crawlee has any out of the box functionality that applies this solution. If not, did anyone else encounter the problem? How did you fix it?
Stack Overflow
I made a code that scrapy a website continuously and after several times a got this message
<--- Last few GCs --->

[17744:00000270608DE2C0] 16122001 ms: Scavenge 2023.5 (2082.0) ->
2017.3...
Python Playwright memory overlad
Was this page helpful?