Addressing playwright memory limitations in crawlee
I am currently using crawlee on a medium sized project and I am generally happy with it. I am targeting e-commerce websites and I am interested in the presentation of various products on the website, therefore I opted of a browser automation solution, to be able to "see" the page.
I am using playwright as the browser automation tool. Recently I noticed some of my scraping instances fail with the following error:
While handling this request, the container instance was found to be using too much memory and was terminated.I did some digging around the web and I found the following:
https://stackoverflow.com/questions/72954376/python-playwright-memory-overlad
It seems that the playwright context just grows over time. It is a known issue, but playwright itself will not handle this because they are primarily a web testing tool, not a scraping tool.
The mentioned solution is to save the state of the context on the disk, and restart the context every once in a while. I was wondering if crawlee has any out of the box functionality that applies this solution. If not, did anyone else encounter the problem? How did you fix it?
<--- Last few GCs --->
[17744:00000270608DE2C0] 16122001 ms: Scavenge 2023.5 (2082.0) ->
2017.3...
