ruaultadrienR
Apify & Crawlee•2w ago•
4 replies
ruaultadrien

Creating WARC files with crawlee

I just read this amazing blog post about creating WARC files with crawlee: https://crawlee.dev/python/docs/guides/creating-web-archive.

I was considering using the wayback proxy to constitute my archive collection.

Still I am wondering, if I need to set up some datacenter proxy to bypass some rate limit, is it possible to use it along with the wayback setup?

Like, can I use the wayback proxy for archiving on top of other proxies? Does crawlee allow such set up?

Thanks 😘
Solution
From our internal support:
Yes, that is very good question. I never got that far to try that out. So I have just a theoretical guess based on the documentation of the recording server. It was last year, so any fine details are lost in my memory now.

It should be possible to setup fixed proxy for the recording server. I think the server looks at some environment variables (HTTP_PROXY, HTTPS_PROXY  ???) . Then it will use the same proxy for all the requests.I am not sure if it is easily possible to do dynamic per requests proxies. This would have to be experimented with and I do not know it without try it. Probably not possible out of the box and you would need to write some subclass of the default recording server to handle dynamic proxies.
Was this page helpful?