Apify Discord Mirror

Updated 2 years ago

Best practice for rendering javascript, then doing a deep or structuredclone of the window object?

At a glance

The community member is looking for high-level advice on the best approach to crawl a website, save the JavaScript resources, and log the window object. The comments suggest using either the Puppeteer or Playwright framework, and provide some specific tips such as extracting script tags from the HTML, catching responses with page.on('response') to get the JavaScript files, and using a library to serialize the window object. However, there is no explicitly marked answer in the comments.

Useful resources

ggmmmer

Hello, I am looking for general high level advice for the best approach to crawl a site, and save the *.js resources as well as log the window object. Does anyone have an idea? I'm a little unsure if I should be leaning more on the playwright API or if there is a built-in utility or helper function for downloading resources ( and analyzing the window object at a depth of 3 or 4 ) from the site. Thanks in advance for any help.

2 comments

AAlexey Udovydchenko

SDK do not provide any special support for that, you need to choose either Puppeteer or Playwright framework then see what works better for your case. Usually when you parsing values from browser in actor code you know what is it, but if not, i.e. to find object key you can reuse https://lodash.com/docs/#findKey - its available as SDK dependency

LLukas Krivka

Hello,

You can extract all the <script> tags from the HTML that contain the JS loaded with HTML
You can catch the responses with page.on('response' that contain JS
There is probably some library for serializing window object. Generally, it will just need to replace all the references and non-serializable stuff

Add a reply