Crawler for SPAs (Single Page Application)
Hi all!
My target is to scrap a website composed of SPAs (Single Page Application) and it looks like existing browser crawlers (i.e. PlaywrightCrawler and PuppeteerCrawler) are not a good fit as each request is processed in a new page, which is a waste of resources.
What I need is to open one browser page and execute multiple XHR / fetch requests to their unofficial API, until I get blocked and need to re-open a new browser page to continue until all requests have been processed.
I'm thinking to solve my need by writing a
Is it a good idea? Is there a way to do this in a better way?
Thanks in advance for your feedback!
My target is to scrap a website composed of SPAs (Single Page Application) and it looks like existing browser crawlers (i.e. PlaywrightCrawler and PuppeteerCrawler) are not a good fit as each request is processed in a new page, which is a waste of resources.
What I need is to open one browser page and execute multiple XHR / fetch requests to their unofficial API, until I get blocked and need to re-open a new browser page to continue until all requests have been processed.
Note that need a browser to pass fingerprint checks and use the website's internal library to digitally sign each request to their unofficial API.
I'm thinking to solve my need by writing a
SinglePageBrowserCrawler that extends BasicCrawler and works similarly to BrowserCrawler but manage browser pages differently.Is it a good idea? Is there a way to do this in a better way?
Thanks in advance for your feedback!
Provides a simple framework for parallel crawling of web pages
using headless Chromium, Firefox and Webkit browsers with Playwright.
The URLs to crawl are fed either from a static list of URLs
or from a dynamic queue of URLs enabling recursive crawling of websites.
Since
using headless Chromium, Firefox and Webkit browsers with Playwright.
The URLs to crawl are fed either from a static list of URLs
or from a dynamic queue of URLs enabling recursive crawling of websites.
Since
Playwright uses headless brow...
MDN Web Docs
XMLHttpRequest (XHR) is a JavaScript API to create HTTP requests. Its methods provide the ability to send network requests between the browser and a server.

MDN Web Docs
The global fetch() method starts the process of fetching a resource from the network, returning a promise that is fulfilled once the response is available.

Provides a simple framework for parallel crawling of web pages.
The URLs to crawl are fed either from a static list of URLs
or from a dynamic queue of URLs enabling recursive crawling of websites.
download and data extraction functionality themselves.
If we want a c...
The URLs to crawl are fed either from a static list of URLs
or from a dynamic queue of URLs enabling recursive crawling of websites.
BasicCrawler is a low-level tool that requires the user to implement the pagedownload and data extraction functionality themselves.
If we want a c...
