Crawler for SPAs (Single Page Application)

Hi all!

My target is to scrap a website composed of SPAs (Single Page Application) and it looks like existing browser crawlers (i.e. PlaywrightCrawler and PuppeteerCrawler) are not a good fit as each request is processed in a new page, which is a waste of resources.

What I need is to open one browser page and execute multiple XHR / fetch requests to their unofficial API, until I get blocked and need to re-open a new browser page to continue until all requests have been processed.

Note that need a browser to pass fingerprint checks and use the website's internal library to digitally sign each request to their unofficial API.

I'm thinking to solve my need by writing a

SinglePageBrowserCrawler

SinglePageBrowserCrawler

that extends BasicCrawler and works similarly to BrowserCrawler but manage browser pages differently.

Is it a good idea? Is there a way to do this in a better way?

Thanks in advance for your feedback!

PlaywrightCrawler | API | Crawlee

Provides a simple framework for parallel crawling of web pages
using headless Chromium, Firefox and Webkit browsers with Playwright.
The URLs to crawl are fed either from a static list of URLs
or from a dynamic queue of URLs enabling recursive crawling of websites.

Since

Playwright

Playwright

uses headless brow...

MDN Web Docs

XMLHttpRequest (XHR) - MDN Web Docs Glossary: Definitions of Web-re...

XMLHttpRequest (XHR) is a JavaScript API to create HTTP requests. Its methods provide the ability to send network requests between the browser and a server.

MDN Web Docs

fetch() global function - Web APIs | MDN

The global fetch() method starts the process of fetching a resource from the network, returning a promise that is fulfilled once the response is available.

BasicCrawler | API | Crawlee

Provides a simple framework for parallel crawling of web pages.
The URLs to crawl are fed either from a static list of URLs
or from a dynamic queue of URLs enabling recursive crawling of websites.

BasicCrawler

BasicCrawler

is a low-level tool that requires the user to implement the page
download and data extraction functionality themselves.
If we want a c...

Crawler for SPAs (Single Page Application)

Note that need a browser to pass fingerprint checks and use the website's internal library to digitally sign each request to their unofficial API.

I'm thinking to solve my need by writing a

SinglePageBrowserCrawler

SinglePageBrowserCrawler

PlaywrightCrawler | API | Crawlee

Playwright

Playwright

uses headless brow...

MDN Web Docs

XMLHttpRequest (XHR) - MDN Web Docs Glossary: Definitions of Web-re...

XMLHttpRequest (XHR) is a JavaScript API to create HTTP requests. Its methods provide the ability to send network requests between the browser and a server.

MDN Web Docs

fetch() global function - Web APIs | MDN

The global fetch() method starts the process of fetching a resource from the network, returning a promise that is fulfilled once the response is available.

BasicCrawler | API | Crawlee

Provides a simple framework for parallel crawling of web pages.
The URLs to crawl are fed either from a static list of URLs
or from a dynamic queue of URLs enabling recursive crawling of websites.

BasicCrawler

BasicCrawler

is a low-level tool that requires the user to implement the page
download and data extraction functionality themselves.
If we want a c...

Crawler for SPAs (Single Page Application)

Crawler for SPAs (Single Page Application)

Similar Threads

Similar Threads

Similar Threads