ambitious-aqua
ambitious-aqua•15mo ago

Extract data from a json variable

I can't find out how to extract data from a variable on a page I'm crawling. It looks like this:
<script>
window.addEventListener("load", function() {
var jsondata = [{"id":"JOB_POSTING-3-865832","jobreqid":"107457WD"}];
var jobsTable = new JobsTable({
element: document.getElementById("wdresults"),
data: jsondata
});
});
</script>
<script>
window.addEventListener("load", function() {
var jsondata = [{"id":"JOB_POSTING-3-865832","jobreqid":"107457WD"}];
var jobsTable = new JobsTable({
element: document.getElementById("wdresults"),
data: jsondata
});
});
</script>
My crawler is currently like this:
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, log }) {
log.info(`Processing ${request.url}...`);

// Wait for the page to fully load
await page.waitForLoadState('networkidle');

// Extract the jsondata variable
const jsondata = await page.evaluate(() => {
// Check if the variable is defined and return it
if (typeof jsondata !== 'undefined') {
return jsondata;
}
return [];
});

// Log the extracted data for debugging
log.info(`Extracted data: ${JSON.stringify(jsondata)}`);

// Save the jsondata to a dataset
await Dataset.pushData({ jsondata });
},
});
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, log }) {
log.info(`Processing ${request.url}...`);

// Wait for the page to fully load
await page.waitForLoadState('networkidle');

// Extract the jsondata variable
const jsondata = await page.evaluate(() => {
// Check if the variable is defined and return it
if (typeof jsondata !== 'undefined') {
return jsondata;
}
return [];
});

// Log the extracted data for debugging
log.info(`Extracted data: ${JSON.stringify(jsondata)}`);

// Save the jsondata to a dataset
await Dataset.pushData({ jsondata });
},
});
What am I doing wrong?
4 Replies
Marco
Marco•15mo ago
Are you trying to access the variable jsondata in the script tag? I'm not sure that would be possible, or at least I have no experience with something like that and it isn't mentioned in the Playwright documentation. Moreover, in this case, the variable is accessible only in the body of the callback passed to addEventListener (inside the curly braces { }) and it is not visible from outside, so I'm quite sure you cannot retrieve the data in this way. But, apparently, the data is applied somehow to an element with ID wdresults, so maybe you could use that element to scrape the data.
HonzaS
HonzaS•15mo ago
Can't you just parse it from the page?
ambitious-aqua
ambitious-aquaOP•15mo ago
I ended up doing it like this:
// Wait for the page to fully load
await page.waitForLoadState('networkidle');
log.info('Page load state: networkidle');

// Extract JSON data from the script tag
const jsondata = await page.evaluate(() => {
const scriptNodes = Array.from(document.querySelectorAll('script'));
let jsonDataArray: any[] | Promise<any[]> = [];

scriptNodes.forEach(script => {
const scriptContent = script.innerHTML;
const pattern = /var jsondata = (\[.*?\])\s*;/s;
const match = scriptContent.match(pattern);

if (match) {
jsonDataArray = JSON.parse(match[1]);
}
});

return jsonDataArray;
});

// Wait for the page to fully load
await page.waitForLoadState('networkidle');
log.info('Page load state: networkidle');

// Extract JSON data from the script tag
const jsondata = await page.evaluate(() => {
const scriptNodes = Array.from(document.querySelectorAll('script'));
let jsonDataArray: any[] | Promise<any[]> = [];

scriptNodes.forEach(script => {
const scriptContent = script.innerHTML;
const pattern = /var jsondata = (\[.*?\])\s*;/s;
const match = scriptContent.match(pattern);

if (match) {
jsonDataArray = JSON.parse(match[1]);
}
});

return jsonDataArray;
});

Not sure if that is the right way, but it works 🙂
Marco
Marco•15mo ago
There is no "right way" to do scraping, here you are essentially parsing the code text, which is a bit uncommon but works, so it shouldn't be a problem. 🙂

Did you find this page helpful?