ambitious-aqua
ambitious-aqua•3y ago

dataset.getData(offset, limit) throws error

Hi everyone, I'm looping over a dataset and retrieving items in batch using dataset.getData(offset, limit) but my process seems to crash randomly with the following error: items.push(await existingStoreById.datasetEntries.get(entryNumber).get()); TypeError: Cannot read properties of undefined (reading 'get') at DatasetClient.listItems (/home/crawleruser/node_modules/@crawlee/memory-storage/resource-clients/dataset.js:140:79) at async Dataset.getData (/home/crawleruser/node_modules/@crawlee/core/storages/dataset.js:220:20) Does anyone know what might be causing this? I'm using Crawlee 3.3.3
11 Replies
continuing-cyan
continuing-cyan•3y ago
Could you please provide a (semi-)full reproduction (at least the looping part)? It's hard to understand what exactly is going on... Any chance you have a too high offset and there are no items returned or something like that?
ambitious-aqua
ambitious-aquaOP•3y ago
Thank you for your help Andrey, here is some code below. Basically, I have an async loop running in parallel to my scraper which stores items in a database every 10s as they are collected
async processAndStore() {
let offset = 0;
let limit = 100;
let itemCount = 0;
while (
!this.crawler_finished ||
offset < (await this.dataset.getInfo()).itemCount
) {
await sleep(10000);
itemCount = (await this.dataset.getInfo()).itemCount;
if (itemCount > offset) {
const { items } = await this.dataset.getData({
offset: offset,
limit: limit
});
// store items
offset = Math.min(itemCount, offset + limit);
}
}
}
async processAndStore() {
let offset = 0;
let limit = 100;
let itemCount = 0;
while (
!this.crawler_finished ||
offset < (await this.dataset.getInfo()).itemCount
) {
await sleep(10000);
itemCount = (await this.dataset.getInfo()).itemCount;
if (itemCount > offset) {
const { items } = await this.dataset.getData({
offset: offset,
limit: limit
});
// store items
offset = Math.min(itemCount, offset + limit);
}
}
}
It's possible that I have an offset issue although I believe it shouldn't happen with the code above. But in any case, would it make sense for the crawlee error to be more explicit in this case?
continuing-cyan
continuing-cyan•3y ago
So it fails on dataset.getData()? This is rather weird, because getData will just return an empty array inside if e.g. the offset is too high. So just once again - you basically run the crawler - wait for it to finish - then load all the items in bathches. Then it works fine for some time, and then it just crashes?
ambitious-aqua
ambitious-aquaOP•3y ago
Yep it fails on dataset.getData(). It does look like a weird issue because it doesn't seem to happen in my local env, but it crashes randomly when I run my crawler in kubernetes Essentially, I have two async processes running in parallel 1) the crawler, CheerioCrawler or PuppeteerCrawler 2) the storage loop, which collects the items scraped from the dataset every 10s and stores them to a database I use this to avoid having a massive insert query at the end of my crawling process and overload the database The simplified code looks like this
const storageLoop = processAndStore(); #the async function pasted above
await crawler.run();
crawler_finished = true;
await storageLoop;
const storageLoop = processAndStore(); #the async function pasted above
await crawler.run();
crawler_finished = true;
await storageLoop;
continuing-cyan
continuing-cyan•3y ago
Sorry for disappearing, I was sick for a couple of days. Sent it to the team for deeper invetstigation
ambitious-aqua
ambitious-aquaOP•3y ago
No worries at all, thank you for looking into it Andrey and hope you are feeling better!
continuing-cyan
continuing-cyan•3y ago
Thanks! @vladdy this is the full conversation, so that if you'll find anything - you could post it there directly 🙂
stormy-gold
stormy-gold•3y ago
@fab8203 can you at all make a minimum repro sample? Something like interval for the storage fetcher and for loop adding things in? I'm trying to think what could cause this but nothing jumps out right away
ambitious-aqua
ambitious-aquaOP•3y ago
Hi @vladdy thank you for helping troubleshooting the issue. The whole project is quite complex but I'll try to add as much as I can. One key element is that it never happens on my local env (windows) but does happen systematically although at random times when running in docker on GKE
import { Dataset, CheerioCrawler } from "crawlee";

const dataset = await Dataset.open();
const items = [];
const itemsToStore = [];
const crawlerFinished = false;
const crawler = new CheerioCrawler({});

const storageLoop = processAndStore();
await crawler.run();
crawlerFinished = true;
await storageLoop;

async function processAndStore() {
let offset = 0;
let maxItemPicked = 100;
while (!crawlerFinished || offset < (await dataset.getInfo()).itemCount) {
await sleep(10000);
let itemCount = (await dataset.getInfo()).itemCount;
let itemPicked = Math.min(maxItemPicked, itemCount - offset);
if (itemCount > offset) {
const { items } = await dataset.getData({
offset: offset,
limit: itemPicked,
});

for (let item of items) {
item = process(item);
itemsToStore.push(item);
}

if (this.listingPriceToCreate.length > 0) {
await prisma.app_items.createMany({
data: items,
});
}

offset = Math.min(itemCount, offset + itemPicked);
}
}
}

function process(item) {
// do some processing
}
import { Dataset, CheerioCrawler } from "crawlee";

const dataset = await Dataset.open();
const items = [];
const itemsToStore = [];
const crawlerFinished = false;
const crawler = new CheerioCrawler({});

const storageLoop = processAndStore();
await crawler.run();
crawlerFinished = true;
await storageLoop;

async function processAndStore() {
let offset = 0;
let maxItemPicked = 100;
while (!crawlerFinished || offset < (await dataset.getInfo()).itemCount) {
await sleep(10000);
let itemCount = (await dataset.getInfo()).itemCount;
let itemPicked = Math.min(maxItemPicked, itemCount - offset);
if (itemCount > offset) {
const { items } = await dataset.getData({
offset: offset,
limit: itemPicked,
});

for (let item of items) {
item = process(item);
itemsToStore.push(item);
}

if (this.listingPriceToCreate.length > 0) {
await prisma.app_items.createMany({
data: items,
});
}

offset = Math.min(itemCount, offset + itemPicked);
}
}
}

function process(item) {
// do some processing
}
fascinating-indigo
fascinating-indigo•15mo ago
Hey @fab8203, did you guys ever figure this out last year? Running into the same issue when running my crawler at scale in Kubernetes/Docker.
Pepa J
Pepa J•15mo ago
Hi @Joshua Perk can you post sample of your code related to the exception? As it seems to me there are two different issues discussed in this topic.

Did you find this page helpful?