ambitious-aqua•3y ago
dataset.getData(offset, limit) throws error
Hi everyone,
I'm looping over a dataset and retrieving items in batch using dataset.getData(offset, limit) but my process seems to crash randomly with the following error:
items.push(await existingStoreById.datasetEntries.get(entryNumber).get());
TypeError: Cannot read properties of undefined (reading 'get') at DatasetClient.listItems (/home/crawleruser/node_modules/@crawlee/memory-storage/resource-clients/dataset.js:140:79) at async Dataset.getData (/home/crawleruser/node_modules/@crawlee/core/storages/dataset.js:220:20)
Does anyone know what might be causing this?
I'm using Crawlee 3.3.3
11 Replies
continuing-cyan•3y ago
Could you please provide a (semi-)full reproduction (at least the looping part)? It's hard to understand what exactly is going on... Any chance you have a too high offset and there are no items returned or something like that?
ambitious-aquaOP•3y ago
Thank you for your help Andrey, here is some code below. Basically, I have an async loop running in parallel to my scraper which stores items in a database every 10s as they are collected
It's possible that I have an offset issue although I believe it shouldn't happen with the code above. But in any case, would it make sense for the crawlee error to be more explicit in this case?
continuing-cyan•3y ago
So it fails on
dataset.getData()
? This is rather weird, because getData will just return an empty array inside if e.g. the offset is too high. So just once again - you basically run the crawler - wait for it to finish - then load all the items in bathches. Then it works fine for some time, and then it just crashes?ambitious-aquaOP•3y ago
Yep it fails on dataset.getData(). It does look like a weird issue because it doesn't seem to happen in my local env, but it crashes randomly when I run my crawler in kubernetes
Essentially, I have two async processes running in parallel
1) the crawler, CheerioCrawler or PuppeteerCrawler
2) the storage loop, which collects the items scraped from the dataset every 10s and stores them to a database
I use this to avoid having a massive insert query at the end of my crawling process and overload the database
The simplified code looks like this
continuing-cyan•3y ago
Sorry for disappearing, I was sick for a couple of days. Sent it to the team for deeper invetstigation
ambitious-aquaOP•3y ago
No worries at all, thank you for looking into it Andrey and hope you are feeling better!
continuing-cyan•3y ago
Thanks!
@vladdy this is the full conversation, so that if you'll find anything - you could post it there directly 🙂
stormy-gold•3y ago
@fab8203 can you at all make a minimum repro sample? Something like interval for the storage fetcher and for loop adding things in?
I'm trying to think what could cause this but nothing jumps out right away
ambitious-aquaOP•3y ago
Hi @vladdy thank you for helping troubleshooting the issue. The whole project is quite complex but I'll try to add as much as I can. One key element is that it never happens on my local env (windows) but does happen systematically although at random times when running in docker on GKE
fascinating-indigo•15mo ago
Hey @fab8203, did you guys ever figure this out last year? Running into the same issue when running my crawler at scale in Kubernetes/Docker.
Hi @Joshua Perk can you post sample of your code related to the exception? As it seems to me there are two different issues discussed in this topic.