flat-fuchsia•2y ago
CheerioCrawler mixed data when using $
Hi Team! 👋 I'm pretty new to Crawlee and I'm experimenting with the
CheerioCrawler
: crawl a website and store visited URLs with their title in a database.
However, I noticed that randomly, $('title').text()
returns the wrong data (probably from another Cheerio instance?).
I believe I'm doing something wrong since it seems to be the basic, so I'm sorry if this question has already been asked.
Would output:
1st run:
2nd run:
15 Replies
Hi @simonbrunel ,
Your code looks good to me. I guess there has to be something happening on the website. Unfortunately I cannot help you more without having the real website example.
flat-fuchsiaOP•2y ago
Hi @Pepa J ! I think it happens with different websites, but the one I used in my example was https://www.intk.com/. Note that this issue does not seem to happen when
maxConcurrency = 1
, so I would think of a shared "context" or something like that between concurrent requests.Thank you @simonbrunel I am able to reproduce it now.
flat-fuchsiaOP•2y ago
Glad to hear, at least it's not specific to my project setup. Have you been able to reproduce it on another website?
@simonbrunel Just tried to run it on my own small static website but was not able to reproduce it there. Do you have some more such a websites?
flat-fuchsiaOP•2y ago
I don't have right now. Though, shouldn't the one I shared enough to trace the issue? I'm not sure how the website itself could generate such problem.
We are discussing it internally right now, with the one specific website.
flat-fuchsiaOP•2y ago
Thank you!
@simonbrunel just advanced to level 1! Thanks for your contributions! 🎉
flat-fuchsiaOP•2y ago
@Pepa J Good morning! Any update on that issue?
Hi @simonbrunel, I am sorry this curently doesn't have a priority and we were not able to reproduce it on any other website so far.
Crawlee is an open-source project, so if you have a need - you may try to debug and solve the problem open PR with fixes if you find anything.
@simonbrunel it might be anti bot protection based on IP, try to save html to KV store and check title in actual html.
flat-fuchsiaOP•2y ago
@Pepa J I understand, will see if I can debug it and submit a fix
@Alexey Udovydchenko Will try, though I'm not completely sure to understand what you are suggesting 😄
@simonbrunel
await keyValueStore.setValue('my-key', context.body, { contentType: 'text/html' });
await keyValueStore.setValue('my-key', context.body, { contentType: 'text/html' });I already checked this but there is wrong page on
response.body
even on HttpCrawler
. I wouldn't rule out the possibility that it is just the website behavior since we was not able to reproduce this behavior anywhere else.