flat-fuchsia
flat-fuchsia•2y ago

CheerioCrawler mixed data when using $

Hi Team! 👋 I'm pretty new to Crawlee and I'm experimenting with the CheerioCrawler: crawl a website and store visited URLs with their title in a database. However, I noticed that randomly, $('title').text() returns the wrong data (probably from another Cheerio instance?). I believe I'm doing something wrong since it seems to be the basic, so I'm sorry if this question has already been asked.
const crawler = new CheerioCrawler({
requestHandler: async ({ $, enqueueLinks, request }) => {
const { url } = request;
const name = $("title").text();
console.log(name, "-->", url);
await enqueueLinks();
},
});

await crawler.run(["https://www.example.com"]);
const crawler = new CheerioCrawler({
requestHandler: async ({ $, enqueueLinks, request }) => {
const { url } = request;
const name = $("title").text();
console.log(name, "-->", url);
await enqueueLinks();
},
});

await crawler.run(["https://www.example.com"]);
Would output: 1st run:
Digital Marketing Expert --> https://www.example.com/job-positions/digital-marketing-expert
Job positions --> https://www.example.com/job-positions
Job positions --> https://www.example.com/internships/internship-software-development
Digital Marketing Expert --> https://www.example.com/job-positions/digital-marketing-expert
Job positions --> https://www.example.com/job-positions
Job positions --> https://www.example.com/internships/internship-software-development
2nd run:
Digital Marketing Expert --> https://www.example.com/job-positions/digital-marketing-expert
Job positions --> https://www.example.com/job-positions
Internship: Software Development --> https://www.example.com/internships/internship-software-development
Digital Marketing Expert --> https://www.example.com/job-positions/digital-marketing-expert
Job positions --> https://www.example.com/job-positions
Internship: Software Development --> https://www.example.com/internships/internship-software-development
15 Replies
Pepa J
Pepa J•2y ago
Hi @simonbrunel , Your code looks good to me. I guess there has to be something happening on the website. Unfortunately I cannot help you more without having the real website example.
flat-fuchsia
flat-fuchsiaOP•2y ago
Hi @Pepa J ! I think it happens with different websites, but the one I used in my example was https://www.intk.com/. Note that this issue does not seem to happen when maxConcurrency = 1, so I would think of a shared "context" or something like that between concurrent requests.
Pepa J
Pepa J•2y ago
Thank you @simonbrunel I am able to reproduce it now.
flat-fuchsia
flat-fuchsiaOP•2y ago
Glad to hear, at least it's not specific to my project setup. Have you been able to reproduce it on another website?
Pepa J
Pepa J•2y ago
@simonbrunel Just tried to run it on my own small static website but was not able to reproduce it there. Do you have some more such a websites?
flat-fuchsia
flat-fuchsiaOP•2y ago
I don't have right now. Though, shouldn't the one I shared enough to trace the issue? I'm not sure how the website itself could generate such problem.
Pepa J
Pepa J•2y ago
We are discussing it internally right now, with the one specific website.
flat-fuchsia
flat-fuchsiaOP•2y ago
Thank you!
MEE6
MEE6•2y ago
@simonbrunel just advanced to level 1! Thanks for your contributions! 🎉
flat-fuchsia
flat-fuchsiaOP•2y ago
@Pepa J Good morning! Any update on that issue?
Pepa J
Pepa J•2y ago
Hi @simonbrunel, I am sorry this curently doesn't have a priority and we were not able to reproduce it on any other website so far. Crawlee is an open-source project, so if you have a need - you may try to debug and solve the problem open PR with fixes if you find anything.
Alexey Udovydchenko
Alexey Udovydchenko•2y ago
@simonbrunel it might be anti bot protection based on IP, try to save html to KV store and check title in actual html.
flat-fuchsia
flat-fuchsiaOP•2y ago
@Pepa J I understand, will see if I can debug it and submit a fix @Alexey Udovydchenko Will try, though I'm not completely sure to understand what you are suggesting 😄
Alexey Udovydchenko
Alexey Udovydchenko•2y ago
@simonbrunel await keyValueStore.setValue('my-key', context.body, { contentType: 'text/html' });
Pepa J
Pepa J•2y ago
await keyValueStore.setValue('my-key', context.body, { contentType: 'text/html' });
I already checked this but there is wrong page on response.body even on HttpCrawler. I wouldn't rule out the possibility that it is just the website behavior since we was not able to reproduce this behavior anywhere else.

Did you find this page helpful?