HonzaS
HonzaS2y ago

got-scraping vs cheerioCrawler or sendRequest

I have weird case of url like this: https://www.firmy.cz/detail/13470923-veronika-vankova-mseno.html With got-scraping it returns data but with cheerioCrawler or using sendRequest in BasicCrawler I get just small html with this text: 'Pravděpodobně používáte rozšíření či blokátory, jež mohou ovlivňovat načtení této stránky. Pro správnou funkčnost prosíme, deaktivujte všechna tato rozšíření a zkuste stránku načíst znovu.' Anybody clever can tell me what is the difference? I thought that cheerioCrawler and sendRequest both use got-scraping inside. Thanks
4 Replies
Saurav Jain
Saurav Jain2y ago
Our team will reply soon.
xenial-black
xenial-black2y ago
Hi @HonzaS, it seems that this is caused by the isStream: true got option that is being used in crawlee, but I am not entirely sure why that is so. I will ask internally for more info.
Pepa J
Pepa J2y ago
Hi @HonzaS , can you please confirm this is what is causing your issue? If so as a workaround I believe you should be able to set this value in preNavigationHook.
HonzaS
HonzaSOP17mo ago
@Pepa J I have tried the got-scraping with isStream: true but it makes the same result as isStream: false. But I have found out that it is accept header that makes the site return consent screen instead of data. So for cheerio the solution is to delete that header in preNavigationHooks like this:
preNavigationHooks:[async (crawlingContext, gotOptions) =>{
crawlingContext.request.headers['accept']= '';}
]
preNavigationHooks:[async (crawlingContext, gotOptions) =>{
crawlingContext.request.headers['accept']= '';}
]

Did you find this page helpful?