Apify Discord Mirror

Updated 5 months ago

got-scraping vs cheerioCrawler or sendRequest

At a glance

The community member is having an issue with scraping a website using different methods. With got-scraping, the website returns data, but with cheerioCrawler or sendRequest in BasicCrawler, the website returns a message about the user potentially using extensions or blockers that may affect the page loading. The community members discuss that the issue may be caused by the isStream: true option in crawlee, but the solution is to remove the accept header in the preNavigationHooks.

Useful resources
I have weird case of url like this: https://www.firmy.cz/detail/13470923-veronika-vankova-mseno.html
With got-scraping it returns data but with cheerioCrawler or using sendRequest in BasicCrawler I get just small html with this text: 'Pravděpodobně používáte rozšíření či blokátory, jež mohou ovlivňovat načtení této stránky. Pro správnou funkčnost prosíme, deaktivujte všechna tato rozšíření a zkuste stránku načíst znovu.'
Anybody clever can tell me what is the difference? I thought that cheerioCrawler and sendRequest both use got-scraping inside.
Thanks
1
S
v
P
4 comments
Our team will reply soon.
Hi ,
it seems that this is caused by the isStream: true got option that is being used in crawlee, but I am not entirely sure why that is so. I will ask internally for more info.
Hi , can you please confirm this is what is causing your issue? If so as a workaround I believe you should be able to set this value in preNavigationHook.
I have tried the got-scraping with isStream: true but it makes the same result as isStream: false.
But I have found out that it is accept header that makes the site return consent screen instead of data.
So for cheerio the solution is to delete that header in preNavigationHooks like this:
Plain Text
preNavigationHooks:[async (crawlingContext, gotOptions)  =>{
    crawlingContext.request.headers['accept']= '';}
    ]
Add a reply
Sign up and join the conversation on Discord