Apify Discord Mirror

Updated 2 years ago

Request works in Postman but doesnt work with Cheerio Crawler, request object headers empty

At a glance

The community member is trying to scrape data from a public IP using Cheerio, but is having issues getting the data back. They have tried using Postman and were able to get the data, but when using Cheerio, the headers are empty. The community member is interested in understanding at what points Cheerio sets the headers and generates fingerprints, and how they can view this information.

In the comments, another community member suggests that the community member should look at the response.headers instead of the request.headers. The community member responds that they are more interested in what is being sent in the request headers, and they have found that the API they are trying to scrape is sensitive to certain headers. They have tried using Playwright in headful mode and found that the first request fails, but a refresh of the same page works.

The community members suggest that the community member can add the necessary headers themselves when enqueueing the link, and that they should try testing with different proxy groups. Ultimately, the community member states that the issue was not related to the proxy, but rather to cookies, and that the issue has been solved.

Useful resources
Dear all, I am trying to scrap data from a public ip. For some reason cheeriocrawler is not getting the data back but in postman I could easily get the data. Proxy ip is whitelisted because I am using the same ip for postman and for cheerio.

Postman does add some default headers but when I look at my request object the headers are empty. Does someone knows at which points cheerio sets the headers and generate some fingerprints and how can I see them ?

Request { id: 'OBTRQI5zvA4aIJ9', url: 'https://someapi.com', loadedUrl: 'https://someapi.com', uniqueKey: '22586062-3f0d-40be-b499-f1a00261b5d3', method: 'GET', payload: undefined, noRetry: false, retryCount: 0, errorMessages: [], headers: {}, userData: [Getter/Setter], handledAt: undefined }


any help would be highly appreciated. Thanks
v
c
O
9 comments
Request is our structure which stores what URL to call, what HTTP method, and with what headers/payload to call it!

You probably want response.headers, where response comes from the context of the requestHandler function
Thanks for your response. Actually, I am more interested in what is being sent in the request headers. I have debugged it further and found out that when I try to scrap the API it won't work in the first try and when I refresh the opened browser by crawlee it does work. I wanted to check what is going on so I used Playwright in head full mode and I could see that there was an error but when I refreshed the same page I got the response back. The api I am trying to scrap data from is very sensitive to some headers as you see in the picture. I think some headers are not set properly in the request and on refresh the browser adds default headers and then it works.
Attachment
image.png
Oh those headeds
You can add them yourself!
When you enqueue the link, you can enqueue via an object with url and headers, and pass in any header you need on initial request
Still doesn't work. With the same proxy it works in ,in a simple browser but when I use it with crawlee it doesn't work.
do you have any idea ?
Did you testit with different proxy groups?
It wasn't related to proxy bur rather to cookies. Its solved now.
Add a reply
Sign up and join the conversation on Discord