Problem with scraping a site that requires login
I have a paid actor I am renting out to customers that is failing because of a recent anti bot mitigation that prevents scraping pages past 10 without logging in. I have implemented Google login and store session cookies in a shared key value store for the actor to use and this seem to work fine. However Google has flagged account and logins as being a bot and has since terminated the account thus login fails and then scraping fails as well. Before the Google account termination, I experienced that the site I scrape, also seemed to throttle my requests - however this is without using a proxy so might be possible to circumvent, however this has never been an issue before with this site.
The site has option for Google, Facebook, Apple or email login and I chose Google because email requires to receive a login code to the email each time a login is performed, which I couldn't automate.
I have been trying to resolve this for the past week and was successfull until the Google login termination.
I am using Crawlee Playwright and run only 1 concurrent browser context to not overwhelm or batch requests against the site.
Do you have experience with how to deal with such anti bot measures reliably?
9 Replies
best practices are try to use session ids and not login every time as thats what real user do
and use multiple account to distribute the load
I am logging in when developing the actor locally and when it is in the cloud it uses the session and login cookies by reading them from the shared named key value store I upload the cookies to manually which does seem to work. I am trying to implement email login instead now
How many?
well depends on site and load. Which site are you trying this on.
Then i can help you in better ways
I am trying to scrape trustpilot.com
no matter what even if I browse the site from my mobile phone on a mobile network I am getting redirected to login page which confirms that I am not being detected as a bot but rather that trustpilot enforces that any requests to page 11 and so on gets redirected to prevent scraping regardless if it detects a user or bot. So I am trying to reliably implement session storage with login cookies
Hello
Hey I was able to find a solution
oh, did you find a solution?
Yes
I see