graceful-blue
graceful-blue3y ago

parallel Login Scraping

Hello, I want to make a scaled scraper that would scrape data from the site after logging in, and I want to run multiple instances, and such that each instance looks to have scraped from a unique device/location given proxies. Can you help me in visualizing the high-level overview of the project on how should I go solve this problem?
13 Replies
graceful-blue
graceful-blueOP3y ago
The site is dynamic and requires solving px-captchas, and when logging in, it requires some paramTokens, which seems like jwt. I just need to get cookies after logging into the site. Any suggestion if this process can be broken down into one-request solutions and not have to emulate human interactions to get to that endpoint? @Helper
like-gold
like-gold3y ago
Hi there, there’s a few different ways you can tackle this - but it comes down to the site’s constraints. The first thing to figure out in order to determine how to proceed is what needs to remain the same for each of your sessions? Proxy, cookies, fingerprint, etc etc. It’s possible that if you can keep track of cookies / proxies that you can load them into the session and do a “one-request solution” after the initial login. But, it depends on the site.
graceful-blue
graceful-blueOP3y ago
yes I tried that it gets to some iterations and then get blocked and have to do it again
like-gold
like-gold3y ago
Typically though, the answer is yes this is very doable and is how most sites work. If you can keep track of the right session data you should be able to use the api in the same way the website does What indicates a block in this case?
graceful-blue
graceful-blueOP3y ago
probably auth token gets changed?
MEE6
MEE63y ago
@Scarlet just advanced to level 1! Thanks for your contributions! 🎉
graceful-blue
graceful-blueOP3y ago
ok rn the main problem is this, I want to initialize multiple scrapers at the same time, so that the work gets done fast. I would go for the one-request solution after this
like-gold
like-gold3y ago
Well it sounds like you need to figure out what is causing the block first
graceful-blue
graceful-blueOP3y ago
like 100-500 at the instances at least yeah mean while I will find that
like-gold
like-gold3y ago
Try to analyze the site on chrome dev tools
graceful-blue
graceful-blueOP3y ago
I have tried and seems like this site is tough
like-gold
like-gold3y ago
Are there any network requests that may refresh the auth token / cookies?
graceful-blue
graceful-blueOP3y ago
probably

Did you find this page helpful?