graceful-blue•3y ago
parallel Login Scraping
Hello, I want to make a scaled scraper that would scrape data from the site after logging in, and I want to run multiple instances, and such that each instance looks to have scraped from a unique device/location given proxies. Can you help me in visualizing the high-level overview of the project on how should I go solve this problem?
13 Replies
graceful-blueOP•3y ago
The site is dynamic and requires solving px-captchas, and when logging in, it requires some paramTokens, which seems like jwt. I just need to get cookies after logging into the site. Any suggestion if this process can be broken down into one-request solutions and not have to emulate human interactions to get to that endpoint?
@Helper
like-gold•3y ago
Hi there, there’s a few different ways you can tackle this - but it comes down to the site’s constraints.
The first thing to figure out in order to determine how to proceed is what needs to remain the same for each of your sessions? Proxy, cookies, fingerprint, etc etc.
It’s possible that if you can keep track of cookies / proxies that you can load them into the session and do a “one-request solution” after the initial login. But, it depends on the site.
graceful-blueOP•3y ago
yes I tried that it gets to some iterations and then get blocked
and have to do it again
like-gold•3y ago
Typically though, the answer is yes this is very doable and is how most sites work. If you can keep track of the right session data you should be able to use the api in the same way the website does
What indicates a block in this case?
graceful-blueOP•3y ago
probably auth token gets changed?
@Scarlet just advanced to level 1! Thanks for your contributions! 🎉
graceful-blueOP•3y ago
ok rn the main problem is this, I want to initialize multiple scrapers at the same time, so that the work gets done fast. I would go for the one-request solution after this
like-gold•3y ago
Well it sounds like you need to figure out what is causing the block first
graceful-blueOP•3y ago
like 100-500 at the instances at least
yeah mean while I will find that
like-gold•3y ago
Try to analyze the site on chrome dev tools
graceful-blueOP•3y ago
I have tried and seems like this site is tough
like-gold•3y ago
Are there any network requests that may refresh the auth token / cookies?
graceful-blueOP•3y ago
probably