absent-sapphire•15mo ago
enqueueLinks not respecting strategy
Hello folks, I'm running into the issue described here https://github.com/apify/crawlee/issues/2525 using a basic CheerioCrawler. I specify same-domain and it's running all over the internet.
Does anyone have a workaround that I can use to prevent it from going to outside domains?
GitHub
Enqueue strategy check after redirects is not working with adaptive...
Which package is this bug report for? If unsure which one to select, leave blank @crawlee/playwright (PlaywrightCrawler) Issue description use enqueueLinks() without any parameters in the request h...
7 Replies
So your problems are specifically redirects? Then you will need to throw away the page after it is loaded in requestHandler because the enqueueLinks doesn't know what might the URL redirect to once loaded.
absent-sapphireOP•15mo ago
This does not always occur on redirects. While the crawler is scraping a page if it finds an external link it does not apply the same domain logic.
It’s almost like enqueueLinks just switches the strategy to All from SameDomain.
I can provide some sample logs and code in the next few hours once I get to my desk.
I was mistaken. I'm sorry. It was me expanding short links which caused issues.
Hi @taintedgamer4k , regarding this issue is everything working for you as expected now?
absent-sapphireOP•15mo ago
@Pepa J yes it is. It was a mistake on my part of how I was handling redirects earlier in the code.
@taintedgamer4k just advanced to level 1! Thanks for your contributions! 🎉
foreign-sapphire•15mo ago
Short links make everything worse. ;D Congrats on figuring it out!
absent-sapphireOP•14mo ago
Yes they do! Thanks @eaton !