absent-sapphire
absent-sapphire15mo ago

enqueueLinks not respecting strategy

Hello folks, I'm running into the issue described here https://github.com/apify/crawlee/issues/2525 using a basic CheerioCrawler. I specify same-domain and it's running all over the internet. Does anyone have a workaround that I can use to prevent it from going to outside domains?
GitHub
Enqueue strategy check after redirects is not working with adaptive...
Which package is this bug report for? If unsure which one to select, leave blank @crawlee/playwright (PlaywrightCrawler) Issue description use enqueueLinks() without any parameters in the request h...
7 Replies
Lukas Krivka
Lukas Krivka15mo ago
So your problems are specifically redirects? Then you will need to throw away the page after it is loaded in requestHandler because the enqueueLinks doesn't know what might the URL redirect to once loaded.
absent-sapphire
absent-sapphireOP15mo ago
This does not always occur on redirects. While the crawler is scraping a page if it finds an external link it does not apply the same domain logic. It’s almost like enqueueLinks just switches the strategy to All from SameDomain. I can provide some sample logs and code in the next few hours once I get to my desk. I was mistaken. I'm sorry. It was me expanding short links which caused issues.
Pepa J
Pepa J15mo ago
Hi @taintedgamer4k , regarding this issue is everything working for you as expected now?
absent-sapphire
absent-sapphireOP15mo ago
@Pepa J yes it is. It was a mistake on my part of how I was handling redirects earlier in the code.
MEE6
MEE615mo ago
@taintedgamer4k just advanced to level 1! Thanks for your contributions! 🎉
foreign-sapphire
foreign-sapphire15mo ago
Short links make everything worse. ;D Congrats on figuring it out!
absent-sapphire
absent-sapphireOP14mo ago
Yes they do! Thanks @eaton !

Did you find this page helpful?