adverse-sapphire•3y ago
Exclude query parameter URLs from crawl jobs
Hello,
I'm researching currently methods to exclude URLs with, for example: https://domain[.]com/path?query1=test&query2=test2
I've tried hooking into the enqueueLinks options like:
However, it seems like it still matches, because this isn't necessarily excluding, rather matching allowables based on RegEx.
I"m using PlayrightCrawler via crawlee, but I think this would just be something I can do across all crawler engines. Please let me know of how I might achieve this or guide me to more research. Thanks Team!
2 Replies
The regexes like this will be matching ones. To do skipping ones, you can do it with
transformRequestFunction
option of enqueueLinks
.
https://crawlee.dev/api/core/interface/EnqueueLinksOptions#transformRequestFunction
adverse-sapphireOP•3y ago
Thanks Lukas! I'll try this out 🙂
thanks again @Lukas Krivka I've tested this and its working as I needed 🥳