Exclude query parameter URLs from crawl jobs

Hello,

I'm researching currently methods to exclude URLs with, for example: https://domain[.]com/path?query1=test&query2=test2

I've tried hooking into the enqueueLinks options like:

await enqueueLinks({ regexps: [ new RegExp('^'+[websiteURL]+'[^?]+') ]});

However, it seems like it still matches, because this isn't necessarily excluding, rather matching allowables based on RegEx.

I"m using PlayrightCrawler via crawlee, but I think this would just be something I can do across all crawler engines. Please let me know of how I might achieve this or guide me to more research. Thanks Team!
Was this page helpful?