adverse-sapphire
adverse-sapphire•3y ago

Exclude query parameter URLs from crawl jobs

Hello, I'm researching currently methods to exclude URLs with, for example: https://domain[.]com/path?query1=test&query2=test2 I've tried hooking into the enqueueLinks options like:
await enqueueLinks({ regexps: [ new RegExp('^'+[websiteURL]+'[^?]+') ]});
await enqueueLinks({ regexps: [ new RegExp('^'+[websiteURL]+'[^?]+') ]});
However, it seems like it still matches, because this isn't necessarily excluding, rather matching allowables based on RegEx. I"m using PlayrightCrawler via crawlee, but I think this would just be something I can do across all crawler engines. Please let me know of how I might achieve this or guide me to more research. Thanks Team!
2 Replies
Lukas Krivka
Lukas Krivka•3y ago
The regexes like this will be matching ones. To do skipping ones, you can do it with transformRequestFunction option of enqueueLinks. https://crawlee.dev/api/core/interface/EnqueueLinksOptions#transformRequestFunction
transformRequestFunction: (request) => {
if (request.url.match(mySkipRegex)) {
return null;
}
return request;
}
transformRequestFunction: (request) => {
if (request.url.match(mySkipRegex)) {
return null;
}
return request;
}
adverse-sapphire
adverse-sapphireOP•3y ago
Thanks Lukas! I'll try this out 🙂 thanks again @Lukas Krivka I've tested this and its working as I needed 🥳

Did you find this page helpful?