I have set up my handler that it only enqueue links that match on certain keywords Problem here is that I want the code to only check the URL Path and not the full URL.
To give an example:
Lets say I only want to enqueue links where the keyword "team" or "about" is part of the URL path.
When crawling www.example.com and it would find an url with www.example.com/team. I want that URL to queue.
When crawling www.my-team.com it would match on all urls on that website because team is part of the main url. But that is not the desired behaviour I want.
I thought of using a pre_navigation_hook and check there again with the following code, but I don't think it's possible to cancel a request that is already queued?
parsed_url = urlparse(context.request.url)
path_name = parsed_url.path
results = _get_regex_matches(path_name)
if not results:
context.log.info(
f'No match found for URL: {context.request.url} in path: '
f'{path_name}'
)
# TODO: CANCEL REQUEST
In the docs I found something like
await request_list.mark_request_as_handled(request)
but I don't think I have any access to a request_list or something simular in the
PlaywrightPreNavCrawlingContext
It would be great if someone can point me in the right direction!