Apify Discord Mirror

Updated last week

Enqueue_links only on match in url path? Cancel request in pre_navigation_hook?

At a glance
The community member has set up a handler that only enqueues links that match certain keywords. The issue is that the code checks the full URL, but the community member wants it to only check the URL path. They provided examples to illustrate the problem, and suggested using a pre_navigation_hook to check the URL path again and potentially cancel the request. However, they are unsure if it's possible to cancel a request that is already queued. In the comments, another community member suggests two solutions:

1. Using a selector to only enqueue links that contain "changelog" or "quick-start" in the URL.

2. Manually checking the links and only adding the ones that contain "changelog" or "quick-start" to the next requests.

The community member also mentions that a PR is in the works that will allow the community member to easily customize this behavior.
Useful resources
I have set up my handler that it only enqueue links that match on certain keywords Problem here is that I want the code to only check the URL Path and not the full URL.

To give an example:
Lets say I only want to enqueue links where the keyword "team" or "about" is part of the URL path.
When crawling www.example.com and it would find an url with www.example.com/team. I want that URL to queue.
When crawling www.my-team.com it would match on all urls on that website because team is part of the main url. But that is not the desired behaviour I want.

I thought of using a pre_navigation_hook and check there again with the following code, but I don't think it's possible to cancel a request that is already queued?
Plain Text
    parsed_url = urlparse(context.request.url)
    path_name = parsed_url.path

    results = _get_regex_matches(path_name)

    if not results:
        context.log.info(
            f'No match found for URL: {context.request.url} in path: '
            f'{path_name}'
        )
        # TODO: CANCEL REQUEST


In the docs I found something like await request_list.mark_request_as_handled(request) but I don't think I have any access to a request_list or something simular in the PlaywrightPreNavCrawlingContext

It would be great if someone can point me in the right direction!
Marked as solution
Hey @ROYOSTI

A PR is now in the works that will allow you to easily customize this behavior - https://github.com/apify/crawlee-python/pull/923

Prior to its release, there are several ways to solve it.

  1. You can try setting up a selector that selects only the links you need
Plain Text
import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'The title of {context.request.url} ...')
        await context.enqueue_links(selector='a[href*="changelog"], a[href*="quick-start"]')

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())


  1. You do not necessarily need to use enqueue_links
Plain Text
import asyncio

from yarl import URL

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'The title of {context.request.url} ...')
        next_requests = []
        for link in context.parsed_content.select('a'):
            link = link.get('href')
            if 'changelog' in link or 'quick-start' in link:
                url = URL(context.request.url).join(URL(link))
                next_requests.append(str(url))
        await context.add_requests(next_requests)

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())
View full solution
M
2 comments
Hey @ROYOSTI

A PR is now in the works that will allow you to easily customize this behavior - https://github.com/apify/crawlee-python/pull/923

Prior to its release, there are several ways to solve it.

  1. You can try setting up a selector that selects only the links you need
Plain Text
import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'The title of {context.request.url} ...')
        await context.enqueue_links(selector='a[href*="changelog"], a[href*="quick-start"]')

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())


  1. You do not necessarily need to use enqueue_links
Plain Text
import asyncio

from yarl import URL

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'The title of {context.request.url} ...')
        next_requests = []
        for link in context.parsed_content.select('a'):
            link = link.get('href')
            if 'changelog' in link or 'quick-start' in link:
                url = URL(context.request.url).join(URL(link))
                next_requests.append(str(url))
        await context.add_requests(next_requests)

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())
I noticed you're using Playwright. You can use route so you don't have to make a real request.
Plain Text
import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext, PlaywrightPreNavCrawlingContext


async def main() -> None:
    crawler = PlaywrightCrawler(max_requests_per_crawl=50)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        page_content = await context.page.content()
        if '||skip||' in page_content:
            context.log.info(f'Skip {context.request.url} ...')
            return

        await context.enqueue_links()

    @crawler.pre_navigation_hook
    async def navigation_hook(context: PlaywrightPreNavCrawlingContext) -> None:
        if context.request.url == 'https://crawlee.dev/':
            return
        if 'changelog' not in context.request.url and 'quick-start' not in context.request.url:
            await context.page.route(
                context.request.url,
                lambda route, _: route.fulfill(
                    status=200,
                    body=b'||skip||',
                ),
            )

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())
Add a reply
Sign up and join the conversation on Discord