ROYOSTI

I run Crawlee in a docker container. That docker container is used in a Jenkins task.
When starting the crawler I receive the following error:

Plain Text

    Browser logs:
      Chromium sandboxing failed!
      ================================
      To avoid the sandboxing issue, do either of the following:
        - (preferred): Configure your environment to support sandboxing
        - (alternative): Launch Chromium without sandbox using 'chromiumSandbox: false' option
      ================================

The full error log can be found in the attachment.
This error only occurs after upgrading crawlee[playwright] to 0.5.2

What are the advantages/disadvantages of launching Chromium without sandbox? How could I configure my environment to support sandboxing?

I have set up my handler that it only enqueue links that match on certain keywords Problem here is that I want the code to only check the URL Path and not the full URL.

To give an example:
Lets say I only want to enqueue links where the keyword "team" or "about" is part of the URL path.
When crawling www.example.com and it would find an url with www.example.com/team. I want that URL to queue.
When crawling www.my-team.com it would match on all urls on that website because team is part of the main url. But that is not the desired behaviour I want.

I thought of using a pre_navigation_hook and check there again with the following code, but I don't think it's possible to cancel a request that is already queued?

Plain Text

    parsed_url = urlparse(context.request.url)
    path_name = parsed_url.path

    results = _get_regex_matches(path_name)

    if not results:
        context.log.info(
            f'No match found for URL: {context.request.url} in path: '
            f'{path_name}'
        )
        # TODO: CANCEL REQUEST

In the docs I found something like await request_list.mark_request_as_handled(request) but I don't think I have any access to a request_list or something simular in the PlaywrightPreNavCrawlingContext

It would be great if someone can point me in the right direction!

Apify Discord Mirror

Chromium sandboxing failed

Enqueue_links only on match in url path? Cancel request in pre_navigation_hook?