DuxSec

Solved

Double log output

in main.py logging works as expected, however in routes.py logging is printed twice for some reason.
I did not setup any custom logging, I just use
Actor.log.info("STARTING A NEW CRAWL JOB")

example:

Plain Text

[apify] INFO  Checking item 17
[apify] INFO  Checking item 17 ({"message": "Checking item 17"})
[apify] INFO  Processing new item with index: 17
[apify] INFO  Processing new item with index: 17 ({"message": "Processing new item with index: 17"})

If I add this in my main.py (https://docs.apify.com/sdk/python/docs/concepts/logging)

Plain Text

async def main() -> None:
    async with Actor:
        ##### SETUP LOGGING #####
        handler = logging.StreamHandler()
        handler.setFormatter(ActorLogFormatter())

        apify_logger = logging.getLogger('apify')
        apify_logger.setLevel(logging.DEBUG)
        apify_logger.addHandler(handler)

it prints everything from main.py 2x, and everything from routes.py 3x.

Plain Text

[apify] INFO  STARTING A NEW CRAWL JOB
[apify] INFO  STARTING A NEW CRAWL JOB ({"message": "STARTING A NEW CRAWL JOB"})
[apify] INFO  STARTING A NEW CRAWL JOB ({"message": "STARTING A NEW CRAWL JOB"})

11 comments

DDuxSec

Solved

clean way to stop "request queue seems to be stuck for 300.0"

A scraper that I am developing, scrapes a SPA with infinite scrolling. This works fine, but after 300 seconds, I get a WARN , which spawns another playwright instance.
This probably happens since I only handle 1 request (I do not add anything to the RequestQueue), from which I just have a while until finished condition is met.

Plain Text

[crawlee.storages._request_queue] WARN  The request queue seems to be stuck for 300.0s, resetting internal state. ({"queue_head_ids_pending": 0, "in_progress": ["tEyKIytjmqjtRvA"]})

What is a clean way to stop this from happening?

3 comments

DDuxSec

how to pass data to routes.py

If i use multiple files, what is the best way to pass data (user input, which contains 'max_results' or something) to my routes.py?

example snippet main.py

Plain Text

        max_results = 5 # example

        crawler = PlaywrightCrawler(
            headless=False, 
            request_handler=router,
        )
        await crawler.run([start_url])

snippet routes.py

Plain Text

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    max_results = ???

3 comments

Apify Discord Mirror

Double log output

clean way to stop "request queue seems to be stuck for 300.0"

how to pass data to routes.py