[nuxt] [request error] [unhandled] [500] Cannot find module '/app/server/node_modules/puppeteer/lib/cjs/puppeteer/puppeteer.js'
Browser logs: Chromium sandboxing failed! ================================ To avoid the sandboxing issue, do either of the following: - (preferred): Configure your environment to support sandboxing - (alternative): Launch Chromium without sandbox using 'chromiumSandbox: false' option ================================
crawlee[playwright]
to 0.5.2
[90m[crawlee.events._event_manager][0m [34mDEBUG[0m LocalEventManager.on.listener_wrapper(): Awaiting listener task... [90m[crawlee.events._event_manager][0m [34mDEBUG[0m LocalEventManager.on.listener_wrapper(): Awaiting listener task... '[90m[crawlee._autoscaling.autoscaled_pool][0m [34mDEBUG[0m Not scheduling new tasks - system is overloaded '[90m[crawlee.storages._request_queue][0m [34mDEBUG[0m There are still ids in the queue head that are pending processing [90m({"queue_head_ids_pending": 1})[0m [90m[crawlee._utils.system][0m [34mDEBUG[0m Calling get_memory_info()... '[90m[crawlee._autoscaling.autoscaled_pool][0m [34mDEBUG[0m Not scheduling new tasks - system is overloaded '[90m[crawlee.storages._request_queue][0m [34mDEBUG[0m There are still ids in the queue head that are pending processing [90m({"queue_head_ids_pending": 1})[0m '[90m[crawlee._autoscaling.autoscaled_pool][0m [34mDEBUG[0m Not scheduling new tasks - system is overloaded '[90m[crawlee.storages._request_queue][0m [34mDEBUG[0m There are still ids in the queue head that are pending processing [90m({"queue_head_ids_pending": 1})[0m [90m[crawlee._utils.system][0m [34mDEBUG[0m Calling get_cpu_info()... '[90m[crawlee._autoscaling.autoscaled_pool][0m [34mDEBUG[0m Not scheduling new tasks - system is overloaded '[90m[crawlee.storages._request_queue][0m [34mDEBUG[0m There are still ids in the queue head that are pending processing [90m({"queue_head_ids_pending": 1})[0m '[90m[crawlee._autoscaling.autoscaled_pool][0m [34mDEBUG[0m Not scheduling new tasks - system is overloaded
import asyncio, httpx from apify import Actor import dotenv async def main(): async with Actor: proxy_configuration = await Actor.create_proxy_configuration( password=dotenv.get_key('.env', 'APIFY_PROXY_PASSWORD'), ) proxy_url = await proxy_configuration.new_url() proxies = { 'http': proxy_url, 'https': proxy_url, } async with httpx.AsyncClient(proxy=proxy_url) as client: for _ in range(3): response = await client.get('https://httpbin.org/ip') if response.status_code == 200: print(response.json()) elif response: print(response.text) if __name__ == '__main__': asyncio.run(main())
raise mapped_exc(message) from exc httpx.ReadTimeout [apify] INFO Exiting Actor ({"exit_code": 91})
router.addDefaultHandler(async ({ request, enqueueLinks, parseWithCheerio, querySelector, log, page }) => {
await enqueueLinks({
strategy: 'same-domain',
globs: globs,
transformRequestFunction: (request) => {
return request;
},
});
});
parsed_url = urlparse(context.request.url) path_name = parsed_url.path results = _get_regex_matches(path_name) if not results: context.log.info( f'No match found for URL: {context.request.url} in path: ' f'{path_name}' ) # TODO: CANCEL REQUEST
await request_list.mark_request_as_handled(request)
but I don't think I have any access to a request_list or something simular in the PlaywrightPreNavCrawlingContext
main.py
logging works as expected, however in routes.py
logging is printed twice for some reason.Actor.log.info("STARTING A NEW CRAWL JOB")
[apify] INFO Checking item 17 [apify] INFO Checking item 17 ({"message": "Checking item 17"}) [apify] INFO Processing new item with index: 17 [apify] INFO Processing new item with index: 17 ({"message": "Processing new item with index: 17"})
main.py
(https://docs.apify.com/sdk/python/docs/concepts/logging) async def main() -> None: async with Actor: ##### SETUP LOGGING ##### handler = logging.StreamHandler() handler.setFormatter(ActorLogFormatter()) apify_logger = logging.getLogger('apify') apify_logger.setLevel(logging.DEBUG) apify_logger.addHandler(handler)
main.py
2x, and everything from routes.py
3x.[apify] INFO STARTING A NEW CRAWL JOB [apify] INFO STARTING A NEW CRAWL JOB ({"message": "STARTING A NEW CRAWL JOB"}) [apify] INFO STARTING A NEW CRAWL JOB ({"message": "STARTING A NEW CRAWL JOB"})
WARN
, which spawns another playwright instance. finished condition
is met.[crawlee.storages._request_queue] WARN The request queue seems to be stuck for 300.0s, resetting internal state. ({"queue_head_ids_pending": 0, "in_progress": ["tEyKIytjmqjtRvA"]})
max_results = 5 # example crawler = PlaywrightCrawler( headless=False, request_handler=router, ) await crawler.run([start_url])
@router.default_handler async def default_handler(context: PlaywrightCrawlingContext) -> None: max_results = ???
router = Router[BeautifulSoupCrawlingContext]()