Cena Ashoori
Cena Ashoori2d ago

Crawler become slower as time goes on

Hello guys, thanks for your great tools. I have a problem with crawlee, it works well when I run crawler at the beginning, but when my vpn has a problem and I switch my config, crawler won't continue and I have to restart it. Are there any timeout fields to manage the maximum time each request could take? Or sometimes it becomes slower for no reason and it won't get back the same speed(rpm) as the beginning.
rq = await RequestQueue.open(name="urls/mwm")
from crawlee import ConcurrencySettings

concurrency_settings = ConcurrencySettings(
desired_concurrency=1,
min_concurrency=1,
max_concurrency=7,
)
crawler = PlaywrightCrawler(
max_request_retries=50,
# browser_type="firefox",
browser_type="chromium",
user_data_dir="./session/mwm",
headless=True,
request_handler=router,
request_manager=rq,
concurrency_settings=concurrency_settings,
browser_launch_options={
"args": [
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-web-security",
"--disable-extensions",
]
},
request_handler_timeout=timedelta(seconds=90),
)
rq = await RequestQueue.open(name="urls/mwm")
from crawlee import ConcurrencySettings

concurrency_settings = ConcurrencySettings(
desired_concurrency=1,
min_concurrency=1,
max_concurrency=7,
)
crawler = PlaywrightCrawler(
max_request_retries=50,
# browser_type="firefox",
browser_type="chromium",
user_data_dir="./session/mwm",
headless=True,
request_handler=router,
request_manager=rq,
concurrency_settings=concurrency_settings,
browser_launch_options={
"args": [
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-web-security",
"--disable-extensions",
]
},
request_handler_timeout=timedelta(seconds=90),
)
4 Replies
Mantisus
Mantisus2d ago
Hey @Cena Ashoori
request_handler_timeout
Limits the time that will be spent on the request and its processing However, when you switch the VPN you are using while Playwright is running, it can cause problems directly with Playwright.
Exp
Exp2d ago
Hi, refer to this code
crawler = PlaywrightCrawler(
max_request_retries=5,
browser_type="chromium",
headless=True,
request_handler=router,
request_manager=rq,
concurrency_settings=ConcurrencySettings(
desired_concurrency=7,
min_concurrency=1,
max_concurrency=7,
),
request_handler_timeout=timedelta(seconds=90),
navigation_timeout_secs=60,
request_timeout_secs=120,
browser_launch_options={
"args": [
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-web-security",
"--disable-extensions",
]
},
)
crawler = PlaywrightCrawler(
max_request_retries=5,
browser_type="chromium",
headless=True,
request_handler=router,
request_manager=rq,
concurrency_settings=ConcurrencySettings(
desired_concurrency=7,
min_concurrency=1,
max_concurrency=7,
),
request_handler_timeout=timedelta(seconds=90),
navigation_timeout_secs=60,
request_timeout_secs=120,
browser_launch_options={
"args": [
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-web-security",
"--disable-extensions",
]
},
)
Cena Ashoori
Cena AshooriOP2d ago
These fields no longer exist navigation_timeout_secs=60, request_timeout_secs=120,
Exp
Exp2d ago
If so, replace navigation_timeout_secx to following
goto_timeout_secs=60,
goto_timeout_secs=60,

Did you find this page helpful?