Eric
Eric3w ago

Whole crawler dies because "failed to lookup address information: Name or service not known"

I am not able to reproduce it in a simple example (it may be a transient error), but I have gotten this error regularly and it kills the crawler completely.
Traceback:
File "crawlee/crawlers/_basic/_basic_crawler.py", line 1366, in __run_task_function
if not (await self._is_allowed_based_on_robots_txt_file(request.url)):

File "crawlee/crawlers/_basic/_basic_crawler.py", line 1566, in _is_allowed_based_on_robots_txt_file
robots_txt_file = await self._get_robots_txt_file_for_url(url)

File "crawlee/crawlers/_basic/_basic_crawler.py", line 1589, in _get_robots_txt_file_for_url
robots_txt_file = await self._find_txt_file_for_url(url)

File "crawlee/crawlers/_basic/_basic_crawler.py", line 1599, in _find_txt_file_for_url
return await RobotsTxtFile.find(url, self._http_client)

File "crawlee/_utils/robots.py", line 48, in find
return await cls.load(str(robots_url), http_client, proxy_info)

File "crawlee/_utils/robots.py", line 59, in load
response = await http_client.send_request(url, proxy_info=proxy_info)

File "crawlee/http_clients/_impit.py", line 167, in send_request
response = await client.request(

impit.ConnectError: Failed to connect to the server.
Reason: hyper_util::client::legacy::Error(
Connect,
ConnectError(
"dns error",
Custom {
kind: Uncategorized,
error: "failed to lookup address information: Name or service not known",
},
),
)
exited with code 1
Traceback:
File "crawlee/crawlers/_basic/_basic_crawler.py", line 1366, in __run_task_function
if not (await self._is_allowed_based_on_robots_txt_file(request.url)):

File "crawlee/crawlers/_basic/_basic_crawler.py", line 1566, in _is_allowed_based_on_robots_txt_file
robots_txt_file = await self._get_robots_txt_file_for_url(url)

File "crawlee/crawlers/_basic/_basic_crawler.py", line 1589, in _get_robots_txt_file_for_url
robots_txt_file = await self._find_txt_file_for_url(url)

File "crawlee/crawlers/_basic/_basic_crawler.py", line 1599, in _find_txt_file_for_url
return await RobotsTxtFile.find(url, self._http_client)

File "crawlee/_utils/robots.py", line 48, in find
return await cls.load(str(robots_url), http_client, proxy_info)

File "crawlee/_utils/robots.py", line 59, in load
response = await http_client.send_request(url, proxy_info=proxy_info)

File "crawlee/http_clients/_impit.py", line 167, in send_request
response = await client.request(

impit.ConnectError: Failed to connect to the server.
Reason: hyper_util::client::legacy::Error(
Connect,
ConnectError(
"dns error",
Custom {
kind: Uncategorized,
error: "failed to lookup address information: Name or service not known",
},
),
)
exited with code 1
This is my crawler:
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
playwright_crawler_specific_kwargs={
"browser_type": "firefox",
"headless": True,
},
max_session_rotations=10,
retry_on_blocked=True,
max_request_retries=5,
keep_alive=True,
respect_robots_txt_file=True,
)
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
playwright_crawler_specific_kwargs={
"browser_type": "firefox",
"headless": True,
},
max_session_rotations=10,
retry_on_blocked=True,
max_request_retries=5,
keep_alive=True,
respect_robots_txt_file=True,
)
I am on version 1.0.4 and I was crawling crawlee.dev (though it doesn't fail in a specific page)
5 Replies
Eric
EricOP3w ago
I think it is related to the new release because I had not seen this error before upgrading to 1.0.4 (from 1.0.3)
Exp
Exp3w ago
This error specifically shows up while Crawlee tries to download the robots.txt file. you can try to
respect_robots_txt_file=False
respect_robots_txt_file=False
Eric
EricOP3w ago
thanks, I tried and you are right, the error doesn't appear. I would like to respect the robots.txt though...
Mantisus
Mantisus3w ago
Thank you for bringing this to our attention. This is an error, and we will endeavor to correct it in the next release.
Vlada Dusek
Vlada Dusek3w ago
GitHub
fix: Improve error handling for RobotsTxtFile.load by Mantisus ·...
Description This PR adds error handling for RobotsTxtFile.load. This prevents crawler failures related to network errors, DNS errors for non-existent domains (e.g., https://placeholder.com/), or u...

Did you find this page helpful?