Apify Discord Mirror

Updated 4 months ago

robots.txt Compatibility

At a glance

The community member's Apify actor can pull data from a website even though the robots.txt setting is "TRUE". When tested on the community member's own server, the actor complies with robots.txt rules. The community member wonders if Apify automatically follows robots.txt rules and if it can be set manually, but they haven't found any documentation on this.

In the comments, another community member explains that Apify does not automatically enforce robots.txt rules by default, as Apify aims to provide flexibility for web scraping and automation, and some use cases may require bypassing these rules (within legal and ethical bounds). Therefore, even if the robots.txt setting is "TRUE," it might not be enforced automatically unless explicitly handled in the code.

The community member suggests that the user can manually enforce robots.txt rules by adding logic to their actor, such as using libraries like robots-txt-guard in Node.js to parse and respect robots.txt restrictions before pulling data from a website. The basic approach involves parsing the robots.txt file, checking whether the actor is allowed to scrape specific endpoints, and proceeding based on the result of the check.

Hi guys ๐Ÿ‘‹ , my Apify actor can pull data from the website even though the robots.txt setting is โ€œTRUEโ€. When I test it on my own server, it complies with robots.txt rules. Doesn't Apify automatically follow robots.txt rules? Can't we set it manually? I haven't found any documentation on this.
O
1 comment
Apify does not automatically enforce robots.txt rules by default. This is because Apify focuses on providing flexibility for web scraping and automation, and some use cases may require bypassing these rules (within the bounds of legality and ethics). Therefore, even if the robots.txt setting is "TRUE," it might not be enforced automatically unless explicitly handled in your code.

You can manually enforce robots.txt rules by adding logic to your actor. For example, you can use libraries like robots-txt-guard in Node.js to parse and respect robots.txt restrictions before pulling data from a website.

Here's a basic approach:

  • Parse the robots.txt file from the target site.
  • Check whether your actor is allowed to scrape specific endpoints.
  • Proceed based on the result of the check.
Add a reply
Sign up and join the conversation on Discord