General way to scrape blogs, articles and content?

Hi all!

Is there a general way to scrape blogs of various types?

I want to create a program that:

  • takes in a list of blog top-level-directory URLs, ie: ["http://paulgraham.com/index.html", "https://www.vitalik.ca", "https://medium.com/@FEhrsam", "https://openai.com/blog/"]
  • extracts a list of URLs for each blog/article
  • for each blog, extracts common information, ie:
    {title, author, date, contentBody, photoURLs=[]}
  • as well as URLs of any photos contained within the article/blog body (but not eg icons, ads, etc)
  • ignores irrelevant pages (i.e. "Contact" "requires login") etc - just articles and blog posts
Does Crawlee, Apify, Scrapy, or any other (free or paid) program do this?

Thank you!
Medium
Was this page helpful?