General way to scrape blogs, articles and content?

Hi all!

Is there a general way to scrape blogs of various types?

I want to create a program that:

* takes in a list of blog top-level-directory URLs, ie:
["http://paulgraham.com/index.html", "https://www.vitalik.ca", "https://medium.com/@FEhrsam", "https://openai.com/blog/"]

* extracts a list of URLs for each blog/article

* for each blog, extracts common information, ie:
{title, author, date, contentBody, photoURLs=[]}
* as well as URLs of any photos contained within the article/blog body (but not eg icons, ads, etc)

* ignores irrelevant pages (i.e. "Contact" "requires login") etc - just articles and blog posts

Does Crawlee, Apify, Scrapy, or any other (free or paid) program do this?

Thank you!

Medium

Apify & Crawlee•4y ago•

1 reply

uniform-turquoise

General way to scrape blogs, articles and content?

Medium

General way to scrape blogs, articles and content?

Similar Threads

General way to scrape blogs, articles and content?

Similar Threads

Similar Threads

Similar Threads