General way to scrape blogs, articles and content?
Hi all!
Is there a general way to scrape blogs of various types?
I want to create a program that:
Thank you!
Is there a general way to scrape blogs of various types?
I want to create a program that:
- takes in a list of blog top-level-directory URLs, ie: ["http://paulgraham.com/index.html", "https://www.vitalik.ca", "https://medium.com/@FEhrsam", "https://openai.com/blog/"]
- extracts a list of URLs for each blog/article
- for each blog, extracts common information, ie:
{title, author, date, contentBody, photoURLs=[]} - as well as URLs of any photos contained within the article/blog body (but not eg icons, ads, etc)
- ignores irrelevant pages (i.e. "Contact" "requires login") etc - just articles and blog posts
Thank you!
Medium