rival-black•2y ago

Strange Sitemaps. What can I do to use them?

Hi! I got to know Crawlee for the first time the day before yesterday, I like it all very much, but now I want to get it up and running. And the first thing I encountered is that my set of sites, it happens that either there are no Sitemaps, or they are divided into sections, and for me it is difficult to death them somehow because of the lack of something in common I really hope for your advice! What to do in such a situation, when Sitemaps to trust is not quite poluchaetsya, and to realize the intended still want to? I really want to use Sitemaps sites to additionally under each site in the set not to dopisya logic, but I can not think of a solution. Tell me please, what do you think!

5 Replies

conscious-sapphire•2y ago

Instead of relying on Sitemaps, you can use a web crawler to explore and discover the pages on your sites. Web crawlers navigate through links on web pages to discover new content. While this method may take longer and may not cover all pages, it can still provide a good starting point.

MEE6•2y ago

@! maxwell just advanced to level 5! Thanks for your contributions! 🎉

rival-blackOP•2y ago

you mean the mechanisms of Crawlee itself? Yes, I think it would help a lot if I only had the goal to get a site map, but I wanted to get Sitemaps as much as possible ready because there is already a ready short content, and without it you need to get every site that I use in the pool to add logic in the code. It's strange, but maybe someone knows the solution to this problem? Without paid integrations with AI services

conscious-sapphire•2y ago

There are open-source web scraping libraries available in various programming languages that can help you extract information from web pages. You can write custom scripts to scrape the necessary information from each site, including URLs and relevant metadata, and generate your own Sitemaps based on that data. Popular libraries include BeautifulSoup (Python), Scrapy (Python), and Puppeteer (JavaScript).

Lukas Krivka•2y ago

Crawlee does support extracting sitemaps directly https://crawlee.dev/api/3.7/utils/class/Sitemap

Sitemap | API | Crawlee

Loads one or more sitemaps from given URLs, following references in sitemap index files, and exposes the contained URLs. Example usage: ```javascript // Load a sitemap const sitemap = await Sitemap.load(['https://example.com/sitemap.xml', 'https://example.com/sitemap_2.xml.gz']); // Enqueue all the contained URLs (including those from sub-...

Strange Sitemaps. What can I do to use them?

Did you find this page helpful?