quickest-silver
quickest-silver3y ago

Downloadlistofurloptions

how to crawl XML from the XML (nested)from there I have to collect links,any suggestions by using cheerio crawler it will useful for me to move on.#crawlee
3 Replies
Lukas Krivka
Lukas Krivka3y ago
XML is normally parsed by CheerioCrawler you can work with it like with HTML.
quickest-silver
quickest-silverOP3y ago
Is there any doc available.Thanks
ambitious-aqua
ambitious-aqua3y ago
I just did a quick work around for this (only two levels at the mo) - get the lists of urls. Loop through the urls and then run downloadListOfUrls on that one is we know it's xml. Then enqueue all of those links into the same requestQueue
const urls = await downloadListOfUrls({
url,
});
for (let url of urls) {
// check to see if it ends in .xml to then download all of those links
// will only work for two levels, might be worth making a recursive function
if (url.search(/\.xml/gi) !== -1) {
const urls = await downloadListOfUrls({
url,
});
await enqueueLinks({
urls,
requestQueue,
baseUrl: parsedUrl.origin,
strategy: "same-domain",
});
}
}
// still need to check/filter these urls from above incase the whole sitemap was all nested sitemaps
await enqueueLinks({
urls,
requestQueue,
baseUrl: parsedUrl.origin,
strategy: "same-domain",
});
const urls = await downloadListOfUrls({
url,
});
for (let url of urls) {
// check to see if it ends in .xml to then download all of those links
// will only work for two levels, might be worth making a recursive function
if (url.search(/\.xml/gi) !== -1) {
const urls = await downloadListOfUrls({
url,
});
await enqueueLinks({
urls,
requestQueue,
baseUrl: parsedUrl.origin,
strategy: "same-domain",
});
}
}
// still need to check/filter these urls from above incase the whole sitemap was all nested sitemaps
await enqueueLinks({
urls,
requestQueue,
baseUrl: parsedUrl.origin,
strategy: "same-domain",
});

Did you find this page helpful?