other-emerald
other-emerald2y ago

Given a url how can I build a tree object of its children with Crawlee?

Hi! I'm trying to build a an object that contains the hierarchy of a given url. Kind of like a sitemap, is this possible with Crawlee? I could not find a way to access links from enqueueLinks()
3 Replies
wise-white
wise-white17mo ago
There’s a few ways of inferring site structure between pages to build a site map. One way is using the url structure (eg. /cheese/brie is a child of /cheese) . You can contruct this from the list of urls. Another is using the first page that a link is found in. I haven’t tried it but I’m wondering if you can pass the parent page url using the userData parameter https://crawlee.dev/api/core/class/Request#userData
Request | API | Crawlee
Represents a URL to be crawled, optionally including HTTP method, headers, payload and other metadata. The Request object also stores information about errors that occurred during processing of the request. Each Request instance has the uniqueKey property, which can be either specified manually in the constructor or generated automaticall...
wise-white
wise-white17mo ago
I’ve also used cheerio to construct a site structure by analysing menu structures of a specific site, which is a lot more reliable but a lot more effort
lemurio
lemurio16mo ago
@vic.aace you can get links from enqueueLinks() like this:
const enqueuedRequests = await enqueueLinks(); enqueuedRequests.processedRequests.map((request) => {
console.log(request.uniqueKey);
});
const enqueuedRequests = await enqueueLinks(); enqueuedRequests.processedRequests.map((request) => {
console.log(request.uniqueKey);
});
and as @zag mentioned, you can save the parent url in the userData of each request

Did you find this page helpful?