other-emerald•2y ago
Given a url how can I build a tree object of its children with Crawlee?
Hi! I'm trying to build a an object that contains the hierarchy of a given url. Kind of like a sitemap, is this possible with Crawlee? I could not find a way to access links from enqueueLinks()
3 Replies
wise-white•17mo ago
There’s a few ways of inferring site structure between pages to build a site map.
One way is using the url structure (eg. /cheese/brie is a child of /cheese) . You can contruct this from the list of urls.
Another is using the first page that a link is found in. I haven’t tried it but I’m wondering if you can pass the parent page url using the userData parameter
https://crawlee.dev/api/core/class/Request#userData
Request | API | Crawlee
Represents a URL to be crawled, optionally including HTTP method, headers, payload and other metadata.
The
Request
object also stores information about errors that occurred during processing of the request.
Each Request
instance has the uniqueKey
property, which can be either specified
manually in the constructor or generated automaticall...wise-white•17mo ago
I’ve also used cheerio to construct a site structure by analysing menu structures of a specific site, which is a lot more reliable but a lot more effort
@vic.aace you can get links from enqueueLinks() like this:
and as @zag mentioned, you can save the parent url in the
userData
of each request