fair-rose•2y ago
Given a url how can I build a tree object of its children with Crawlee?
Hi! I'm trying to build a an object that contains the hierarchy of a given url. Kind of like a sitemap, is this possible with Crawlee? I could not find a way to access links from enqueueLinks()
3 Replies
sunny-green•2y ago
There’s a few ways of inferring site structure between pages to build a site map.
One way is using the url structure (eg. /cheese/brie is a child of /cheese) . You can contruct this from the list of urls.
Another is using the first page that a link is found in. I haven’t tried it but I’m wondering if you can pass the parent page url using the userData parameter
https://crawlee.dev/api/core/class/Request#userData
Request | API | Crawlee
Represents a URL to be crawled, optionally including HTTP method, headers, payload and other metadata.
The
Request object also stores information about errors that occurred during processing of the request.
Each Request instance has the uniqueKey property, which can be either specified
manually in the constructor or generated automaticall...sunny-green•2y ago
I’ve also used cheerio to construct a site structure by analysing menu structures of a specific site, which is a lot more reliable but a lot more effort
@vic.aace you can get links from enqueueLinks() like this:
and as @zag mentioned, you can save the parent url in the
userData of each request