graceful-beige
graceful-beige2y ago

Excluding urls from enqueue links

I've scraped a website which includes all links found using globs pattern eg.
globs: ['http?(s)://crawlee.dev/*/*'],
globs: ['http?(s)://crawlee.dev/*/*'],
Now I want to scrape same website but I don't want to scrape links that I've already visited. My flow is that I provide root url and I have a root route handler which enqueues links using pattern. How could I do same thing again, but I want to exclude all the links I've scraped in previous run? ps. This website adds new content weekly.
1 Reply
Lukas Krivka
Lukas Krivka2y ago
You store them in named dataset or KV store and load on start into a Set that you use for exclusion. Then you can use the transform to skip those that you already have https://crawlee.dev/api/core/interface/RequestTransform
RequestTransform | API | Crawlee
Takes an Apify {@apilink RequestOptions} object and changes its attributes in a desired way. This user-function is used {@apilink enqueueLinks} to modify requests before enqueuing them.

Did you find this page helpful?