graceful-beige•2y ago

Excluding urls from enqueue links

I've scraped a website which includes all links found using globs pattern eg.

 globs: ['http?(s)://crawlee.dev/*/*'],

 globs: ['http?(s)://crawlee.dev/*/*'],

Now I want to scrape same website but I don't want to scrape links that I've already visited. My flow is that I provide root url and I have a root route handler which enqueues links using pattern. How could I do same thing again, but I want to exclude all the links I've scraped in previous run? ps. This website adds new content weekly.

1 Reply

Lukas Krivka•2y ago

You store them in named dataset or KV store and load on start into a Set that you use for exclusion. Then you can use the transform to skip those that you already have https://crawlee.dev/api/core/interface/RequestTransform

RequestTransform | API | Crawlee

Takes an Apify {@apilink RequestOptions} object and changes its attributes in a desired way. This user-function is used {@apilink enqueueLinks} to modify requests before enqueuing them.

Excluding urls from enqueue links

Did you find this page helpful?