graceful-beige•2y ago
Excluding urls from enqueue links
I've scraped a website which includes all links found using globs pattern
eg.
Now I want to scrape same website but I don't want to scrape links that I've already visited.
My flow is that I provide root url and I have a root route handler which enqueues links using pattern.
How could I do same thing again, but I want to exclude all the links I've scraped in previous run?
ps. This website adds new content weekly.
1 Reply
You store them in named dataset or KV store and load on start into a Set that you use for exclusion. Then you can use the transform to skip those that you already have
https://crawlee.dev/api/core/interface/RequestTransform
RequestTransform | API | Crawlee
Takes an Apify {@apilink RequestOptions} object and changes its attributes in a desired way. This user-function is used
{@apilink enqueueLinks} to modify requests before enqueuing them.