foreign-sapphire
foreign-sapphire3y ago

Crawlee seems to be getting a cached version of a xml file

I'm starting my crawler with the first request being a https://site.com/sitemap.xml. Then I read all the URLs in sitemap and check for the modified date (The website does update the modified that in the sitemap), and only crawl the pages that were modified. The problem is that the crawler in production is doing that once every hour, and it's always getting the same version of the sitemap.xml. If I run it after a while in my PC, it finds modified URLs, crawl the pages and get the updates. I'm enqueuing the XML with await crawler.run([{url: "sitemap.xml", "label": "SITEMAP"}]); Is there a way to add headers and prevent caching here?
8 Replies
flat-fuchsia
flat-fuchsia3y ago
trty using residential proxy maybe
foreign-sapphire
foreign-sapphireOP3y ago
Isn't there a way to set headers for the crawlee request?
optimistic-gold
optimistic-gold3y ago
You can do that in pre navigation hooks or add skipNavigation: true to request object when enqueuing and manually send request in route handler via sendRequest from context object provideded in handler arguments Not sure what is a production env in your case, is that apify platform? The easiest way to verify whether it is a cache issue is buy adding query string to the request - this way cache will be invalidated in most cases.
foreign-sapphire
foreign-sapphireOP3y ago
Sorry I didn't say that. But that's what I did to validate it's a cache issue. I added a sitemap.xml?random=RAMDOM And it worked. However not all websites I crawl support adding random query strings. Some give me an error if I try to add a query string that the site is not expecting.
optimistic-gold
optimistic-gold3y ago
Have you already tried setting cache control headers?
optimistic-gold
optimistic-gold3y ago
Found interesting info in docs (https://playwright.dev/docs/api/class-page#page-route): "Enabling routing disables http cache."
Not sure if that works. You may try it this as well: page.route('**', route => route.continue());
Page | Playwright
* extends: [EventEmitter]
optimistic-gold
optimistic-gold3y ago
But use your glob pattern instead of wildcard if that approach works of course.
Alexey Udovydchenko
In Apify cloud every run getting non-cached results from the start, since actor instance created on each run and destroyed on finish, there is no "cache". If you getting cached output in your own server ensure that actor executed by Apify CLI as "apify run -p"

Did you find this page helpful?