notable-maroon
notable-maroon2y ago

Running multiple PlaywrightCrawlers has them using each others context/data, causing data leaks.

Hi folks. I have a PlaywrightCrawler that takes a base URL and then uses requestQueue to find urls on and scan an entire website. That's located in a function called parseSite that I call from a Redis queue managed by bullmq. The redis job has data as to what project to save the page details under (like projectId), which i send as arguments to the parseSite function. This works fine when I have concurrency set to 1, but when I allow multiple jobs to be picked up at the same time, PlaywrightCrawler starts to use the wrong projectId for some of the pages. Code wise that shouldn't be possible, since thats an argument for parseSite and there is no way to access other projectIds in the context of that function, so it sounds like the PlaywrightCrawler is mixing things there. Is that a known issue, and what can I do to prevent it (It's now leaking data to other teams)
1 Reply
notable-maroon
notable-maroonOP2y ago
closing this out, it was the requestQueue and requestList that weren't unique to the current function that I overlooked

Did you find this page helpful?