notable-maroon•2y ago
Running multiple PlaywrightCrawlers has them using each others context/data, causing data leaks.
Hi folks.
I have a
PlaywrightCrawler
that takes a base URL and then uses requestQueue to find urls on and scan an entire website. That's located in a function called parseSite
that I call from a Redis queue managed by bullmq. The redis job has data as to what project to save the page details under (like projectId
), which i send as arguments to the parseSite
function.
This works fine when I have concurrency set to 1, but when I allow multiple jobs to be picked up at the same time, PlaywrightCrawler
starts to use the wrong projectId
for some of the pages. Code wise that shouldn't be possible, since thats an argument for parseSite and there is no way to access other projectId
s in the context of that function, so it sounds like the PlaywrightCrawler
is mixing things there. Is that a known issue, and what can I do to prevent it (It's now leaking data to other teams)1 Reply
notable-maroonOP•2y ago
closing this out, it was the requestQueue and requestList that weren't unique to the current function that I overlooked