unwilling-turquoise
unwilling-turquoise2y ago

is there a way to have custom variables accessible inside the crawler function?

Is there a way we can pass in some variables and have them accessible inside the crawler? main use case is to have some internal variables which we can check/modify and execute some conditional logic using them. for e.g. lets say we have duplicate_count variable (I know we already have this functionality but this is just an example), and I'll update it if the data is already there in my db and stop the crawler if the count exceeds some threshold. Is there a way I could implement this?
9 Replies
sensitive-blue
sensitive-blue2y ago
This was essentially answered/led to in my thread https://discord.com/channels/801163717915574323/1178416767991824415 . You can pass data in "userData". Example passing data:
crawler.run([
{ url: 'someUrl', userData: { thing: 'value' } }
]);
crawler.run([
{ url: 'someUrl', userData: { thing: 'value' } }
]);
And you can access/modify it from within a crawl with the "request" property:
request.userData
request.userData
unwilling-turquoise
unwilling-turquoiseOP2y ago
thanks, this worksm only issue is you gotta explicitly pass this in enququLinks to ensure it propagates in further calls
sensitive-blue
sensitive-blue2y ago
you could use a "preNavigationHook" to automatically set it as well although at that point you might as well just import the variables where you need them. unless i'm understanding what you want incorrectly.
unwilling-turquoise
unwilling-turquoiseOP2y ago
I dont think preNavigationHooks will work, but yeah importing them is a good way but it's not that good DX wise, I prefer explicit variables defined right in the file instead of having to import them unless necessary, makes for lesser load cognitively but thanks for the info! userData works perfectly
Lukas Krivka
Lukas Krivka2y ago
There are generally 2 ways to manage state 1. For sequential flow, it is request.userData 2. For non sequential, you can have global state object with useState https://crawlee.dev/api/core/function/useState
unwilling-turquoise
unwilling-turquoiseOP2y ago
hey thanks for the info, do we define it outside of the router/crawler like this? and then use state variable
import { createPlaywrightRouter , useState} from 'crawlee';

export const router = createPlaywrightRouter();

const state = await useState("test", {"val":12})

router.addDefaultHandler(async ({ enqueueLinks, log }) => {

log.info(`enqueueing new URLs`);
await enqueueLinks({
globs: ['https://crawlee.dev/**'],
label: 'detail',
});
});

router.addHandler('detail', async ({ request, page, log, pushData, }) => {
const title = await page.title();
log.info(`${title}`, { url: request.loadedUrl });

await pushData({
url: request.loadedUrl,
title,
});
});
import { createPlaywrightRouter , useState} from 'crawlee';

export const router = createPlaywrightRouter();

const state = await useState("test", {"val":12})

router.addDefaultHandler(async ({ enqueueLinks, log }) => {

log.info(`enqueueing new URLs`);
await enqueueLinks({
globs: ['https://crawlee.dev/**'],
label: 'detail',
});
});

router.addHandler('detail', async ({ request, page, log, pushData, }) => {
const title = await page.title();
log.info(`${title}`, { url: request.loadedUrl });

await pushData({
url: request.loadedUrl,
title,
});
});
I also saw another snippet on github issues using it like crawler.useState , can you clarify a bit on this? and whats the difference between passing in name in the useState func vs passing it in the config parameter? on the docs both options use it to define a custom key value store
Lukas Krivka
Lukas Krivka2y ago
Both imports are equivalent. Name in useState will be just for that function, config would be global
unwilling-turquoise
unwilling-turquoiseOP2y ago
gotcha thanks, so call it outside of crawler like const state = await useState() and then use it inside the crawler like a simple object? e.g. state.property=val
Lukas Krivka
Lukas Krivka2y ago
Yep, the reason to have this instead of just naked object is that it is persisted to KV Store in case you need to restart

Did you find this page helpful?