How can I pass data extracted in the first part of the ...

At a glance

The community member is extracting prices of products and wants to handle a situation where they sometimes lose the fee information due to getting blocked on some products. They are using the "Router" class from the Crawlee library and want to pass the extracted product information, including the fees, to the next request. Another community member suggests using context.add_requests() instead of context.enqueue_links() to pass the product information as user data. The community members discuss how to handle the data upload depending on whether the request was successful or not, with one community member providing a suggested approach using the failed_request_handler method. The community members confirm that the suggested approach works for them.

Useful resources

ffrankman

Hi. I'm extracting prices of products. In the process, I have the main page where I can extract all the information I need except for the fees. If I go through every product individually, I can get the price and fees, but sometimes I lose the fee information because I get blocked on some products. I want to handle this situation. If I extract the fees, I want to add them to my product_item, but if I get blocked, I want to pass this data as empty. I'm using the "Router" class as the Crawlee team explains here: https://crawlee.dev/python/docs/introduction/refactoring. When I add my URL extracted from the first page as shown below, I cannot pass data extracted before:

await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES')

I want something like this:

await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES', data=product_item # type: dict)

But I cannot do the above. How can I do it?

So, my final data will be showed as:

If I handle the data correctly I want something like this:
product_item = {product_id: 1234, price: 50$, fees: 3$}

If I get blocked, I have something like this:
product_item = {product_id: 1234, price: 50$, fees: ''}

4 comments

MMantisus

Hi @frankman

You can use this approach

Plain Text

await context.add_requests([
    Request.from_url(
            url='product_url',
            label='PRODUCT_WITH_FEES',
            user_data={"product_item": product_item}
            )
    ])

enqueue_links - It also supports the user_data variable, but it seems to me that add_requests is better for your case

ffrankman

Thank you Mantisus, that works for me. Now I know how can I pass data between requests. And how can I handle the data upload depending on whether the request failed or was successful?

If I handle the data correctly I want something like this:
product_item = {product_id: 1234, price: 50$, fees: 3$}

If I get blocked, I have something like this:
product_item = {product_id: 1234, price: 50$, fees: ''}

In my final function with the "label": PRODUCT_WITH_FEES I'm using Apify.push(product_item) (same than crawlee.push()).

I have to do the following way?

try: 
   ...
   await context.add_requests([
       Request.from_url(
               url='product_url',
               label='PRODUCT_WITH_FEES',
               user_data={"product_item":    product_item}
               )
       ])

except Exception as e:
   Apify.push(produc_item)  # product_item without fees.

MMantisus

I can't be certain as I don't know exactly what behavior you are observing. But it's more likely to be something like this

Plain Text

@crawler.failed_request_handler
async def blocked_item_handle(context, error) -> None:
    if context.request.label == "PRODUCT_WITH_FEES":
        Apify.push(context.request.produc_item)

https://crawlee.dev/python/api/class/BasicCrawler#failed_request_handler

Either at the try ... except in the route for PRODUCT_WITH_FEES

ffrankman

Thank you, that works fine!

Add a reply

Apify Discord Mirror

How can I pass data extracted in the first part of the scraper to items that will be extracted later