Strange behaviour when using rq_client()
I have been struggling with tests for a while, and finally reduced it to a simple test for which I don't understand the behaviour. Is this expected (and I am missing sth) or is this expected?
This test fails
async def test_failing():
storage_client = MemoryStorageClient()
request_queue_client = await storage_client.create_rq_client()
req = Request.from_url("https://crawlee.dev")
await request_queue_client.add_batch_of_requests([req])
crawler = BasicCrawler(
concurrency_settings=ConcurrencySettings(desired_concurrency=1, max_concurrency=1),
max_crawl_depth=2,
storage_client=storage_client,
)
@crawler.router.default_handler
async def handler(context: BasicCrawlingContext) -> None:
pass
stats = await crawler.run()
assert stats.requests_finished > 0
but this one passes
async def test_success():
storage_client = MemoryStorageClient()
req = Request.from_url("https://crawlee.dev")
crawler = BasicCrawler(
concurrency_settings=ConcurrencySettings(desired_concurrency=1, max_concurrency=1),
max_crawl_depth=2,
storage_client=storage_client,
)
@crawler.router.default_handler
async def handler(context: BasicCrawlingContext) -> None:
pass
await crawler.add_requests([req])
stats = await crawler.run()
assert stats.requests_finished > 0
the only difference is that in the first I add requests through a request client or through the crawler. If I add it through this
rq = await RequestQueue.open()
await rq.add_request(req)
it also fails. Thanks in advance
5 Replies
@Eric just advanced to level 3! Thanks for your contributions! 🎉
Such use is not desirable.
To create a queue and interact with it, use
RequestQueue.
The following code should work without any errors
oh okay, thank you very much! It is not intuitive to me the difference and why would the first not work, but I changed to what you suggest and it works 🙂
@Mantisus is it possible that this deletes the request queue?
request_queue = await RequestQueue.open(storage_client=sql_client) dataset = await Dataset.open(storage_client=sql_client) my db keeps getting purged since I changed to this
request_queue = await RequestQueue.open(storage_client=sql_client) dataset = await Dataset.open(storage_client=sql_client) my db keeps getting purged since I changed to this
@Eric You should use the named queue so that it is not purged .
Default storages and storages created with
alias are purged during startup if Configuration.purge_on_start=True (default behavior).
https://crawlee.dev/python/docs/guides/storages#named-and-unnamed-storagesoh okay! thanks! I had not read this page sorry