Python crawlers running in parallel | Apify & Crawlee | Page 1

hardy fog Aug 28, 2023, 9:42 PM

#

Hi, I have a custom Python + requests Actor that works great. It's pretty simple, it works against a list of starting URLs and pulls out a piece of information per URL.

My question is: If (for example) one run of 1,000 input URLs takes an hour to complete, i would like to parallel-ize it 4 ways so that I can run 4,000 URLs in an hour.

What's the best way to do this? I could kick off 4 copies of the run with segmented data, but this seems like something Apify could support natively.

I saw that if I was using Crawlee (and therefore JS) I could use autoscaling: https://docs.apify.com/platform/actors/running/usage-and-resources . But is there a way to build a single Python based Actor that uses more threads/CPU cores if needed?

Usage and resources | Platform | Apify Documentation

Learn about your Actors' memory and processing power requirements, their relationship with Docker resources, minimum requirements for different use cases and its impact on the cost.

vital daggerBOT Aug 28, 2023, 9:42 PM

#

@hardy fog just advanced to level 1! Thanks for your contributions! 🎉

trail crescent Aug 29, 2023, 11:35 AM

#

Hi,

We don't have a similar functionality in the Python SDK yet (it's planned in the upcoming months).

But for now, you could just write some simple utility, using asyncio.Queue, to get what you need: https://docs.python.org/3/library/asyncio-queue.html#examples

Python documentation

Queues

Source code: Lib/asyncio/queues.py asyncio queues are designed to be similar to classes of the queue module. Although asyncio queues are not thread-safe, they are designed to be used specifically i...

hardy fog Aug 29, 2023, 2:12 PM

#

Thank you ! I will try out the queue system.

#Python crawlers running in parallel