#Python crawlers running in parallel

1 messages · Page 1 of 1 (latest)

hardy fog
#

Hi, I have a custom Python + requests Actor that works great. It's pretty simple, it works against a list of starting URLs and pulls out a piece of information per URL.

My question is: If (for example) one run of 1,000 input URLs takes an hour to complete, i would like to parallel-ize it 4 ways so that I can run 4,000 URLs in an hour.

What's the best way to do this? I could kick off 4 copies of the run with segmented data, but this seems like something Apify could support natively.

I saw that if I was using Crawlee (and therefore JS) I could use autoscaling: https://docs.apify.com/platform/actors/running/usage-and-resources . But is there a way to build a single Python based Actor that uses more threads/CPU cores if needed?

Learn about your Actors' memory and processing power requirements, their relationship with Docker resources, minimum requirements for different use cases and its impact on the cost.

vital daggerBOT
#

@hardy fog just advanced to level 1! Thanks for your contributions! 🎉

trail crescent
#

Hi,

We don't have a similar functionality in the Python SDK yet (it's planned in the upcoming months).

But for now, you could just write some simple utility, using asyncio.Queue, to get what you need: https://docs.python.org/3/library/asyncio-queue.html#examples

hardy fog
#

Thank you ! I will try out the queue system.