#Mutliprocessing destroy saving data.

76 messages · Page 1 of 1 (latest)

fickle sorrel
#

Hello, I would need some advice on how I could use multiprocessing in my use case.

Basically, my program grabs a list of keywords from an excel file. Then I want to start 50 tasks in parallel( multiprocessing) to process each of these keywords).
Then each process will save up the result in a csv file independantly. So like once the process finished, it willl save it.

This caused issues when I was using multiprocessing, because sometimes they would try to write the file simultaneously and it breaks the data.
Notice in the picture how after 800 successful data insertion, it just broke for no reason.
Thank you for any help

regal grail
#

How much are you reading/processing vs writing? Might make sense to write results to a queue and then dump that queue into a file (alternatively you could have another thread popping elements off the queue and writing them during processing)

sturdy pond
regal grail
#

The point of multithreading that program is to not have different files

#

The goal was to speed up an existing computation while keeping the entire result (and not looking at chunks) kek

#

Saving to intermediate files and then combining them works but adds lots of I/O, imo keeping the results in memory is fine (or streaming them to disk via a queue + writer thread as they come in) depending on the result size

fickle sorrel
fickle sorrel
#

I would lose progress and money, because my operations are not free

fickle sorrel
#

I am thinking of moving on from multiprocessing to something like async Semaphore

regal grail
sturdy pond
sturdy pond
regal grail
sturdy pond
#

am i reading that right

regal grail
sturdy pond
regal grail
#

IO is expensive -> Don't do more IO than you need

sturdy pond
#

in this case he probably wants to do a lot of IO

regal grail
#

Doing more IO for the same result doesn't sound like the smartest thing to dokek kek kek

sturdy pond
#

"I don't want to lose my progress on a 3-5 hour task so i don't want to keep it in memory"
Having a single thread write the output file is not gonna decrease the amount of data you're sending, but is going to make it easier to make a mistake.

so you COULD have a single thread write to a file for no real gain at all, or you could just make it simple and just write to multiple files?

#

idk one seems like the more logical choice and the other is pre-optimization

#

just my 2 cents

regal grail
#

It is decreasing the amount by exactly half. No data is lost either since you're continuously writing to disk (also disk failure is more likely but whatever)

sturdy pond
#

if you say so ¯_(ツ)_/¯

fickle sorrel
fickle sorrel
#

I used multiprocessing, because initally the program was very slow ( on one process). I would have took a full week or more. Now, with 50 processes, it runs in 3-5 hours.

#

Thing is multiprocessing takes a lot of RAM and CPU.

#

I have to do IO, because since the task is long, I dont want to risk losing progress for any reason. I pay money to process this data, so it would be unfortunate to lose everything because of an error or something I misconfigured. My program is still not 100% perfect for me to blindly trust it will end the task with no issues

#

For optimizing it, I though about using Asyncio.Semaphore

#

since it's mostly api calls. The only data manipulation I do is playing with strings and patterns.

#

So I could do async for all the tasks in one process.

#

well idk, I am kind of a newbie into concurrency programming.

fickle sorrel
#

idk if it's the best solution or not. But my code definitely needs optimization.

#

as I plan to get better rate limits soon.

#

right now 50 processes take all my 16gb of ram and 100% of my cpu (I am on windows 10)

#

I am looking to getting up to 100 concurrency soon and I will need some help into designing my system correctly.

#

Thank you for any help

regal grail
#

Can you post the data source here? Tempted to optimize it kek

fickle sorrel
regal grail
#

Yeah basically your original file + the processing steps needed

fickle sorrel
fickle sorrel
primal mothBOT
#

@fickle sorrel

File Attachments Not Allowed

For safety reasons we do not allow files with certain file extensions.

mhmm0879 Said

An exemple of a source file to process

Code Formatting

You can share your code using triple backticks like this:
```
YOUR CODE
```

Large Portions of Code

For longer scripts use Hastebin or GitHub Gists and share the link here

Ignored these files due to them having disallowed file extensions
  • Golden-Retriever_all-keywords_us_2024-03-30.xlsx
fickle sorrel
#

cant upload an excel?

#

I shared the excel

primal mothBOT
#

@fickle sorrel

File Attachments Not Allowed

For safety reasons we do not allow files with certain file extensions.

Code Formatting

You can share your code using triple backticks like this:
```
YOUR CODE
```

Large Portions of Code

For longer scripts use Hastebin or GitHub Gists and share the link here

Ignored these files due to them having disallowed file extensions
  • message.txt
fickle sorrel
#

this is the "brain" script that manages the content to then send it to workers who will process it( workers being processes)

regal grail
#

thanks, will take a look

fickle sorrel
fickle sorrel
#

@regal grail I dont want to bother, but any update?

regal grail
#

Got quite busy with work, back now

#

Not sure what that NAG3 library is, can't seem to find anything on it

fickle sorrel
fickle sorrel
regal grail
#

Which you haven't shared yet? kek

#

Also the Excel file got deleted

fickle sorrel
fickle sorrel
regal grail
#

No point then lol

#

If I can't even run your script there ain't no way that I'll be able to do actual performance work

#

I can do some guesstimations but the #1 rule for performance is measure before you optimize

fickle sorrel
fickle sorrel
#

can I just share it to you privately?

regal grail
#

That's fine as well lol

fickle sorrel
#

now, the code is ugly and not cleaned up

regal grail
#

Doesn't matter much to me kek

fickle sorrel