#Load and process a file in multi-threads using C++ ifstream, but meet something wrong

30 messages · Page 1 of 1 (latest)

sonic tide
#

I have one file contains many lines, every line is a "info" of one video, and I need to process it. But the file it too large, so using one single thread to load&process it is a bad idea. So I'm trying to use maybe 6 thread to do that. Since my ops is read-only (get_line and process_line one by one), my first idea is load the file ==> load different chunks (with different offset) of the file. Code:[image]. The code can be complie and run. However, my MAP size is not right. My file should contain 13,000,000 lines, but MAP size is or so 12,990,000. I am very sure No duplicate data here. Then why? Even though eventually I adopted one reader + two workers(process lines) to do that. I still wanna know why there are some lines missed? Where is the mistake?

feral surgeBOT
#

When your question is answered use !solved to mark the question as resolved.

Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question use !howto ask.

sonic tide
#

I have read many posts on stackoverflow but none of them satisfy me 😂

misty skiff
#
  1. wtf is that sysconf? use std::thread::hardware_concurrency();
feral surgeBOT
# digital wind !sc

@sonic tide

Monke
Please Do Not Send Screenshots or Photos!

They're hard to read and prevent copying and pasting.

round furnace
#
  1. what's video2tagids_map_
  2. since you're chunking the file by raw bytes, it's possible that whatever you're looking for gets split between different threads. Have you considered that?
sonic tide
digital wind
#

@feral surge is our bot

sonic tide
#

@round furnace it store every line's info. one line -> one record in that map. Question 2: Suppose I have 6 threads, then I divide them into six blocks, but when I actually continue getline in each thread, I guess only the line where the junction/split is located will have a problem (right?) I fixed this so it won't be a problem, and even if it does cause a problem, it should only be for those 6 rows, right?

#

@digital wind 😅oops

digital wind
#

Like, this multithreaded code, but set max_threads_number = 1

round furnace
#

otherwise some threads may be slower and not yet finished when you retrieve the size

digital wind
#

I would guess he's simply missing a few lines at the end due to a rounding error

round furnace
#

also you should be making sure that the map supports concurrent writes (just in case)

sonic tide
digital wind
#

*Rounding error in this line:

size_t chunkSize = fileSize / max_threads_number;
round furnace
digital wind
round furnace
#

i thought that was the case too :p

#

but it seems to be ok

sonic tide
digital wind
#

👍

feral surgeBOT
#

This question is being automatically marked as stale.
If your question has been answered, type !solved.
If your question is not answered feel free to bump the post or re-ask.
Take a look at !howto ask for tips on improving your question.

sonic tide
#

!solved