Increasing speed of Jacobian using threading/optimisation/cpu optimisation | Together C & C++ | Page 1

final sageBOT Oct 12, 2025, 12:25 AM

#

When your question is answered use !solved or the button below to mark the question as resolved.

Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question use !howto ask.

scarlet talon Oct 12, 2025, 12:25 AM

#

📎 message.txt

final sageBOT Oct 12, 2025, 12:27 AM

#

@scarlet talon

Please Do Not Delete Posts!

Please don't delete forum posts. They can be helpful to refer to later and other members can learn from them. In the future you can use !solved to close a post and mark a post as solved.

scarlet talon Oct 12, 2025, 12:28 AM

#

anyone able to help with the optimsation of this algortithm?

uncut lantern Oct 12, 2025, 12:54 AM

#

scarlet talon

What have you identified to be the bottle necks?

scarlet talon Oct 12, 2025, 1:02 AM

#

uncut lantern What have you identified to be the bottle necks?

it seems like maybe memory bandwidth

#

#

im 8 core 16 threads

#

but after 4 it starts to decrease in time again

uncut lantern Oct 12, 2025, 1:18 AM

#

scarlet talon it seems like maybe memory bandwidth

How much memory per second are you processing rn?

scarlet talon Oct 12, 2025, 1:27 AM

#

uncut lantern How much memory per second are you processing rn?

ummmm

#

well with 401, 100 iterations

scarlet talon Oct 12, 2025, 1:34 AM

#

uncut lantern How much memory per second are you processing rn?

just trying to cachegrind to see

uncut lantern Oct 12, 2025, 1:40 AM

#

scarlet talon it seems like maybe memory bandwidth

Have you ran it through perf yet? Smth like:

sudo perf stat -e cpu-clock,context-switches,instructions,cycles,stalled-cycles-frontend,branches,branch-misses,cache-misses,L1-dcache-misses -r 100 ./<my_program>
```as a starting point?

#

The -r 100 defines the repetitions, change that as you want

scarlet talon Oct 12, 2025, 1:45 AM

#

#

it seems like its hyper stacking cores tho

uncut lantern Oct 12, 2025, 1:47 AM

#

uncut lantern Have you ran it through `perf` yet? Smth like: ```bash sudo perf stat -e cpu-clo...

Since your program takes input parameters I think you need to do:

sudo perf stat -e cpu-clock,context-switches,instructions,cycles,stalled-cycles-frontend,branches,branch-misses,cache-misses,L1-dcache-misses -r 5 -c "./<my_program> -arg1 val1 -arg2 val2"
```instead, but I'm not 100% sure

scarlet talon Oct 12, 2025, 1:52 AM

#

#

like it just doesnt look right

uncut lantern Oct 12, 2025, 1:55 AM

#

what do you want to point out in that screenshot?

scarlet talon Oct 12, 2025, 2:07 AM

#

uncut lantern what do you want to point out in that screenshot?

illo explain in a sec

#

sudo perf stat -r 5 -e cpu-clock,context-switches,instructions,cycles,stalled-cycles-frontend,branches,branch-misses,cache-misses,L1-dcache-misses
-- taskset -c 0,2,4,6 ./poisson -n 401 -i 100 -t 4 >/dev/null

Performance counter stats for 'taskset -c 0,2,4,6 ./poisson -n 401 -i 100 -t 4' (5 runs):

     13,024.55 msec cpu-clock                        #    3.555 CPUs utilized               ( +-  0.99% )
           627      context-switches                 #   48.140 /sec                        ( +-  0.72% )

108,882,570,212 instructions # 1.62 insn per cycle
# 0.02 stalled cycles per insn ( +- 0.01% ) (71.45%)
67,320,409,800 cycles # 5.169 GHz ( +- 0.97% ) (71.48%)
1,765,185,276 stalled-cycles-frontend # 2.62% frontend cycles idle ( +- 0.41% ) (71.42%)
2,194,149,478 branches # 168.463 M/sec ( +- 0.07% ) (71.39%)
61,576,729 branch-misses # 2.81% of all branches ( +- 0.15% ) (71.40%)
132,625,365 cache-misses ( +- 7.35% ) (71.42%)
3,824,394,869 L1-dcache-misses ( +- 0.06% ) (71.45%)

        3.6636 +- 0.0280 seconds time elapsed  ( +-  0.77% )

#

thats what i seem to get

uncut lantern Oct 12, 2025, 2:10 AM

#

Can you put that in a code block so Discord doesn't think of some of those characters as formatting characters?

#

I.e., put your text like so:

```
YOUR TEXT
```

scarlet talon Oct 12, 2025, 2:10 AM

#

bu24@obu24-desktop:~/Desktop/Group20$ sudo perf stat -r 5 -e cpu-clock,context-switches,instructions,cycles,stalled-cycles-frontend,branches,branch-misses,cache-misses,L1-dcache-misses \
  -- taskset -c 0,2,4,6 ./poisson -n 401 -i 100 -t 4 >/dev/null
       

 Performance counter stats for 'taskset -c 0,2,4,6 ./poisson -n 401 -i 100 -t 4' (5 runs):

         13,024.55 msec cpu-clock                        #    3.555 CPUs utilized               ( +-  0.99% )
               627      context-switches                 #   48.140 /sec                        ( +-  0.72% )
   108,882,570,212      instructions                     #    1.62  insn per cycle            
                                                  #    0.02  stalled cycles per insn     ( +-  0.01% )  (71.45%)
    67,320,409,800      cycles                           #    5.169 GHz                         ( +-  0.97% )  (71.48%)
     1,765,185,276      stalled-cycles-frontend          #    2.62% frontend cycles idle        ( +-  0.41% )  (71.42%)
     2,194,149,478      branches                         #  168.463 M/sec                       ( +-  0.07% )  (71.39%)
        61,576,729      branch-misses                    #    2.81% of all branches             ( +-  0.15% )  (71.40%)
       132,625,365      cache-misses                                                            ( +-  7.35% )  (71.42%)
     3,824,394,869      L1-dcache-misses                                                        ( +-  0.06% )  (71.45%)

            3.6636 +- 0.0280 seconds time elapsed  ( +-  0.77% )

uncut lantern Oct 12, 2025, 2:11 AM

#

Hui

#

That's a lot of cache and branch misses

scarlet talon Oct 12, 2025, 2:14 AM

#

yeahhh... idrk what to do tho to fix that

uncut lantern Oct 12, 2025, 2:17 AM

#

scarlet talon yeahhh... idrk what to do tho to fix that

Have you tried to profile it, e.g. with perf record & perf report?

E.g.

sudo perf record -g sh -c './my_program -arg1 val1' && sudo perf report --children

I'm not entirely sure about the -g and --children flag (i.e. I'm not 100% sure what they really do), but at least for me they made it more usable.

scarlet talon Oct 12, 2025, 2:21 AM

#

uncut lantern Have you tried to profile it, e.g. with `perf record` & `perf report`? E.g. ```...

um no i havent i havent ever used perf before tbh untill you just brought it up

uncut lantern Oct 12, 2025, 2:22 AM

#

scarlet talon um no i havent i havent ever used perf before tbh untill you just brought it up

It's a pretty nice command line tool functionality: https://perfwiki.github.io/main/tutorial/#sampling-with-perf-record

#

"Unfortunately", they changed the entire tutorial not too long ago, so I don't really know how good the tutorial is.
I say "unfortunately" in quotes, because it's probably hard to be worse than the old one.

#

Yeah, actually skimming through the tutorial that looks pretty darn good compared to the old one

scarlet talon Oct 12, 2025, 2:31 AM

#

obu24@obu24-desktop:~/Desktop/Group20$ sudo perf stat -r 5 -e   cpu-clock,context-switches,cycles,instructions,branches,branch-misses,cache-references,cache-misses   -- ./poisson -n 401 -i 100 -t 4 >/dev/null

 Performance counter stats for './poisson -n 401 -i 100 -t 4' (5 runs):

         12,610.83 msec cpu-clock                        #    3.578 CPUs utilized               ( +-  0.69% )
               607      context-switches                 #   48.133 /sec                        ( +-  0.30% )
    65,493,053,605      cycles                           #    5.193 GHz                         ( +-  0.73% )  (83.36%)
   118,354,761,553      instructions                     #    1.81  insn per cycle              ( +-  0.01% )  (83.33%)
     6,953,116,132      branches                         #  551.361 M/sec                       ( +-  0.01% )  (83.36%)
        62,076,328      branch-misses                    #    0.89% of all branches             ( +-  0.10% )  (83.30%)
    12,979,328,039      cache-references                 #    1.029 G/sec                       ( +-  0.40% )  (83.32%)
       124,772,557      cache-misses                     #    0.96% of all cache refs           ( +-  3.65% )  (83.34%)

           3.52416 +- 0.00607 seconds time elapsed  ( +-  0.17% )

#

thats better right?

uncut lantern Oct 12, 2025, 3:20 AM

#

scarlet talon ```py bu24@obu24-desktop:~/Desktop/Group20$ sudo perf stat -r 5 -e cpu-clock,con...

I think if you run it also with the cache-references, then you can also see how much those cache-misses actually amount to in comparison to the total accesses

scarlet talon Oct 12, 2025, 3:27 AM

#

uncut lantern I think if you run it also with the `cache-references`, then you can also see ho...

yeahh like this last picture

#

its less than 1% of all cache refs

uncut lantern Oct 12, 2025, 3:49 AM

#

There appears to be some sort of time where one or more of the threads are idle and waiting on another, otherwise the CPUs utilized number would be closer to 4.
The instructions per cycle look pretty solid though.

uncut lantern Oct 12, 2025, 3:50 AM

#

scarlet talon ```py obu24@obu24-desktop:~/Desktop/Group20$ sudo perf stat -r 5 -e cpu-clock,...

How does this behave if you pump up the number of threads to 8, or even more?

#

But I compared it with 2 of my versions, and relative to the instructions, that appears to be quiet a lot of cache misses

scarlet talon Oct 12, 2025, 3:52 AM

#

obu24@obu24-desktop:~/Desktop/Group20$ sudo perf stat -r 5 -e   cpu-clock,context-switches,cycles,instructions,branches,branch-misses,cache-references,cache-misses   -- ./poisson -n 401 -i 100 -t 8 >/dev/null
[sudo] password for obu24:           

 Performance counter stats for './poisson -n 401 -i 100 -t 8' (5 runs):

         30,551.87 msec cpu-clock                        #    6.702 CPUs utilized               ( +-  0.84% )
             3,606      context-switches                 #  118.029 /sec                        ( +- 16.62% )
   158,632,417,242      cycles                           #    5.192 GHz                         ( +-  0.82% )  (83.33%)
   109,082,810,969      instructions                     #    0.69  insn per cycle              ( +-  0.02% )  (83.31%)
     2,230,604,441      branches                         #   73.010 M/sec                       ( +-  0.04% )  (83.34%)
        67,778,397      branch-misses                    #    3.04% of all branches             ( +-  0.35% )  (83.35%)
    13,309,302,452      cache-references                 #  435.630 M/sec                       ( +-  0.33% )  (83.32%)
       206,691,279      cache-misses                     #    1.55% of all cache refs           ( +-  3.99% )  (83.34%)

            4.5585 +- 0.0248 seconds time elapsed  ( +-  0.54% )

#

yeahhh only 6.7 when i run 8

uncut lantern Oct 12, 2025, 3:53 AM

#

The more crazy part is the instructions per cycle part

scarlet talon Oct 12, 2025, 3:53 AM

#

yeahhh i just saw that

uncut lantern Oct 12, 2025, 3:56 AM

#

I saw a "Shared" section (I tried to understand the code for a bit, but honestly... a bit too complex for 5 AM, especially since I've never worked with omp). Are there any shared variables that may cause some memory contention?

uncut lantern Oct 12, 2025, 4:05 AM

#

scarlet talon yeahhh only 6.7 when i run 8

But that also undermines the "apparently some threads are waiting on others" part. My first guess as for why, would be that the work isn't properly distributed between the tasks, so in the end there's multiple idle threads just watching one meagerly other thread finish his thing, until they can all finally die, but it just takes that last thread an eternity to finish his thing, while none of the other threads can help.

#Increasing speed of Jacobian using threading/optimisation/cpu optimisation