#Increasing speed of Jacobian using threading/optimisation/cpu optimisation

49 messages · Page 1 of 1 (latest)

final sageBOT
#

When your question is answered use !solved or the button below to mark the question as resolved.

Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question use !howto ask.

scarlet talon
final sageBOT
#

@scarlet talon

Please Do Not Delete Posts!

Please don't delete forum posts. They can be helpful to refer to later and other members can learn from them. In the future you can use !solved to close a post and mark a post as solved.

scarlet talon
#

anyone able to help with the optimsation of this algortithm?

uncut lantern
scarlet talon
#

im 8 core 16 threads

#

but after 4 it starts to decrease in time again

uncut lantern
scarlet talon
#

well with 401, 100 iterations

scarlet talon
uncut lantern
# scarlet talon it seems like maybe memory bandwidth

Have you ran it through perf yet? Smth like:

sudo perf stat -e cpu-clock,context-switches,instructions,cycles,stalled-cycles-frontend,branches,branch-misses,cache-misses,L1-dcache-misses -r 100 ./<my_program>
```as a starting point?
#

The -r 100 defines the repetitions, change that as you want

scarlet talon
#

it seems like its hyper stacking cores tho

uncut lantern
scarlet talon
#

like it just doesnt look right

uncut lantern
#

what do you want to point out in that screenshot?

scarlet talon
#

sudo perf stat -r 5 -e cpu-clock,context-switches,instructions,cycles,stalled-cycles-frontend,branches,branch-misses,cache-misses,L1-dcache-misses
-- taskset -c 0,2,4,6 ./poisson -n 401 -i 100 -t 4 >/dev/null

Performance counter stats for 'taskset -c 0,2,4,6 ./poisson -n 401 -i 100 -t 4' (5 runs):

     13,024.55 msec cpu-clock                        #    3.555 CPUs utilized               ( +-  0.99% )
           627      context-switches                 #   48.140 /sec                        ( +-  0.72% )

108,882,570,212 instructions # 1.62 insn per cycle
# 0.02 stalled cycles per insn ( +- 0.01% ) (71.45%)
67,320,409,800 cycles # 5.169 GHz ( +- 0.97% ) (71.48%)
1,765,185,276 stalled-cycles-frontend # 2.62% frontend cycles idle ( +- 0.41% ) (71.42%)
2,194,149,478 branches # 168.463 M/sec ( +- 0.07% ) (71.39%)
61,576,729 branch-misses # 2.81% of all branches ( +- 0.15% ) (71.40%)
132,625,365 cache-misses ( +- 7.35% ) (71.42%)
3,824,394,869 L1-dcache-misses ( +- 0.06% ) (71.45%)

        3.6636 +- 0.0280 seconds time elapsed  ( +-  0.77% )
#

thats what i seem to get

uncut lantern
#

Can you put that in a code block so Discord doesn't think of some of those characters as formatting characters?

#

I.e., put your text like so:

```
YOUR TEXT
```

scarlet talon
#
bu24@obu24-desktop:~/Desktop/Group20$ sudo perf stat -r 5 -e cpu-clock,context-switches,instructions,cycles,stalled-cycles-frontend,branches,branch-misses,cache-misses,L1-dcache-misses \
  -- taskset -c 0,2,4,6 ./poisson -n 401 -i 100 -t 4 >/dev/null
       

 Performance counter stats for 'taskset -c 0,2,4,6 ./poisson -n 401 -i 100 -t 4' (5 runs):

         13,024.55 msec cpu-clock                        #    3.555 CPUs utilized               ( +-  0.99% )
               627      context-switches                 #   48.140 /sec                        ( +-  0.72% )
   108,882,570,212      instructions                     #    1.62  insn per cycle            
                                                  #    0.02  stalled cycles per insn     ( +-  0.01% )  (71.45%)
    67,320,409,800      cycles                           #    5.169 GHz                         ( +-  0.97% )  (71.48%)
     1,765,185,276      stalled-cycles-frontend          #    2.62% frontend cycles idle        ( +-  0.41% )  (71.42%)
     2,194,149,478      branches                         #  168.463 M/sec                       ( +-  0.07% )  (71.39%)
        61,576,729      branch-misses                    #    2.81% of all branches             ( +-  0.15% )  (71.40%)
       132,625,365      cache-misses                                                            ( +-  7.35% )  (71.42%)
     3,824,394,869      L1-dcache-misses                                                        ( +-  0.06% )  (71.45%)

            3.6636 +- 0.0280 seconds time elapsed  ( +-  0.77% )
uncut lantern
#

Hui

#

That's a lot of cache and branch misses

scarlet talon
#

yeahhh... idrk what to do tho to fix that

uncut lantern
# scarlet talon yeahhh... idrk what to do tho to fix that

Have you tried to profile it, e.g. with perf record & perf report?

E.g.

sudo perf record -g sh -c './my_program -arg1 val1' && sudo perf report --children

I'm not entirely sure about the -g and --children flag (i.e. I'm not 100% sure what they really do), but at least for me they made it more usable.

scarlet talon
uncut lantern
#

"Unfortunately", they changed the entire tutorial not too long ago, so I don't really know how good the tutorial is.
I say "unfortunately" in quotes, because it's probably hard to be worse than the old one.

#

Yeah, actually skimming through the tutorial that looks pretty darn good compared to the old one

scarlet talon
#
obu24@obu24-desktop:~/Desktop/Group20$ sudo perf stat -r 5 -e   cpu-clock,context-switches,cycles,instructions,branches,branch-misses,cache-references,cache-misses   -- ./poisson -n 401 -i 100 -t 4 >/dev/null

 Performance counter stats for './poisson -n 401 -i 100 -t 4' (5 runs):

         12,610.83 msec cpu-clock                        #    3.578 CPUs utilized               ( +-  0.69% )
               607      context-switches                 #   48.133 /sec                        ( +-  0.30% )
    65,493,053,605      cycles                           #    5.193 GHz                         ( +-  0.73% )  (83.36%)
   118,354,761,553      instructions                     #    1.81  insn per cycle              ( +-  0.01% )  (83.33%)
     6,953,116,132      branches                         #  551.361 M/sec                       ( +-  0.01% )  (83.36%)
        62,076,328      branch-misses                    #    0.89% of all branches             ( +-  0.10% )  (83.30%)
    12,979,328,039      cache-references                 #    1.029 G/sec                       ( +-  0.40% )  (83.32%)
       124,772,557      cache-misses                     #    0.96% of all cache refs           ( +-  3.65% )  (83.34%)

           3.52416 +- 0.00607 seconds time elapsed  ( +-  0.17% )

#

thats better right?

uncut lantern
scarlet talon
#

its less than 1% of all cache refs

uncut lantern
#

There appears to be some sort of time where one or more of the threads are idle and waiting on another, otherwise the CPUs utilized number would be closer to 4.
The instructions per cycle look pretty solid though.

uncut lantern
#

But I compared it with 2 of my versions, and relative to the instructions, that appears to be quiet a lot of cache misses

scarlet talon
#
obu24@obu24-desktop:~/Desktop/Group20$ sudo perf stat -r 5 -e   cpu-clock,context-switches,cycles,instructions,branches,branch-misses,cache-references,cache-misses   -- ./poisson -n 401 -i 100 -t 8 >/dev/null
[sudo] password for obu24:           

 Performance counter stats for './poisson -n 401 -i 100 -t 8' (5 runs):

         30,551.87 msec cpu-clock                        #    6.702 CPUs utilized               ( +-  0.84% )
             3,606      context-switches                 #  118.029 /sec                        ( +- 16.62% )
   158,632,417,242      cycles                           #    5.192 GHz                         ( +-  0.82% )  (83.33%)
   109,082,810,969      instructions                     #    0.69  insn per cycle              ( +-  0.02% )  (83.31%)
     2,230,604,441      branches                         #   73.010 M/sec                       ( +-  0.04% )  (83.34%)
        67,778,397      branch-misses                    #    3.04% of all branches             ( +-  0.35% )  (83.35%)
    13,309,302,452      cache-references                 #  435.630 M/sec                       ( +-  0.33% )  (83.32%)
       206,691,279      cache-misses                     #    1.55% of all cache refs           ( +-  3.99% )  (83.34%)

            4.5585 +- 0.0248 seconds time elapsed  ( +-  0.54% )

#

yeahhh only 6.7 when i run 8

uncut lantern
#

The more crazy part is the instructions per cycle part

scarlet talon
#

yeahhh i just saw that

uncut lantern
#

I saw a "Shared" section (I tried to understand the code for a bit, but honestly... a bit too complex for 5 AM, especially since I've never worked with omp). Are there any shared variables that may cause some memory contention?

uncut lantern
# scarlet talon yeahhh only 6.7 when i run 8

But that also undermines the "apparently some threads are waiting on others" part. My first guess as for why, would be that the work isn't properly distributed between the tasks, so in the end there's multiple idle threads just watching one meagerly other thread finish his thing, until they can all finally die, but it just takes that last thread an eternity to finish his thing, while none of the other threads can help.