#Wrapping an operation in a function makes division slower, but multiplication faster???

132 messages · Page 1 of 1 (latest)

faint bridge
#

So I wrote a c program that runs two for loops, recording and printing the time before and after each for loop.

In the first for loop, i just do output = 5/2,
and in the second for loop, i do output = division_method(5, 2)
division_method just looks like this:
division_method(num1, num2)
{
return (num1 / num2)
}
I increase the loop cycles until there is a visible comparison in seconds between the two loops, and obviously the division_method() loop is slower. Normal division takes 65.39 seconds, while the function call takes 91.78 seconds.

I decide to replace the division sign with a multiplication sign just for funsies, so now both loops are doing multiplication.

Normal multiplication - 71.55 seconds (Slower than normal division, faster than function division)
Multiplication returned by a function call - 64.42 seconds. (Fastest overall)

This has me confused for a few reasons. For one, I thought that division was supposed to be a lot slower than multiplication (although I'm very not well researched and have no solid ground for that claim except the Quake evil pointer hack video). Two, why is multiplication wrapped in a function faster?

At first I thought it had to do with variables being close together on the stack, but I tried messing around with variable ordering and I don't know what it is.

Specifically the operation I'm doing is either:
5 / 2,
division_method(5, 2){ return num1 / num2; }, <- i know this isn't correct syntax but just for simplicity sake ima write it like that here
5 * 2,
or division_method(5, 2){ return num1 * num2; } <- im lazy so i didnt change the function name

serene monolithBOT
#

When your question is answered use !solved to mark the question as resolved.

Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question use !howto ask.

alpine panther
#

Is the printing time taken into account?

faint bridge
#

i record the time immediately before and after each forloop, and print the difference

#

im using <time.h>

#

loop looks like this

alpine panther
#

The problem isnt the way you time

faint bridge
#

dont mind the weird highlighting

alpine panther
#

Have you tried seeing the output assembly?

faint bridge
#

ima be real i got no clue what that means

alpine panther
#

Assembly are the little instructions the CPU reads, 1:1 machine code

#

Put your code here and select your compiler of choice

#

Probably gcc x86_64

faint bridge
#

gcc on a unix server my university provides

alpine panther
faint bridge
#

sounds about right

#

i put my code in the thing

#

and hit link to binary under output

#

not really sure what im lookin for here

alpine panther
#

In that thing, the useful information is in the assembly

#

Just the size of the functions in assembly can be helpful

#

In that site, when you hover on a line in C it highlights the assembly
Like thing() translates to
jsr thing

faint bridge
#

so output = 5 * 2 highlights this

#

and output = division_method(5, 2) hightlights this

alpine panther
# faint bridge

This is a constant operation, it just puts the output (10) in

#

@faint bridge can you send your whole thing?

faint bridge
#

like copy paste the code?

alpine panther
#

Just a screenshot serves

#

From the assembly + the code

faint bridge
#

its uh

#

a bit long

#

ill skip to the loops

alpine panther
#

Actually just put what matters

#

The loop included

faint bridge
#

heres the next section and at the bottom is return 0

alpine panther
#

Wait

faint bridge
#

what'd ya find chief

alpine panther
#

This may be all CPU weirdness afterall

#

CPUs do some optimizations when running that may affect this code

faint bridge
#

interesting

#

so weird to me that a function call that allocates 2 ints on the stack and does an operation is faster than just doing the operation

alpine panther
#

Not on the stack really, it's using a register

faint bridge
#

oh?

#

hm

alpine panther
#

Registers are the fastest storage in the entirety of your computer

#

Anyway i need to sleep rn

faint bridge
#

fair

wooden yacht
#

@faint bridge Okay, as was mentioned already this has likely to do with your compiler optimizing certain statements and unrolling certain loops.
Let's look at that in action:

#

;asm -O3

#define REPETITIONS 1e7

int div(int a, int b) {
    return a / b;
}

int main() {
    for (int i = 0; i < REPETITIONS; i++) {
        int x = 5 / 2;
    }

    for (int i = 0; i < REPETITIONS; i++) {
        int x = div(5, 2);
    }
}
analog portalBOT
#
Assembly Output
div:
  mov eax, edi
  cdq
  idiv esi
  ret
main:
  xor eax, eax
  ret

wooden yacht
#

As you can see it completely unrolled the loop, since it realized the loop isn't actually affecting the program state and while it does produce actual assembly code for the div function, it never actually calls the div function because your compiler is smart enough to realize div doesn't have any side effects, so it can easily deduce the second loop is also not doing anything

#

Could you share your benchmarking code as a code block? (Please with all the required #includes)

faint bridge
#

oooo

#

thats interesting

#

wait its like

#
code something something
#

okay perfect brb

#
#include <time.h>
#include <stdio.h>
#define NUMBER_OF_DIVISION_METHODS = 3

int division_method_1(int, int);
unsigned long power(unsigned long, unsigned long);

int main(int argc, char *argv[])
{
    unsigned long repetitions = 1;
    unsigned long max_repetitions;
    int output;
    unsigned long i;
    unsigned long size;
    time_t start;
    time_t stop;
    double deltaTime;
    printf("\n< This program compares the time it takes to divide numbers using different methods >\n\n");
    printf("Size of int: %d bytes\n", (int) sizeof(int));
    size = (unsigned long) (sizeof(unsigned long) * 4) - 1;
    printf("Size of unsigned long: %lu\n", size);
    max_repetitions = power(2, size);
    printf("Max repititions: %lu\n", max_repetitions);
    while (repetitions < max_repetitions)
    {
        start = clock();
        for (i = 0; i < repetitions; i++)
        {
            output = 5 * 2;
        }
        stop = clock();
        deltaTime = ((double)(stop - start) / CLOCKS_PER_SEC);
        printf("Operation 1 took %f seconds for %lu repetitions\n", (float) deltaTime, repetitions);
        start = clock();
        for (i = 0; i < repetitions; i++)
        {
            output = division_method_1(5, 2);
        }
        stop = clock();
        deltaTime = ((double)(stop - start) / CLOCKS_PER_SEC);
        printf("Operation 2 took %f seconds for %lu repetitions\n", deltaTime, repetitions);
        repetitions *= 2;
    }
    return 0;
}

int division_method_1(int num1, int num2)
{
    return (num1 * num2);
}

/* power: raise base to n-th power; n >= 0 */
unsigned long power(unsigned long  base, unsigned long  n)
{
    int i, p;
    p = 1;
    for (i = 1; i <= n; ++i)
    {
        p = p * base;
    }
    return p;
}
#

the #define at the top isnt used cus i got sidetracked, and its also the complete wrong number dont mind it

#

and i was gonna try comparing different methods of division - just explaining my horrendous function name

#

also keep in mind you're probably gonna have to stop this program on your own, cus it keeps doubling the amount of repetitions until it reaches the size of an unsigned long

#

might be half the size actually im not sure

#

but i made sure it doesnt overflow

faint bridge
#

also also also dont mind that the code is multiplying despite the naming scheme being division because the interesting behavior with the function call being faster only happened with the multiplication operation

wooden yacht
#

Btw, if you want to raise 2 to the power of something (let's call that exponent n) then you can just do:

1 << n;
#

I'm having a hard time benchmarking it cause I'm on Windows and am only running it in the browser right now, but the entire benchmarking setup seems very naive.

First of all a program is usually a bit slower at startup until the CPU is "warm", also you can't use -O0 (the default) to benchmark code. -O0 means to enable no compiler optimizations. That's not how you will deliver your final product. In both -O2 and -O3 the loops are completely unrolled, meaning there is no assembly generated, meaning it's just instant and even -O1 and -Og introduce optimizations that drastically lower the execution time.

On top of that I'm a bit suspicious of that clock function, it looks a lot like the time.time function in Python, which is okay for rough estimates but not for precise benchmarking, especially when it comes to operations this short (each division/divison call will only take a few nanoseconds [amortized]). Now I have no proof that this function isn't precise, but let me show you how I once implemented benchmarking for a uni project, maybe you can take a thing or two away from it (although it's a macro mess 😅 ).

#

No clue which of these 4 includes you'll actually need:

#include <stdlib.h>
#include <dirent.h>
#include <time.h>
#include <unistd.h>
#
#define true 1
#define false 0
#define ITERATIONS 32
#define LOOP for(int i = 0; i < ITERATIONS; ++i)

#define PRINT_TIME(action, time, tab) \
    printf("\t"); \
    if(tab) \
    printf("\t"); \
    printf("%s: %.9fs\n", action, time / ITERATIONS);


#define LOOP_BIG(init, method, post_method, action, tab) {\
    struct timespec start;                  \
    struct timespec end;                    \
    double average = 0;                     \
    LOOP {                                  \
    init;                                   \
    clock_gettime(CLOCK_MONOTONIC, &start); \
    method;                                 \
    clock_gettime(CLOCK_MONOTONIC, &end);   \
    average += end.tv_sec - start.tv_sec + 1e-9 * (end.tv_nsec - start.tv_nsec); \
    post_method;                            \
    sleep(1);                               \
    }                                       \
    PRINT_TIME(action, average, tab)        \
    }

static inline double curtime(void) {
    struct timespec t;
    clock_gettime(CLOCK_MONOTONIC, &t);
    return t.tv_sec + t.tv_nsec * 1e-9;
}
#

It could then be used like e.g. this:

LOOP_BIG(unsigned int x,
         x += i / 2,
         printf("x = %d\n", x),  // do something to make it harder for the compiler to just omit the function call
         "literal division",
         false);  // the false is just to specify whether a tab should be printed (I put it for formatting reasons)
#

Although you wouldn't need the semicolon in the very last line

#

but it doesn't hurt

#

@faint bridge But the main thing I want you to take away from this is the usage of clock_gettime(CLOCK_MONOTONIC) and how you extract information out of it, rather than the simple time function (where I'm a bit sketchy about, albeit without any evidence)

#

And that it's useful to have some unmeasured runtime before you actually start your measurement (I actually didn't do this either in my implementation) so that the CPU gets some time to warm up

#

But yeah, if you want to reliably test with such short operations (like a single division which take a few nanoseconds at most), then you'll be best of implementing these functions you want to test directly in assembly, so that you can be sure of the code that's actually getting tested.
Obviously for larger operations you want to work with the compiler's optimizations, but with these toy projects it can really screw you up, as most of the time will be spent with memory read/writes, which I assume -O0 produces a lot of.

analog portalBOT
#
Assembly Output Pt. 1
.LC0:
  .string "\n< This program compares the time it takes to divide numbers using different methods >\n"
.LC1:
  .string "Size of int: %d bytes\n"
.LC2:
  .string "Size of unsigned long: %lu\n"
.LC3:
  .string "Max repititions: %lu\n"
.LC5:
  .string "Operation 1 took %f seconds for %lu repetitions\n"
.LC6:
  .string "Operation 2 took %f seconds for %lu repetitions\n"
main:
  push rbp
  mov rbp, rsp
  sub rsp, 80
  mov DWORD PTR [rbp-68], edi
  mov QWORD PTR [rbp-80], rsi
  mov QWORD PTR [rbp-8], 1
  mov edi, OFFSET FLAT:.LC0
  call puts
  mov esi, 4
  mov edi, OFFSET FLAT:.LC1
  mov eax, 0
  call printf
  mov QWORD PTR [rbp-24], 31
  mov rax, QWORD PTR [rbp-24]
  mov rsi, rax
  mov edi, OFFSET FLAT:.LC2
  mov eax, 0
  call printf
  mov rax, QWORD PTR [rbp-24]
  mov rsi, rax
  mov edi, 2
  call power
  mov QWORD PTR [rbp-32], rax
  mov rax, QWORD PTR [rbp-32]
  mov rsi, rax
  mov edi, OFFSET FLAT:.LC3
  mov eax, 0
  call printf
  jmp .L2
.L7:
  call clock

Assembly Output Pt. 2
  mov QWORD PTR [rbp-40], rax
  mov QWORD PTR [rbp-16], 0
  jmp .L3
.L4:
  mov DWORD PTR [rbp-60], 10
  add QWORD PTR [rbp-16], 1
.L3:
  mov rax, QWORD PTR [rbp-16]
  cmp rax, QWORD PTR [rbp-8]
  jb .L4
  call clock
  mov QWORD PTR [rbp-48], rax
  mov rax, QWORD PTR [rbp-48]
  sub rax, QWORD PTR [rbp-40]
  pxor xmm0, xmm0
  cvtsi2sd xmm0, rax
  movsd xmm1, QWORD PTR .LC4[rip]
  divsd xmm0, xmm1
  movsd QWORD PTR [rbp-56], xmm0
  pxor xmm0, xmm0
  cvtsd2ss xmm0, QWORD PTR [rbp-56]
  pxor xmm2, xmm2
  cvtss2sd xmm2, xmm0
  movq rax, xmm2
  mov rdx, QWORD PTR [rbp-8]
  mov rsi, rdx
  movq xmm0, rax
  mov edi, OFFSET FLAT:.LC5
  mov eax, 1
  call printf
  call clock
  mov QWORD PTR [rbp-40], rax
  mov QWORD PTR [rbp-16], 0
  jmp .L5
.L6:
  mov esi, 2
  mov edi, 5
  call division_method_1
  mov DWORD PTR [rbp-60], eax
  add QWORD PTR [rbp-16], 1
.L5:
  mov rax, QWORD PTR [rbp-16]
  cmp rax, QWORD PTR [rbp-8]
  jb .L6
  call clock
  mov QWORD PTR [rbp-48], rax
  mov rax, QWORD PTR [rbp-48]

Assembly Output Pt. 3
  sub rax, QWORD PTR [rbp-40]
  pxor xmm0, xmm0
  cvtsi2sd xmm0, rax
  movsd xmm1, QWORD PTR .LC4[rip]
  divsd xmm0, xmm1
  movsd QWORD PTR [rbp-56], xmm0
  mov rdx, QWORD PTR [rbp-8]
  mov rax, QWORD PTR [rbp-56]
  mov rsi, rdx
  movq xmm0, rax
  mov edi, OFFSET FLAT:.LC6
  mov eax, 1
  call printf
  sal QWORD PTR [rbp-8]
.L2:
  mov rax, QWORD PTR [rbp-8]
  cmp rax, QWORD PTR [rbp-32]
  jb .L7
  mov eax, 0
  leave
  ret
division_method_1:
  push rbp
  mov rbp, rsp
  mov DWORD PTR [rbp-4], edi
  mov DWORD PTR [rbp-8], esi
  mov eax, DWORD PTR [rbp-4]
  imul eax, DWORD PTR [rbp-8]
  pop rbp
  ret
power:
  push rbp
  mov rbp, rsp
  mov QWORD PTR [rbp-24], rdi
  mov QWORD PTR [rbp-32], rsi
  mov DWORD PTR [rbp-8], 1
  mov DWORD PTR [rbp-4], 1
  jmp .L12
.L13:
  mov eax, DWORD PTR [rbp-8]
  cdqe
  mov edx, eax
  mov rax, QWORD PTR [rbp-24]
  imul eax, edx
  mov DWORD PTR [rbp-8], eax
  add DWORD PTR [rbp-4], 1
.L12:
  mov eax, DWORD PTR [rbp-4]
  cdqe
  cmp QWORD PTR [rbp-32], rax

Assembly Output Pt. 4
  jnb .L13
  mov eax, DWORD PTR [rbp-8]
  cdqe
  pop rbp
  ret
.LC4:
  .long 0
  .long 1093567616

wooden yacht
#

Yeah. Look at that division_method aseembly as an example (the normal division is a bit more messy to inspect):

division_method_1:
  push rbp
  mov rbp, rsp
  mov DWORD PTR [rbp-4], edi
  mov DWORD PTR [rbp-8], esi
  mov eax, DWORD PTR [rbp-4]
  imul eax, DWORD PTR [rbp-8]
  pop rbp
  ret

Whereas with -O1 this already looks like:

#

;asm -O1

int division_method_1(int num1, int num2)
{
    return (num1 * num2);
}
analog portalBOT
#
Assembly Output
division_method_1:
  mov eax, edi
  imul eax, esi
  ret

wooden yacht
#

So much shorter, completely removing the memory accesses.

#

@faint bridge Oh, other important information.
Your / 2 (divide by 2) calls that are hardcoded are actually all getting translated to >> 1, even with -O0.
On the same hand all * 2 calls are automatically optimized to << 1.

You can see this first hand in action:

#

;asm -O0

int division_by_2(int x) {
    return x / 2;
}

int multiply_by_2(int x) {
    return x * 2;
}
analog portalBOT
#
Assembly Output
division_by_2:
  push rbp
  mov rbp, rsp
  mov DWORD PTR [rbp-4], edi
  mov eax, DWORD PTR [rbp-4]
  mov edx, eax
  shr edx, 31
  add eax, edx
  sar eax
  pop rbp
  ret
multiply_by_2:
  push rbp
  mov rbp, rsp
  mov DWORD PTR [rbp-4], edi
  mov eax, DWORD PTR [rbp-4]
  add eax, eax
  pop rbp
  ret

wooden yacht
#

*Okay, the multiply actually converts to a single add operation, but the assembly for divide really uses right shifts (shr and sar). Now the reason it looks a bit confusing and why there's a shift by 31 and another shift by 1 is because the compiler has to think about what'd happen if someone passed a negative number.
If we specifically declare our input to be unsigned, then the compiler should be able to generate much more efficient assembly (even on optimization level -O0, but especially important for higher optimization levels):

#

;asm -O0

int division_by_2(unsigned int x) {
    return x / 2;
}

int multiply_by_2(unsigned int x) {
    return x * 2;
}
analog portalBOT
#
Assembly Output
division_by_2:
  push rbp
  mov rbp, rsp
  mov DWORD PTR [rbp-4], edi
  mov eax, DWORD PTR [rbp-4]
  shr eax
  pop rbp
  ret
multiply_by_2:
  push rbp
  mov rbp, rsp
  mov DWORD PTR [rbp-4], edi
  mov eax, DWORD PTR [rbp-4]
  add eax, eax
  pop rbp
  ret

wooden yacht
#

As you can see the

shr edx, 31
add eax, edx
sar eax
```got replaced by just:
```x86asm
shr eax
```which is not only more efficient but also a, and be it marginally, smaller executable size
faint bridge
#

@wooden yacht sounds like there's a lot I'm not accounting for ;-;

#

I couldn't figure out a good way to get time in milliseconds with time.h in c so I just pulled something random off a forum

#

I probably won't ever have to optimize division

#

But I kinda hope I do some day ngl

#

Seems really interesting to learn

serene lily
faint bridge
#

I'm using vi on unix though

alpine panther
faint bridge
#

Ah

#

Yeah nah

#

My professor makes us use vi

wooden yacht
alpine panther
#

Atleast vim

faint bridge
#

I'm kinda used to it ngl

#

Pretty 🔥

wooden yacht
faint bridge
#

It's so fun 🤪🤪🤪

alpine panther
faint bridge
#

mhm mhm

#

ill check it out

alpine panther
faint bridge
#

what if its everything i need

#

forbidden love is still love

#

(i have not tried neovim and i probably would like it)

alpine panther
#

but you can put syntax highlighting, a language server, see the errors in your editor, have autocomplete

#

fancy colors

faint bridge
#

Time to use vi till I'm 50

#

Or make my own editor that's even worse than vi

alpine panther
faint bridge
alpine panther
#

nvim is a thing

faint bridge
#

That isn't the same as vimim

#

I want the exact title "Vi Improved, Improved"

alpine panther
#

💀