#Confused about hash table insertion time (closed hashing using linear probing)

18 messages · Page 1 of 1 (latest)

fast dragonBOT
#

When your question is answered use !solved to mark the question as resolved.

Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question use !howto ask.

stuck heart
#

Confused about hash table insertion time (closed hashing using linear probing)

cobalt dove
#

Disclaimer: So far I've only looked at your testing program:

First of all try to change your get_time function to this:

static inline double get_time(void) {
    struct timespec t;
    clock_gettime(CLOCK_MONOTONIC, &t);
    return t.tv_sec + t.tv_nsec * 1e-9;
}

The reason for that is on the manpage of gettimeofday:

The time returned by gettimeofday() is affected by discontinuous
jumps in the system time (e.g., if the system administrator
manually changes the system time). If you need a monotonically
increasing clock, see clock_gettime(2).

Secondly you should definetly use double instead of float for percent and why do you have the / 100? Just make it like so:

for (double percent = 0.002; percent < 1.0; percent += 0.002) {
    const int n = TABLE_CAPACITY * percent;
    ...
```This makes much more sense since mathematically 100% is the same as 1.
cobalt dove
#

errrr?

  1. Why do you even have a struct called bucket? You're not using buckets. That's just confusing to read
  2. You really shouldn't use % tab->capacity to calculate the index. Rather be smart and in the constructor set tab->capacity to be the next bigger power of 2, then have an additional variable mask that you'll use instead of tab->capacity but you'll use bit-AND with it (much faster).
    Here's how the constructor would change:
tab->capacity = nextPow2(capacity);
tab->mask = tab->capacity - 1;
```in theory you can even entirely remove `tab->capacity`, but keeping it doesn't hurt.
And here's how you would calculate the index:
```c
int hash = hash_function(key) & mask;  // I kept your variable name for consistency
// but hopefully this shows why the name 'hash' doesn't make sense since it's 
// not a hash but an index into your map
// Your 'hash_function' is just the identify function, so it just returns 'key' again
nocturne shore
#

Y'all aren't addressing the question.

cobalt dove
#

And on a second look I don't quiet get this line:

T *= 1e9;             /* Convert to microseconds. */
```shouldn't that convert that to nanoseconds?
Like, in your `get_time` function you even do:
```c
tv.tv_usec * 1e-6;
```which means you must've obviously known the conversions, no?
cobalt dove
nocturne shore
#

Do you consider the difference statistically significant? How so?

nocturne shore
cobalt dove
#

Da fuck mate? Calm down.
I decided to talk about potential root causes for the issue, like him using gettimeofday instead of clock_gettime and him doing

T *= 1e9;             /* Convert to microseconds. */
```which should be 
```c
T *= 1e6;
```and other things
cobalt dove
#

Yeah, it works but have you read the quote from the manpage I copied?

#

clock_gettime with a CLOCK_MONTONIC is simply more accurate (and it also gives you nanosecond precision instead of microsecond precision) than gettimeofday

#

Btw, don't use time (ms) if by ms you mean microseconds since ms stands for milliseoconds basically everywhere. The short version for microseconds is us

nocturne shore
#

I think this result could be correct.

#

I think it would be a good exercise to try to show why it is or isn't correct.

cobalt dove
#

Okay, then your professor made a mistake because he's clearly not printing out microseconds per operation, but he's printing out nanoseconds per operation (and he named it milliseconds in the output 🐒 ).

But yeah, considering we measured this with an insufficient clock (as I said a few times now: gettimeofday only provides micro-second precision, not nano-second precision, so if he tries to convert that to nanoseconds it's always going to be flawed, even if the clock was perfect which it isn't since it can be affected by system time jumps) this looks like a to be expected result, especially since the variance in the nanosecond range is so incredibly sensitive to how your CPU felt during that timespan.

I also think that even a table size of 1'000'000 like you have it will still mostly fit into your CPU caches. If you truly want to have some memory reads you'd need to make the table even bigger.

(You can also measure the program with perf to see how many cache/branch misses you have)

fast dragonBOT
#

Thank you and let us know if you have any more questions!

This thread is now set to auto-hide after an hour of inactivity