Part of C in 100s

C in 100 Seconds: Hash Table Concept

Celest KimCelest Kim

Video: C in 100 Seconds: Hash Table — Hash Function, Buckets, Collisions | Episode 48 by Taught by Celeste AI - AI Coding Coach

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

C Hash Table: Hash Function, Buckets, Collisions

A hash function maps keys to indices. The table is an array of buckets. Collisions happen when two keys hash to the same bucket. Chaining or open addressing resolves them.

Hash tables are the bread-and-butter data structure for "look up by key." Average O(1) operations — far faster than linear search. The key insight: turn the key into an array index via a hash function.

The basic shape

#include <stdio.h>
#include <string.h>

#define SIZE 10

int hash(char *key) {
  int sum = 0;
  while (*key) {
    sum += *key;
    key++;
  }
  return sum % SIZE;
}

char *table[SIZE] = {NULL};

void insert(char *key) {
  int idx = hash(key);
  if (table[idx] != NULL) {
    printf("Collision: %s and %s both hash to %d\n", table[idx], key, idx);
  }
  table[idx] = key;
}

char *lookup(char *key) {
  int idx = hash(key);
  return table[idx];
}

int main() {
  insert("Alice");
  insert("Bob");
  insert("Carol");

  printf("Bob -> %s\n", lookup("Bob"));
}

The hash function

A hash function turns a key (string, int, struct) into an index in [0, SIZE). Goals:

  • Deterministic. Same key always produces same hash.
  • Uniform. Distributes keys evenly across buckets.
  • Fast. Computed in O(length).

A simple hash for strings:

int hash(char *key) {
  int sum = 0;
  while (*key) {
    sum += *key;
    key++;
  }
  return sum % SIZE;
}

Sum of ASCII values, mod table size. Bad distribution — anagrams hash to the same bucket. "abc" and "bca" and "cab" all collide.

Better: djb2 (a classic):

unsigned long hash(const char *str) {
  unsigned long hash = 5381;
  int c;
  while ((c = *str++)) {
    hash = ((hash << 5) + hash) + c;   // hash * 33 + c
  }
  return hash % SIZE;
}

hash * 33 + c mixes bits well. Anagrams produce different hashes. Production-quality.

Even better: SipHash, FNV-1a, xxHash. Use a real hash for non-trivial code.

Collisions

Two keys hash to the same index = collision. Inevitable when the table is smaller than the keyspace (which it always is).

Birthday paradox: even a half-full table has many collisions. Don't panic — handle them.

Two strategies: chaining and open addressing.

Chaining

Each bucket holds a linked list of items that hash there:

typedef struct Entry {
  char *key;
  char *value;
  struct Entry *next;
} Entry;

Entry *table[SIZE] = {NULL};

void insert(char *key, char *value) {
  int idx = hash(key);
  Entry *e = malloc(sizeof(Entry));
  e->key = key;
  e->value = value;
  e->next = table[idx];   // prepend to chain
  table[idx] = e;
}

char *lookup(char *key) {
  int idx = hash(key);
  for (Entry *e = table[idx]; e != NULL; e = e->next) {
    if (strcmp(e->key, key) == 0) return e->value;
  }
  return NULL;
}

Insert: prepend to the chain at idx. Lookup: walk the chain, compare keys with strcmp, return on match.

Worst case: all keys collide → O(n). Average: O(1) if hash is uniform.

Episode 49 walks chaining in detail.

Open addressing

When a bucket is taken, try the next slot. Variations:

  • Linear probing: try idx + 1, idx + 2, ...
  • Quadratic probing: try idx + 1, idx + 4, idx + 9, ...
  • Double hashing: use a second hash function for the step.
char *table[SIZE] = {NULL};

void insert(char *key) {
  int idx = hash(key);
  while (table[idx] != NULL) {
    idx = (idx + 1) % SIZE;   // linear probe
  }
  table[idx] = key;
}

Lookup walks the same probe sequence:

char *lookup(char *key) {
  int idx = hash(key);
  while (table[idx] != NULL) {
    if (strcmp(table[idx], key) == 0) return table[idx];
    idx = (idx + 1) % SIZE;
  }
  return NULL;
}

Open addressing is cache-friendly (no pointer-chasing). Chaining handles high load factors better. Most modern hash tables use open addressing with sophisticated probing.

Load factor

load_factor = number_of_entries / table_size
  • 0.5 — comfortable, low collision rate.
  • 0.75 — typical "resize threshold" in dynamic hash tables.
  • 0.9+ — many collisions; lookups slow.

Dynamic hash tables grow when load factor exceeds a threshold:

if (count >= 0.75 * table_size) {
  resize(table_size * 2);
}

Resizing rehashes every entry into the new table.

Why mod size

return hash_value % SIZE;

After computing a (potentially huge) hash, mod by table size to get an in-range index.

If SIZE is a power of 2, hash & (SIZE - 1) is faster than hash % SIZE. Many production tables use powers of 2.

But: the low bits of the hash become the index. A bad hash function with poor low-bit distribution will cluster. Use a hash designed for power-of-2 modding (xxHash, FNV-1a).

What can be a key?

Anything you can hash:

  • Strings — concatenate bytes.
  • Integers — usually key % SIZE, but multiply by a prime first to mix.
  • Structs — hash each field, combine.
  • Pointers(uintptr_t)ptr / sizeof(void*) — pointer values mod SIZE.

For structured keys, hash each component and combine:

unsigned long hash_pair(int x, int y) {
  unsigned long h = 17;
  h = h * 31 + x;
  h = h * 31 + y;
  return h % SIZE;
}

17 and 31 are common primes. Different primes for different multiply rounds reduce clustering.

Hash tables in real programs

  • Symbol tables in compilers — function names → addresses.
  • Cache — content addressable lookups.
  • Set membershipis_in_set reduces to lookup.
  • Frequency counterscount[word]++ over a corpus.
  • Database indexes — hash indexes for equality lookups.

The standard library doesn't include hash tables for C — you write your own or use a library (uthash, klib's khash, etc.).

Common mistakes

Bad hash function. Sum-of-chars collides on anagrams. djb2 or FNV is the floor for real use.

Comparing keys with == instead of strcmp. Pointer equality, not string equality.

Storing pointers to local strings. insert("alice") stores the literal address; insert(stack_buffer) stores a soon-stale address.

Forgetting to handle collisions. Naive insert overwrites whatever was there — data loss.

Not resizing. A fixed-size table with growing data has O(n) lookups eventually.

Modulo with negative. % in C with negative dividend can be negative. Useunsigned` for hash values.

What's next

Episode 49: chaining — insert, lookup, delete. The chained-list strategy in detail, with full code.

Recap

Hash table: array of buckets, indexed by hashing the key. Hash function must be deterministic, uniform, fast. Collisions inevitable; resolve via chaining (linked list per bucket) or open addressing (probe nearby slots). Load factor (entries/size) drives resize decisions. Use real hash functions (djb2, FNV) — not character-sum. Many production languages have hash tables in their stdlib; C doesn't — roll your own or use a library.

Next episode: chaining.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.