ELF: symbol lookup via DT_HASH

referenced:https://flapenguin.me/elf-dt-hash

libc.so contains tons of dynamic symbols exported to the outer world (
nm -D /lib/libc.so.6 | wc -l gives 3023 symbols on my machine).
And quite a few of them will be imported in any relatively big program.

Obviously, doing linear search would be a bad idea.
Some sort of a hash table can be used to
optimize the searching through thousands of strings

There're two options.
The first is to create a hash table from a plain symbol list
during the binary loading.
This would introduce a lot of work and allocations during dynamic linking,
so it's not such a good solution

The second option is to create a hash table
during linking
and save it inside the binary in some serialized format.
And this is exactly what happens in the ELF.


Every ELF binary has a hash table
filled with symbol names baked into the binary itself.
The hashing function is "hard-coded" in the standard
so every compiler, static linker, and dynamic linker can use the same one.
Otherwise, nothing would work.
Here's what the function looks like:



#include <stdint.h>

uint32_t elf_hash(const uint8_t* name) {<!-- -->
    uint32_t h = 0, g;
    for (; *name; name + + ) {<!-- -->
        h = (h << 4) + *name;
        if (g = h & amp; 0xf0000000) {<!-- -->
            h ^= g >> 24;
        }
        h &= ~g;
    }
    return h;
}


elf_hash("") // 0x00000000
elf_hash("printf") // 0x077905a6
elf_hash("exit") // 0x0006cf04
elf_hash("syscall") // 0x0b09985c

Once the string, symbol, and hash tables are located via section headers (or program headers, or _DYNAMIC)
they can be used to find a symbol by its name.
The hash table looks like this
struct elf_hash_table {<!-- -->
    uint32_t nbucket;
    uint32_t nchain;
    uint32_t bucket[nbucket];
    uint32_t chain[nchain];
};
chain array contains chains of symbol indexes
within the same bucket.
A chain starts at bucket[hash % nbucket] index

You should walk through the chain by interpreting chain[ix] value
as the index of the next symbol
and the next chain element.

Finally you’ll bump into the STN_UNDEF symbol
(dummy “undefined” symbol),
which is always the last symbol of any chain.

The ELF obligates the Symbol Table to hold the STN_UNDEF symbol at 0 index.
So effectively a chain breaks when current index is 0.

Since the chain array values are indexes for not only the chain array itself,
but also for the symbol table, the chain array must be the same size as the symbol table.
This makes nchain equal to the length of the symbol table,

nbucket = 4 (because I decided that there will be four buckets)
nchain = 16 (16 symbols, including the SHT_UNDEF at index 0)


ix bucket[ix] name of first symbol in chain
------------------------------------------
 0 2 freelocal
 1 8 setrlimi
 2 1 isnan
 3 3 hcreate_

Two asterisks ** and parens ()
indicate the start of a chain

       SYMBOL TABLE | HASH TABLE
                       |
    name = | hash = hash %
ix symtab[ix].st_name | elf_hash(name) nbucket chain[ix]
-- ------------------ | --------------- ------- -------- -
 0 <STN_UNDEF> |
 1 isnan | 0x0070a47e 2 ** /------(5)
 2 freelocal | 0x0bc334fc 0 ** | /----(4)
 3 hcreate_ | 0x0a8b8c4f 3 ** | | (6)-----\
 4 getopt_long_onl | 0x0f256dbc 0 | \--> 12 -----|-\
 5 endrpcen | 0x04b96f7e 2 \-----> 7 ---\ | |
 6 pthread_mutex_lock | 0x0de6a18b 3 0 <--|-/ |
 7 isinf | 0x0070a046 2 /--- 9 <--/ |
 8 setrlimi | 0x0cb929a9 1 ** | (11)-----\ |
 9 getspen | 0x0dcba6de 2 \-> 10 ---\ | |
10 umoun | 0x007c46be 2 /---- 13 <--/ | |
11 strsigna | 0x0b99fbe1 1 | 0 <----/ |
12 listxatt | 0x00abef84 0 | /-- 15 <------/
13 getttyen | 0x0cbbb96e 2 \-|-> 14 ---\
14 uselib | 0x07c9c2f2 2 | 0 <--/
15 cfsetispeed | 0x0b63b274 0 \--> 0


Now that we have this table, let's try to find some symbols by hand:

freelocal:
hash = elf_hash("freelocal") = 0x0bc334fc
chain starts at bucket[0x0bc334fc % 4] = bucket[0] = 2

symbols[2] (= "freelocal") == "freelocal"? yep => found at index 2
****************************************
getspen:
hash = elf_hash("getspen") = 0x0dcba6de
chain starts at bucket[0x0dcba6de % 4] = bucket[2] = 1

symbols[2] (= "isnan") == "getspen"? nope => chain continues at 5
symbols[4] (= "endrpcen") == "getspen"? nope => chain continues at 7
symbols[8] (= "setrlimi") == "getspen"? nope => chain continues at 9
symbols[9] (= "getspen") == "getspen"? yep => found at index 9
*************************************************** ************
foobar:
hash = elf_hash("foobar") = 0x06d65882
chain starts at bucket[0x06d65882 % 4] = bucket[2] = 1

symbols[ 1] (= "isnan") == "foobar"? nope => chain continues at 5
symbols[ 5] (= "endrpcen") == "foobar"? nope => chain continues at 7
symbols[ 7] (= "isinf") == "foobar"? nope => chain continues at 9
symbols[ 9] (= "getspen") == "foobar"? nope => chain continues at 10
symbols[10] (= "umoun") == "foobar"? nope => chain continues at 13
symbols[13] (= "getttyen") == "foobar"? nope => chain continues at 14
symbols[14] (= "uselib") == "foobar"? nope => chain continues at 0
symbols[ 0] is STN_UNDEF => not found


And now that we can do a lookup manually, let's actually code it.
It should be self-explanatory:
/* Different architecture have different symbol structure size. */
typedef Elf64_Sym Elf_Sym;
/* typedef Elf32_Sym Elf_Sym; */

const Elf_Sym* elf_lookup(
    const char* strtab, /* string table */
    const Elf_Sym* symtab, /* symbol table */
    const uint32_t* hashtab, /* hash table */
    const char* symname /* name to look up */
) {<!-- -->
    const uint32_t hash = elf_hash(symname);

    const uint32_t nbucket = hashtab[0];
    const uint32_t nchain = hashtab[1];
    const uint32_t* bucket = & amp;hashtab[2];
    const uint32_t* chain = & amp;bucket[nbucket];

    for (uint32_t i = bucket[hash % nbucket]; i; i = chain[i]) {<!-- -->
        if (strcmp(symname, strtab + symtab[i].st_name) == 0) {<!-- -->
            return &symtab[i];
        }
    }

    return NULL;
}

while DT_HASH is very good at finding existing symbols,
it performs badly with the missing ones.
As you could see in the example with searching for "foobar",
you need to walk through a random chain
and compare strings only to bump into the STN_UNDEF.
This becomes even worse in real life
because symbols are searched in multiple shared libraries,
so you'll have to walk multiple random chains.