lhash() -- SSLeay 0.6.6 -- February 1997

NAME

lhash -- hash table routines

SYNOPSIS

#include "lhash.h"

LHASH *lh_new(hash, cmp)
unsigned long (*hash)();
int (*cmp)();

void lh_free(lh)
LHASH *lh;

char *lh_insert(lh, data)
LHASH *lh;
char *data;

char *lh_delete(lh, data)
LHASH *lh;
char *data;

char *lh_retrieve(lh, data)
LHASH *lh;
char *data;

void lh_doall(lh, func)
LHASH *lh;
void (*func)(char *a);

void lh_doall_arg(lh, func, arg)
LHASH *lh;
void(*func)(char *a,char *arg);
char *arg;

unsigned long lh_strhash(c)
char *c;

void lh_stats(lh, out)
LHASH *lh;
FILE *out;

void lh_node_stats(lh, out)
LHASH *lh;
FILE *out;

void lh_node_usage_stats(lh, out)
LHASH *lh;
FILE *out;

DESCRIPTION

This library is based on a 1991 article from the Communications of the ACM.

What makes this hash table different is that as the table fills, the hash table is increased (or decreased) in size via realloc(). When a 'resize' is done, instead of all hashes being redistributed over twice as many 'buckets', one bucket is split. So when an 'expand' is done, there is only a minimal cost to redistribute some values. Subsequent inserts will cause more single 'bucket' redistributions but there will never be a sudden large cost due to redistributing all the 'buckets'.

The state for a particular hash table is kept in the LHASH structure. The LHASH structure also records statistics about most aspects of accessing the hash table. This is mostly a legacy of my writing this library for the reasons of implementing what looked like a nice algorithm rather than for a particular software product.

Internal stuff you probably don't want to know about. The decision to increase or decrease the hash table size is made depending on the 'load' of the hash table. The load is the number of items in the hash table divided by the size of the hash table. The default values are as follows. If (hash->up_load < load) => expand. if (hash->down_load > load) => contract. The 'up_load' has a default value of 1 and 'down_load' has a default value of 2. These numbers can be modified by the application by just playing with the 'up_load' and 'down_load' variables. The 'load' is kept in a form which is multiplied by 256. So hash->up_load=8*256; will cause a load of 8 to be set.

If you are interested in performance the field to watch is num_comp_calls. The hash library keeps track of the 'hash' value for each item so when a lookup is done, the 'hashes' are compared, if there is a match, then a full compare is done, and hash->num_comp_calls is incremented. If num_comp_calls is not equal to num_delete plus num_retrieve it means that your hash function is generating hashes that are the same for different values. It is probably worth changing your hash function if this is the case because even if your hash table has 10 items in a 'bucked', it can be searched with 10 'unsigned long' compares and 10 linked list traverses. This will be much less expensive that 10 calls to you compare function.

lh_new is used to create a new LHASH structure. It is passed function pointers that are used to store and retrieve values passed into the hash table. The 'hash' function is a hashing function that will return a hashed value of it's passed structure. 'cmp' is passed 2 parameters, it returns 0 is they are equal, otherwise, non zero. If there are any problems (usually malloc failures), NULL is returned, otherwise a new LHASH structure is returned. The hash value is normally truncated to a power of 2, so make sure that your hash function returns well mixed low order bits.

lh_free free()s a LHASH structure. If there is malloced data in the hash table, it will not be freed. Consider using the lh_doall function to deallocate any remaining entries in the hash table.

lh_insert inserts the data pointed to by data into the lh hash table. If there is already an entry in the hash table entry, the value being replaced is returned. A NULL is returned if the new entry does not clash with an entry already in the table (the normal case) or on a malloc() failure (perhaps I should change this....). The 'char *data' is exactly what is passed to the hash and comparison functions specified in lh_new().

NOTE: the entry into the hash table is made by assigning the address of the data, i.e. the value of data, into the hash table. This means that if you write fifty different entries into a buffer and add the buffer to the hash table fifty times, you get fifty pointers to the buffer, which contains whatever the last thing you put in there is.

Moral of the story: don't re-use strings you are passing as arguments to lh_insert.

lh_delete deletes an entry from the hash table. The value being deleted is returned. NULL is returned if there is no such value in the hash table.

lh_retrieve looks for 'data' in the hash table; if present, it is returned, else NULL is returned. The way this routine would normally be used is that a dummy structure would have key fields populated and then ret=lh_retrieve(hash,&dummy);. Ret would now be a pointer to a fully populated structure.

lh_doall will, for every entry in the hash table, call function 'func' with the data item as parameters. This function can be quite useful when used as follows.

        void cleanup(STUFF *a)
		{ STUFF_free(a); }
	lh_doall(hash,cleanup);
	lh_free(hash);

This can be used to free all the entries, lh_free() then cleans up the 'buckets' that point to nothing. Be careful when doing this. If you delete entries from the hash table, in the call back function, the table may decrease in size, moving item that you are currently on down lower in the hash table. This could cause some entries to be skipped. The best solution to this problem is to set lh->down_load=0 before you start. This will stop the hash table ever being decreased in size.

lh_doall_arg is the same as lh_doall except that the function called will be passed 'arg' as the second argument.

lh_strhash is a demo string hashing function. Since the LHASH routines would normally be passed structures, this routine would not normally be passed to lh_new(), rather it would be used in the function passed to lh_new().

The next three routines print out various statistics about the state of the passed hash table. These numbers are all kept in the lhash structure.

lh_stats prints out statistics on the size of the hash table, how many entries are in it, and the number and result of calls to the routines in this library.

lh_node_stats prints the number of entries for each 'bucket' in the hash table.

lh_node_usage_stats prints out a short summary of the state of the hash table. It prints what I call the 'load' and the 'actual load'. The load is the average number of data items per 'bucket' in the hash table. The 'actual load' is the average number of items per 'bucket', but only for buckets which contain entries. So the 'actual load' is the average number of searches that will need to find an item in the hash table, while the 'load' is the average number that will be done to record a miss.