A hash algorithm transforms the input data to a string of characters of fixed length. The length of the input data can be variable but the result of applying the algorithm is always of fixed length.
In addition, this algorithm has a peculiarity that makes it very useful: it works only in one direction. It is practically impossible to obtain the input data from the output of the algorithm.
The hash algorithm is used when we have an unlimited set of input values and want to get a limited set of result values. Normally, the input values will be strings of variable length characters, which we will convert into strings of fixed length. However, these types of functions support all types of input data. In addition, the results of the function can be delimited to a defined set of characters: integers and alphanumeric.
In the following image you can see the behaviour of a hash function:
A good hash function will have the following characteristics:
- Generate unique values for each input: The probability of getting the same result for two different inputs is very low, ideally zero. That is, for two different possible inputs: e1 and e2, the probability of h(e1) = h(e2) has to be practically zero.
Although the probability of obtaining the same result for two different inputs is very low, such a situation can happen. In this case, a collision will have occurred. A good hash function has very few collisions.
- The results are repeatable: Every time we apply the hash function to the same input, we will always get the same result.
- The function is not invertible: In other words, we cannot get the input value from the output.
Uses of hashing
These types of functions are computable, that is, they can be implemented by an algorithm, which makes them very useful for certain applications. Most programming languages incorporate function libraries that allow us to implement different hash functions in our programs.
Among the most common uses of hash functions, we can find the storage of user passwords in a database. Instead of storing passwords in text, we store the result of hashing passwords. In this way, we increase the security of the database, and if it were compromised, the passwords of the users would be safe, since they could not be obtained from the results of the hash function that is what we save in the database.
They are also widely used to check if any file has been modified when we download it from the internet. This way, we ensure the integrity of the files that we send and download over the Internet, acting the hash chain as a kind of digital signature.
And of course, another utility of these functions is the organization of the data, providing an alternative for working with index tables.
Using hashing in the organization of data into files
The idea is that by applying the hash function to each value of the field that we use as an index, we would obtain the address of a block where we would store the physical address in memory where we would find the record associated with the index. In the event of a collision, we store the pair key-address in the same block. And when we were going to read a piece of data, we would have to go through the block. In any case, collisions are very unlikely and the blocks will always be very small and quick to navigate.
To understand the logic behind this way of organizing data, I will give a practical example, using the already famous file of suppliers of SuperGades, the famous supermarket of Antonio that we have been using as an example.
We want to access as quickly as possible the data of Antonio’s suppliers, and for this, we want to organize them into groups that make our lives easier and reduce the search time. Then, we have a lot of drawers, each with a letter of the alphabet. We place each supplier’s data in the drawer that has the letter by which its name begins. In practice, we have divided our data into several blocks and when we want to recover the data from a provider, we will go to look for them only in the drawer that has the letter by which its name begins.
Well, now imagine that you have a way to organize the supplier data in which you would have a drawer for each supplier. This is what, in a digital environment, the hash function allows you. Although in the few situations that we have collisions, we could find two supplier data in a drawer, perhaps up to three in an extreme situation. But in any case, the searches would be very fast, since the number of elements of the set will always be very small.
Applying this idea to the organization of our data, what we will do is create a structure known as hash tables. These are organized into blocks where the pair: key-address, the value of the key and a pointer pointing to the registry address in memory are saved to retrieve all the registry data associated with the key.
Here I leave you an example of what could be a hash table in the case of the supplier data of the SuperGades supermarket, taking the name of the supplier as a key.
Applying the hash function to the content of the key (in this case, the name of the provider), the address of a block is obtained. In this block, the key-address pair is stored. In the example above it can be seen that for the values of the key: “Fresh Bakery” and “Smith fruits” a collision occurs, and both key-address pairs are saved in the same block.
To access the data associated with a key, we would follow the following steps:
- We applied the hash function to the value of the search key, so we would get the address of the block where to search.
- In this block, we look for the address to the registry data associated with the key. If we had collisions we would have to perform a sequential search, but as I have already indicated before the number of elements in the block will be very small, since collisions are very unlikely.
- With the address recovered, we position ourselves where it indicates us and read the data.
If we want to enter new data, we proceed as follows:
- We save the data in the file and we are left with the address that points to them.
- We apply the hash function to the content of the key that we are using, and with it, we obtain the address of a block.
- In that block, we will save the value of the key and the address (that of the first point) that points to where the associated data in the file begins.
This post is part of the collection “Data Access and Storage Systems”. You can see the index of this collection here