Indexed access

After the random access files I explained in the previous post, indexed access appears to correct the main drawbacks of those. Random access files meant a considerable reduction in the search time of data compared with the sequential access files. However, you had to know exactly the registration number of the data you are looking for, which is far to be efficient, since normally we will want to search by values in the different fields.

Let’s see it with our character Antonio, who changed from sequential access files to random access files. It turns out that when searching for a supplier by the registration number, he is wrong many times as he hardly remembers the position of this supplier in the file. He would like to be able to search for the supplier’s data, by the name of the supplier or by the name of the contact, but doing by the registration number of the suppliers usually ended making a mistake.

To solve this issue, we started working with indexes. An index is nothing more than a field that is used as a key to speed up the search for records. In random access files, the index was the record number and now we want the index to be a field in the record.

Basically, we build an additional structure that, for the values of the field chosen as an index, collects all the starting addresses of the records that have that value. We will have to create a structure for each index that we choose, in the case at hand, two structures: one for the name field of the provider and another for the contact field of the supplier.

Good thing is that these structures are loaded into the RAM and therefore the access is much faster than with the sequential files loaded in the secondary memory (hard disks).

When a field is used as an index, that field is said to be indexed. Similarly, we talk about indexed access when we work with these index structures to access data more quickly.

Indexed access example

After explaining sequential access and random access files, it is the turn of the indexed access. Remember that, although random access meant an improvement in the speed of data access, we were forced to know the registration number of the record we were looking for. It is through this parameter that we access the memory area where the data we are looking for is stored. This way of working, even if it is fast, is also quite cumbersome, in practice what is needed is to be able to search by the values of a parameter, not by the registration number.

To understand it better, let’s see an example of our old acquaintance Antonio and his supermarket Gades.

Step 1: The initial file

We are going to slightly modify the supplier file that we had been using, to better explain the index structures. To do this, I represent the file as a table where its row is equivalent to a record on the file, that is conformed by a collection of lines. I think this representation will be better understood.

The supplier file would be:

Supplier data
Supplier data

We are not going to use the registration number (#id), but I keep it so that the structure is understood and it is appreciated that we start from the file ordered by registration number.

Step 2: Index by Supplier field

In this case, we are talking about an index without duplicates, we assume that there cannot be two suppliers with the same name, that is, there cannot be two records that have the same value for the supplier field.

The structure created registers all the values of the supplier field (keys) with its corresponding record address. That way, using the values stored in one field of the record, we access the whole record.

key-address structure
Key-address structure

The relationship between this index structure and the records in our archive would be:

Indexes and records
Indexes and records

Each address associated with a key in the structure created points to the memory address of the record whose value in the indexed field is equal to the key.

Step 3: Using the index structure for searches

With the index structure created, Antonio can now search for a supplier by name. As this structure is in the system memory, the query will be fast, and we will get the address in secondary memory where to go to look for the record with the selected supplier name.

At the moment, we are going through the key structure sequentially, but we will see later that there are more efficient ways to do it.

Step 4: Indexing by the contact field

The main difference between the contact field and supplier field, which we used as an index in step 2, is that contact admits duplicates. In practice, this means that for one key we can have several records that have the key in the indexed field, and then we will need the address of each of those records.

In the example that we have been handling, the contacts of suppliers “Green way sugar” and “Asparagus Gin” have the same name, it just so happens that they are both called Elizabeth Roberts, what a coincidence!

Index that admits  duplicates
Index that admits duplicates

It can be seen how the key “Elizabeth Roberts” has two addresses, each one pointing to a record of those that contain this key in the contact field (indexed). In this case, we are using an index that admits duplicates.

Indexed sequential access

In the previous example of indexed access, we did not sort the data, which forced us to have the registration address for all possible values of the key. If the user entered a value that we did not have registered in the index structure, it would be impossible to access the record. We can solve this with indexed sequential access.

If we sort both the data of the file and those of the key structure, by the values of the indexed field, when we do not have one of those values in the index table, we can use the index of the nearest previous value. The address associated with that index, will direct us to a record in the file, close to the one we are looking for and start a sequential search from that record instead of from the beginning of the file.

When in our index structure we have a record for each value of the indexed field, we will have a dense index. Conversely, if there are values in the indexed field that have no record in the key structure, we would have a sparse index.

By having the data sorted, we can work with scarce indexes. In practice, it will allow us to locate ourselves in a registry close to the one we are looking for, from which, we can reach the record we are looking for with a short sequential search.

Indexed sequential access example

To understand indexed sequential access, it is convenient to see it in practice. For this, we can continue with the following exercise, with our old acquaintance Antonio and his supermarket Gades.

Step 1: The initial file

We continue with Antonio and his supplier file. Now, sorting both, the key table and the data, we got the following data:

Initial data
Initial data

Remember that the supplier file has a thousand records and that we are using the supplier field as a key.

At first glance, it may seem that we do not gain much by defining the structure of indexes/keys in parallel, after all, we would have the same number of records with only one less field. The reality is that the supplier file would have many more fields that I do not write in this exercise as a matter of practicality, to keep it simple and didactic. However, imagine that the supplier file also has the email, the address, the date of business started, etc. In this case, the difference in sizes would already begin to be significant.

Step 2: Dense index access

Remember that there is a dense index when for each value of the indexed field, there is an index record. In our example, we would have a dense index when we have a key for each value in the supplier field. On the contrary, it is enough that there is a supplier that does not have a key in the index table, to have a sparse index.

The index table is loaded into the system memory, the RAM, and therefore the access to it will be much faster than the access to the supplier file in the secondary memory, on the hard disk.

Using a dense index, we go through the index table until we found the key sought, and using the address for that key, we go to look for the corresponding record in the supplier file. We reduce the time searching in secondary memory, going directly to the desired register.

Dense index
Dense index

Step 3: Enter and delete data

Up to this point, we have seen how creating an index structure can significantly improve the efficiency in accessing data, but there are two other tasks that must also be performed: the registration and deletion of data.

Each time we enter a new supplier in our file, we will have to enter its name and address in the index table, making sure that records remain sorted, that is, we must insert the data in the appropriate position so all the table remain sorted.

Removing data we face a similar situation, we have to act both in the data file and in the index structure.

Step 4: Sparse index access by blocks

Working with sparse indices, we have values from the indexed field that are not registered in the index table. In this case, we have sequential access by blocks, we position in the registry closest to the one sought, and from this, we will sequentially go through the file until we find the data we are looking for.

Let’s see how it works with an example in the supplier file. To do this, I have added some records between “Smith fruits” and “Southern vegetables”

Sparse index
Sparse index

If we want to access the data of the supplier “Sos rice”, using the table of indexes, we would place ourselves on the nearest record: “Smith fruits” in the table of suppliers, and from there, we would go through the records sequentially until we find the one we are looking for: “Sos rice”.

With this example, I think it’s easy to understand why it is named as access by blocks. By not having a key for all records, each key goes on to address a group of records, a block that begins with the record addressed by the key and includes all consecutive records that do not have a key until it reaches a record that does have a key in the index table.

Step 5: Sparse or dense index?

Normally we would use a dense index if there are few updates and data insertions. In this situation, we shouldn’t be updating the index table too often.

We would use sparse indexes when queries are rare and we don’t have a great need for data recovery speed. We would let the supplier file grow, monitoring the size that the blocks are taking, and we would update the index table periodically, as large blocks appeared that would have to be segmented.

NOTE:

This post is part of the collection “Data Access and Storage Systems”. You can see the index of this collection here