Data are crucial for organizations, and they can’t risk losing them due to an eventual storage system failure. So much so, that additionally to data redundancy systems, companies implement backup systems, which make temporary copies of the data on additional media. That way, in the event of a serious failure of the business systems, backup data would be available for recovering business activity, without significant loss of information and without damaging the business.
The most common support used by these systems is magnetic tapes, which provide a sequential access mode to data. They are therefore less versatile than hard drives that provide a random access mode, but the cost of storage is lower.
Storage architectures can be classified into two types according to the required access, either directly to file or hard drive. Have a view at both types.
Hard drive access: Known also as block access, this type of access happens when the client requires direct access to the disk. The file system of the client computer manages the access to disks. There are three possible situations:
1.- Internal disk: The client is accessing directly to its internal disk using the internal buses. If it is a personal computer it would likely use ATA or SATA, and if it is a server it would surely use SCSI.
2.- DAS (Direct attached storage): The client accesses a cabinet of hard drive directly connected to the computer.
3.- SAN (Storage Area Network): The client also accesses a cabinet of hard drives and this time it is not directly connected to the computer but is networked.
File access: The client works at the file level, which requests a NAS (Network attached storage) server that manages all the accesses.
In the following figure you can see the four storage architectures discussed:
A RAID (Redundant array of independent disks) or RAID System, is a set of redundant and independent disks. Redundant because they will store duplicated information to ensure fault tolerance and improve availability. And independent, because there is no dependence between them, which allows us to replace any disk in the set with a new one, and it will work perfectly with the disks the array already had.
This type of disk grouping aims to improve the achievable performance with a single disk. Depending on the type of combination we can improve security, storage capacity or data availability. And the RAID controller will ensure that, for the server, this combination of disks appears as one, under the same drive letter.
A hash algorithm transforms the input data to a string of characters of fixed length. The length of the input data can be variable but the result of applying the algorithm is always of fixed length.
In addition, this algorithm has a peculiarity that makes it very useful: it works only in one direction. It is practically impossible to obtain the input data from the output of the algorithm.
The hash algorithm is used when we have an unlimited set of input values and want to get a limited set of result values. Normally, the input values will be strings of variable length characters, which we will convert into strings of fixed length. However, these types of functions support all types of input data. In addition, the results of the function can be delimited to a defined set of characters: integers and alphanumeric.
In the following image you can see the behaviour of a hash function:
Knowing how binary search trees work, let’s study how to work with them. The binary search tree operations that can be performed are: Searching, insertion and deletion of a node.
Searching a node
When we want to retrieve data from our binary search tree structure, we will take advantage of the fact that these structures are ordered. We would start by checking the root node, and if this is the one we are looking for, we are done. If it is not, we would move to your left or right child, depending on whether the data we are looking for is less or greater than that contained in the parent node. And so we would continue until we find the data or finish touring the tree without finding it.
This process is recursive since when we move to a child node, we can consider it as the root node of a new tree. The search process could be expressed as:
A binary search tree is a data sorted structure where each record can be linked with a máximum of two other records. We are going to pay special attention to binary search trees, as they are very popular and widely used in databases. As I anticipated in the previous post, binary trees are of order 2, that is, each node can have a maximum of two children. Additionally, for being a search tree, all nodes have to meet the following conditions:
if the node has a left child, the value of it has to be smaller than the value of the father.
if the node has a right child, the value of it has to be higher than the value of the father.
This forces that the data we are saving are sortable, so we can determine if one is bigger or smaller than another. Given the conditions above that a binary search tree has to meet, duplicates are not allowed, they have to be unique data.
Let’s look at it with a simple example where data are integers. The binary search tree for our example is:
In the indexed files seen in previous posts, the table of indexes is accessed sequentially with the drawbacks that I already mentioned this type of access has. However, there is a much more efficient indexing technique than sequential indexing and that is tree indexing, which significantly reduces search time. In this post, I will explain the aforementioned technique.
Using sequential access, when looking for data, it has to go through the index file sequentially until we find the searched one, obtaining the memory address of the beginning of the record in the data file. As index tables are much less heavy than files with all the data, we have a considerable improvement using indexes, even if the search in the table for these is sequential.
By changing the index table of sequential access to an organization of the indexes in a data tree, it will be able to access the indexes much more quickly. These types of structures are specially indicated to speed up searches, but let’s see how it does.
After the random access files I explained in the previous post, indexed access appears to correct the main drawbacks of those. Random access files meant a considerable reduction in the search time of data compared with the sequential access files. However, you had to know exactly the registration number of the data you are looking for, which is far to be efficient, since normally we will want to search by values in the different fields.
Let’s see it with our character Antonio, who changed from sequential access files to random access files. It turns out that when searching for a supplier by the registration number, he is wrong many times as he hardly remembers the position of this supplier in the file. He would like to be able to search for the supplier’s data, by the name of the supplier or by the name of the contact, but doing by the registration number of the suppliers usually ended making a mistake.
Random access files allow us to go directly to recover the desired record, without having to read all the previous ones before. In this way they solve the main limitation of sequential access files, which was precisely their method of accessing the data, the need to go through the entire file from the beginning until we reached the point that interested us. This limitation is high time consuming and became larger and larger as the size of our file increases. Working with sequential access files having a high number of records, the reading times could take too long, to the point of not being operative and making sense of a migration to a random access file.
Sequential access files were the first digital storage solution, those files present a certain structure to store the data but are still far away from the databases.
Let’s put ourselves at the beginning of the second half of the twentieth century when our protagonist Antonio begins to digitally the data of his business, the SuperGades store. The option available then was simply to pass the data from a paper file to a digital file, using a very simple word processor with very few functionalities, in which we would write the data without any format. This file is also called an archive, taking the name of its analogue antecedent, a physical file with drawers where we kept the files with the data.