Nov 22 2021
The Efficiency of Data Storage
As our world becomes increasingly digital, math becomes more and more important (not that it wasn’t always important). Even in ancient times, math was a critical technology improving our ability to predict the seasons, design buildings and roads, and have a functioning economy. In recent decades our world has been becoming increasingly virtual and digital, run by mathematical algorithms, simulations, and digital representations. We are increasingly building our world using methods that are driven by computers, and the clear trend in technology is toward a greater meshing of the virtual with the physical. One possible future destination of this trend is programmable matter, in which the physical world literally becomes a manifestation of a digital creation.
What this means is that the impact of even tiny incremental improvements in the efficiency of the underlying technology, computers, has increasingly powerful reverberations throughout our economy and our world. The nerds have truly inherited the Earth. This is why it is interesting science news that computer scientists at MIT have developed a tweak that may improve the efficiency with this computers store and retrieve data. William Kuszmaul and his team have demonstrated a way to improve what is known as linear probing hash tables. The underlying concept is interesting, at least for those curious about how the increasingly ubiquitous computer technology works.
Hash tables were developed in 1954 as a way for computers to store and locate data. When given a piece of data to store, the computer will calculate the “hash function of x, h(x)”. This will generate an essentially random number from 1 to 10,000. The computer then goes to that location in the sequential data array and stores the data there. If that location is already occupied by data then it probes forward until it finds an open slot and it puts the data there. When searching for the data to retrieve it does the same thing – goes to the assigned location and if the data is not there it probes forward until it finds it. If it encounters an open position first it concludes the data has been deleted.
When deleting data it does not entirely remove it and create a newly opened position, because that will cause confusion for all future searchers for data. If while probing forward the computer encounters an open slot (created from deleting information) then it may falsely conclude the data it was searching for does not exist. Therefore computers will put a placeholder into a data position with deleted information, called a tombstone, to let the computer know that there was data here that has been deleted, so keep probing for the information you are looking for.
This all seems a bit clumsy, but it works, and has been the mainstay of how data is stored and retrieved from computers for more than half a century. There is, however, a major drawback of this process. The more slots get filled with data, the more the computer must probe forward to search for the information it is looking for. Long blocks of completely filled data positions is known as “clusters” and they can dramatically slow down data retrieval. This slowing is not linear, it’s “quadratic”, which means it can become impractical. One way to avoid this is to not use significant portions of the data positions, but that introduces its own inefficiency. Now you need more hardware for data storage in order to avoid significant clustering.
The new proposal by the MIT scientists they are calling “graveyard hashing”. First, they demonstrate that placing tombstones in the position of deleted data can have an anti-clustering effect. Based upon this they propose a process of deliberately placing tombstones in the data array (graveyard hashing) in such a way as to prevent clustering and maintain a high degree of efficiency in data storage and retrieval. In fact they argue that clustering can be entirely eliminated, given certain mathematical parameters.
I know this is all terribly wonky and narrow, although I do find the basic concept of how data storage works interesting. The bigger point that caught my attention is that this was a science news item, and how that reflects on where our society is in terms of technology. Math nerds are now billionaires. Nuanced mathematical or statistical tweaks could represent trillions of dollars to corporations and governments over time.
Further, warfare may increasingly be taking place in the cyber battlefield. The iconic image of a modern warrior may be shifting from a physically intimating solider with a large and powerful gun to a coding jockey sitting in front of a keyboard and monitor.