Part 1 talked about Data, Part 3 will talk about Data Analytics and Data Science.
This is about Data Storage.
For the longest time, relational database management systems (RDBMS) were (and still are in many cases) the preferred way to store and manage data. In an RDBMS, data is stored in a very structured manner, with well-defined data fields and with specified value types, in a row and column layout. Further, multiple sets of data values, each maintained in what is called a table, can be connected with each other (the relational aspect) to minimize duplication of data and allow insights to be derived from multiple data sets (tables) joined together.
Storing, reading and manipulating data in an RDBMS is very straightforward, using a user-friendly Structure Query Language (SQL). Data manipulation operations run very fast even on huge volumes of data, in part due to the design from many years ago, and in part due to the efficient database systems developed and evolved over the years by the likes of Oracle and Microsoft.
With the emergence of Big Data, if you think about the three main V-characteristics, one stands out as a problem for the otherwise almost-perfect RDBMS. That V is Variety.
In an RDBMS, one of the constraints is that the structure of the data needs to be predefined and well-defined. If you design for a set of data that will be text and numbers of a specific size and in a specific order, that's exactly how the data must be received or the system will fail. If somewhere along the way the order or size of the data values changes, or if new data values are introduced, a significant amount of redesign is required to the database and the applications that read from or write to it.
With smart design, some of these limitations can be overcome, but not to the extent of change in the data formats possible when working with Big Data. Imagine the wide range of sources of data in the connected world today. You have the structured transactions, which are well-defined as suited for an RDBMS, but in addition you have social media feeds – text, images, audio and video that users post, you have video and images streaming from cameras, you have sound and speech from voice control devices and so much more disconnected and diverse data.
All this data has to be stored in a manner that can handle the unpredictability and yet can be retrieved and processed and analyzed for insights as required.
For this purpose, a data storage paradigm known as NoSQL has become popular and is slowly taking the place of RDBMS systems in a growing number of solutions. Although the design has been around for many years, it was actually not quite suitable for most applications which until now really had very structured data.
NoSQL databases (MongoDB and AWS DynamoDB are probably the most well-known) score better than an RDBMS databases when you need to store huge amounts of unstructured data that has no constraints on type and size and changes frequently and has no defined relationship that needs to be maintained between data sets. In two words – Big Data.
Data Warehouses and Data Lakes
Based on size, frequency and nature of access, we have broadly two categories of data stores: transaction databases and data warehouses. Transaction databases are generally lean and mean as they are used to manipulate data from an end-user application and a quick response time is important. Data warehouses are large lumbering beasts that store all the transaction data from time immemorial for reporting and analysis.
Just as we have relational and NoSQL transaction databases, we have data warehouses and data lakes. Both are used for storing large amounts of data, but a data lake is a vast pool of raw data, the purpose for which is not yet defined while a data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.
Go on to read Part 3 about Data Analytics and Data Science.