Five Tips for Preparing Your Enterprise Storage Solution for Big Data
by Michael Fiorenza on October 25, 2016
Big data is exploding. According to IBM, the human race is producing 2.5 quintillion bytes (2.5 million terabytes) of data every day. Businesses all over the world, both large and small, are beginning to tap into that glut of information. Whether it's a healthcare company extracting data from the medical histories of millions of patients in order to help clinicians diagnose and treat illnesses, or a security firm analyzing images from surveillance cameras or drones, more and more organizations are routinely handling big datasets as a normal part of their operations.
In order for all that data to be of any use, it must be stored in a form that allows it to be retrieved as needed and analyzed in a timely manner. In fact, that's pretty much the definition of big data. TechTarget describes it as "any voluminous amount of structured, semistructured and unstructured data that has the potential to be mined for information." But implementing a storage system that can adequately store, retrieve, and present huge amounts of data for analysis is no easy task.
Issues a Big Data Storage Solution Must Address
Major challenges in designing a storage solution for big data include cost, scalability, and performance. The sheer amount of information, along with the need to regularly increase the available storage because of continual growth in the amount of data to be retained, puts pressure on a company's data center to make extended use of commodity storage such as hard disk drive arrays (HDDs). On the other hand, the IOPS (Input/output Operations Per Second) performance required to do analytics on huge datasets is often beyond the capabilities of HDD technology.
Moreover, the normal practice of many organizations is to never discard data once it has been stored, since unforeseen uses for that information may arise in the future. That means the enterprise's storage infrastructure must be almost infinitely scalable. And that scaling must be accomplished without disrupting normal operations of the data center.
Another significant issue in creating an enterprise storage solution for big data is the fact that the design must handle different formats. Much of the data, from sources such as streaming video or surveillance images, may take the form of large unstructured datasets. On the other hand, sensor monitoring or telemetry applications may generate trillions of small files. The storage system design that best accommodates one set of use cases may be entirely inadequate for a different set. Yet, according to George Crump, president of IT analyst firm Storage Switzerland, most organizations will eventually need to handle both types of format.
In addition to the configuration of the incoming data, a storage system must also take into account the uses to which the data will be put. Some of it, such as information in large databases, must reside in high performance storage that allows it to be retrieved and perhaps analyzed in real time. Other data is more archival in nature. Seldom used, it can be safely tucked away in low performance, low cost storage. In order to be cost effective, a storage system, along with the software that manages it, must be able to discriminate between these different availability requirements.
Tips For Designing Your Company's Big Data Storage Solution
If your company is moving toward fielding applications that make use of big data, you'll need to pay special attention to the issues outlined above. Here are some tips for how you can structure your enterprise storage solution to handle the requirements big data will impose on it.
1. Determine the storage approach your applications require
The two main storage types commonly used for big data are scale-out NAS (network attached storage) and object storage. According to Paul Speciale, vice president of products for Amplidata, "NAS systems have become optimized for fast file serving, and can typically perform effectively where files are small (Kilobytes) to moderate (gigabytes) in size."
Object storage, on the other hand, is ideally suited to unstructured datasets of practically unlimited size. With its flat address space, and metadata that's stored with the data, the object storage approach allows quick retrieval of large datasets for search, analytics, and data mining.
2. Use an appropriate mix of storage technologies
Doing analytics on large datasets requires that the compute engine be able to retrieve and examine desired subsets of that data practically instantaneously. Otherwise, the analysis process will be unbearably slow. For that reason, data that must be accessed in near real time is now usually committed to flash memory-based solid state drives (SSDs). The problem with SSDs, however, is that they remain far more costly than HDD-based storage units.
The most cost-effective storage solutions employ a mixture of SSD storage for data requiring the quickest access times, and HDD storage for data that doesn't require that level of performance.
3. Use a tiered storage architecture
Software managed tiering is the method normally employed to effectively manage a mixture of slow but cheap, and fast but expensive storage units. Tier 1 (or sometimes Tier 0) represents data that must have the lowest possible access times. It is normally implemented as SSDs. Tier 2 is reserved for data that doesn't require that level of IOPS performance. The tiering software ensures that data is appropriately moved between tiers, on the fly and transparently to the user, based on the requirements of the using application.
According to Storage Switzerland's George Crump, "Most IT staff simply cannot manage a dozen storage systems from six different vendors. IT professionals need to drive their storage hardware requirements to one to three storage systems that cover Tier 1 and Tier 2 applications."
4. Consider using the open source Apache Hadoop or a similar file sharing product
Tyler Keenan, a technology writer at Upwork, says that "the Hadoop Distributed File System (HDFS) allows you to store truly massive files - tables with billions of entries, across dozens (or in some cases thousands) of inexpensive servers." HDFS is the storage element of Hadoop. The system also includes a processing engine, called MapReduce, that implements Hadoop's algorithm for analyzing the data it stores. MapReduce allows parallel (rather than serial) computations on the potentially huge mass of data being managed by HDFS, thus speeding up the process immensely.
5. Store data as close as possible to the applications that use it
Moving data around takes time. And when analytics is being performed on entire datasets, getting that huge amount of data to the compute engines used to analyze it can introduce unacceptable delays into the process. That's why it's best to keep data and the using applications as close to one another as possible.
Sometimes this may mean implementation of a distributed processing system that allows preprocessing of portions of the data near the point at which it is generated. For example, sensor data might be aggregated and evaluated by local processors before being committed to the storage system.
Another application of this principle is seen in the hybrid on-site/cloud storage solutions many organizations are now using. In this architecture, both on-premises and web-hosted storage are employed based on the character of the data and the location of the using applications. As Greg Schulz, an analyst at StorageIO Group says, "Put the data close to where the applications using it are located; if those applications are in the cloud, then put the data in the cloud and vice versa if local."
This is just a brief overview of the what it takes to prepare your enterprise storage solution for big data. If you'd like to know more, please watch our Talon FAST™ video.