In part one, we explored how Object-Based Storage (OBS) cost-effectively delivers data at scale. This article will detail the key features associated with OBS, covering extreme scalability, advanced data availability and durability, and simplified data management.
OBS platforms operate on a flat address space, and as such, massive scalability is achieved without the overhead associated with file system hierarchies, data look-ups, or a block reassembly.
With traditional file storage architectures, indexes enable scaling beyond a single folder, but as the number of files increase, the file hierarchy and associated overhead become cumbersome, limiting performance and scalability.
Instead of indexes, OBS uses metadata to aggregate objects into buckets (or other logical associations) which delivers more efficient capacity scaling, which equates to virtually unlimited data at scale.
Advanced Data Availability
In traditional storage architectures, Redundant Array of Independent Disks (RAID) is a common approach to ensure data is available and accurate when it's read. Striping data across multiple drives will protect one or two of them from failing, however, once a failure occurs, performance drops dramatically during the rebuild operation and the likelihood of other group members failing increases as well.
RAID rebuild times can take hours, or even days, and may require an immediate replacement of a failed drive. If an unrecoverable read error occurs during a rebuild, data will be permanently lost possibly placing business data and productivity at risk.
With OBS, data availability is achieved through advanced erasure coding - a technique that combines data with the parity information, divided into chunks, and distributed across the local storage pool.
Erasure coding best practices require that no single drive hold more than one chunk of an object, and a single node never hold more chunks than an object can afford to lose. This approach ensures data availability even if multiple components fail since only a subset of the chunks are needed to rehydrate the data.
There is no rebuild time or degraded performance, and failed storage components do not need to be replaced at the time of the read error, but when it is convenient. Rather than focus on hardware redundancy, OBS focuses on data redundancy.
An OBS system achieves data availability through geographically-spreading across three locations, but unlike the triple mirroring data replication model, the total data is not replicated to each location.
Rather, only one-third of the object data is stored in each location. This approach not only reduces network traffic, but maintaining this level of data availability only incurs about 67 per cent of overhead, whereas triple mirroring requires replicating, storing, and managing 100 per cent of the data at three locations.
The geo-spread model provides very high data accessibility and resiliency at a substantially lower cost in equipment and management than traditional triple mirror data replication.
Advanced Data Durability
Data durability refers to long-term data protection, so a media failure, such as bit rot, where a portion of the drive surface becomes unreadable and corrupts data, makes it impossible to retrieve data in its original unaltered form. Protecting chunks as they lie dormant on disk is of paramount importance in enterprise storage. Simply protecting against a complete hard drive failure (as with RAID) does not protect against the gradual failure of bits stored on magnetic media.
When combined with appropriate data scrubbing technology, OBS guards against bit failures, which means that if a given chunk becomes corrupt, a replacement chunk can be constructed from the parity information stored in the remaining chunks that constitute the object. It isn't necessary to rebuild or replace an entire drive, just the affected data. The combination of erasure coding with data scrubbing technology achieves extreme durability.
Simplified Data Management
Unlike hierarchical file storage used in NAS environments, OBS has a flat architecture known as a namespace that collects the objects to hold within the object store, even those objects that reside in disparate storage system hardware and locations.
The namespace provides an effective and cost-efficient way to manage multiple racks of storage within one entity, thus enabling a simplified, single management solution for all data. Although geo-spreading distributes data across multiple storage systems in various locations, the actual operation is only performed once, and invisible to the end user.
A single namespace makes it is easier to manage one system spanning multiple locations than managing multiple sites individually.
Erik Ottem is director of product marketing, datacenter systems, at Western Digital