Distributed SQL

A distributed SQL database is a relational database which replicates data across multiple servers. Distributed SQL databases are strongly consistent and most support consistency across racks, data centers, and wide area networks including cloud availability zones and cloud geographic zones. Distributed SQL databases typically use the Paxos or Raft to achieve consensus across multiple nodes. Sometimes distributed SQL database are referred to as NewSQL but NewSQL is a more inclusive term that includes databases that are not distributed databases.

History

Google's Spanner popularized the modern distributed SQL database concept. Google described the database and its architecture in a 2012 whitepaper called "Spanner: Google's Globally-Distributed Database". The paper described Spanner as having evolved from a Big Table-like key value store into a temporal multi-version database where data is stored in "schematized semi-relational tables."[1]

Spanner uses atomic clocks with the Paxos algorithm to accomplish consensus with regards to state distributed between servers. In 2010, and earlier implementation, ClustrixDB moved from a hardware appliance to a Paxos-based software database[2] and was later acquired by MariaDB[3] as Xpand and added to a SaaS cloud offering calledSkySQL.[4] In 2017, two Google engineers left the company to create Cockroach DB which achieves similar results using the Raft algorithm without atomic clocks or customer hardware.[5] Following this, other entries have emerged in the market such as MariaDB's SkySQL and YugabyteDB. Aside from implementation and performance claims these offerings can run on multiple public and in some cases private clouds or infrastructure.

Spanner is primarily used for transactional and time-series use cases. However, Google furthered this research with a follow on paper about Google F1 which it describes as a Hybrid transactional/analytical processing database built on Spanner.[1]

Architecture

Distributed SQL databases have the following general characteristics

  • Synchronous replication
  • Strong transactional consistency across at least availability zones (i.e. ACID compliance).
  • Relational database front end structure - meaning data is represented as tables with rows and columns similar to any other RDBMS
  • Automatically sharded data storage
  • Underlying key-value storage[6][1]
  • Native SQL implementation

Following the CAP Theorem, distributed SQL databases are "CP" or consistent and partition-tolerant. Algorithmically they sacrifice availability in that a failure of a primary node can make the database unavailable for writes. However, availability is achieved through greater software and hardware reliability, the election of new primaries, and heuristical recovery methods. [7]

All distributed SQL implementations require some kind of temporal synchronization to guarantee consistency. With the exception of Spanner, most do not use custom hardware to provide atomic clocks. Spanner is able to synchronize writes with temporal guarantees. Implementations without custom hardware require servers to compare clock offsets and potentially retry reads.[8]

Compared to NewSQL

CockroachDB and others have at times referred to themselves as NewSQL databases, but the overall architecture of NewSQL databases like Citus and Vitess are fundamentally different. Some of the NewSQL databases given as examples by Matthew Aslett,[9] who coined the term. In essence, distributed SQL databases are built from the ground-up based on key-value stores and NewSQL databases are generally replication and sharding technologies built on existing client-server relational databases like PostgreSQL.[10] Others define DistributedSQL databases as a more specific subset of NewSQL databases.[11]

References

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.