Breaking Down UUIDs

The following is a teardown on UUIDs history, their different versions, when they should be used, and how they are constructed. The author also shares the probability of collision, for a specific UUID usage scenario.

https://duo.com/labs/tech-notes/breaking-down-uuids

Universally Unique Identifiers, or UUIDS, are 128 bit numbers, composed of 16 octets and represented as 32 base-16 characters, that can be used to identify information across a computer system. This specification was originally created by Microsoft and standardized by both the IETF and ITU.

UUIDs are generally used for identifying information that needs to be unique within a system or network thereof. Their uniqueness and low probability in being repeated makes them useful for being associative keys in databases and identifiers for physical hardware within an organization. One of the benefits to UUIDs is that they don’t need to be issued by a central authority, but can be generated independently and then used across a given system without suspicion that a duplicate. or colliding, UUID has been generated elsewhere. Apple, Microsoft, Samsung, and others use UUIDs, either defined by the IETF spec or a proprietary variant, to identify and track hardware both internally and sold to consumers.

There are 5 different versions of UUIDs, excluding the Nil UUID version, which is a special case UUID where all its bytes are set to 0, and most contain some variants that allow for special cases specific to vendors like Microsoft. Version 1 and 2 use time-based sources (a 60 bit timestamp sourced from the system clock) for its randomness. Versions 1 and 2 are effectively the same, except in the latter version the least significant bits of the of the clock sequence are replaced with an identifier specific to the system. Most implementations, because of this and other reasons, omit version 2. Version 1 is the most commonly used of the UUID versions.

Versions 3 and 5 are generated by hashing a name or namespace identifier and using the resultant hash, MD5 or SHA-1 respectively, as the source of uniqueness instead of the time-based sources like in versions 1 and 2. They are meant for generating UUIDs from names that are drawn from, and unique within, some name space. The concept of name and name space should be broadly construed, and not limited to textual names. For example, some name spaces are the domain name system, URLs, ISO Object IDs (OIDs), X.500 Distinguished Names (DNs), and reserved words in a programming language.

Version 4 uses random or pseudo-random sources rather than time or namespace-derived sources for its uniqueness. Versions 3, 4, and 5 use their respective sources to generate 60 bits of unique output that is used in lieu of the timestamp bits used in versions 1 and 2.

01. UUID Generation

To show how the UUID is derived, we’ll go through how a version 1 UUID is created. For versions 3 through 5, the clock sequence sources mentioned below should be replaced with their respective sources. The V1 UUID string is derived into an ordered sequence of 6 fields from that give the ID a great (but not non-zero) chance of being completely unique. UUID records are derived from the following sources, in big-endian fashion, as follows:

TimeLow : 4 Bytes (8 hex chars) from the integer value of the low 32 bits of current UTC timestamp
TimeMid : 2 Bytes (4 hex chars) from the integer value of the middle 16 bits of current UTC time
TimeHighAndVersion : 2 Bytes (4 hex chars) contain the 4 bit UUID version (most significant bits) and the integer value of the high remaining 12 bits of current UTC time (timestamp is comprised of 60 bits)
ClockSequenceHiAndRes && ClockSequenceLow : 2 Bytes (4 hex chars) where the 1 through 3 (significant) bits contain the “variant” of the UUID version being used, and the remaining bits contain the clock sequence. The clock sequence is used to help avoid collisions if there a multiple UUID generators within the system or if a system clock for a generator was set backwards or doesn’t advance fast enough. For additional information around changing Node IDs and other collision considerations, see section 4.1.5 of the IETF RFC
Node : 6 bytes (12 hex chars) that represent the 48-bit “node id”, which is usually the MAC address of the host hardware that generated it.

This yields a string that would look like this for example:

123e4567-e89b-12d3-a456-426655440000

02. What makes UUIDs Versions unique?

To further break down what gives us faith in why a given UUID is most likely unique, let’s look at the sources of UUID data for the different UUID versions a bit more:

Versions 1 & 2

For these versions we have 74 bits of time data, 60 bits from the timestamp and 14 from the clock sequence. Along with that we have the 48-bit Node ID, which could be the MAC or, in some cases where we may not want to expose it or the Node does not have a MAC, a 48 random or pseudo-random bits. Ideally (for the UUID version, not for us) however if we have the MAC and, in combination with the timestamp and clock sequence, we are given an ID that correlates to a single point in space (the node MAC) and time (timestamp and clock sequence). If the ideal case holds true, a node is capable of generating 274 (18 Sextillion), but if no MAC is given we have an additional 48 bits of uniqueness, yielding 2122 (5.3 undecillion) possible UUIDs for a node.

Versions 3 & 5

If you want to have unique identifiers for “name-able” information and data within a namespace of your system stored in a UUID format and make sure that duplicate resource names do not occur across your system, this is the version to use. According to the spec, Version 5 is preferred, since it uses SHA-1.

Version 4

The most “unique” of the versions. As with the other versions, 4 bits are used to indicate the version, and 2 or 3 bits depending on the variant are used to indicate the variant of the UUID. this leaves either 2122 (5.3 × 1036) or, for v4 variant 2, half as many possible unique IDs there is one less random bit. Still the chances of collision are extremely small.

03. Chances of Collision

The chance of a collision occurring where two identical UUIDS are generated at the same time on the same node is incredibly small, and the probability of collision can be calculated using the Birthday Problem. For example, if we have 68,719,476,736 UUIDs with 74 random bits , the probability of a duplicate would be 0.1175030974154045 but If we have 122 random bits the probability would be 0.0000000000000004. If a user was generating all UUIDs for a system using a single node, they may want to consider using UUID version 4 rather than 1, since the chance of collision is much greater.

#reads #nick steele #uuid