Smart Screen Exclusive

C4 IDs: Agreement Without Communication

The unambiguous identification of digital assets has long challenged media and entertainment companies. And as new devices, platforms and business models create the need for new formats and configurations of content, the scale of the challenge is growing.

As workflows grow more distributed, collaborative and complex, moreover, the need for a common, consistent system that can unambiguously identify files across networks and within varied asset-management systems is becoming critical.

That was the impetus behind the Cinema Content Creation Cloud (C4), developed by the Entertainment Technology Center at USC (ETC@USC) and its major studio partners, that its backers claim provides an unambiguous, universally unique ID for any file or block of data anywhere in the world, regardless of origin, ownership, location or storage system.

ETC has now published a white paper describing C4 and its benefits, from which this article is adapted (the full white paper can be downloaded here). Smart Content News will be publishing a series of articles in coming weeks discussing C4’s attributes, adoption and use-cases. This is the first article in the series.

The most common way of identifying data is by filenames and folders. But even when using standardized file-naming conventions, differences in local storage can quickly introduce ambiguity by introducing different hierarchical folder paths to the file. File and folder names also sometimes reveal information about the type of data contained in the file, the devices on which it’s stored and their locations, and the file’s origin that the user may not want revealed.

On the Internet, the URL system backed by a central registry managed by ICANN solves some of the ambiguity problem, but it’s useless for files that are not on the Internet or when bandwidth isn’t available to access them.

Earlier efforts at developing a universal ID system, such as the Universally Unique Identifier standard (UUID) and Unique Material Identifier (UMID) standard defined by SMPTE, solve some of the problems inherent to the files and folders method and require less central coordination that URLs, neither produce infallibly unique IDs and what they do produce is not easily used by humans.

In analyzing the problem, the ETC team developed a list of 10 properties that an ideal identification system would have:

  1. It would be safe to use in filenames, URLs, and database records.
  1. It would be immune to local property changes like filename, path or date.
  1. It would be format agnostic and able to represent any kind of data.
  1. It would be unique throughout the world for a given piece of data.
  1. It would not depend on any external information.
  1. It would be easily recognized by humans or machines.
  1. It would be simple to implement in software and easy to use.
  1. It would not leak sensitive information.
  1. It would be unchangeable for a given asset.
  1. The same file would always have the same ID to everyone.

According the ETC white paper, the C4 system has all 10 qualities.

C4 IDs are produced by using the common, open-source Secure Hashing Algorithm SHA-512 to generate a 512-bit hash of the data to be identified, then encoding the hash value to a simple Base58 character set and prefixing it with “c4”. The result is a string of human and machine-readable characters that is always exactly 90-characters long and always begins with “c4”.

Critically, once generated, C4 IDs can be used in conjunction with any asset management system, regardless of the storage method or UI, without compromising its uniqueness or introducing ambiguity. The C4 ID for a particular block of data will always be the same, regardless of the local file system, anywhere in the world.

That ‘agreement without communication’ is a key differentiator between C4 and other identification systems. It enables interoperability between human beings, organizations, databases, software applications, and networks, and it is essential to the globally distributed workflows of media production.

Among its other virtues:

  • C4 IDs are safe to use as filenames and in URLs because they use URL-safe alphabet that includes only capital and lower-case letters and numbers; no symbols or special characters are required.
  • C4 IDs are generated exclusively from the bytes of the file itself, without reference to any other file attributes. As a result, they can be used to identify any data, whether in a file or an arbitrary block of data, they’re immune to local property changes and they require no reference any external information or registry to ensure uniqueness and consistency.
  • C4 IDs contain 64 bytes of “uniqueness” allowing for an immense address space and ensuring that any given ID will remain universally unique for the foreseeable future. For context, the universe is estimated to contain 2266 atoms; C4 can uniquely identify 2256 of them.
  • C4 IDs are easy to identify by humans and machines, even out of context, because they always begin “c4” and do not contain zero “0”, the upper-case letters “O” or “I”, or lower-case letter “l” because those characters can easily be misidentified in certain fonts.
  • C4 IDs are secure and do not “leak” any sensitive information because they contain no metadata about the file they identify or the computer on which the ID was generated. Each ID appears completely random and unique.
  • C4 IDs are unchangeable and inseparable from the data they represent because they are generated from the data itself, rather than assigning an ID. It is impossible even intentionally to hide a file from its C4 ID regardless of how it is renamed or moved.
  • C4 IDs will always be the same for the same file and will always be different from different files. Thus, two organizations that each have a file with the same C4 ID can be absolutely confident they have the same file without needing to compare the files directly.

Best of all, C4 is open-source and free to use as is, or incorporated into other software including closed-source software.

Future articles in this series will focus on how C4 can be used internally by organizations for asset-management tasks such as de-duplication and archiving of files, network optimization, and version control. It will also focus on how C4 can be used to uniquely define relationships within and between assets to improve flexibility and data recovery, and how it can enable interoperability among organizations without requiring changes to local workflows.

For more information contact Joshua Kolden at Studio Pyxis [email protected].