Data Integrity: How to Authenticate Your Electronic Records
2017-12-20: Reprocessed the NSRL data files into a more usable format. See: NSRL HASHES
How do you prove that electronically-stored records are authentic? It is a problem faced daily, in both the public and private sectors. For example: Banks must have verifiable storage procedures for customers' financial transaction records. Libraries and other archivists must be able to prove that their documents are authentic. State voting authorities must guarantee that the voting machine software has not been altered from its original installation. Public safety 911 centers must confirm that the software which runs the 911 systems has not been altered. Auditors, corporate CEOs and CFOs must be able to validate data that forms the basis for audit reports, tax returns, employee pension plan records, and financial statements filed with the Securities and Exchange Commission. The Sarbanes-Oxley Act of 2002, section 1102 and others, served to heighten awareness of just how critical such validation capabilities are. And, in every setting, software users handling sensitive data must have a way to confirm that the software they originally installed hasn't been altered, contaminated, or tampered with.
Computerized records require that security measures be implemented at every juncture: when the vendor's software is originally installed; through the data collection process; and on to the transmittal, storage, and retieval processes. Despite the best security measures, however, eventually the question arises: How do you know these "secure" records are the real thing? And can you prove it?
Today there are vast areas in which such verification challenges arise, but the problem has been quietly addressed in certain technical circles for some time. In a process somewhat similar to using fingerprints as an identifying marker to compare one sample to another, electronic data can likewise be examined for authenticity. Specialists in computer forensics and other fields that deal with demonstrating data integrity have proven the effectiveness of what mathematicians call a "hashing algorithm." Using "hashing", as it is called, they can authenticate electronic data and the software used to store and maintain it.
For instance, at the National Institute of Standards & Technology (at the U.S. Department of Commerce), hashing procedures are routinely used in their work of developing reliable standards and technology for industry. In a white paper by Tim Boland and Gary Fisher they explained:
"Hashing is an extremely good way to verify the integrity of a sequence of data bits (e.g., to make sure the contents of the sequence haven't been changed inadvertently). The sequence might make up a character string, a file, a directory, or a message representing data (binary 1s or 0s) stored in a computer system. The word "hash" means to "chop into small pieces" (REF1). A hashing algorithm is a mathematical function (or a series of functions) taking as input the aforementioned sequence of bits and generating as output a code (value) produced from the data bits and possibly including both code and data bits. Two files with exactly the same bit patterns should hash to the same code using the same hashing algorithm. If a hash for a file stays the same, there is only an extremely small probability that the file has been changed. On the other hand, if the hashes for the files do not match, then the files are not the same." 3
The calculated hash value is often called a message digest and is referred to as a "digital fingerprint" of the data.
The MD5 hashing algorithm generates a 128 bit "fingerprint" or value. The number of possible calculated values equates to a value in the range of 2 ^^128 or 3.4 followed by 38 zeros. This generally means that there is a 1 in 10 ^^38 chance that two different files will generate the same MD5 hash. "It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given prespecified target message digest."1
SHA (Secure Hash Algorithm) was developed by NIST and the National Security Agency (NSA).3 FIPS 180-1 describes the SHA algorithm in detail. This document also contains sample text and the expected calculated SHA value. (note: these values have been compiled and are included on the Maresware CRC CD (which is soon to be discontinued.)
While the MD5 algorithm generates a 128 bit signature, the SHA algorithm generates a 160 bit signature, which is a value in the range 2 ^^160. Both algorithms are sound and generally accepted as providing adequate validation of a file's authenticity. However, the SHA is the only set of algorithms in this group which NIST or the government recognizes.
In essence, MD5 and SHA are ways to verify data integrity, and are more reliable than checksum and many other commonly used methods. 1 These algorithms (MD5 and SHA) provide the foundation for the Maresware Hash and SHA_V program. At least one state has already developed a protocol to validate the software used in its voting machines (computers). This protocol using Maresware's Hash and Hashcmp programs together (see the description that follows) can quickly be modified or adapted to authenticate many different types of electronic records.
Maresware's Hash program is designed to calculate the MD5 hash values of source files. (It can also calculate the SHA1 (160Bit) , and SHA2 (256, 384, 512 BIT) values.) The data that the Maresware hash program works on is the contents of a source file. The hash values and other information about the source file are placed (by default) to the screen. All Maresware programs are command line driven for easy use and customization of their operation. This means command line options can be easily modified and the data produced by the Hash program will be placed in a text output file; that output file can be further processed or printed. Forensic examiners and others who must determine or record the authenticity of a source file find the hash program very useful. A simple procedure to determine a source file's authenticity would be:
If the hashes of the source files match, the files are identical and unchanged. If the hash values are not identical, the contents of the source file have been altered. Here are sample output records (rows) from the hash program. Headings, and some format modifications were done to facilitate display here. The hash command line to generate a default output file would be:
C:\hash -p c:\ -o output.txt Filename MD5 or SHA Hash (fingerprint) Size Date Time C:\FOLDER\source1.ext 893C5990B1029171F8FDB262AF5ABDD0... 5741 2003/01/28 08:22:36 C:\FOLDER\source2.ext 893C5990C1023171F8123262AF5ABDD0... 9941 2003/01/22 08:24:36 C:\FOLDER\source3.ext 893C5990B102917EFDCDB262AF5ABDD0... 1234 2003/03/28 08:22:36 C:\FOLDER\source4.ext 893ABCD0C1023171F8123262AF5ABDD0... 5678 2003/02/28 08:24:36
Step 3 of the hashing procedure requires the comparison of the original hash value with a current hash value of the specific file. Hashcmp is specially designed to compare the output files produced by the Hash program. Hashcmp will very quickly compare the two output files (an initial, and a current one) created by the Hash program. It then displays information on source files whose hashes do not match. The information that is displayed is the appropriate record or line in the hash output files relating to the original source file(s) whose hashes do not match. In tests on a 2.8 GHz. CPU, with a reasonably fast hard drive, Hashcmp can compare two files containing upwards of 30,000 records each in under 10 seconds.
The procedure of comparing a reference, or original, hash value of a source file and a current hash value of a source file has many uses. One interesting application lately brought to our attention, is its use by university library archivists to ensure that the copies of documents they are maintaining have not been altered or tampered with.
Voting Machine/Software Validation
Before you get into this section, you should familiarize yourself with some of the following documents.
U.S. Voting Commission Voluntary Voting Systems Guildelines. and especially section
7.4.6 Software Setup Validation, paragraph 'd' that references Validated SHA FIPS 140-2
software, and using COTS software to perform the tasks.
Also take note, that references to the Maresware HASH program will soon be changed to the
Maresware SHA_V program. As of September 2008, a new Maresware program called SHA_V for
SHA_V(alidate) is being tested by a goverment lab, and we expect it will soon be on the
accepted software list.
In its simplest implementation, you can use the HASH or SHA_V program in your software validation test suite. You would hash the vendor supplied software at day 1, and later, at day X (voting day) you would hash again, and confirm no changes to significant software have occured.
For a number of years, some states and validating companies have been using the Maresware Hash and Hashcmp programs to assist in validating the voting machine software. The process of calculating the hash, and then comparing hash values with standards or originals is a good starting point to help convince persons that the software has not been tampered with. With the addition of software hash values from election machine vendors, to the Hash sets maintained at NIST it becomes easier and more reliable to confirm valid software.
Special processing note:
The following are sources of technical documents with information relating to the MD5 or SHA hashing algorithm:http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html
http://andrew2.andrew.cmu.edu/rfc/rfc1321.html, a document from MIT.
FIPS 180-1 PDF
FIPS 140-2 PDF
validated software listings