The March 2024, SQL version 2024.03.xxx. is what this article deals with, and the massaged data talked about here is a result of my re-processing the NSRL data to a more usable format of only the MD5 values. The NSRL goes back many years, and previously the names of the data files were sequenced like, 267, 273, 274 .(<275).. etc. In some cases below, I have processed and included those older versions because some of the MD5's were unique and warrented being added. However, where totals are posted, some may seen confusing, as some include the old version MD5s and others include only the current 202403 items. So if the numbers seem irregular, that is probably the reason they don't match.
See the NSRL - NIST site for explanation of their processes and definition of what is included in the data sets.
The various NSRL segments (IOS, ANDROID, Legacy, Modern) contain a total of approximately 1.3 billion MD5 values. I have merged and uniqued these segments of the Legacy, Modern, ANDROID and IOS, to obtain a total of 186+ million MD5 items from the 202403 set.
The record format of the file which I have massaged is simply the 32 byte MD5 value with a carriage return, making the entire fixed width record 34 characters (32 MD5 and 2 CR/LF). Since there are no collisions/collusions I didn't feel it necessary to add the substantial (doubling) size if the SHA were included.
I would have liked to include the application_id in the record, but since I can hardly spell database I couldn't reform the data to include the application_id. If you research the NSRL web site you will probably have some questions about the files which are actually included in the data set. Enough said, do your own research. You are after all a forensicator.
Go to this page and scroll down about 14 sections to to the section on: which hashes are for known bad files.
The data/files I am making available are sorted on the MD5 value and is a fixed length record of 34 characters. Using a reliable binary search engine such as Maresware BSEARCH, I searched the 240+ million for 20 MD5 values and it took less than a second. Sequential searches of the 240 million records took a little over 1 minute. Depending on the speed of your drive and machine, the times will obviously be different for you.
Output stats from the linear SEARCH program. Output record length is 34 No of records read = 240,428,306 No of records wrote= 20 Elapsed time: 0 hrs. 1 mins. 21 secswhile a binary search (BSEARCH) of the MD5 values is as fast as a traditional indexed search.
The current (March 2024) RDS_2024.03 (combined/uniqued) values from NIST are 186+ million unique MD5's.
1D6EBB5A789ABD108FF578263E1F40F3
9B3702B0E788C6D62996392FE3C9786A
See the NIST - NSRL site for explanation of their processes and definition of what is included in the data sets. NSRL-NIST overview.
Current March 2024 Hash Counts (before combining and uniquing) Total Unique Modern: 879,510,365 69,437,521 Legacy: 289,938,900 61,814,050 Android: 97,148,886 29,406,405 IOS: 89,176,502 26,148,185 ============ =========== Total: 1,355,774,653 186,806,161
Older VERSIONS V231-277 before 202403 were combined to build the more complete zip files below. The total unique MD5 values are just at 240,428,306
I have split the combined 240+ million items into 4 smaller (760 Meg. each) zipped files and made them
available for download. Each zip file is just over 760 Meg in size.
NSRL_0-3.zip contains 60,117,305 items with first character 0-3
NSRL_4-7.zip contains 60,108,371 items with first character 4-7
NSRL_8-B.zip contains 60,102,830 items with first character 8-B
NSRL_C-F.zip contains 60,099,800 items with first character C-F
NSRL_DEMO.zip contains sample command lines and batch file to run maresware.
You should unzip them, and then merge them. Make certain the sort order is in tact. Else you can't use a binary search or
compare.
C:> copy /b NSRL_0-3.MD5 + NSRL_4-7.MD5 +
NSRL_8-B.MD5 + NSRL_C-F.MD5 COMPLETE_SORTED_MD5
in the sorted fashion to restore the entire data set. If you need help doing
the merge, let me know. Once you merge the files, I suggest you use the
sortchek.exe and the
help file
program to verify that the total set is still sorted.
The suggested command line for sortchek is:
D:>sortchek COMPLETE_SORTED_MD5 -r 34 -p 0 -l 32
replace the COMPLETE_SORTED_MD5 name with whatever yours is named.
If it finds a record out of order, it will show you.
SAMPLE RUNS:
If you wish a sample of a MARESWARE batch file to demonstrate how to use and run MARESWARE when processing the NSRL MD5's
data which is referenced above send me an email request and I'll send the sample batch file.
BSEARCH stats: Input filesize = 8,174,562,404 Number input records = 240,428,306 Records Written = 20 Finished: Fri Jun 07 14:00:41 2024 Elapsed time: 0 hrs. 0 mins. 1 secs =========================================== Linear SEARCH stats: No of records read = 240,428,306 No of records wrote= 20 Elapsed time: 0 hrs. 1 mins. 21 secs =========================================== COMPARE stats from a smaller group 182,078,612 Records in IOS_MOD_ANDR_LEG.MD5 262 Records in Maresware_hashes Total records file 1 = 182,078,612 Total records file 2 = 262 Read 181,597,969 records from: IOS_MOD_ANDR_LEG.MD5 Read 262 records from: Maresware_hashes Wrote 40 records to junk Final record length is = 34 Elapsed time: 0 hrs. 1 mins. 2 secs
A reference page for algorithms and other documemts may be found at: NIST. Research the documents link.