HASH_DUP

PURPOSE OPERATION OPTIONS COMMAND LINES RELATED PROGRAMS

Author: Dan Mares, dmares @ maresware . com (you will be asked for e-mail address confirmation)
Portions Copyright © 1998-2021 by Dan Mares and Mares and Company, LLC
Phone: 678-427-3275
Last update: June 8, 2014

top

PURPOSE

The program HASH_DUP.exe is designed to find duplicates in a file of hashes created using the HASH program. The file containing the hashes should not have any headers or footers. It should contain only raw data records.

With proper planning, thought, and the correct input file format, it can also be used to identify duplicates in other types of files containing a hash value.

top

OPERATION

The user provides HASH_DUP with the name of the file containing the hash records, and the name of an output file (using the -i and -o options). The program then proceeds to find all duplicate hash values and outputs those records to the output file.

The input file MUST be of a fixed record length with no short or blank lines. This is a very specifically designed program which requires specific input formats. However, there exist a number of Maresware filter programs (pipefix, filbreak, hash, diskcat) which can reformat many files to fit this requirement.

Prior to conducting any tests, the file MUST BE sorted on the hash field, because of this sort process, the filenames and paths are not included in the test. Only the HASH value field is included. The hash value field can be either the MD5 or SHA value.

Even though only the hash field is sorted on, and then compared, using the -d option can often trick the program into using another field as the sort field. This requires some thought and practice to fool the program. But as we all know, computers are stupid and do what they are told, even if it is the wrong thing.

The maximum number of records/files it can process per input file is 250,000.

top

OUTPUT

The output is a file containing the records of all files which contain duplicate hash values.

top

OPTIONS

Usage: HASH_DUP-[options] -?

-i + inputfile: Name of the file containing the hashes to find duplicates. This file should not have any headers or footers. It should contain only data records. The correct hash option to generate this type of headerless file is the -v.

-o + outputfile: Name of the output file to place the duplicate records. All duplicate records are placed in this file. If you don't want the first occurrence to show up, add the -m option.

-m: Normally, all duplicate records are included in the output file. In some instances the user may want to see only succeeding duplicate records, and not the first occurrence. If this option is included, the 1^st occurrence of any duplicate is not included in the output file. The remaining output file can be passed directly to (rmd -S )to remove all the duplicate records, and because the first occurrence wasn't included in this list, there will be only one copy of the file left on the source drive.

-d + #: The hash field begins here. The program looks for a string of hex characters and stops at the first blank or non-hex character and considers that the field.

-1 + accounting file: (that’s a one, not ell) Create an accounting file to contain statistics of the run.

top

COMMAND LINES

First you need to generate a hash file. Using the hash program, the simplest output would be to use this command line:

hash  -v -o hash_values.txt -d "|" -w 255

Then run the hash_dup on the resulting hash output file.

c:>HASH_DUP -i hash_values.txt  -o dupes.txt 
    Do HASH_DUP of files in hash_values.txt filename.  

c:>HASH_DUP -i hash_values.txt  -o dupes.txt  -m
    Do HASH_DUP of input file, but don't include (-m) the first occurrence of every duplicate.

RELATED PROGRAMS

CRCKIT
HASH
DISKCAT
MD5

top