STRSRCH

PURPOSE   OPERATION   OPTIONS   COMMAND LINES   RELATED PROGRAMS


Author: Dan Mares, dmares @ maresware . com
Portions Copyright © 1998-2016 by Dan Mares and Mares and Company, LLC
Phone: 678-427-3275
Last update: Date: 03-11-2012

This is a command line program.
MUST be run within a command window as administrator.

top

PURPOSE

Also available in the Linux suite

The program is designed to do multiple string searches of files contained on a disk. It cannot perform the search within a container such as a zip of docx type of file.

It is an excellent addition if you export drive slack and freespace from any of the many forensic tools such as FTK or X-Ways software.

The output format is fixed length records (using the -w option) which can be imported easily to a spreadsheet for to allow for additional e-discovery and attorney review.

The user supplies a text file which contains the strings to search for. Then the STRSRCH program searches files for those target strings. (It has been tested while using over 9000 keys with very little degradation in performance. However, overall performance of any keyword search program is also affected by the amount of output which is generated. It is not unusual to have over half million hits with incorrectly chosen keywords.

The program can search individual directories or do recursive searches of directories. By default, it searches all files (*.*) but can be requested to search only search in files of specific file names or types as specified by file type parameters in the -f (-f *.doc *.txt *.etc) option.

Its output can be specifically tailored for easy import into data bases. IT CANNOT OPEN COMPRESSED OR OTHER CONTAINER TYPE FILES FOR SEARCHING.

It has also been used as a tool to assist in e-discovery processing of files. String searching is often one of the most cumbersome and tedious tasks, even with the sophisticated search capabilities of the integrated forensic programs. Often it becomes necessary to "extract" files, freespace, or slack space and search these items seperately. This program may be easier to use to process certain types of data than the mainstream integrated packages. It is also extremely fast, and has output which can be customized to create a load file for summation. If you don't know what that is, you don't need to worry about it. ☺

There is a custom alternative to the strsrch program called str_spec.exe. This was created to process the html e-mail outputs from the report generation wizard of the FTK program. Simply put, the str_spec program will search through all the e-mail html report files and produce a "log" of the header information found in each of the html files. The log contains reference to the From: To: CC:, etc. all the way up to and including the Subject line. This program is useful when prepareing e-discovery data for a party, but not wishing to provide e-mail content until properly reviewed.

Also, for specific header processing of .eml files you might take a look at the eml_process.exe program.


top

OPERATION

The user, provides the program (via the -s option) with a text (ascii) file containing the strings to search for, an output file name using the -o option, and a list of input file name(s) (the -f filetype option) to search.

The program reads the strings from the input ASCII text file, and proceeds to search through all the appropriate files for matching strings. The search is not case dependent and the number of strings searched for is currently limited to 10,000. However, that is easily modified. If you have more than 10,000 keys, first, consider what you are searching for, then give us a call.

Even though the file containing the strings to search for is an ASCII file, you can place binary (or hex) characters in the file for searching. The only requirement is that you CANNOT search for the typical carriage return. This is because when strsrch sees the carriage return in the strings file, it terminates that line. Any other binary characters should be OK. You may have to use a hex editor to get those characters into the text file, it does not take formats like: 0Xaabbcc.

The program opens each file and does a search for each of the strings. When it finds a match, it creates an output record containing a number of fields. It fields it places to the output record are: the string it found, the name of the file it was found in, the location (byte number) of the string within the file, AND most importantly, it also 80 surrounding characters to the output. This 80-character limit can be adjusted by using the -m option. (We have had clients request as many as 5000 characters of surrounding text when searching through extracted freespace. And had as many as 1.5 million output hits. They obviously didn't take our suggestion that their key list was too broad.) This way you can easily look at the output and see if it needs further examination.  


STRSRCH doesn't open compressed files (ie zip), docx, xlsx, pdf or Outlook pst files, to search within the contents.

COMPRESSED EXECUTABLES can be identified by using strsrch to look for the following strings: (for 7-zip, the offsets and strings may change by version, and the -U unicode option should be used.)

Program   STRING   LOCATION
pkzip:    PKWARE          565
Winzip:   winzip          682
7-zip     7-ZIP        144398  (this is a unicode string)
7-zip     Igor Pavlov  144400  (this is a unicode string)
Place these strings (and any others you need to research from other zipped executables) in a file called strings.txt and run the following command line:

C:>strsrch -f *.exe -D 500 -E 150000 -N -w -o outputfilename.txt -V -s zip_strings.txt -v -i -U -d "|"
-f *.exe == look for these type files only. add *.dll etc as necessary
-D 500 == start 500 characters into the file (not really needed)
-E 1500000 == stop after reading this many characters, so you don't need to read the entire file
-i == if a hit is found, proceed to next file. no need to check further and waste time
-N == place the filename at the beginning of the output, for inclusion to upcopy
-w == make the output single line 'w'ide
-o outputfilename.txt == this is the output filename
-V == make the "hit" filename its full length so it can be handed to upcopy
-s    zip_strings.txt == this is the text file containing the strings to look for
-v == do not include any header-footer in output file. clean text output
-U == process as Unicode. This is needed if you think there are 7-zip executables
-d "|" == create a delimieted output, for input to upcopy, or Excel


You can then pass the output file to the upcopy command to copy and isolate those exe's for additional processing.
upcopy -S outputfilename.txt  -d  D:\destination\path\towriteto  [any other options]


STRSRCH doesn't currently have grep capabilities. (Examine the logic of the following grep expression on a binary file with no carriage returns: ( grep http:.* ) The output would match the entire files contents at the first iteration of http. Not really efficient.

Output file names are NEVER overwritten. If a file of the same name and extension is found it appends output to that existing file. The date, time and command line used is also placed in the output file for accounting purposes.

When creating output for input into a database you should take note of the last record in the output file. If, by coincidence the last string hit is close to the end of the input file, then the last output record may not contain enough characters from the input file to fill the entire output record. Especially if very large (-m) output records are requested. Many data bases will drop records which do not completely fill their data area. The last record may have to be manually manipulated to allow the data base to view it as a full record.

The -[cC] (compress) options remove hex00 and white space from the input buffer, allowing you to search for multi word strings (like certified public accountant) without the intermediate spaces, carriage returns, tabs, etc. This could be used to eliminate missed strings separated by special characters and end of line sequences. In a round about way, this compress option will also crudely allow for searching ascii unicode characters. But the -U option is more appropriate for that use.

After removing all white space, it then processes the buffer as contiguous text. Because it physically moves data within the input buffer, this option gives erroneous byte locations in the output file. The byte location provided will be within 64000 bytes of where the data actually is located in the input file. It will appear as an approximation (i.e. ~2560).

The -[cC] options have the following restriction. That the line entered in the strings file, be entered without any spaces also (certifiedpublicaccountant).

The -c option is also useful when dealing with files that might be UNICODE formatted files. It will compress out all the NULL (0x00) characters which make up half of the unicode character set. The -C (upper case version) will not only compress out hex00, but will also compress all whitespace. This should be used concervatively.

(-X eXtract special option). The program can also take the number of characters selected by the -m xx option which surround the keyword that was found, and place them in additional seperate output files. Each file will be named uniquely but the filename generated will contain a reference to the original source file where the "hits" were found. Each of these files will contain all the hits found in a specific input file, and the hit will be surrounded by the -m xx number of characters. So if hits were found in 10 different source files, then there would be 10 different output files.

So if you have a word "mares" and a file contains 10 instances of the word, then a single output file is created, the hits are eXtracted, and 10 sections are created, each with the appropriate number of -m xx characters in it. This option is especially useful for persons providing strsrch output to e-discovery reviewers and processors. You wouldn't want to provide a 20 Meg file of freespace extract to a reviewer when only 100 characters are needed. Additional enhancements to this option are also available via the traditional maresware INI type of option enhancments.

Additional .INI capabilities (and are only available via the .INI file) are the capability of controlling multiple extraction directories so that the extract directory doesn't get so many files in it. see: DIRFILES in the .INI option. There is also a TYPE .INI directive which will include in the output extracted file, the "type" of file the hit comes from. This is a loose definition of TYPE, and is only meaningful for the person who is reviewing the output. For all the ini enhancements see the INI section below the options section.

This program is INI capable, and it is suggested that you use its capabilities if you are doing e-discovery work. Read about the generic ini capability in the section INI found in the main Maresware help file.


top

OPTIONS

Options are generally not positionally sensitive. Meaning the options can be used in any order, and may be grouped. The only restriction is that an option which take a modifier, such as the (-o outputfile.txt) -o option must have its modifier immediately following the option. So the -m option would need the value directly after the -m, such as: -m 80. For novices using command line options, it would probably be better to segregate each -option and not try to group them.

-p: + drive\path:  to start search at [-p d:\\path]. The default is to use the current directory. You can include multiple paths by separating each path with a space. must have space after -p path.    [PATH=path]

-f + file(s)_type:  Include any file types you want to restrict the search to.     [FILE=filetype]

Sample:[-f *.c myfile.exe junk.bat *.obj] The default is (*), which searches for all files, and is identicle to using the -a.

-x + file_type(s):  E(x)clude these file types from the search. You might wish to exclude all .exes or .dlls.     [EXCLUDE=filetype]

-s[S]W[xx] + string_file  The upper case 'W' immediataely tied to the -s indicates that the strings are to be treated as whole words. This means that the string (in order to create a hit) will be tested for non-printing or whitespace surrounding the string. Be careful, as one persons interpretation of word may not be what the program interprets. The use of an UPPER CASE S instead of the lower case, makes the test case sentitive. This is not recommended.

The use of the W should be tested and verified for applicable output based on your needs. NOTE: if the filename provided, does not exist, or the program can't find it, the program assumes that the string of characters is that which is searched for. Be careful that you correctly point to the file containing the strings.

A numeric value after the -s or -sW ( -s20, -sW20 ) will make the keyword field in the output record this many characters wide. The default width if all are less than 20 characters is 20 characters. If some are more the program will calculate the maximum needed and adjust accordingly. However, if you intend to merge outputs from two different runs of differing keyword searches, and one list contains a very long keyword then the output records will not be properly sized. The first field in the run with the large keywords will have its first field adjusted upward and be a different width from any other runs. This can cause problem when reprocessing the data using other maresware software. To guarantee a consistant width, use the explicit value large enough to cover the widest key. This option can also be implemented in the ini file with the line: MAXSTRING=xx or STRINGWIDTH=xx.

-i:  "i"mmediate exit. When a "hit" is found in a file, the program stops processing that file immediatately. This leads to a single output line indicating the file name containing hits. Use this to get a listing of files which contain hits. Then process the files seperately.

-o + dr:output_filename:   (REQUIRED) Output filename is name of the output file you want the output to be placed into. ********** NOTE, DO NOT PLACE OUTPUT FILE IN A LOCATION WHERE IT IS IN THE DIRECTORIES TO BE SEARCHED. IT WILL GET INTO A LOOP AND CONTINUE TO SEARCH AND INCREASE ITS OWN SIZE *********     [OUTPUT=filename]

-m + #:  replace the # with a number representing the new maximum line length of the output record. This length will be the number of characters printed from the input buffer where the string was found. The string will be centered within this area. The normal default is 80 character output width. If this option is used, the string found will be surrounded by ascii characters 174 and 175 which look somewhat like << and >>. If the << and >> are not wanted in the output record, use the INI option: CHEVRON=OFF. [WIDTH=value]

In the output record, use of the following characters: (C, L, or R) will cause the search string hit, to be placed in either the 'L'eft most, 'R'ight most, or 'C'enter of the output record. Upper and lower case CLR values have different meanings. See the program help screen for explanations.

The default is to place the string in the center of the output record and show characters surrounding the string. However, the user now has the capability of placing the string at either end of the output.

-d + delimiter  If you intend to import the file into a database, it is suggested you input delimiters between fields. use the character, or ascii value of the delimiter wanted. The pipe ‘|’ symbol is somewhat standard default for most data bases. And you can always inform the data base what delimiter was used.     [DELIMITER=value]

-wW:
-w #: Make the output record a single line output. Use this for preparation to importing the output into a data base. If the # is replaced with a value, then the path/filname is restricted/enlarged to this value. The default width is 60 characters. The use of the # is optional. The upper case -W eliminates the traditional header and footer from the output file, making it a little cleaner for importation into a data base. [SINGLE=[ON|OFF]]

-N: List the path/filename first in the output record. Use this with the -d delimieter option, for creating an output that can be processed by rm, upcopy, and other Maresware programs.

-1 + filename:  (that's a one, not ell) The filename here is a file which will contain accounting/log information about the run. It is always appended to, and contains the command line, and statistics about how many files and time of run. The file can later be used as a batch file for duplicating the runs. The ACCT environment variable can also be set. (SET ACCT=logfilename). Or use the .INI option [ACCT=filename] The order of priority is: Environment, INI file, Command Line option. To explicity turn off use a +1.

-r:  DO NOT do a recursive search. The default is to recurse through the entire structure from default path. If you start at root, the entire drive is searched.      [RECURSE=[ON|OFF]]

-R:  reset file access time. (NT and WIN9X version only). This will reset the file access date and time to the value before the file was analyzed by strsrch.       [RESET=[ON|OFF]]

-c:  (compress input buffer) This -c option removes from the input buffer all hex00 values. This effectively allows the user to search ANSI UNICODE files. The resulting output file positions are not accurate because of the location change of the characters within the processing buffer

-C:  : (compress input buffer) This is more powerful than the -c option, and it not only compresses hex00 (UNICODE) but also compresses out all whitespace, tabs, carriage returns, linefeeds and any other whitespace which it finds. [COMPRESS=[ON|OFF]]

-q:  Performs QUIET operation. Only prints hits on screen.[SILENT=[ON|OFF]]

-I:  (NO LONGER SUPPORTED) Indicates the file(s) being searched were created using DISKIMAG.exe. The hits will not only be identified by displacement in the file, but also by a sector number which relates to the sector of the floppy disk image. This is useful as a tool when using NORTON, dd, hex_sect, viewsect to later find the actual sector of the disk for additional review. This option is archaic, and not of any current use.[IMAGE=[ON|OFF]]

LINUX VERSION (Not currently supported) WINDOWS VERSION

-lL+#: (not valid on LINUX version) Process files less than # days old. Based on -t[acw] option. Default last modify (write) date.

-gG+#: (not valid on LINUX version) Process files greater than # days old.Default last modify (write) date.

-t[acw]:   Process file dates options -[lLgG] according to ‘a’ccess, ‘c’change, ‘mw’modification/write times. With LINUX, Linux uses a funny way of representing the ‘c’hange and ‘m’odification times (an ‘m’ or ‘w’ means the same thing to the linux version of the program.) The ‘c’hange is listed as status change times. (ls -cl) I don’t really know what constitutes a change, but it shows up. The ‘m’odification should show last write time. But I haven’t been able to figure tha one out either. The modification is the default time listed by ls -l. And ls -ul gets the last ‘u’pdate or access time

-h  bypass (h)ex values that are not printable. This effectively bypasses formatted disks and programs. It can increase speed up to 5 times when there is nothing to search. It bypasses characters << 0x20 and >> 0x7f.

-bB:  Normally in the output record any unprintable characters are converted to the usual dot (.) for legibility. However, in some cases this conversion is not wanted. Most often when the output is of data that the user wants in tact to place into a data base. The [bB] option does not convert these unprintable characters to dots. It leaves them alone. Use of the upper case B, will also automatically install the -W option will make a single line output record without headers or footers in the file. Again, assisting in the import into a data base. [BINARY=[ON|OFF]] The use of the .ini indicator also installs the -W option.

-v:   No Verbose. Do not add any headers or footers to the output file. The output file contains only the data records. It is suggested that if this option be used, the -w option also be used.

-X + eXtract_folder_name:   (available only after 6/2006) A directory name (it must exist, the program doesn't create this folder if it is not already there), where seperate files are to be created corresponding to input files where "hits" were found. For every file where there is a "hit", a corresponding file in this directory is created. In these files are placed (paragraphs) of the appropriate number of characters (-m XX option) surrounding each hit. So if a file contains 10 hits, a single output is created with 10 "paragraphs". One each for each keyword hit. The output file names are given unique names. The name is formed using the name of the file the hit was found in, and then a unique serialization number is appended. This is to keep unique any possible duplication of filenames from source directory to source directory. Also, if an FTK file export is used to generate the source files to search, and the FTK option is chosen to keep the index number, that index number [12345] is maintained in the output file name. The INI option for this is: [INDIVIDUAL=destination_folder_for_files].

Special INI enhancements only available with the -X option

Some special enhancements to this eXtract option are available only when using a .ini file.

The ini file format is (filename strsrch.ini which is located in the directory where the executable is, or the current directory. Current directory version take precedence if file exists in both locations.)
Directives are one item per line.

They are:
[INDIVIDUAL=directory]. Installs the -X extract_directory option from the INI file. This item is identicle to using the command line -X extract_directory\location. It is the only one that is available from both command line and ini file content.

[OFFSET=ON]. If this setting is included, then in addition to the output text, the output file also has the offset within the file that the data was located. This is useful for later locating the data in large files.

[DIRFILES=XX]. Replace XX with a number. This will create sub-directories within the top level -X extract folder, and place only XX number of files in each folder. This is useful when you expect to get thousands of files as output. Seperating them makes working with the output files a little easier. Each file is uniquely named and the filename also includes a unique index number to prevent duplication of names.

[TYPE=text]. Replace text with a short descriptive word. This would indicate the type of keyword hit we have. It is included in every record. A type more often is used when you have extracted freespace, slackspace or other generic large data files to search in. Type might be something like: FREESPACE.

[DOC_ID=text] Replace text with a short descriptive word. Used mostly for e-discovery summation and concordance load files. the DOC_ID string that you enter here, is included in every record of the files created in the eXtract directory. It is also sequencially indexed for unique identification of each record. (A record is considered a keyword hit).

[STARS=[ON[xx]]] Between each record in the output eXtract file, include a line of stars. ******. If you use the word ON, the default of 20 stars is used. If you use a number, then that number of stars is used. ie: STARS=40

[SINGLE=[ON]] Create a "single" or seperate output file for EACH hit found. This option, has the capability of creating a substantial number of files. So it is suggested that the DIRFILES option also be set to restrict the number of files in each directory. If not, the user stands a good chance of choking the operating system with tens of thousands of files in a single directory.

With the exception of the "-X directory_name" option, all of the above ini options are only available if the -eXtract option is used. In other words, these items are only installed if the INI file is used.

Below are samples of ini files, the command line, and sample output. The header containing the word FILENAME is always included. The FILENAME is the file containing the keyword hits.

INI SAMPLES

[STRSRCH]
; SAMPLE 1 ini file contents to obtain single output file for each input file
;
CHEVRON=OFF
EXTRACT=HIT_DIR
SINGLE=ON
DOC_ID=SINGLE_HIT_FILES
MAXSTRING=30
; command line for this option: strsrch -p c: -1 logfile -s mares -o outputfilename -X HIT_DIR
; this ini file will create a single output file for each HIT 
; All the hits will be placed in SINGLE files in the HIT_DIR folder
;
; this process is time consuming because of the possible large number
; of output files created
;
; SAMPLE single output file contents
;
;FILENAME:  HIT_DIR\bates_no_exe_09999_047
;
;DOC_ID:  SINGLE_HIT_FILES_00047
;STRING: mares
;REGISTERED/ILLEGAL COPY~PROPERY OF MARES AND COMPANY,LLC.*.*.                  
;
;*****  END SEGMENT  *****

SAMPLE #2

[STRSRCH]
; ini file contents to obtain single output file for each input file
;
CHEVRON=OFF
INDIVIDUAL=HIT_DIR
MAXSTRING=30
;
;
; command line for this option:  strsrch -p c: -1 logfile -s mares -o outputfilename
;
; this ini file will create a single file output for each file which 
; contains hit(s). All the hits for an individual file will be placed 
; in a single output file referencing the source file.
;
; this process is time consuming because of the possible large number
; of output files created
;
; SAMPLE OUTPUT: (a file with the following content)

;FILENAME:  D:\TEMP\HIT_DIR\bates_no_09999_087

;DOC_ID:  _00527
;STRING: mares
;N TO CONTINUE, ^C or x==exit.......Maresware .Error processing file.............

;DOC_ID:  _00528
;STRING: mares
;al Copy ...........................Maresware Unregistered/Unlicensed Illegal Cop

;DOC_ID:  _00529
;STRING: mares
;..............Doesn't have a valid Maresware registration.>...y...h...G...T...Ca


top

COMMAND LINES

strsrch  -p c:\  -s string.fle  -o d:output.fle
search entire C: drive, output on d: THIS IS PREFERRED DEFAULT

strsrch  -p: c:\  -f *.txt *.c  -s strings.fle  -o  a:output
search all .txt and .c files on drive C: output to a:

strsrch  -s string.fle  -o a:output.fle
search all files in current directory. output to A: drive

strsrch  -p c:\work  -s string.fle  -o c:\output.fle
search c:\work directory only

strsrch  -p c:\work  -r  -s string.fle  -o c:\output.fle
search c:\work and all subdirectories

strsrch  -p c:\work c:\test  -r -s string.fle  -o c:\output.fle
all subdirectories in c:\work and c:\test

strsrch  -p c:\work  -s string.fle  -o c:\output.fle -X c:\top_level_directory
include in c:\top_level_directory individual files containing the hit strings + the -m xx number of characters.
An ini entry of OFFSET=ON, will include the offsets of the hits in the output files.

strsrch  -p x:\   -s c:\tmp\string.fle  -o c:\output.fle -d "|" -v -i -1 c: \logfile.txt -N -m 160 -w 200

this is suggested command line for input to upcopy.


RELATED PROGRAMS

DISKIMAG  Obsolete

HEX_SECT  Obsolete

NT_SS

top