shell - Recursively search directory of binary files for hexadecimal sequence? -


the current commands i'm using search hex values (say 0a 8b 02) involve:

find . -type f -not -name "*.png" -exec xxd -p {} \; | grep "0a8b02" || xargs -0 -p 4

is possible improve given following goals:

  • search files recursively
  • display offset , filename
  • exclude files extensions (above example not search .png files)
  • speed: search needs handle 200,000 files (around 50kb 1mb) in directly totaling ~2gb.

i'm not confident if xargs working 4 processors. i'm having difficulties printing filename when grep finds match since piped xxd. suggestions?

if:

  • you have gnu grep
  • and hex bytes search never contain newlines (0xa)[1]
    • if contain nul (0x), must provide grep search string via file (-f) rather direct argument.

the following command there, using example of searching 0e 8b 02:

lc_all=c find . -type f -not -name "*.png" -exec grep -fhoab $'\x{0e}\x{8b}\x{02}' {} + |   lc_all=c cut -d: -f1-2 

the grep command produces output lines follows:

<filename>:<byte-offset>:<matched-bytes> 

which lc_all=c cut -d: -f1-2 reduces <filename>:<byte-offset>

the command almost works bsd grep, except byte offset reported invariably start of line pattern matched on.
in other words: the byte offset correct if no newlines precede match in file.
also, bsd grep doesn't support specifying nul (0x0) bytes part of search string, not when provided via file -f.

  • note there'll no parallel processing, few grep invocations, based on using find's -exec ... +, which, xargs, passes many filenames fit on command line grep @ once.
  • by letting grep search byte sequence directly, there no need xxd:
    • the sequence specified ansi c-quoted string, means escape sequences expanded literals shell, enabling grep search resulting string as literal (via -f), faster.
      the linked article bash manual, work in zsh (and ksh) too.
      • a gnu grep alternative use -p (support prces, perl-compatible regular expressions) non-pre-expanded escape sequences, slower: grep -phoab '\x{0e}\x{8b}\x{02}'
    • lc_all=c ensures grep treats each byte own character without applying encoding rules.
    • -f treats search strings literal (rather regex)
    • -h prepends relevant input filename each output line; note grep implicitly when given more 1 filename argument
    • -o report matched strings (byte sequences), not whole line (the concept of line has no meaning in binary files anyway)[2]
    • -a treats binary files if text files (without this, grep print text binary file <filename> matches binary input files matches)
    • -b reports byte offsets of matches

if it's sufficient find @ 1 match in given input file, add -m 1.


[1] newlines cannot used, because grep invariably treats newlines in search-pattern string separating multiple search patterns. also, grep line-based, can't match across lines; gnu grep's -null-data option split input nul bytes help, if search byte sequence doesn't comprise nul bytes; you'd have represent byte values escape sequences in regex combined -p - because you'll need use escape sequence \n in lieu of actual newlines.

[2] -o needed make -b report byte offset of match opposed of beginning of line (as stated, bsd grep always latter, unfortunately); additionally, beneficial report matches here, attempt print entire line result in unpredictably long output lines, given there's no concept of lines in binary files; either way, however, outputting bytes binary file may cause strange rendering behavior in terminal.


Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -