Bash script to remove duplicates

Hacking

Hacking gnu-linux things

Useful and (imho) elegant bash script found here to remove duplicate files based on their md5 sum. I use it to get rid on duplicate emails in my mailbox.

#!/bin/bash
declare -A arr
shopt -s globstar

for file in **; do
  [[ -f "$file" ]] || continue

  read cksm _ < <(md5sum "$file")
  if ((arr[$cksm]++)); then 
    echo "rm $file"
  fi
done

Of course, remove the echo on line 10 to put the rm in effect. I actually replace the rm by a mv $file ~/.local/share/Trash/files/, for more safety.

It took me some search to understand the read cksm _ < <(md5sum "$file") statement. It stores the first word of md5sum "$file" (the md5 sum itself) into cksm variable and everything else (here, the filename which the md5sum command outputs again) in $_. In effect, this throws away anything but the first word. One could as well write read cksm throwaway < <(md5sum "$file"). Using $_ is just a shorthand.

It also looks as if only the first < was an actual redirect. It sets stdin to what follows. The second one is part of <(cmd) which is a command substitution. According to this answer, it points to a file which is the output of the enclosed command. So for example

vic@bufni:~$ echo <(date)
/dev/fd/63
vic@bufni:~$ cat <(date)
Mon Sep 28 11:43:32 CEST 2015

Returning to the script itself , the way the array is used is interesting: it does not store the md5 as a value but as a label. Then you just need to test whether the label exists. Smart!