Useful and (imho) elegant bash script found here to remove duplicate files based on their md5 sum. I use it to get rid on duplicate emails in my mailbox.
#!/bin/bash
declare -A arr
shopt -s globstar
for file in **; do
[[ -f "$file" ]] || continue
read cksm _ < <(md5sum "$file")
if ((arr[$cksm]++)); then
echo "rm $file"
fi
done
Of course, remove the echo on line 10 to put the rm in effect. I actually replace the rm by a mv $file ~/.local/share/Trash/files/, for more safety.
It took me some search to understand the read cksm _ < <(md5sum "$file") statement. It stores the first word of md5sum "$file" (the md5 sum itself) into cksm variable and everything else (here, the filename which the md5sum command outputs again) in $_. In effect, this throws away anything but the first word. One could as well write read cksm throwaway < <(md5sum "$file"). Using $_ is just a shorthand.
It also looks as if only the first < was an actual redirect. It sets stdin to what follows. The second one is part of <(cmd) which is a command substitution. According to this answer, it points to a file which is the output of the enclosed command. So for example
vic@bufni:~$ echo <(date) /dev/fd/63 vic@bufni:~$ cat <(date) Mon Sep 28 11:43:32 CEST 2015
Returning to the script itself , the way the array is used is interesting: it does not store the md5 as a value but as a label. Then you just need to test whether the label exists. Smart!