Useful and (imho) elegant bash script found here to remove duplicate files based on their md5 sum. I use it to get rid on duplicate emails in my mailbox.
#!/bin/bash
declare -A arr
shopt -s globstar
for file in **; do
[[ -f "$file" ]] || continue
read cksm _ < <(md5sum "$file")
if ((arr[$cksm]++)); then
echo "rm $file"
fi
done
Of course, remove the echo
on line 10 to put the rm
in effect. I actually replace the rm
by a mv $file ~/.local/share/Trash/files/
, for more safety.
It took me some search to understand the read cksm _ < <(md5sum "$file")
statement. It stores the first word of md5sum "$file"
(the md5 sum itself) into cksm
variable and everything else (here, the filename which the md5sum command outputs again) in $_
. In effect, this throws away anything but the first word. One could as well write read cksm throwaway < <(md5sum "$file")
. Using $_
is just a shorthand.
It also looks as if only the first <
was an actual redirect. It sets stdin to what follows. The second one is part of <(cmd)
which is a command substitution. According to this answer, it points to a file which is the output of the enclosed command. So for example
vic@bufni:~$ echo <(date) /dev/fd/63 vic@bufni:~$ cat <(date) Mon Sep 28 11:43:32 CEST 2015
Returning to the script itself , the way the array is used is interesting: it does not store the md5 as a value but as a label. Then you just need to test whether the label exists. Smart!