$HOME/Google-Drive-md5-$(date +%F).txt
This may take some time depending on the number of files but it ended quicker than I expected.
The file contents is like
39044094333de4a47d7478227cfa22a9 Facebin/vgg-clean/clea_duvall/00000501.jpg
398d44f0b32d1b4855e838754c2c49fc Facebin/vgg-clean/Bob_Barker/00000445.jpg
c2090f9412b93d71fed884bb45b26518 Facebin/vgg-clean/Adam_Goldberg/00000432.jpg
cd6971f809a534ab02ee1b03eb6c1183 CHECK Uploads/Google Photos/2017/06/IMG_1461.JPG
1fb80cc9ef622410557cb42c4abf26a8 Facebin/vgg-clean/Angell_Conwell/00000510.jpg
160793fd8f24fcf27efb9a2b3698a9c8 Facebin/makimface-artifacts/dataset-images/user-test-v4/00056--/img-5c1fdc5d4fd82c1e12a7d49937d7f47b.png
bcddf6ebbc1eed83950d908c2ccbac4a Facebin/vgg-clean/danny_pino/00000102.jpg
eb2746a9559e7a93fd0002f4bfd90517 Facebin/makimface-artifacts/dataset-images/user-test/00009--/ds-767cd16bfdf4e31860d597708a586979.png
9605d69aaf5fb83d309f10f9e2544630 Facebin/vgg-clean/Adam_Beach/00000953.jpg
68e6d3ec3cca8a728a78fb60c9a1ddba Facebin/vgg-clean/corey_stoll/00000137.jpg
First 32 characters are the md5 hash of a file and the second column is the path of the file.
You’ll probably see some blank for md5 of files. It’s important to remove these from the list, as these are your Google docs, spreadsheets etc.
$ grep '[^0-9a-f]' $HOME/Google-Drive-md5-$(date +%F).txt > $HOME/Google-Drive-md5-cleaned.txt
Then we sort the file using
$ sort $HOME/Google-Drive-md5-cleaned.txt > $HOME/Google-Drive-md5-sorted.txt
We are keeping all intermediate files because you may want to take a look at the differences after the cleanup.
Then we write the following script and run
#!/bin/zsh
PREV_MD5=""
PREV_PATH=""
CURRENT_MD5=""
CURRENT_PATH=""
MD5_FILE=$HOME/Google-Drive-md5-sorted.txt
cat $MD5_FILE | while read current_line ; do
# echo $current_line
CURRENT_MD5=$(echo "$current_line" | cut -c -32)
CURRENT_PATH=$(echo "$current_line" | cut -c 35-)
# echo $CURRENT_MD5
# echo "$CURRENT_PATH"
if [[ "$CURRENT_MD5" == "$PREV_MD5" ]] ; then
echo "EQUAL: $CURRENT_MD5 $PREV_MD5"
echo "DELETE: drive:/$CURRENT_PATH"
rclone -v delete "drive:/$CURRENT_PATH"
else
PREV_MD5=$CURRENT_MD5
PREV_PATH=$CURRENT_PATH
fi
done
You can save this script as google-drive-delete-duplicates.sh
and run
$ chmod +x google-drive-delete-duplicates.sh
$ ./google-drive-delete-duplicates.sh
The script checks each line one by one and if two duplicate md5 are found consecutively, the second file is deleted. It keeps only one of the duplicates even if there are more than two copies each.
I have gained about 200GB by running this script.