Deleting duplicate files in Google Drive using rclone2021-01-06
I'm a Google One user and my Google Drive has about 1TB of content from various sources. A few years ago I used a third party utility to sync my linux boxes to Drive and it created more duplicate files. I had around 3-4 different versions of some dirs with different sets of files. As a true procrastinator I postponed the problem until Google One alerted me about my quota.
These days I'm using
It's my primary way of using Drive on Linux after that half-baked sync application.
I discovered that rclone supports server side md5 hash for the files.
I decided to write a script to delete the duplicate files in Drive.
First I get md5 hash of all files in Drive using
$ rclone md5hash drive:/ > $HOME/Google-Drive-md5-$(date +%F).txt
This may take some time depending on the number of files but it ended quicker than I expected.
The file contents is like
39044094333de4a47d7478227cfa22a9 Facebin/vgg-clean/clea_duvall/00000501.jpg 398d44f0b32d1b4855e838754c2c49fc Facebin/vgg-clean/Bob_Barker/00000445.jpg c2090f9412b93d71fed884bb45b26518 Facebin/vgg-clean/Adam_Goldberg/00000432.jpg cd6971f809a534ab02ee1b03eb6c1183 CHECK Uploads/Google Photos/2017/06/IMG_1461.JPG 1fb80cc9ef622410557cb42c4abf26a8 Facebin/vgg-clean/Angell_Conwell/00000510.jpg 160793fd8f24fcf27efb9a2b3698a9c8 Facebin/makimface-artifacts/dataset-images/user-test-v4/00056--/img-5c1fdc5d4fd82c1e12a7d49937d7f47b.png bcddf6ebbc1eed83950d908c2ccbac4a Facebin/vgg-clean/danny_pino/00000102.jpg eb2746a9559e7a93fd0002f4bfd90517 Facebin/makimface-artifacts/dataset-images/user-test/00009--/ds-767cd16bfdf4e31860d597708a586979.png 9605d69aaf5fb83d309f10f9e2544630 Facebin/vgg-clean/Adam_Beach/00000953.jpg 68e6d3ec3cca8a728a78fb60c9a1ddba Facebin/vgg-clean/corey_stoll/00000137.jpg
First 32 characters are the md5 hash of a file and the second column is the path of the file.
You'll probably see some blank for md5 of files. It's important to remove these from the list, as these are your Google docs, spreadsheets etc.
$ grep '[^0-9a-f]' $HOME/Google-Drive-md5-$(date +%F).txt > $HOME/Google-Drive-md5-cleaned.txt
Then we sort the file using
$ sort $HOME/Google-Drive-md5-cleaned.txt > $HOME/Google-Drive-md5-sorted.txt
We are keeping all intermediate files because you may want to take a look at the differences after the cleanup.
Then we write the following script and run
#!/bin/zsh PREV_MD5="" PREV_PATH="" CURRENT_MD5="" CURRENT_PATH="" MD5_FILE=$HOME/Google-Drive-md5-sorted.txt cat $MD5_FILE | while read current_line ; do # echo $current_line CURRENT_MD5=$(echo "$current_line" | cut -c -32) CURRENT_PATH=$(echo "$current_line" | cut -c 35-) # echo $CURRENT_MD5 # echo "$CURRENT_PATH" if [[ "$CURRENT_MD5" == "$PREV_MD5" ]] ; then echo "EQUAL: $CURRENT_MD5 $PREV_MD5" echo "DELETE: drive:/$CURRENT_PATH" rclone -v delete "drive:/$CURRENT_PATH" else PREV_MD5=$CURRENT_MD5 PREV_PATH=$CURRENT_PATH fi done
You can save this script as
google-drive-delete-duplicates.sh and run
$ chmod +x google-drive-delete-duplicates.sh $ ./google-drive-delete-duplicates.sh
The script checks each line one by one and if two duplicate md5 are found consecutively, the second file is deleted. It keeps only one of the duplicates even if there are more than two copies each.
I have gained about 200GB by running this script.