Deleting duplicate files in Google Drive using rclone

I’m a Google One user and my Google Drive has about 1TB of content from various sources. A few years ago I used a third party utility to sync my linux boxes to Drive and it created more duplicate files. I had around 3-4 different versions of some dirs with different sets of files. As a true procrastinator I postponed the problem until Google One alerted me about my quota.

These days I’m using rclone. It’s my primary way of using Drive on Linux after that half-baked sync application. I discovered that rclone supports server side md5 hash for the files. I decided to write a script to delete the duplicate files in Drive.

First I get md5 hash of all files in Drive using

$ rclone md5hash drive:/ > $HOME/Google-Drive-md5-$(date +%F).txt 

This may take some time depending on the number of files but it ended quicker than I expected.

The file contents is like

39044094333de4a47d7478227cfa22a9  Facebin/vgg-clean/clea_duvall/00000501.jpg
398d44f0b32d1b4855e838754c2c49fc  Facebin/vgg-clean/Bob_Barker/00000445.jpg
c2090f9412b93d71fed884bb45b26518  Facebin/vgg-clean/Adam_Goldberg/00000432.jpg
cd6971f809a534ab02ee1b03eb6c1183  CHECK Uploads/Google Photos/2017/06/IMG_1461.JPG
1fb80cc9ef622410557cb42c4abf26a8  Facebin/vgg-clean/Angell_Conwell/00000510.jpg
160793fd8f24fcf27efb9a2b3698a9c8  Facebin/makimface-artifacts/dataset-images/user-test-v4/00056--/img-5c1fdc5d4fd82c1e12a7d49937d7f47b.png
bcddf6ebbc1eed83950d908c2ccbac4a  Facebin/vgg-clean/danny_pino/00000102.jpg
eb2746a9559e7a93fd0002f4bfd90517  Facebin/makimface-artifacts/dataset-images/user-test/00009--/ds-767cd16bfdf4e31860d597708a586979.png
9605d69aaf5fb83d309f10f9e2544630  Facebin/vgg-clean/Adam_Beach/00000953.jpg
68e6d3ec3cca8a728a78fb60c9a1ddba  Facebin/vgg-clean/corey_stoll/00000137.jpg

First 32 characters are the md5 hash of a file and the second column is the path of the file.

You’ll probably see some blank for md5 of files. It’s important to remove these from the list, as these are your Google docs, spreadsheets etc.

$ grep '[^0-9a-f]' $HOME/Google-Drive-md5-$(date +%F).txt > $HOME/Google-Drive-md5-cleaned.txt

Then we sort the file using

$ sort $HOME/Google-Drive-md5-cleaned.txt > $HOME/Google-Drive-md5-sorted.txt

We are keeping all intermediate files because you may want to take a look at the differences after the cleanup.

Then we write the following script and run

#!/bin/zsh

PREV_MD5=""
PREV_PATH=""
CURRENT_MD5=""
CURRENT_PATH=""
MD5_FILE=$HOME/Google-Drive-md5-sorted.txt
cat $MD5_FILE | while read current_line ; do
    # echo $current_line
    CURRENT_MD5=$(echo "$current_line" | cut -c -32)
    CURRENT_PATH=$(echo "$current_line" | cut -c 35-)
    # echo $CURRENT_MD5
    # echo "$CURRENT_PATH" 
    if [[ "$CURRENT_MD5" == "$PREV_MD5" ]] ; then
        echo "EQUAL: $CURRENT_MD5 $PREV_MD5"
        echo "DELETE: drive:/$CURRENT_PATH"
        rclone -v delete "drive:/$CURRENT_PATH"
    else
        PREV_MD5=$CURRENT_MD5
        PREV_PATH=$CURRENT_PATH
    fi
done

You can save this script as google-drive-delete-duplicates.sh and run

$ chmod +x google-drive-delete-duplicates.sh 
$ ./google-drive-delete-duplicates.sh

The script checks each line one by one and if two duplicate md5 are found consecutively, the second file is deleted. It keeps only one of the duplicates even if there are more than two copies each.

I have gained about 200GB by running this script.