Find similar file names, the python way

One of challenges of keeping a hard drive clean is first removing any files we download, often under the misguided thought of “I’ll read it later.” There are a number of tools for doing this already. Afterwards, we are left with files that have similar, but not the names, and without common attribute, such as file size or check sum.

As a programming exercise, I created a python script to find files with similar, but different names, using difflib’s get_close_matches method.


Start by creating a method to call get_close_matches, using threads.

def find_most_similar_phrases_threaded(phrase_list, cutoff=0.9):
    matched_list = {}
    phrase_list = [p.lower() for p in set(phrase_list)]
    phrase_list.sort()
    with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
        futures = {executor.submit(difflib.get_close_matches, phrase, phrase_list[i:], n=100, cutoff=cutoff): phrase for i, phrase in enumerate(phrase_list)}
        for future in concurrent.futures.as_completed(futures):
            item = futures[future]
            result = future.result()
            result.remove(item)
            if result and len(result) > 0:
                matched_list |= {item: result}
    return matched_list

Text

if name == 'main':
    input_folder = '/data/'
    files = [x for (root,dirs,file) in os.walk(input_folder) for x in file]
    files = list(set(files))
    files = [f.lower() for f in files]

Comments

One response to “Find similar file names, the python way”

  1. Tim Avatar
    Tim

    It works.