To create a Python duplicate file finder, you can use the os
and hashlib
modules to traverse a directory tree and generate a hash value for each file. Here’s an example of how to create a simple duplicate file finder:
import os
import hashlib
def find_duplicate_files(directory):
"""
Finds duplicate files in a directory
"""
file_hash = {}
for dirpath, dirnames, filenames in os.walk(directory):
for filename in filenames:
file_path = os.path.join(dirpath, filename)
with open(file_path, 'rb') as file:
file_content = file.read()
hash_value = hashlib.md5(file_content).hexdigest()
if hash_value in file_hash:
file_hash[hash_value].append(file_path)
else:
file_hash[hash_value] = [file_path]
return {hash_value: file_list for hash_value, file_list in file_hash.items() if len(file_list) > 1}
# example usage
directory = "/path/to/directory"
duplicate_files = find_duplicate_files(directory)
print(duplicate_files)
In this example, we first import the os
and hashlib
modules.
We define a function called find_duplicate_files()
that takes a directory path as input and returns a dictionary of file hash values and lists of files with matching hash values.
We create an empty dictionary called file_hash
to store the hash values and corresponding file paths for each file in the directory.
We use the os.walk()
function to traverse the directory tree and generate a hash value for each file using the hashlib.md5()
function. We add the hash value and file path to the file_hash
dictionary.
If multiple files have the same hash value, we append their file paths to a list in the file_hash
dictionary.
We return a new dictionary that contains only the hash values and file lists with more than one file, indicating that they are duplicates.
We call the find_duplicate_files()
function with an example directory path and print the resulting dictionary of duplicate files to the console.
Note that this is just a basic example of how to create a Python duplicate file finder. You can customize the code to suit your specific needs, such as adding additional file metadata or providing more detailed output for each duplicate file. Additionally, keep in mind that this script may take a while to run on large directories, as it generates hash values for every file.
+ There are no comments
Add yours