match names in csv file to filename in folder

match names in csv file to filename in folder - python

I have got a list of about 7000 names in a csv file that is arranged by surname, name, date of birth etc. I also have a folder of about 7000+ scanned documents (enrolment forms) which have the name of each person as a filename.
Now the filenames may not exactly match the names in the csv ie. John Doe in the csv, filename will be John-Michael Doe etc..
How would I go abouts writing a program that looks through the csv and see what filenames are missing in the scanned folder?
I am a complete novice in programming and any advice is appreciated.

The first thing you want to do is read the CSV into memory. You can do this with the csv module. The most useful tool there is csv.DictReader, which takes the first line of the file as keys in a dictionary, and reads the remainder:
import csv
with open('/path/to/yourfile.csv', 'r') as f:
rows = list(csv.DictReader(f))
from pprint import pprint
pprint(rows[:100])
In windows, the path would look different, and would be something like c:/some folder/some other folder/ (note the forward-slashes instead of backslashes).
This will show the first 100 rows from the file. For example if you have columns named "First Name", "Last Name", "Date of Birth", this will look like:
[{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'}
...]
Next you want to get a list of all the 7000 files, using os.listdir:
import os
images_directory = '/path/to/images/'
image_paths = [
os.path.join(images_directory, filename)
for filename in os.listdir(images_directory)]
Now you'll need some way to extract the names from the files. This depends crucially on the way the files are structured. The tricky-to-use but very powerful tool for this task is called a regular expression, but probably something simple will suffice. For example, if the files are named like "first-name last-name.pdf", you could write a simple parsing method like:
def parse_filename(filename):
name, extension = filename.split('.')
first_name, last_name = name.split(' ')
return first_name.replace('-', ' '), last_name.replace('-', ' ')
The exact implementation will depend on how the files are named, but the key things to get you started are str.split, str.strip and a few others in that same class. You might also take a look at the re module for handling regular expressions. As I said, that's a more advanced/powerful technique, so it may not be worth worrying about right now.
A simple algorithm to do the matching would be something like the following:
name_to_filename = {parse_filename(filename.lower()): filename for filename in filenames}
matched_rows = []
unmatched_files = []
for row in rows:
name_key = (row['First Name'].lower(), row['Last Name'].lower())
matching_file = name_to_filename.get(name_key) # This sees if we have a matching file name, and returns
# None otherwise.
new_row = row.copy()
if matching_file:
new_row['File'] = matching_file
print('Matched "%s" to %s' % (' '.join(name_key), matching_file))
else:
new_row['File'] = ''
print('No match for "%s"' % (' '.join(name_key)))
matched_rows.append(new_row)
with open('/path/to/output.csv', 'w') as f:
writer = csv.DictWriter(f, ['First Name', 'Last Name', 'Date of Birth', 'File])
writer.writeheader()
writer.writerows(matched_rows)
This should give you an output spreadsheet with whatever rows you could match automatically matched up, and the remaining ones blank. Depending on how clean your data is, you might be able to just match the remaining few entries by hand. There's only 7000, and the "dumb" heuristic will probably catch most of them. If you need more advanced heuristics, you might look at Jaccard similarity of the "words" in the name, and the difflib module for approximate string matching.
Of course most of this code won't quite work on your problem, but hopefully it's enough to get you started.

Related

How to remove the duplicate dictionaries from the list of dictionaries where dictionary contains an another dictionary? [duplicate]

This question already has answers here:
Remove duplicate dict in list in Python
(16 answers)
Closed 8 months ago.
I have a bit complex list of dictionaries which looks like
[
{'Name': 'Something XYZ', 'Address': 'Random Address', 'Customer Number': '-', 'User Info': [{'Registration Number': '17002', 'First Name': 'John', 'Middle Name': '', 'Last Name': 'Denver'}, {'Registration Number': '27417', 'First Name': 'Robert', 'Middle Name': '', 'Last Name': 'Patson'}]},
{'Name': 'Something XYZ', 'Address': 'Random Address', 'Customer Number': '-', 'User Info': [{'Registration Number': '27417', 'First Name': 'Robert', 'Middle Name': '', 'Last Name': 'Patson'}, {'Registration Number': '17002', 'First Name': 'John', 'Middle Name': '', 'Last Name': 'Denver'}]}
]
Expected is below
[
{'Name': 'Something XYZ', 'Address': 'Random Address', 'Customer Number': '-', 'User Info': [{'Registration Number': '17002', 'First Name': 'John', 'Middle Name': '', 'Last Name': 'Denver'}, {'Registration Number': '27417', 'First Name': 'Robert', 'Middle Name': '', 'Last Name': 'Patson'}]},
]
I want to remove the duplicate dictionaries in this list but I don't know how to deal with User Info because the order of the items might be different. A duplicate case would be where all the dictionary items are exactly the same and in the case of User Info order doesn't matter.

I think the best way is to make a hash of User Info by sum the hash values of it's elements (sum will tolerate position change).
def deepHash(value):
if type(value) == list:
return sum([deepHash(x) for x in value])
if type(value) == dict:
return sum([deepHash(x) * deepHash(y) for x, y in value.items()])
return hash(str(value))
and you can simply check the hash of you inputs:
assert deepHash({"a": [1,2,3], "c": "d"}) == deepHash({"c": "d", "a": [3,2,1]})

Is there a way to extract the selected value in a nested Dictionary using a for loop?

Using this dictionary, is there a way I can only extract the Name, Last Name, and Age of the boys?
myDict = {'boy1': {'Name': 'JM', 'Last Name':'Delgado', 'Middle Name':'Goneza', 'Age':'21',
'Birthday':'8/22/2001', 'Gender':'Male'},
'boy2': {'Name': 'Ralph', 'Last Name':'Tubongbanua', 'Middle Name':'Castro',
'Age':'21', 'Birthday':'9/5/2001', 'Gender':'Male'},}
for required in myDict.values():
print (required ['Name', 'Last Name', 'Age'])
The output is:
JM
Ralph
What I have in mind is
JM Delgado 21
Ralph Tubongbanua 21

You have to extract the keys one by one:
myDict = {'boy1': {'Name': 'JM', 'Last Name':'Delgado', 'Middle Name':'Goneza', 'Age':'21',
'Birthday':'8/22/2001', 'Gender':'Male'},
'boy2': {'Name': 'Ralph', 'Last Name':'Tubongbanua', 'Middle Name':'Castro',
'Age':'21', 'Birthday':'9/5/2001', 'Gender':'Male'},}
for required in myDict.values():
print (required['Name'], required['Last Name'],required['Age'])

this could be a solution:
myDict = {'boy1': {'Name': 'JM', 'Last Name':'Delgado', 'Middle, Name':'Goneza', 'Age':'21', 'Birthday':'8/22/2001', 'Gender':'Male'},
'boy2': {'Name': 'Ralph', 'Last Name':'Tubongbanua', 'Middle Name':'Castro',
'Age':'21', 'Birthday':'9/5/2001', 'Gender':'Male'},}
for required in myDict.values():
print(required ['Name'], required['Last Name'], required['Age'])
When printing multiple values separated with commas, a space will automatically appear between them.

Convert lines of string to a dictionary

I have an initial code like this:
record = "Jane,Doe,25/02/2002;
James,Poe,19/03/1998;
Max,Soe,16/12/2001
..."
I need to make it into a dictionary and its output should be something like this:
{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'}
{'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'}
...
Each line should have an incrementing key starting from 1.
I currently have no idea to approach this issue as I am still a student with no prior experience.
I have seen people use this for strings containing key-value pairs but my string does not contain those:
mydict = dict((k.strip(), v.strip()) for k,v in
(item.split('-') for item in record.split(',')))

Use split:
In [220]: ans = []
In [221]: record = "Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001"
In [223]: l = record.split(';')
In [227]: for i in l:
...: l1 = i.split(',')
...: d = {'First Name': l1[0], 'Last Name': l1[1], 'Birthday': l1[2]}
...: ans.append(d)
...:
In [228]: ans
Out[228]:
[{'First Name': 'Jane', 'Last Name': 'Doe', 'Birthday': '25/02/2002'},
{'First Name': 'James', 'Last Name': 'Poe', 'Birthday': '19/03/1998'},
{'First Name': 'Max', 'Last Name': 'Soe', 'Birthday': '16/12/2001'}]

To make the required dictionary for a single line, you can use split to chop up the line where there are commas (','), to get the values for the dictionary, and hard-code the keys. E.g.
line = "Jane,Doe,25/02/2002"
values = line.split(",")
d = {"First Name": values[0], "Last Name": values[1], "Birthday": values[2]}
Now to repeat that for each line in the record, a list of all the lines is needed. Again, you can use split in this case to chop up the input where there are semicolons (';'). E.g.
record = "Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001"
lines = record.split(";")
Now you can iterate the solution for one line over this lines list, collecting the results into another list.
results = []
for line in lines:
values = line.split(",")
results.append({"First Name": values[0], "Last Name": values[1], "Birthday": values[2]})
The incremental key requirement you mention seems strange, because you could just keep them in a list, where the index in the list is effectively the key. But of course, if you really need the indexed-dictionary thing, you can use a dictionary comprehension to do that.
results = {i + 1: results[i] for i in range(len(results))}
Finally, the whole thing might be made more concise (and nicer IMO) by using a combination of list and dictionary comprehensions, as well as a list of your expected keys.
record = "Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001"
keys = ["First Name", "Last Name", "Birthday"]
results = [dict(zip(keys, line.split(","))) for line in record.split(";")]
With the optional indexed-dictionary thingy:
results = {i + 1: results[i] for i in range(len(results))}

This should work for your case:
lines = [line.replace('\n','').replace('.','').strip() for line in record.split(';')]
desired_dict = {}
for i, line in enumerate(lines):
words = line.split(',')
desired_dict[i] = {
'First name':words[0],
'Last name':words[1],
'Birthday':words[2]
}

The .split() method is useful. First split the strings separated by ; and split each of the new strings by ,.
record = """Jane,Doe,25/02/2002;
James,Poe,19/03/1998;
Max,Soe,16/12/2001"""
out = []
for rec in record.split(';'):
lst = rec.strip().split(',')
dict_new = {}
dict_new['First Name'] = lst[0]
dict_new['Last Name'] = lst[1]
dict_new['Birthday'] = lst[2]
out.append(dict_new)
print(out)

other answers are already quite clear, just want to add on that, you can do it in one line (which is much less readable, not recommended, but it is arguably fancier). it also takes possible spaces into account with strip(), you can remove them if you don't want them. this gives you a list of dicts you need
record_dict = [{'First name': val[0].strip(), 'Last name': val[1].strip(), 'Birthday': val[2].strip()} for val in (rec.strip().split(',') for rec in record.strip().split(';'))]

I think you are looking for :
record = """Jane,Doe,25/02/2002;
James,Poe,19/03/1998;
Max,Soe,16/12/2001"""
num = 0
out = dict()
for v in record.split(";"):
v = v.strip().split(",")
num += 1
out[num] = {'First name':v[0],'Last name':v[1], 'Birthday':v[2]}
print(out)
prints:
{1: {'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'},
2: {'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'},
3: {'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}}

# raw string data
record = 'Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001'
# list of lists
list_of_lists = [x.split(',') for x in record.split(';')]
# list of dicts
list_of_dicts = []
for x in list_of_lists:
# assemble into dict
d = {'First name': x[0],
'Last name': x[1],
'Birthday': x[2]}
# append to list
list_of_dicts.append(d)
output:
[{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'},
{'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'},
{'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}]

Here is a step by step Pythonic way to achieve that:
>>> from pprint import pprint # just to have a fancy print
>>> columns = ['First name', 'Last name', 'Birthday']
>>> records = '''Jane,Doe,25/02/2002
... James,Poe,19/03/1998
... Max,Soe,16/12/2001'''
>>> records = records.split()
>>> pprint(records)
['Jane,Doe,25/02/2002',
'James,Poe,19/03/1998',
'Max,Soe,16/12/2001']
>>> records = [_.split(',') for _ in records]
>>> pprint(records)
[['Jane', 'Doe', '25/02/2002'],
['James', 'Poe', '19/03/1998'],
['Max', 'Soe', '16/12/2001']]
>>> records = [dict(zip(columns, _)) for _ in records]
>>> pprint(records)
[{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'},
{'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'},
{'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}]
If you have all records in one line, delimited by a ; signal, then you can do this:
>>> from pprint import pprint # just to have a fancy print
>>> columns = ['First name', 'Last name', 'Birthday']
>>> records = 'Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001'
>>> records = records.split(';')
>>> pprint(records)
['Jane,Doe,25/02/2002',
'James,Poe,19/03/1998',
'Max,Soe,16/12/2001']
>>> records = [_.split(',') for _ in records]
>>> pprint(records)
[['Jane', 'Doe', '25/02/2002'],
['James', 'Poe', '19/03/1998'],
['Max', 'Soe', '16/12/2001']]
>>> records = [dict(zip(columns, _)) for _ in records]
>>> pprint(records)
[{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'},
{'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'},
{'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}]
And finally you can put it all together in one line:
>>> from pprint import pprint # just to have a fancy print
>>> columns = ['First name', 'Last name', 'Birthday']
>>> records = 'Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001'
>>> # All tasks in one line now
>>> records = [dict(zip(columns, _)) for _ in [_.split(',') for _ in records.split(';')]]
>>> pprint(records)
[{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'},
{'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'},
{'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}]

list comprehensions make it easy.
record = "Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001"
list_of_records = [item.split(',') for item in record.split(';')]
dict_of_records = [{'first_name':line[0], 'last_name':line[1], 'Birthday':line[2]} for line in list_of_records]
print(dict_of_records)
Output:
[{'first_name': 'Jane', 'last_name': 'Doe', 'Birthday': '25/02/2002'}, {'first_name': 'James', 'last_name': 'Poe', 'Birthday': '19/03/1998'}, {'first_name': 'Max', 'last_name': 'Soe', 'Birthday': '16/12/2001'}]

You can do it without writing any loops using sub() method of re and json:
import re
import json
record = "Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001"
sub_record = re.sub(r'\b;?([a-zA-Z]+),([a-zA-Z]+),(\d\d/\d\d/\d\d\d\d)',r',{"First name": "\1", "Last name": "\2", "Birthday": "\3"}',record)
mydict = json.loads('['+sub_record[1:]+']')
print(mydict)
Output:
[{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'}, {'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'}, {'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}]

With some regex:
import re
[re.match(r'(?P<First_name>\w+),(?P<Last_name>\w+),(?P<Birthday>.+)', r).groupdict() for r in record.split(';')]
The underscores in First_name and Last_name are inevitable unfortunately.

python else part in if runs many times , how to solve

here is my code for showing search record and showing inform usr if found nothing.
Problem: else part runs as many times as outer loop.
entries = [{'First Name': 'Sher', 'Last Name': 'Khan', 'Age': '22', 'Telephone': '2989484'},
{'First Name': 'Ali', 'Last Name': 'Khan', 'Age': '22', 'Telephone': '398439'},
{'First Name': 'Talha', 'Last Name': 'Khan', 'Age': '22', 'Telephone': '3343434'}]
search = input("type your search: ")
print(search)
for person in entries:
# print(person)
if person["Last Name"] == search:
print("Here are the records found for your search")
for e in person:
print(e, ":", person[e])
else:
print("There is no record found as you search Keyword")

thats because each iteration you are checking only 1 person, and if you didn't find what you looked for, you are printing that it does not exist.
this is actually an undesired behavior.
a better solution would be to simply look in the set of values you need:
...
search = input("type your search: ")
founds = [entry for entry in entries if entry["Last Name"] == search)] ## filtering only records that match what we need using list comprehension
if founds:
for found in founds:
* print info *
else:
print("There is no record found as you search Keyword")

First, check if the Last Name that the user enters is present in the dictionaries. If yes, then loop through them and print the respective records. Else, display no records found. Here is how you do it:
entries = [{'First Name': 'Sher', 'Last Name': 'Khan', 'Age': '22', 'Telephone': '2989484'},
{'First Name': 'Ali', 'Last Name': 'Khan', 'Age': '22', 'Telephone': '398439'},
{'First Name': 'Talha', 'Last Name': 'Khan', 'Age': '22', 'Telephone': '3343434'}]
search = input("type your search: ")
print(search)
if search in [person['Last Name'] for person in entries]:
for person in entries:
if person["Last Name"] == search:
print("Here are the records found for your search")
for e in person:
print(e, ":", person[e])
else:
print("There is no record found as you search Keyword")
Output:
type your search: >? Khan
Khan
Here are the records found for your search
First Name : Sher
Last Name : Khan
Age : 22
Telephone : 2989484
Here are the records found for your search
First Name : Ali
Last Name : Khan
Age : 22
Telephone : 398439
Here are the records found for your search
First Name : Talha
Last Name : Khan
Age : 22
Telephone : 3343434
type your search: >? Jones
Jones
There is no record found as you search Keyword

Try like this (Use a boolean Found variable)
entries = [{'First Name': 'Sher', 'Last Name': 'Khan', 'Age': '22', 'Telephone': '2989484'},
{'First Name': 'Ali', 'Last Name': 'Khan', 'Age': '22', 'Telephone': '398439'},
{'First Name': 'Talha', 'Last Name': 'Khan', 'Age': '22', 'Telephone': '3343434'},
{'First Name': 'Talha', 'Last Name': 'Jones', 'Age': '22', 'Telephone': '3343434'}]
search = input("type your search: ")
found = False
print(search)
for person in entries:
if person["Last Name"] == search:
found = True
print("Here are the records found for your search")
for e in person:
print(e, ":", person[e])
if not found:
print("There is no record found as you search Keyword")

This might not be the best way of doing it but this would work I think
entries = [{'First Name': 'Sher', 'Last Name': 'Khan', 'Age': '22', 'Telephone':
'2989484'},
{'First Name': 'Ali', 'Last Name': 'Khan', 'Age': '22', 'Telephone': '398439'},
{'First Name': 'Talha', 'Last Name': 'Khan', 'Age': '22', 'Telephone':
'3343434'}]
inEntries = False
search = input("type your search: ")
print(search)
for person in entries:
# print(person)
if person["Last Name"] == search:
inEntries = True
print("Here are the records found for your search")
for e in person:
print(e, ":", person[e])
if not inEntries:
print("There is no record found as you search Keyword")

Dictionaries within a list

suppose I have the following list of dictionaries:
database = [{'Job title': 'painter', 'Email address': 'xxx#yyy.com', 'Last name': 'Wright', 'First name': 'James', 'Company': 'Swift'},
{'Job title': 'plumber', 'Email address': 'xxx#yyy.com', 'Last name': 'Bright', 'First name': 'James', 'Company': 'ABD Plumbing'},
{'Job title': 'brick layer', 'Email address': 'xxx#yyy.com', 'Last name': 'Smith', 'First name': 'John', 'Company': 'Bricky brick'}]
I'm entering the following code so I can print information about a person given their first name (I will be changing this, to search for last name, company, job title etc, using a variable):
print(next(item for item in database if item['First name'] == 'James'))
The issue arises as I have two First name's which are equal, namely James. How do I adjust the code so that it prints out information about all the James's in the database?

Remove the next().
print([item for item in database if item['First name'] == 'James'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

match names in csv file to filename in folder - python

Related

How to remove the duplicate dictionaries from the list of dictionaries where dictionary contains an another dictionary? [duplicate]

Is there a way to extract the selected value in a nested Dictionary using a for loop?

Convert lines of string to a dictionary

python else part in if runs many times , how to solve

Dictionaries within a list

Categories

Resources