This question already has answers here:
Remove duplicate dict in list in Python
(16 answers)
Closed 8 months ago.
I have a bit complex list of dictionaries which looks like
[
{'Name': 'Something XYZ', 'Address': 'Random Address', 'Customer Number': '-', 'User Info': [{'Registration Number': '17002', 'First Name': 'John', 'Middle Name': '', 'Last Name': 'Denver'}, {'Registration Number': '27417', 'First Name': 'Robert', 'Middle Name': '', 'Last Name': 'Patson'}]},
{'Name': 'Something XYZ', 'Address': 'Random Address', 'Customer Number': '-', 'User Info': [{'Registration Number': '27417', 'First Name': 'Robert', 'Middle Name': '', 'Last Name': 'Patson'}, {'Registration Number': '17002', 'First Name': 'John', 'Middle Name': '', 'Last Name': 'Denver'}]}
]
Expected is below
[
{'Name': 'Something XYZ', 'Address': 'Random Address', 'Customer Number': '-', 'User Info': [{'Registration Number': '17002', 'First Name': 'John', 'Middle Name': '', 'Last Name': 'Denver'}, {'Registration Number': '27417', 'First Name': 'Robert', 'Middle Name': '', 'Last Name': 'Patson'}]},
]
I want to remove the duplicate dictionaries in this list but I don't know how to deal with User Info because the order of the items might be different. A duplicate case would be where all the dictionary items are exactly the same and in the case of User Info order doesn't matter.
I think the best way is to make a hash of User Info by sum the hash values of it's elements (sum will tolerate position change).
def deepHash(value):
if type(value) == list:
return sum([deepHash(x) for x in value])
if type(value) == dict:
return sum([deepHash(x) * deepHash(y) for x, y in value.items()])
return hash(str(value))
and you can simply check the hash of you inputs:
assert deepHash({"a": [1,2,3], "c": "d"}) == deepHash({"c": "d", "a": [3,2,1]})
Using this dictionary, is there a way I can only extract the Name, Last Name, and Age of the boys?
myDict = {'boy1': {'Name': 'JM', 'Last Name':'Delgado', 'Middle Name':'Goneza', 'Age':'21',
'Birthday':'8/22/2001', 'Gender':'Male'},
'boy2': {'Name': 'Ralph', 'Last Name':'Tubongbanua', 'Middle Name':'Castro',
'Age':'21', 'Birthday':'9/5/2001', 'Gender':'Male'},}
for required in myDict.values():
print (required ['Name', 'Last Name', 'Age'])
The output is:
JM
Ralph
What I have in mind is
JM Delgado 21
Ralph Tubongbanua 21
You have to extract the keys one by one:
myDict = {'boy1': {'Name': 'JM', 'Last Name':'Delgado', 'Middle Name':'Goneza', 'Age':'21',
'Birthday':'8/22/2001', 'Gender':'Male'},
'boy2': {'Name': 'Ralph', 'Last Name':'Tubongbanua', 'Middle Name':'Castro',
'Age':'21', 'Birthday':'9/5/2001', 'Gender':'Male'},}
for required in myDict.values():
print (required['Name'], required['Last Name'],required['Age'])
this could be a solution:
myDict = {'boy1': {'Name': 'JM', 'Last Name':'Delgado', 'Middle, Name':'Goneza', 'Age':'21', 'Birthday':'8/22/2001', 'Gender':'Male'},
'boy2': {'Name': 'Ralph', 'Last Name':'Tubongbanua', 'Middle Name':'Castro',
'Age':'21', 'Birthday':'9/5/2001', 'Gender':'Male'},}
for required in myDict.values():
print(required ['Name'], required['Last Name'], required['Age'])
When printing multiple values separated with commas, a space will automatically appear between them.
I have an initial code like this:
record = "Jane,Doe,25/02/2002;
James,Poe,19/03/1998;
Max,Soe,16/12/2001
..."
I need to make it into a dictionary and its output should be something like this:
{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'}
{'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'}
...
Each line should have an incrementing key starting from 1.
I currently have no idea to approach this issue as I am still a student with no prior experience.
I have seen people use this for strings containing key-value pairs but my string does not contain those:
mydict = dict((k.strip(), v.strip()) for k,v in
(item.split('-') for item in record.split(',')))
Use split:
In [220]: ans = []
In [221]: record = "Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001"
In [223]: l = record.split(';')
In [227]: for i in l:
...: l1 = i.split(',')
...: d = {'First Name': l1[0], 'Last Name': l1[1], 'Birthday': l1[2]}
...: ans.append(d)
...:
In [228]: ans
Out[228]:
[{'First Name': 'Jane', 'Last Name': 'Doe', 'Birthday': '25/02/2002'},
{'First Name': 'James', 'Last Name': 'Poe', 'Birthday': '19/03/1998'},
{'First Name': 'Max', 'Last Name': 'Soe', 'Birthday': '16/12/2001'}]
To make the required dictionary for a single line, you can use split to chop up the line where there are commas (','), to get the values for the dictionary, and hard-code the keys. E.g.
line = "Jane,Doe,25/02/2002"
values = line.split(",")
d = {"First Name": values[0], "Last Name": values[1], "Birthday": values[2]}
Now to repeat that for each line in the record, a list of all the lines is needed. Again, you can use split in this case to chop up the input where there are semicolons (';'). E.g.
record = "Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001"
lines = record.split(";")
Now you can iterate the solution for one line over this lines list, collecting the results into another list.
results = []
for line in lines:
values = line.split(",")
results.append({"First Name": values[0], "Last Name": values[1], "Birthday": values[2]})
The incremental key requirement you mention seems strange, because you could just keep them in a list, where the index in the list is effectively the key. But of course, if you really need the indexed-dictionary thing, you can use a dictionary comprehension to do that.
results = {i + 1: results[i] for i in range(len(results))}
Finally, the whole thing might be made more concise (and nicer IMO) by using a combination of list and dictionary comprehensions, as well as a list of your expected keys.
record = "Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001"
keys = ["First Name", "Last Name", "Birthday"]
results = [dict(zip(keys, line.split(","))) for line in record.split(";")]
With the optional indexed-dictionary thingy:
results = {i + 1: results[i] for i in range(len(results))}
This should work for your case:
lines = [line.replace('\n','').replace('.','').strip() for line in record.split(';')]
desired_dict = {}
for i, line in enumerate(lines):
words = line.split(',')
desired_dict[i] = {
'First name':words[0],
'Last name':words[1],
'Birthday':words[2]
}
The .split() method is useful. First split the strings separated by ; and split each of the new strings by ,.
record = """Jane,Doe,25/02/2002;
James,Poe,19/03/1998;
Max,Soe,16/12/2001"""
out = []
for rec in record.split(';'):
lst = rec.strip().split(',')
dict_new = {}
dict_new['First Name'] = lst[0]
dict_new['Last Name'] = lst[1]
dict_new['Birthday'] = lst[2]
out.append(dict_new)
print(out)
other answers are already quite clear, just want to add on that, you can do it in one line (which is much less readable, not recommended, but it is arguably fancier). it also takes possible spaces into account with strip(), you can remove them if you don't want them. this gives you a list of dicts you need
record_dict = [{'First name': val[0].strip(), 'Last name': val[1].strip(), 'Birthday': val[2].strip()} for val in (rec.strip().split(',') for rec in record.strip().split(';'))]
I think you are looking for :
record = """Jane,Doe,25/02/2002;
James,Poe,19/03/1998;
Max,Soe,16/12/2001"""
num = 0
out = dict()
for v in record.split(";"):
v = v.strip().split(",")
num += 1
out[num] = {'First name':v[0],'Last name':v[1], 'Birthday':v[2]}
print(out)
prints:
{1: {'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'},
2: {'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'},
3: {'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}}
# raw string data
record = 'Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001'
# list of lists
list_of_lists = [x.split(',') for x in record.split(';')]
# list of dicts
list_of_dicts = []
for x in list_of_lists:
# assemble into dict
d = {'First name': x[0],
'Last name': x[1],
'Birthday': x[2]}
# append to list
list_of_dicts.append(d)
output:
[{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'},
{'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'},
{'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}]
Here is a step by step Pythonic way to achieve that:
>>> from pprint import pprint # just to have a fancy print
>>> columns = ['First name', 'Last name', 'Birthday']
>>> records = '''Jane,Doe,25/02/2002
... James,Poe,19/03/1998
... Max,Soe,16/12/2001'''
>>> records = records.split()
>>> pprint(records)
['Jane,Doe,25/02/2002',
'James,Poe,19/03/1998',
'Max,Soe,16/12/2001']
>>> records = [_.split(',') for _ in records]
>>> pprint(records)
[['Jane', 'Doe', '25/02/2002'],
['James', 'Poe', '19/03/1998'],
['Max', 'Soe', '16/12/2001']]
>>> records = [dict(zip(columns, _)) for _ in records]
>>> pprint(records)
[{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'},
{'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'},
{'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}]
If you have all records in one line, delimited by a ; signal, then you can do this:
>>> from pprint import pprint # just to have a fancy print
>>> columns = ['First name', 'Last name', 'Birthday']
>>> records = 'Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001'
>>> records = records.split(';')
>>> pprint(records)
['Jane,Doe,25/02/2002',
'James,Poe,19/03/1998',
'Max,Soe,16/12/2001']
>>> records = [_.split(',') for _ in records]
>>> pprint(records)
[['Jane', 'Doe', '25/02/2002'],
['James', 'Poe', '19/03/1998'],
['Max', 'Soe', '16/12/2001']]
>>> records = [dict(zip(columns, _)) for _ in records]
>>> pprint(records)
[{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'},
{'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'},
{'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}]
And finally you can put it all together in one line:
>>> from pprint import pprint # just to have a fancy print
>>> columns = ['First name', 'Last name', 'Birthday']
>>> records = 'Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001'
>>> # All tasks in one line now
>>> records = [dict(zip(columns, _)) for _ in [_.split(',') for _ in records.split(';')]]
>>> pprint(records)
[{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'},
{'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'},
{'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}]
list comprehensions make it easy.
record = "Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001"
list_of_records = [item.split(',') for item in record.split(';')]
dict_of_records = [{'first_name':line[0], 'last_name':line[1], 'Birthday':line[2]} for line in list_of_records]
print(dict_of_records)
Output:
[{'first_name': 'Jane', 'last_name': 'Doe', 'Birthday': '25/02/2002'}, {'first_name': 'James', 'last_name': 'Poe', 'Birthday': '19/03/1998'}, {'first_name': 'Max', 'last_name': 'Soe', 'Birthday': '16/12/2001'}]
You can do it without writing any loops using sub() method of re and json:
import re
import json
record = "Jane,Doe,25/02/2002;James,Poe,19/03/1998;Max,Soe,16/12/2001"
sub_record = re.sub(r'\b;?([a-zA-Z]+),([a-zA-Z]+),(\d\d/\d\d/\d\d\d\d)',r',{"First name": "\1", "Last name": "\2", "Birthday": "\3"}',record)
mydict = json.loads('['+sub_record[1:]+']')
print(mydict)
Output:
[{'First name': 'Jane', 'Last name': 'Doe', 'Birthday': '25/02/2002'}, {'First name': 'James', 'Last name': 'Poe', 'Birthday': '19/03/1998'}, {'First name': 'Max', 'Last name': 'Soe', 'Birthday': '16/12/2001'}]
With some regex:
import re
[re.match(r'(?P<First_name>\w+),(?P<Last_name>\w+),(?P<Birthday>.+)', r).groupdict() for r in record.split(';')]
The underscores in First_name and Last_name are inevitable unfortunately.
I have got a list of about 7000 names in a csv file that is arranged by surname, name, date of birth etc. I also have a folder of about 7000+ scanned documents (enrolment forms) which have the name of each person as a filename.
Now the filenames may not exactly match the names in the csv ie. John Doe in the csv, filename will be John-Michael Doe etc..
How would I go abouts writing a program that looks through the csv and see what filenames are missing in the scanned folder?
I am a complete novice in programming and any advice is appreciated.
The first thing you want to do is read the CSV into memory. You can do this with the csv module. The most useful tool there is csv.DictReader, which takes the first line of the file as keys in a dictionary, and reads the remainder:
import csv
with open('/path/to/yourfile.csv', 'r') as f:
rows = list(csv.DictReader(f))
from pprint import pprint
pprint(rows[:100])
In windows, the path would look different, and would be something like c:/some folder/some other folder/ (note the forward-slashes instead of backslashes).
This will show the first 100 rows from the file. For example if you have columns named "First Name", "Last Name", "Date of Birth", this will look like:
[{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'}
...]
Next you want to get a list of all the 7000 files, using os.listdir:
import os
images_directory = '/path/to/images/'
image_paths = [
os.path.join(images_directory, filename)
for filename in os.listdir(images_directory)]
Now you'll need some way to extract the names from the files. This depends crucially on the way the files are structured. The tricky-to-use but very powerful tool for this task is called a regular expression, but probably something simple will suffice. For example, if the files are named like "first-name last-name.pdf", you could write a simple parsing method like:
def parse_filename(filename):
name, extension = filename.split('.')
first_name, last_name = name.split(' ')
return first_name.replace('-', ' '), last_name.replace('-', ' ')
The exact implementation will depend on how the files are named, but the key things to get you started are str.split, str.strip and a few others in that same class. You might also take a look at the re module for handling regular expressions. As I said, that's a more advanced/powerful technique, so it may not be worth worrying about right now.
A simple algorithm to do the matching would be something like the following:
name_to_filename = {parse_filename(filename.lower()): filename for filename in filenames}
matched_rows = []
unmatched_files = []
for row in rows:
name_key = (row['First Name'].lower(), row['Last Name'].lower())
matching_file = name_to_filename.get(name_key) # This sees if we have a matching file name, and returns
# None otherwise.
new_row = row.copy()
if matching_file:
new_row['File'] = matching_file
print('Matched "%s" to %s' % (' '.join(name_key), matching_file))
else:
new_row['File'] = ''
print('No match for "%s"' % (' '.join(name_key)))
matched_rows.append(new_row)
with open('/path/to/output.csv', 'w') as f:
writer = csv.DictWriter(f, ['First Name', 'Last Name', 'Date of Birth', 'File])
writer.writeheader()
writer.writerows(matched_rows)
This should give you an output spreadsheet with whatever rows you could match automatically matched up, and the remaining ones blank. Depending on how clean your data is, you might be able to just match the remaining few entries by hand. There's only 7000, and the "dumb" heuristic will probably catch most of them. If you need more advanced heuristics, you might look at Jaccard similarity of the "words" in the name, and the difflib module for approximate string matching.
Of course most of this code won't quite work on your problem, but hopefully it's enough to get you started.