I'm trying to update a csv file with some student figures provided by other sources however they've formatted their csv data slightly differently to ours.
It needs to match students based on three criteras their name, their class and finally the first few letters of the location so for the first few students from Class B are from Dumpt which is actually Dumpton Park.
When matches are found
If a student's Scorecard in CSV 2 is 0 or blank then it shouldn't update the score column in CSV 1
If a student's Number in CSV 2 is 0 or blank then it shouldn't update the No column in CSV 1
Otherwise it should import the numbers from CSV 2 to CSV1
Below is some example data:
CSV 1
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,
CSV 2
"Class","Location","Student","Scorecard","Number"
"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"
CSV 1 UPDATED (This is the desired output)
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299
Class B,Dumpton Park,Bill,x,x,23.4,198
Class A,Dover,Andy,x,x,23,983
Class A,Dover,Hannah,x,x,1,293
Class B,London,Jemma,x,x,32.2,
Class B,London,James,x,x,32.0,
I would really appreciate any help with this problem. Thanks Oliver
Here are two solutions: a pandas solution and a plain python solution. First a pandas solution which unsurprisingly looks a whole lot like the other pandas solutions...
First load in the data
import pandas
import numpy as np
cdf1 = pandas.read_csv('csv1',dtype=object) #dtype = object allows us to preserve the numeric formats
cdf2 = pandas.read_csv('csv2',dtype=object)
col_order = cdf1.columns #pandas will shuffle the column order at some point---this allows us to reset ot original column order
At this point the data frames will look like
In [6]: cdf1
Out[6]:
Class Local Name DPE JJK Score No
0 Class A York Tom x x 32 NaN
1 Class A York Jim x x 10 NaN
2 Class A York Sam x x 32 NaN
3 Class B Dumpton Park Sarah x x NaN NaN
4 Class B Dumpton Park Bob x x NaN NaN
5 Class B Dumpton Park Bill x x NaN NaN
6 Class A Dover Andy x x NaN NaN
7 Class A Dover Hannah x x NaN NaN
8 Class B London Jemma x x NaN NaN
9 Class B London James x x NaN NaN
In [7]: cdf2
Out[7]:
Class Location Student Scorecard Number
0 Class A York Jim 0 742
1 Class A York Sam 0 931
2 Class A York Tom 0 653
3 Class B Dumpt Bob 23.1 299
4 Class B Dumpt Bill 23.4 198
5 Class B Dumpt Sarah 23.5 12
6 Class A Dover Andy 23 983
7 Class A Dover Hannah 1 293
8 Class B Lond Jemma 32.2 0
9 Class B Lond James 32.0 0
Next manipulate both the data frames into matching formats.
dcol = cdf2.Location
cdf2['Location'] = dcol.apply(lambda x: x[0:4]) #Replacement in cdf2 since we don't need original data
dcol = cdf1.Local
cdf1['Location'] = dcol.apply(lambda x: x[0:4]) #Here we add a column leaving 'Local' because we'll need it for the final output
cdf2 = cdf2.rename(columns={'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'})
cdf2 = cdf2.replace('0', np.nan) #Replacing '0' by np.nan means zeros don't overwrite
cdf1 = cdf1.set_index(['Class', 'Location', 'Name'])
cdf2 = cdf2.set_index(['Class', 'Location', 'Name'])
Now cdf1 and cdf2 look like
In [16]: cdf1
Out[16]:
Local DPE JJK Score No
Class Location Name
Class A York Tom York x x 32 NaN
Jim York x x 10 NaN
Sam York x x 32 NaN
Class B Dump Sarah Dumpton Park x x NaN NaN
Bob Dumpton Park x x NaN NaN
Bill Dumpton Park x x NaN NaN
Class A Dove Andy Dover x x NaN NaN
Hannah Dover x x NaN NaN
Class B Lond Jemma London x x NaN NaN
James London x x NaN NaN
In [17]: cdf2
Out[17]:
Score No
Class Location Name
Class A York Jim NaN 742
Sam NaN 931
Tom NaN 653
Class B Dump Bob 23.1 299
Bill 23.4 198
Sarah 23.5 12
Class A Dove Andy 23 983
Hannah 1 293
Class B Lond Jemma 32.2 NaN
James 32.0 NaN
Updating the data in cdf1 with the data in cdf2
cdf1.update(cdf2, overwrite=False)
results in
In [19]: cdf1
Out[19]:
Local DPE JJK Score No
Class Location Name
Class A York Tom York x x 32 653
Jim York x x 10 742
Sam York x x 32 931
Class B Dump Sarah Dumpton Park x x 23.5 12
Bob Dumpton Park x x 23.1 299
Bill Dumpton Park x x 23.4 198
Class A Dove Andy Dover x x 23 983
Hannah Dover x x 1 293
Class B Lond Jemma London x x 32.2 NaN
James London x x 32.0 NaN
Finally return cdf1 to it's original form and write it to a csv file.
cdf1 = cdf1.reset_index() #These two steps allow us to remove the 'Location' column
del cdf1['Location']
cdf1 = cdf1[col_order] #This will switch Local and Name back to their original order
cdf1.to_csv('temp.csv',index = False)
Two notes: First, given how easy it is to use cdf1.Local.value_counts() or len(cdf1.Local.value_counts()) etc. I'd strongly recommend adding some check summing to make sure that when shifting from Location to the first few letters of Location, you aren't accidentally eliminating a location. Secondly, I sincerely hope there is a typo on line 4 of your desired output.
Onto a plain python solution. In the following, adjust the filenames as needed.
#Open all of the necessary files
csv1 = open('csv1','r')
csv2 = open('csv2','r')
csvout = open('csv_out','w')
#Read past both headers and write the header to the outfile
wstr = csv1.readline()
csvout.write(wstr)
csv2.readline()
#Read csv1 into a dictionary with keys of Class,Name,and first four digits of Local and keep a list of keys for line ordering
line_keys = []
line_dict = {}
for line in csv1:
s = line.split(',')
this_key = (s[0],s[1][0:4],s[2])
line_dict[this_key] = s
line_keys.append(this_key)
#Go through csv2 updating the data in csv1 as necessary
for line in csv2:
s = line.replace('\"','').split(',')
this_key = (s[0],s[1][0:4],s[2])
if this_key in line_dict: #Lowers the crash rate...
#Check if need to replace Score...
if len(s[3]) > 0 and float(s[3]) != 0:
line_dict[this_key][5] = s[3]
#Check if need to repace No...
if len(s[4]) > 0 and float(s[4]) != 0:
line_dict[this_key][6] = s[4]
else:
print "Line not in csv1: %s"%line
#Write the updated line_dict to csvout
for key in line_keys:
wstr = ','.join(line_dict[key])
csvout.write(wstr)
csvout.write('\n')
#Close all of the open filehandles
csv1.close()
csv2.close()
csvout.close()
Hopefully this code is a bit more readable. ;) The backport for Python's new Enum type is here.
from enum import Enum # see PyPI for the backport (enum34)
class Field(Enum):
course = 0
location = 1
student = 2
dpe = 3
jjk = 4
score = -2
number = -1
def __index__(self):
return self._value_
def Float(text):
if not text:
return 0.0
return float(text)
def load_our_data(filename):
"return a dict using the first three fields as the key"
data = dict()
with open(filename) as input:
next(input) # throw away header
for line in input:
fields = line.strip('\n').split(',')
fields[Field.score] = Float(fields[Field.score])
fields[Field.number] = Float(fields[Field.number])
key = (
fields[Field.course].lower(),
fields[Field.location][:4].lower(),
fields[Field.student].lower(),
)
data[key] = fields
return data
def load_their_data(filename):
"return a dict using the first three fields as the key"
data = dict()
with open(filename) as input:
next(input) # throw away header
for line in input:
fields = line.strip('\n').split(',')
fields = [f.strip('"') for f in fields]
fields[Field.score] = Float(fields[Field.score])
fields[Field.number] = Float(fields[Field.number])
key = (
fields[Field.course].lower(),
fields[Field.location][:4].lower(),
fields[Field.student].lower(),
)
data[key] = fields
return data
def merge_data(ours, theirs):
"their data is only used if not blank and non-zero"
for key, our_data in ours.items():
their_data = theirs[key]
if their_data[Field.score]:
our_data[Field.score] = their_data[Field.score]
if their_data[Field.number]:
our_data[Field.number] = their_data[Field.number]
def write_our_data(data, filename):
with open(filename, 'w') as output:
for record in sorted(data.values()):
line = ','.join([str(f) for f in record])
output.write(line + '\n')
if __name__ == '__main__':
ours = load_our_data('one.csv')
theirs = load_their_data('two.csv')
merge_data(ours, theirs)
write_our_data(ours, 'three.csv')
You could use fuzzywuzzy to do the matching of town names, and append as a column to df2:
df1 = pd.read_csv(csv1)
df2 = pd.read_csv(csv2)
towns = df1.Local.unique() # assuming this is complete list of towns
from fuzzywuzzy.fuzz import partial_ratio
In [11]: df2['Local'] = df2.Location.apply(lambda short_location: max(towns, key=lambda t: partial_ratio(short_location, t)))
In [12]: df2
Out[12]:
Class Location Student Scorecard Number Local
0 Class A York Jim 0.0 742 York
1 Class A York Sam 0.0 931 York
2 Class A York Tom 0.0 653 York
3 Class B Dumpt Bob 23.1 299 Dumpton Park
4 Class B Dumpt Bill 23.4 198 Dumpton Park
5 Class B Dumpt Sarah 23.5 12 Dumpton Park
6 Class A Dover Andy 23.0 983 Dover
7 Class A Dover Hannah 1.0 293 Dover
8 Class B Lond Jemma 32.2 0 London
9 Class B Lond James 32.0 0 London
Make the name consistent (at the moment Student and Name are misnamed):
In [13]: df2.rename_axis({'Student': 'Name'}, axis=1, inplace=True)
Now you can merge (on the overlapping columns):
In [14]: res = df1.merge(df2, how='outer')
In [15]: res
Out[15]:
Class Local Name DPE JJK Score No Location Scorecard Number
0 Class A York Tom x x 32 NaN York 0.0 653
1 Class A York Jim x x 10 NaN York 0.0 742
2 Class A York Sam x x 32 NaN York 0.0 931
3 Class B Dumpton Park Sarah x x NaN NaN Dumpt 23.5 12
4 Class B Dumpton Park Bob x x NaN NaN Dumpt 23.1 299
5 Class B Dumpton Park Bill x x NaN NaN Dumpt 23.4 198
6 Class A Dover Andy x x NaN NaN Dover 23.0 983
7 Class A Dover Hannah x x NaN NaN Dover 1.0 293
8 Class B London Jemma x x NaN NaN Lond 32.2 0
9 Class B London James x x NaN NaN Lond 32.0 0
One bit to clean up is the Score, I think I would take the max of the two:
In [16]: res['Score'] = res.loc[:, ['Score', 'Scorecard']].max(1)
In [17]: del res['Scorecard']
del res['No']
del res['Location']
Then you're left with the columns you want:
In [18]: res
Out[18]:
Class Local Name DPE JJK Score Number
0 Class A York Tom x x 32.0 653
1 Class A York Jim x x 10.0 742
2 Class A York Sam x x 32.0 931
3 Class B Dumpton Park Sarah x x 23.5 12
4 Class B Dumpton Park Bob x x 23.1 299
5 Class B Dumpton Park Bill x x 23.4 198
6 Class A Dover Andy x x 23.0 983
7 Class A Dover Hannah x x 1.0 293
8 Class B London Jemma x x 32.2 0
9 Class B London James x x 32.0 0
In [18]: res.to_csv('foo.csv')
Note: to force the dtype to object (and have mixed dtypes, ints and floats, rather than all floats) you can use an apply. I would recommend against this if you're doing any analysis!
res['Score'] = res['Score'].apply(lambda x: int(x) if int(x) == x else x, convert_dtype=False)
Python dictionaries are the way to go here:
studentDict = {}
with open(<csv1>, 'r') as f:
for line in f:
LL = line.rstrip('\n').replace('"','').split(',')
studentDict[LL[0], LL[1], LL[2]] = LL[3:]
with open(<csv2>, 'r') as f:
for line in f:
LL = line.rstrip('\n').replace('"','').split(',')
if LL[-2] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-2] = LL[-2]
if LL[-1] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-1] = LL[-1]
with open(<outFile>, 'w') as f:
for k in studentDict.keys():
v = studentDict[k[0], k[1], k[2]]
f.write(k[0] + ',' + k[1] + ',' + k[2] + ',' + v[0] + ',' + v[1] + ',' + v[2] + ',' + v[3] + '\n')
pandas make this sort of task a bit more convenient.
EDIT: Okay since you can't rely on renaming columns manually, Roman's suggestion to just match on the first few letters is a good one. We have to change a couple things before that though.
In [62]: df1 = pd.read_clipboard(sep=',')
In [63]: df2 = pd.read_clipboard(sep=',')
In [68]: df1
Out[68]:
Class Location Student Scorecard Number
0 Class A York Jim 0.0 742
1 Class A York Sam 0.0 931
2 Class A York Tom 0.0 653
3 Class B Dumpt Bob 23.1 299
4 Class B Dumpt Bill 23.4 198
5 Class B Dumpt Sarah 23.5 12
6 Class A Dover Andy 23.0 983
7 Class A Dover Hannah 1.0 293
8 Class B Lond Jemma 32.2 0
9 Class B Lond James 32.0 0
In [69]: df2
Out[69]:
Class Local Name DPE JJK Score No
0 Class A York Tom x x 32.0 653
1 Class A York Jim x x 10.0 742
2 Class A York Sam x x 32.0 653
3 Class B Dumpton Park Sarah x x 23.5 12
4 Class B Dumpton Park Bob x x 23.1 299
5 Class B Dumpton Park Bill x x 23.4 198
6 Class A Dover Andy x x 23.0 983
7 Class A Dover Hannah x x 1.0 293
8 Class B London Jemma x x 32.2 NaN
9 Class B London James x x 32.0 NaN
Get the columns named the same.
In [70]: df1 = df1.rename(columns={'Location': 'Local', 'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'}
Now for the locations. Save the originals in df2 to a separate Series.
In [71]: locations = df2['Local']
In [72]: df1['Local'] = df1['Local'].str.slice(0, 4)
In [73]: df2['Local'] = df2['Local'].str.slice(0, 4)
Use the string methods to truncate to the first 4 (assuming this won't cause any false matches).
Now set the indices:
In [78]: df1 = df1.set_index(['Class', 'Local', 'Name'])
In [79]: df2 = df2.set_index(['Class', 'Local', 'Name'])
In [80]: df1
Out[80]:
Score No
Class Local Name
Class A York Jim 0.0 742
Sam 0.0 931
Tom 0.0 653
Class B Dump Bob 23.1 299
Bill 23.4 198
Sarah 23.5 12
Class A Dove Andy 23.0 983
Hannah 1.0 293
Class B Lond Jemma 32.2 0
James 32.0 0
In [83]: df1 = df1.replace(0, np.nan)
In [84]: df2 = df2.replace(0, np.nan)
Finally, update the scores as before:
In [85]: df1.update(df2, overwrite=False)
You can get the original locations back by doing:
In [91]: df1 = df1.reset_index()
In [92]: df1['Local'] = locations
And you can write to output to csv (and a bunch of other format) with df1.to_csv('path/to/csv')
You could try using the csv module from the standard library. My solution is very similar to Chris H's, but I used the csv module to read and write the files. (In fact, I stole his technique of storing the keys in a list to save the order).
If you use the csv module, you don't have to worry too much about the quotes, and it also allows you to read the rows directly into dictionaries with the column names as keys.
import csv
# Open first CSV, and read each line as a dictionary with column names as keys.
with open('csv1.csv', 'rb') as csvfile1:
table1 = csv.DictReader(csvfile1,['Class', 'Local', 'Name',
'DPE', 'JJK', 'Score', 'No'])
table1.next() #skip header row
first_table = {}
original_order = [] #list keys to save original order
# build dictionary of rows with name, location, and class as keys
for row in table1:
id = "%s from %s in %s" % (row['Name'], row['Local'][:4], row['Class'])
first_table[id] = row
original_order.append(id)
# Repeat for second csv, but don't worry about order
with open('csv2.csv', 'rb') as csvfile2:
table2 = csv.DictReader(csvfile2, ['Class', 'Location',
'Student', 'Scorecard', 'Number'])
table2.next()
second_table = {}
for row in table2:
id = "%s from %s in %s" % (row['Student'], row['Location'][:4], row['Class'])
second_table[id] = row
with open('student_data.csv', 'wb') as finalfile:
results = csv.DictWriter(finalfile, ['Class', 'Local', 'Name',
'DPE', 'JJK', 'Score', 'No'])
results.writeheader()
# Replace data in first csv with data in second csv when conditions are satisfied.
for student in original_order:
if second_table[student]['Scorecard'] != "0" and second_table[student]['Scorecard'] != "":
first_table[student]['Score'] = second_table[student]['Scorecard']
if second_table[student]['Number'] != "0" and second_table[student]['Number'] != "":
first_table[student]['No'] = second_table[student]['Number']
results.writerow(first_table[student])
Hope this helps.