Updating csv with data from a csv with different formatting

Updating csv with data from a csv with different formatting - python

I'm trying to update a csv file with some student figures provided by other sources however they've formatted their csv data slightly differently to ours.
It needs to match students based on three criteras their name, their class and finally the first few letters of the location so for the first few students from Class B are from Dumpt which is actually Dumpton Park.
When matches are found
If a student's Scorecard in CSV 2 is 0 or blank then it shouldn't update the score column in CSV 1
If a student's Number in CSV 2 is 0 or blank then it shouldn't update the No column in CSV 1
Otherwise it should import the numbers from CSV 2 to CSV1
Below is some example data:
CSV 1
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,
CSV 2
"Class","Location","Student","Scorecard","Number"
"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"
CSV 1 UPDATED (This is the desired output)
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299
Class B,Dumpton Park,Bill,x,x,23.4,198
Class A,Dover,Andy,x,x,23,983
Class A,Dover,Hannah,x,x,1,293
Class B,London,Jemma,x,x,32.2,
Class B,London,James,x,x,32.0,
I would really appreciate any help with this problem. Thanks Oliver

Here are two solutions: a pandas solution and a plain python solution. First a pandas solution which unsurprisingly looks a whole lot like the other pandas solutions...
First load in the data
import pandas
import numpy as np
cdf1 = pandas.read_csv('csv1',dtype=object) #dtype = object allows us to preserve the numeric formats
cdf2 = pandas.read_csv('csv2',dtype=object)
col_order = cdf1.columns #pandas will shuffle the column order at some point---this allows us to reset ot original column order
At this point the data frames will look like
In [6]: cdf1
Out[6]:
Class Local Name DPE JJK Score No
0 Class A York Tom x x 32 NaN
1 Class A York Jim x x 10 NaN
2 Class A York Sam x x 32 NaN
3 Class B Dumpton Park Sarah x x NaN NaN
4 Class B Dumpton Park Bob x x NaN NaN
5 Class B Dumpton Park Bill x x NaN NaN
6 Class A Dover Andy x x NaN NaN
7 Class A Dover Hannah x x NaN NaN
8 Class B London Jemma x x NaN NaN
9 Class B London James x x NaN NaN
In [7]: cdf2
Out[7]:
Class Location Student Scorecard Number
0 Class A York Jim 0 742
1 Class A York Sam 0 931
2 Class A York Tom 0 653
3 Class B Dumpt Bob 23.1 299
4 Class B Dumpt Bill 23.4 198
5 Class B Dumpt Sarah 23.5 12
6 Class A Dover Andy 23 983
7 Class A Dover Hannah 1 293
8 Class B Lond Jemma 32.2 0
9 Class B Lond James 32.0 0
Next manipulate both the data frames into matching formats.
dcol = cdf2.Location
cdf2['Location'] = dcol.apply(lambda x: x[0:4]) #Replacement in cdf2 since we don't need original data
dcol = cdf1.Local
cdf1['Location'] = dcol.apply(lambda x: x[0:4]) #Here we add a column leaving 'Local' because we'll need it for the final output
cdf2 = cdf2.rename(columns={'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'})
cdf2 = cdf2.replace('0', np.nan) #Replacing '0' by np.nan means zeros don't overwrite
cdf1 = cdf1.set_index(['Class', 'Location', 'Name'])
cdf2 = cdf2.set_index(['Class', 'Location', 'Name'])
Now cdf1 and cdf2 look like
In [16]: cdf1
Out[16]:
Local DPE JJK Score No
Class Location Name
Class A York Tom York x x 32 NaN
Jim York x x 10 NaN
Sam York x x 32 NaN
Class B Dump Sarah Dumpton Park x x NaN NaN
Bob Dumpton Park x x NaN NaN
Bill Dumpton Park x x NaN NaN
Class A Dove Andy Dover x x NaN NaN
Hannah Dover x x NaN NaN
Class B Lond Jemma London x x NaN NaN
James London x x NaN NaN
In [17]: cdf2
Out[17]:
Score No
Class Location Name
Class A York Jim NaN 742
Sam NaN 931
Tom NaN 653
Class B Dump Bob 23.1 299
Bill 23.4 198
Sarah 23.5 12
Class A Dove Andy 23 983
Hannah 1 293
Class B Lond Jemma 32.2 NaN
James 32.0 NaN
Updating the data in cdf1 with the data in cdf2
cdf1.update(cdf2, overwrite=False)
results in
In [19]: cdf1
Out[19]:
Local DPE JJK Score No
Class Location Name
Class A York Tom York x x 32 653
Jim York x x 10 742
Sam York x x 32 931
Class B Dump Sarah Dumpton Park x x 23.5 12
Bob Dumpton Park x x 23.1 299
Bill Dumpton Park x x 23.4 198
Class A Dove Andy Dover x x 23 983
Hannah Dover x x 1 293
Class B Lond Jemma London x x 32.2 NaN
James London x x 32.0 NaN
Finally return cdf1 to it's original form and write it to a csv file.
cdf1 = cdf1.reset_index() #These two steps allow us to remove the 'Location' column
del cdf1['Location']
cdf1 = cdf1[col_order] #This will switch Local and Name back to their original order
cdf1.to_csv('temp.csv',index = False)
Two notes: First, given how easy it is to use cdf1.Local.value_counts() or len(cdf1.Local.value_counts()) etc. I'd strongly recommend adding some check summing to make sure that when shifting from Location to the first few letters of Location, you aren't accidentally eliminating a location. Secondly, I sincerely hope there is a typo on line 4 of your desired output.
Onto a plain python solution. In the following, adjust the filenames as needed.
#Open all of the necessary files
csv1 = open('csv1','r')
csv2 = open('csv2','r')
csvout = open('csv_out','w')
#Read past both headers and write the header to the outfile
wstr = csv1.readline()
csvout.write(wstr)
csv2.readline()
#Read csv1 into a dictionary with keys of Class,Name,and first four digits of Local and keep a list of keys for line ordering
line_keys = []
line_dict = {}
for line in csv1:
s = line.split(',')
this_key = (s[0],s[1][0:4],s[2])
line_dict[this_key] = s
line_keys.append(this_key)
#Go through csv2 updating the data in csv1 as necessary
for line in csv2:
s = line.replace('\"','').split(',')
this_key = (s[0],s[1][0:4],s[2])
if this_key in line_dict: #Lowers the crash rate...
#Check if need to replace Score...
if len(s[3]) > 0 and float(s[3]) != 0:
line_dict[this_key][5] = s[3]
#Check if need to repace No...
if len(s[4]) > 0 and float(s[4]) != 0:
line_dict[this_key][6] = s[4]
else:
print "Line not in csv1: %s"%line
#Write the updated line_dict to csvout
for key in line_keys:
wstr = ','.join(line_dict[key])
csvout.write(wstr)
csvout.write('\n')
#Close all of the open filehandles
csv1.close()
csv2.close()
csvout.close()

Hopefully this code is a bit more readable. ;) The backport for Python's new Enum type is here.
from enum import Enum # see PyPI for the backport (enum34)
class Field(Enum):
course = 0
location = 1
student = 2
dpe = 3
jjk = 4
score = -2
number = -1
def __index__(self):
return self._value_
def Float(text):
if not text:
return 0.0
return float(text)
def load_our_data(filename):
"return a dict using the first three fields as the key"
data = dict()
with open(filename) as input:
next(input) # throw away header
for line in input:
fields = line.strip('\n').split(',')
fields[Field.score] = Float(fields[Field.score])
fields[Field.number] = Float(fields[Field.number])
key = (
fields[Field.course].lower(),
fields[Field.location][:4].lower(),
fields[Field.student].lower(),
)
data[key] = fields
return data
def load_their_data(filename):
"return a dict using the first three fields as the key"
data = dict()
with open(filename) as input:
next(input) # throw away header
for line in input:
fields = line.strip('\n').split(',')
fields = [f.strip('"') for f in fields]
fields[Field.score] = Float(fields[Field.score])
fields[Field.number] = Float(fields[Field.number])
key = (
fields[Field.course].lower(),
fields[Field.location][:4].lower(),
fields[Field.student].lower(),
)
data[key] = fields
return data
def merge_data(ours, theirs):
"their data is only used if not blank and non-zero"
for key, our_data in ours.items():
their_data = theirs[key]
if their_data[Field.score]:
our_data[Field.score] = their_data[Field.score]
if their_data[Field.number]:
our_data[Field.number] = their_data[Field.number]
def write_our_data(data, filename):
with open(filename, 'w') as output:
for record in sorted(data.values()):
line = ','.join([str(f) for f in record])
output.write(line + '\n')
if __name__ == '__main__':
ours = load_our_data('one.csv')
theirs = load_their_data('two.csv')
merge_data(ours, theirs)
write_our_data(ours, 'three.csv')

You could use fuzzywuzzy to do the matching of town names, and append as a column to df2:
df1 = pd.read_csv(csv1)
df2 = pd.read_csv(csv2)
towns = df1.Local.unique() # assuming this is complete list of towns
from fuzzywuzzy.fuzz import partial_ratio
In [11]: df2['Local'] = df2.Location.apply(lambda short_location: max(towns, key=lambda t: partial_ratio(short_location, t)))
In [12]: df2
Out[12]:
Class Location Student Scorecard Number Local
0 Class A York Jim 0.0 742 York
1 Class A York Sam 0.0 931 York
2 Class A York Tom 0.0 653 York
3 Class B Dumpt Bob 23.1 299 Dumpton Park
4 Class B Dumpt Bill 23.4 198 Dumpton Park
5 Class B Dumpt Sarah 23.5 12 Dumpton Park
6 Class A Dover Andy 23.0 983 Dover
7 Class A Dover Hannah 1.0 293 Dover
8 Class B Lond Jemma 32.2 0 London
9 Class B Lond James 32.0 0 London
Make the name consistent (at the moment Student and Name are misnamed):
In [13]: df2.rename_axis({'Student': 'Name'}, axis=1, inplace=True)
Now you can merge (on the overlapping columns):
In [14]: res = df1.merge(df2, how='outer')
In [15]: res
Out[15]:
Class Local Name DPE JJK Score No Location Scorecard Number
0 Class A York Tom x x 32 NaN York 0.0 653
1 Class A York Jim x x 10 NaN York 0.0 742
2 Class A York Sam x x 32 NaN York 0.0 931
3 Class B Dumpton Park Sarah x x NaN NaN Dumpt 23.5 12
4 Class B Dumpton Park Bob x x NaN NaN Dumpt 23.1 299
5 Class B Dumpton Park Bill x x NaN NaN Dumpt 23.4 198
6 Class A Dover Andy x x NaN NaN Dover 23.0 983
7 Class A Dover Hannah x x NaN NaN Dover 1.0 293
8 Class B London Jemma x x NaN NaN Lond 32.2 0
9 Class B London James x x NaN NaN Lond 32.0 0
One bit to clean up is the Score, I think I would take the max of the two:
In [16]: res['Score'] = res.loc[:, ['Score', 'Scorecard']].max(1)
In [17]: del res['Scorecard']
del res['No']
del res['Location']
Then you're left with the columns you want:
In [18]: res
Out[18]:
Class Local Name DPE JJK Score Number
0 Class A York Tom x x 32.0 653
1 Class A York Jim x x 10.0 742
2 Class A York Sam x x 32.0 931
3 Class B Dumpton Park Sarah x x 23.5 12
4 Class B Dumpton Park Bob x x 23.1 299
5 Class B Dumpton Park Bill x x 23.4 198
6 Class A Dover Andy x x 23.0 983
7 Class A Dover Hannah x x 1.0 293
8 Class B London Jemma x x 32.2 0
9 Class B London James x x 32.0 0
In [18]: res.to_csv('foo.csv')
Note: to force the dtype to object (and have mixed dtypes, ints and floats, rather than all floats) you can use an apply. I would recommend against this if you're doing any analysis!
res['Score'] = res['Score'].apply(lambda x: int(x) if int(x) == x else x, convert_dtype=False)

Python dictionaries are the way to go here:
studentDict = {}
with open(<csv1>, 'r') as f:
for line in f:
LL = line.rstrip('\n').replace('"','').split(',')
studentDict[LL[0], LL[1], LL[2]] = LL[3:]
with open(<csv2>, 'r') as f:
for line in f:
LL = line.rstrip('\n').replace('"','').split(',')
if LL[-2] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-2] = LL[-2]
if LL[-1] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-1] = LL[-1]
with open(<outFile>, 'w') as f:
for k in studentDict.keys():
v = studentDict[k[0], k[1], k[2]]
f.write(k[0] + ',' + k[1] + ',' + k[2] + ',' + v[0] + ',' + v[1] + ',' + v[2] + ',' + v[3] + '\n')

pandas make this sort of task a bit more convenient.
EDIT: Okay since you can't rely on renaming columns manually, Roman's suggestion to just match on the first few letters is a good one. We have to change a couple things before that though.
In [62]: df1 = pd.read_clipboard(sep=',')
In [63]: df2 = pd.read_clipboard(sep=',')
In [68]: df1
Out[68]:
Class Location Student Scorecard Number
0 Class A York Jim 0.0 742
1 Class A York Sam 0.0 931
2 Class A York Tom 0.0 653
3 Class B Dumpt Bob 23.1 299
4 Class B Dumpt Bill 23.4 198
5 Class B Dumpt Sarah 23.5 12
6 Class A Dover Andy 23.0 983
7 Class A Dover Hannah 1.0 293
8 Class B Lond Jemma 32.2 0
9 Class B Lond James 32.0 0
In [69]: df2
Out[69]:
Class Local Name DPE JJK Score No
0 Class A York Tom x x 32.0 653
1 Class A York Jim x x 10.0 742
2 Class A York Sam x x 32.0 653
3 Class B Dumpton Park Sarah x x 23.5 12
4 Class B Dumpton Park Bob x x 23.1 299
5 Class B Dumpton Park Bill x x 23.4 198
6 Class A Dover Andy x x 23.0 983
7 Class A Dover Hannah x x 1.0 293
8 Class B London Jemma x x 32.2 NaN
9 Class B London James x x 32.0 NaN
Get the columns named the same.
In [70]: df1 = df1.rename(columns={'Location': 'Local', 'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'}
Now for the locations. Save the originals in df2 to a separate Series.
In [71]: locations = df2['Local']
In [72]: df1['Local'] = df1['Local'].str.slice(0, 4)
In [73]: df2['Local'] = df2['Local'].str.slice(0, 4)
Use the string methods to truncate to the first 4 (assuming this won't cause any false matches).
Now set the indices:
In [78]: df1 = df1.set_index(['Class', 'Local', 'Name'])
In [79]: df2 = df2.set_index(['Class', 'Local', 'Name'])
In [80]: df1
Out[80]:
Score No
Class Local Name
Class A York Jim 0.0 742
Sam 0.0 931
Tom 0.0 653
Class B Dump Bob 23.1 299
Bill 23.4 198
Sarah 23.5 12
Class A Dove Andy 23.0 983
Hannah 1.0 293
Class B Lond Jemma 32.2 0
James 32.0 0
In [83]: df1 = df1.replace(0, np.nan)
In [84]: df2 = df2.replace(0, np.nan)
Finally, update the scores as before:
In [85]: df1.update(df2, overwrite=False)
You can get the original locations back by doing:
In [91]: df1 = df1.reset_index()
In [92]: df1['Local'] = locations
And you can write to output to csv (and a bunch of other format) with df1.to_csv('path/to/csv')

You could try using the csv module from the standard library. My solution is very similar to Chris H's, but I used the csv module to read and write the files. (In fact, I stole his technique of storing the keys in a list to save the order).
If you use the csv module, you don't have to worry too much about the quotes, and it also allows you to read the rows directly into dictionaries with the column names as keys.
import csv
# Open first CSV, and read each line as a dictionary with column names as keys.
with open('csv1.csv', 'rb') as csvfile1:
table1 = csv.DictReader(csvfile1,['Class', 'Local', 'Name',
'DPE', 'JJK', 'Score', 'No'])
table1.next() #skip header row
first_table = {}
original_order = [] #list keys to save original order
# build dictionary of rows with name, location, and class as keys
for row in table1:
id = "%s from %s in %s" % (row['Name'], row['Local'][:4], row['Class'])
first_table[id] = row
original_order.append(id)
# Repeat for second csv, but don't worry about order
with open('csv2.csv', 'rb') as csvfile2:
table2 = csv.DictReader(csvfile2, ['Class', 'Location',
'Student', 'Scorecard', 'Number'])
table2.next()
second_table = {}
for row in table2:
id = "%s from %s in %s" % (row['Student'], row['Location'][:4], row['Class'])
second_table[id] = row
with open('student_data.csv', 'wb') as finalfile:
results = csv.DictWriter(finalfile, ['Class', 'Local', 'Name',
'DPE', 'JJK', 'Score', 'No'])
results.writeheader()
# Replace data in first csv with data in second csv when conditions are satisfied.
for student in original_order:
if second_table[student]['Scorecard'] != "0" and second_table[student]['Scorecard'] != "":
first_table[student]['Score'] = second_table[student]['Scorecard']
if second_table[student]['Number'] != "0" and second_table[student]['Number'] != "":
first_table[student]['No'] = second_table[student]['Number']
results.writerow(first_table[student])
Hope this helps.

Related

Correct way to iterate over two dataframes to set specific values based on the value of another df

Edited to add easier to reproduce dataframe
I have two dataframes that look something like this:
df1
index = [0,1,2,3,4,5,6,7,8]
a = pd.Series([John Smith, John Smith, John Smith, Kobe Bryant, Kobe Bryant, Kobe Bryant, Jeff Daniels, Jeff Daniels, Jeff Daniels],index= index)
b = pd.Series([7/29/2022, 8/7/2022, 8/29/2022, 7/9/2022, 7/29/2022, 8/9/2022, 7/28/2022, 8/8/2022, 8/28/2022],index= index)
c = pd.Series([185, 187, 186.5, 212.5, 217.5, 220.5, 211.1, 210.5, 213],index= index)
d = pd.Series([],index= index)
df1 = pd.DataFrame(np.d_[a,b,c],columns = ["Name","Date","Weight","Goal"])
or df1 in this format:
Name
Date
Weight
Goal
John Smith
7/29/2022
185
NaN
John Smith
8/7/2022
187
NaN
John Smith
8/29/2022
186.5
NaN
Kobe Bryant
7/9/2022
212.5
NaN
Kobe Bryant
7/29/2022
217.5
NaN
Kobe Bryant
8/9/2022
220.5
NaN
Jeff Daniels
7/28/2022
211.1
NaN
Jeff Daniels
8/8/2022
210.5
NaN
Jeff Daniels
8/28/2022
213
NaN
df2
index = [0,1,2]
a = pd.Series([John Smith, Kobe Bryant, Jeff Daniels],index= index)
b = pd.Series([195,230,220],index= index)
c = pd.Series([],index= index)
df2 = pd.DataFrame(np.c_[a,b],columns = ["Name", "Weight Goal"])
or df2 in this format:
Name
Weight Goal
John Smith
195
Kobe Bryant
230
Jeff Daniels
220
What I want to do is iterate through df1 and set respective weight goal from df2 for each player...but I only want to do this in August, I want to ignore the July dates.
I know that I shouldn't be using a for loop with a dataframe/pandas but I think me showing my mental thought process with one might show the intent that I was trying to achieve with my code attempts.
for player in df1['Name']:
df1 = df1.loc[(df1['Name'] == f'{player}') & (df1['Date'] > '8/1/2022')]
df1.at[df2['Name'] == f'{player}', 'Goal'] = (df2.loc[df2.Name == f'{player}']['Weight Goal'])
This just ends up delivering an empty dataframe & a settingwithcopy warning. I know this is not the right way to do this but I thought it might help to direct me.
Thank You.

If I correctly understand the output you are after (stack overflow tip: it can be useful to provide a sample of your desired output to help people trying to answer your question), then this should work:
# make the Date column into datetime type so it is easier to filter on
df1 = df1.assign(Date=pd.to_datetime(df1.Date))
# separate out the august rows from the other months
df1_august = df1.loc[df1.Date.apply(lambda x: x.month == 8)]
df1_other_months = df1.loc[df1.Date.apply(lambda x: x.month != 8)]
# use a merge rather than a loop to get WeightGoal column in place
df1_august_merged = df1_august.merge(df2, on="Name")
# finally add the rows for the other months back in
final_df = pd.concat([df1_august_merged, df1_other_months])
print(final_df)
Name Date Weight Goal Weight Goal
0 John Smith 2022-08-07 187.0 NaN 195.0
1 John Smith 2022-08-29 186.5 NaN 195.0
2 Kobe Bryant 2022-08-09 220.5 NaN 230.0
3 Jeff Daniels 2022-08-08 210.5 NaN 220.0
4 Jeff Daniels 2022-08-28 213.0 NaN 220.0
0 John Smith 2022-07-29 185.0 NaN NaN
3 Kobe Bryant 2022-07-09 212.5 NaN NaN
4 Kobe Bryant 2022-07-29 217.5 NaN NaN
6 Jeff Daniels 2022-07-28 211.1 NaN NaN

Grouping DataFrame by column and listing its value counts per group

I have a data frame which looks like this:
ga:country ga:hostname ga:pagePathLevel1 ga:pagePathLevel2 ga:keyword ga:adMatchedQuery ga:operatingSystem ga:hour ga:exitPagePath ga:sessions
0 (not set) de.google.com /beste-sms/ / +sms sms Germany best for Android 09 /beste-sms/ 1
1 (not set) de.google.com /beste-sms/ / +sms sms argentinien Macintosh 14 /beste-sms/ 1
2 (not set) de.google.com /beste-sms/ / +sms sms skandinav Android 18 /beste-sms/ 1
3 (not set) de.google.com /beste-sms/ / +sms sms skandinav Macintosh 20 /beste-sms/ 1
4 (not set) de.google.com /beste-sms/ / sms sms iOS 22 /beste-sms/ 1
... ... ... ... ... ... ... ... ... ... ...
85977 Yemen google.com /reviews/ /iphone/ 45to54 not set) Android 23 /reviews/iphone/ 1
85978 Yemen google.com /tr/ /best-sms/ sms sms Windows 10 /tr/best-sms/ 1
85979 Zambia google.com /best-sms/ /iphone/ +best +sms (not set) Android 16 /best-sms/iphone/ 1
85980 Zimbabwe google.com /reviews/ /testsms/ test test Windows 22 /reviews/testsms/ 1
85981 Zimbabwe google.com /reviews/ /testsms/ testsms testsms Windows 23 /reviews/testsms/ 1
I would like to group them by column ga:adMatchedQuery and get counts of each column values for each group in ga:adMatchedQuery
This question is a follow up on this question which may provide more information of what I am trying to achieve.
After using the same code structure as #jezrael suggested:
def f(x):
x = x.value_counts()
y = x.index.astype(str) + ' (' + x.astype(str) + ')'
return y.reset_index(drop=True)
df = df.groupby(['ga:adMatchedQuery']).apply(lambda x: x.apply(f))
print(df)
I get this result:
ga:country ga:hostname ga:pagePathLevel1 ga:pagePathLevel2 ga:keyword ga:adMatchedQuery ga:operatingSystem ga:hour ga:exitPagePath ga:sessions
United States(5683) google.com(14924) /us/(4187) /best-sms/(4565) Undetermined(1855) (not set)(15327) Windows(7616) 18(806) /reviews/testsms/(1880) 1(14005)
United Kingdom(1691) zh.google.com(170) /reviews/(4093) /testsms/(3561) free sms(1729) Android(4291) 20(805) /reviews/scandina/(1307) 2(815)
Canada(1201) t.google.com(80) /best-sms/(2169) /free-sms/(2344) +sms(1414) iOS(2136) 19(804) /best-sms/(1291) 3(231)
Indonesia(445) es.google.com(33) /coupons/(1264) /scandina/(1751) +free +sms(1008) Macintosh(978) 17(787) /coupons/testsms/holiday-deal/(760) 4(92)
Hong Kong(443) pl.google.com(33) /uk/(1172) /(1508) 25to34(988) Linux(160) 21(779) /coupons/scandina/holiday-deal/(239) 6(40)
Australia(353) fr.google.com(27) /ca/(886) /windows/(365) best sms(803) Chrome OS(73) 16(766) (not set)(112) 5(38)
Whereas I am trying to achieve this:
ga:adMatchedQuery ga:country ga:hostname
Undetermined(1855) United States(100) google.com(1000)
United Kingdom(200) zh.google.com(12)
free sms(1855) United States(100) google.com(1000)
United Kingdom(200) zh.google.com(12)
...
Thank you for your suggestions.

I think there is only changed order of columns, you can use before my solution:
cols = df.columns.difference(['ga:adMatchedQuery'], sort=False).tolist()
df = df[['ga:adMatchedQuery'] + cols]
Sample with data from previous answer:
Here is data grouped by F column, order of columns names is not changed:
def f(x):
x = x.value_counts()
y = x.index.astype(str) + '(' + x.astype(str) + ')'
return y.reset_index(drop=True)
df1 = df.groupby(['F']).apply(lambda x: x.apply(f)).reset_index(drop=True)
print (df1)
B C D E F
0 Honda(1) Canada(1) 2011(1) Salt Lake(1) Crashed(1)
1 Ford(2) Italy(1) 2014(1) Washington(2) New(3)
2 Honda(1) Canada(1) 2005(1) Rome(1) NaN
3 NaN USA(1) 2000(1) NaN NaN
4 Honda(2) USA(3) 2001(2) Salt Lake(2) Used(3)
5 Toyota(1) NaN 2010(1) Ney York(1) NaN
Columns names are changed:
cols = df.columns.difference(['F'], sort=False).tolist()
df = df[['F'] + cols]
print (df)
F B C D E
1 New Honda USA 2000 Washington
2 Used Honda USA 2001 Salt Lake
3 New Ford Canada 2005 Washington
4 Used Toyota USA 2010 Ney York
5 Used Honda USA 2001 Salt Lake
6 Crashed Honda Canada 2011 Salt Lake
7 New Ford Italy 2014 Rome
def f(x):
x = x.value_counts()
y = x.index.astype(str) + '(' + x.astype(str) + ')'
return y.reset_index(drop=True)
df1 = df.groupby(['F']).apply(lambda x: x.apply(f)).reset_index(drop=True)
print (df1)
F B C D E
0 Crashed(1) Honda(1) Canada(1) 2011(1) Salt Lake(1)
1 New(3) Ford(2) Italy(1) 2014(1) Washington(2)
2 NaN Honda(1) Canada(1) 2005(1) Rome(1)
3 NaN NaN USA(1) 2000(1) NaN
4 Used(3) Honda(2) USA(3) 2001(2) Salt Lake(2)
5 NaN Toyota(1) NaN 2010(1) Ney York(1)

Cannot assign a value to certain columns in Pandas

Hi I am trying to assign certain values in columns of a dataframe.
# Count the number of title counts
full.groupby(['Sex', 'Title']).Title.count()
Sex Title
female Dona 1
Dr 1
Lady 1
Miss 260
Mlle 2
Mme 1
Mrs 197
Ms 2
the Countess 1
male Capt 1
Col 4
Don 1
Dr 7
Jonkheer 1
Major 2
Master 61
Mr 757
Rev 8
Sir 1
Name: Title, dtype: int64
My tail of dataframe looks like follows:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket Title
413 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 NaN A.5. 3236 Mr
414 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 NaN PC 17758 Dona
415 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 NaN SOTON/O.Q. 3101262 Mr
416 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 NaN 359309 Mr
417 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 NaN 2668 Master
The name of my dataframe is full and I want to change names of Title.
Here is the following code I wrote :
# Create a variable rate_title to modify the names of Title
rare_title = ['Dona', "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer"]
# Also reassign mlle, ms, and mme accordingly
full[full.Title == "Mlle"].Title = "Miss"
full[full.Title == "Ms"].Title = "Miss"
full[full.Title == "Mme"].Title = "Mrs"
full[full.Title.isin(rare_title)].Title = "Rare Title"
I also tried the following code in pandas:
full.loc[full['Title'] == "Mlle", ['Sex', 'Title']] = "Miss"
Still the dataframe is not changed. Any help is appreciated.

Use loc based indexing and set matching row values -
miss = ['Mlle', 'Ms', 'Mme']
rare_title = ['Dona', "Lady", ...]
df.loc[df.Title.isin(miss), 'Title'] = 'Miss'
df.loc[df.Title.isin(rare_title), 'Title'] = 'Rare Title'

Query based on index value or value in a column in python

I have a pandas data frame from which I computed the mean scores of students. Student scores are stored in data as below:
name score
0 John 90
1 Mary 87
2 John 100
3 Suzie 90
4 Mary 88
By using meanscore = data.groupby("name").mean()
I obtain
score
name
John 95
Mary 87.5
Suzie 90
I would like to query, for instance, meanscore['score'][meanscore['name'] == 'John'] This line yields KeyError: 'name'
I know my way of doing it is not nice, as I can actually find out the meanscore of John by using mean['score'][0].
My question is: is there a way to find the corresponding index value of each name (e.g. [0] for John, [1] for Mary and [2] for Suzie) in my query? Thank you!!

You can use loc:
In [11]: meanscore
Out[11]:
score
name
John 95.0
Mary 87.5
Suzie 90.0
In [12]: meanscore.loc["John", "score"]
Out[12]: 95.0

You can do:
meanscore['score']['John']
Example:
>>> df
name score
0 John 90
1 Mary 87
2 John 100
3 Suzie 90
4 Mary 88
>>> meanscore = df.groupby('name').mean()
>>> meanscore
score
name
John 95.0
Mary 87.5
Suzie 90.0
>>> meanscore['score']['John']
95.0

How to merge two pandas DataFrames based on a similarity function?

Given dataset 1
name,x,y
st. peter,1,2
big university portland,3,4
and dataset 2
name,x,y
saint peter3,4
uni portland,5,6
The goal is to merge on
d1.merge(d2, on="name", how="left")
There are no exact matches on name though. So I'm looking to do a kind of fuzzy matching. The technique does not matter in this case, more how to incorporate it efficiently into pandas.
For example, st. peter might match saint peter in the other, but big university portland might be too much of a deviation that we wouldn't match it with uni portland.
One way to think of it is to allow joining with the lowest Levenshtein distance, but only if it is below 5 edits (st. --> saint is 4).
The resulting dataframe should only contain the row st. peter, and contain both "name" variations, and both x and y variables.
Is there a way to do this kind of merging using pandas?

Did you look at fuzzywuzzy?
You might do something like:
import pandas as pd
import fuzzywuzzy.process as fwp
choices = list(df2.name)
def fmatch(row):
minscore=95 #or whatever score works for you
choice,score = fwp.extractOne(row.name,choices)
return choice if score > minscore else None
df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1,
df2,
left_on='df2_name',
right_on='name',
suffixes=['_df1','_df2'],
how = 'outer') # assuming you want to keep unmatched records
Caveat Emptor: I haven't tried to run this.

Let's say you have that function which returns the best match if any, None otherwise:
def best_match(s, candidates):
''' Return the item in candidates that best matches s.
Will return None if a good enough match is not found.
'''
# Some code here.
Then you can join on the values returned by it, but you can do it in different ways that would lead to different output (so I think, I did not look much at this issue):
(df1.assign(name=df1['name'].apply(lambda x: best_match(x, df2['name'])))
.merge(df2, on='name', how='left'))
(df1.merge(df2.assign(name=df2['name'].apply(lambda x: best_match(x, df1['name'])))),
on='name', how='left'))

The simplest idea I can get now is to create special dataframe with distances between all names:
>>> from Levenshtein import distance
>>> df1['dummy'] = 1
>>> df2['dummy'] = 1
>>> merger = pd.merge(df1, df2, on=['dummy'], suffixes=['1','2'])[['name1','name2', 'x2', 'y2']]
>>> merger
name1 name2 x2 y2
0 st. peter saint peter 3 4
1 st. peter uni portland 5 6
2 big university portland saint peter 3 4
3 big university portland uni portland 5 6
>>> merger['res'] = merger.apply(lambda x: distance(x['name1'], x['name2']), axis=1)
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
1 st. peter uni portland 5 6 9
2 big university portland saint peter 3 4 18
3 big university portland uni portland 5 6 11
>>> merger = merger[merger['res'] <= 5]
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
>>> del df1['dummy']
>>> del merger['res']
>>> pd.merge(df1, merger, how='left', left_on='name', right_on='name1')
name x y name1 name2 x2 y2
0 st. peter 1 2 st. peter saint peter 3 4
1 big university portland 3 4 NaN NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Updating csv with data from a csv with different formatting - python

Related

Correct way to iterate over two dataframes to set specific values based on the value of another df

Grouping DataFrame by column and listing its value counts per group

Cannot assign a value to certain columns in Pandas

Query based on index value or value in a column in python

How to merge two pandas DataFrames based on a similarity function?

Categories

Resources