'DataFrame' object has no attribute 'melt' - python

I just want to use the melt function in pandas and I just keep on getting the same error.
Just typing the example provided by the documentation:
cheese = pd.DataFrame({'first' : ['John', 'Mary'],
'last' : ['Doe', 'Bo'],
'height' : [5.5, 6.0],
'weight' : [130, 150]})
I just get the error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-119-dc0a0b96cf46> in <module>()
----> 1 cheese.melt(id_vars=['first', 'last'])
C:\Anaconda2\lib\site-packages\pandas\core\generic.pyc in __getattr__(self, name)
2670 if name in self._info_axis:
2671 return self[name]
-> 2672 return object.__getattribute__(self, name)
2673
2674 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'melt'`

You pandas version is bellow 0.20.0, so need pandas.melt instead DataFrame.melt:
df = pd.melt(cheese, id_vars=['first', 'last'])
print (df)
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0

def grilled(d):
return d.set_index(['first', 'last']) \
.rename_axis('variable', 1) \
.stack().reset_index(name='value')
grilled(cheese)
first last variable value
0 John Doe height 5.5
1 John Doe weight 130.0
2 Mary Bo height 6.0
3 Mary Bo weight 150.0

Related

How to extract first word from DataFrame

Background
I have created the below data frame combining two dataset from Kaggle.
Titanic: Machine Learning from Disaster
(input/titanic/train.csv)
titanic-nationalities
DataFrame name: output
PassengerId Nationality Name
0 1 CelticEnglish Braund, Mr. Owen Harris
1 2 CelticEnglish Cumings, Mrs. John Bradley (Florence Briggs Th...
2 3 Nordic,Scandinavian,Sweden Heikkinen, Miss. Laina
3 4 CelticEnglish Futrelle, Mrs. Jacques Heath (Lily May Peel
....
What I hoped to transform
PassengerId Nationality Name
0 1 CelticEnglish Braund
1 2 CelticEnglish Cumings
2 3 Nordic Heikkinen
3 4 CelticEnglish Futrelle
....
Problem
I tried to execute the below code, but I have no idea to fix the below.
Error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
----> 1 output['Nationality'].split('\n', 1)[0]
2 output['Name'].split('\n', 1)[0]
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5137 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5138 return self[name]
-> 5139 return object.__getattribute__(self, name)
5140
5141 def __setattr__(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute 'split'
Code
output['Nationality'].split('\n', 1)[0]
output['Name'].split('\n', 1)[0]
What I tried to do
I tried to change the type conversion, but the result was not changed.
output['Nationality'] = output['Nationality'].astype(str)
output['Name'] = output['Name'].astype(str)
output['Nationality'] = output['Nationality'].str.split('\n', expand=True)[0]
output['Name'] = output['Name'].str.split('\n', expand=True)[0]
output
PassengerId Nationality Name
0 1 CelticEnglish Braund, Mr. Owen Harris
1 2 CelticEnglish Cumings, Mrs. John Bradley (Florence Briggs Th...
2 3 Nordic,Scandinavian,Sweden Heikkinen, Miss. Laina
3 4 CelticEnglish Futrelle, Mrs. Jacques Heath (Lily May Peel)
Environment
Kaggle Notebook
A Series object doesn't have a split method. You're trying to split a string so you'll need to convert the column datatype into string first (or expand the column out into multiple columns) before applying a split.
check data type of columns with df.dtypes
assign datatype with output['Nationality'].astype(str)
edit: no parentheses on dtype call
Try with .str.split()
output['Nationality'] = output['Nationality'].str.split('\n', expand=True)[0]
output['Name'] = output['Name'].str.split('\n', expand=True)[0]

How can I count how many male/female are in each title?

I am a newbie to datascience and I want to count how many female/male are in each Title.
I tried the following piece of code:
'''
newdf = pd.DataFrame()
newdf[ 'Title' ] = full[ 'Name' ].map( lambda name: name.split( ',' )
[1].split( '.' )[0].strip() )
newdf['Age'] = full['Age']
newdf['Sex'] = full['Sex']
newdf.dropna(axis = 0,inplace=True)
print(newdf.head())
What I get is :
Title Age Sex
0 Mr 22.0 male
1 Mrs 38.0 female
2 Miss 26.0 female
3 Mrs 35.0 female
4 Mr 35.0 male
Then I am trying this to add #male,#female columns
df = pd.DataFrame()
df = newdf[['Age','Title']].groupby('Title').mean().sort_values(by='Age',ascending=False)
df['#People'] = newdf['Title'].value_counts()
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
Error message that I have:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What I expected is to have four columns: Title,Age(average),#People, #male,#female. So I want to know how many of those #people are male and female
P.S Without these lines :
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
everything works fine,and I get:
Age #People
Title
Capt 70.000000 1
Col 54.000000 4
Sir 49.000000 1
Major 48.500000 2
Lady 48.000000 1
Dr 43.571429 7
....
But without #male,#female.
Use GroupBy.agg for aggregate mean with size and for new columns add crosstab by DataFrame.join:
df1 = (df.groupby('Title')['Age']
.agg([('Age','mean'),('#People','size')])
.sort_values(by='Age',ascending=False))
df2 = pd.crosstab(df['Title'], df['Sex']).add_suffix('_avg')
df = df1.join(df2)
print (df)
Age #People female_avg male_avg
Title
Mrs 36.5 2 2 0
Mr 28.5 2 0 2
Miss 26.0 1 1 0

Cannot assign a value to certain columns in Pandas

Hi I am trying to assign certain values in columns of a dataframe.
# Count the number of title counts
full.groupby(['Sex', 'Title']).Title.count()
Sex Title
female Dona 1
Dr 1
Lady 1
Miss 260
Mlle 2
Mme 1
Mrs 197
Ms 2
the Countess 1
male Capt 1
Col 4
Don 1
Dr 7
Jonkheer 1
Major 2
Master 61
Mr 757
Rev 8
Sir 1
Name: Title, dtype: int64
My tail of dataframe looks like follows:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket Title
413 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 NaN A.5. 3236 Mr
414 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 NaN PC 17758 Dona
415 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 NaN SOTON/O.Q. 3101262 Mr
416 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 NaN 359309 Mr
417 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 NaN 2668 Master
The name of my dataframe is full and I want to change names of Title.
Here is the following code I wrote :
# Create a variable rate_title to modify the names of Title
rare_title = ['Dona', "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer"]
# Also reassign mlle, ms, and mme accordingly
full[full.Title == "Mlle"].Title = "Miss"
full[full.Title == "Ms"].Title = "Miss"
full[full.Title == "Mme"].Title = "Mrs"
full[full.Title.isin(rare_title)].Title = "Rare Title"
I also tried the following code in pandas:
full.loc[full['Title'] == "Mlle", ['Sex', 'Title']] = "Miss"
Still the dataframe is not changed. Any help is appreciated.
Use loc based indexing and set matching row values -
miss = ['Mlle', 'Ms', 'Mme']
rare_title = ['Dona', "Lady", ...]
df.loc[df.Title.isin(miss), 'Title'] = 'Miss'
df.loc[df.Title.isin(rare_title), 'Title'] = 'Rare Title'

Query based on index value or value in a column in python

I have a pandas data frame from which I computed the mean scores of students. Student scores are stored in data as below:
name score
0 John 90
1 Mary 87
2 John 100
3 Suzie 90
4 Mary 88
By using meanscore = data.groupby("name").mean()
I obtain
score
name
John 95
Mary 87.5
Suzie 90
I would like to query, for instance, meanscore['score'][meanscore['name'] == 'John'] This line yields KeyError: 'name'
I know my way of doing it is not nice, as I can actually find out the meanscore of John by using mean['score'][0].
My question is: is there a way to find the corresponding index value of each name (e.g. [0] for John, [1] for Mary and [2] for Suzie) in my query? Thank you!!
You can use loc:
In [11]: meanscore
Out[11]:
score
name
John 95.0
Mary 87.5
Suzie 90.0
In [12]: meanscore.loc["John", "score"]
Out[12]: 95.0
You can do:
meanscore['score']['John']
Example:
>>> df
name score
0 John 90
1 Mary 87
2 John 100
3 Suzie 90
4 Mary 88
>>> meanscore = df.groupby('name').mean()
>>> meanscore
score
name
John 95.0
Mary 87.5
Suzie 90.0
>>> meanscore['score']['John']
95.0

Updating csv with data from a csv with different formatting

I'm trying to update a csv file with some student figures provided by other sources however they've formatted their csv data slightly differently to ours.
It needs to match students based on three criteras their name, their class and finally the first few letters of the location so for the first few students from Class B are from Dumpt which is actually Dumpton Park.
When matches are found
If a student's Scorecard in CSV 2 is 0 or blank then it shouldn't update the score column in CSV 1
If a student's Number in CSV 2 is 0 or blank then it shouldn't update the No column in CSV 1
Otherwise it should import the numbers from CSV 2 to CSV1
Below is some example data:
CSV 1
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,
CSV 2
"Class","Location","Student","Scorecard","Number"
"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"
CSV 1 UPDATED (This is the desired output)
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299
Class B,Dumpton Park,Bill,x,x,23.4,198
Class A,Dover,Andy,x,x,23,983
Class A,Dover,Hannah,x,x,1,293
Class B,London,Jemma,x,x,32.2,
Class B,London,James,x,x,32.0,
I would really appreciate any help with this problem. Thanks Oliver
Here are two solutions: a pandas solution and a plain python solution. First a pandas solution which unsurprisingly looks a whole lot like the other pandas solutions...
First load in the data
import pandas
import numpy as np
cdf1 = pandas.read_csv('csv1',dtype=object) #dtype = object allows us to preserve the numeric formats
cdf2 = pandas.read_csv('csv2',dtype=object)
col_order = cdf1.columns #pandas will shuffle the column order at some point---this allows us to reset ot original column order
At this point the data frames will look like
In [6]: cdf1
Out[6]:
Class Local Name DPE JJK Score No
0 Class A York Tom x x 32 NaN
1 Class A York Jim x x 10 NaN
2 Class A York Sam x x 32 NaN
3 Class B Dumpton Park Sarah x x NaN NaN
4 Class B Dumpton Park Bob x x NaN NaN
5 Class B Dumpton Park Bill x x NaN NaN
6 Class A Dover Andy x x NaN NaN
7 Class A Dover Hannah x x NaN NaN
8 Class B London Jemma x x NaN NaN
9 Class B London James x x NaN NaN
In [7]: cdf2
Out[7]:
Class Location Student Scorecard Number
0 Class A York Jim 0 742
1 Class A York Sam 0 931
2 Class A York Tom 0 653
3 Class B Dumpt Bob 23.1 299
4 Class B Dumpt Bill 23.4 198
5 Class B Dumpt Sarah 23.5 12
6 Class A Dover Andy 23 983
7 Class A Dover Hannah 1 293
8 Class B Lond Jemma 32.2 0
9 Class B Lond James 32.0 0
Next manipulate both the data frames into matching formats.
dcol = cdf2.Location
cdf2['Location'] = dcol.apply(lambda x: x[0:4]) #Replacement in cdf2 since we don't need original data
dcol = cdf1.Local
cdf1['Location'] = dcol.apply(lambda x: x[0:4]) #Here we add a column leaving 'Local' because we'll need it for the final output
cdf2 = cdf2.rename(columns={'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'})
cdf2 = cdf2.replace('0', np.nan) #Replacing '0' by np.nan means zeros don't overwrite
cdf1 = cdf1.set_index(['Class', 'Location', 'Name'])
cdf2 = cdf2.set_index(['Class', 'Location', 'Name'])
Now cdf1 and cdf2 look like
In [16]: cdf1
Out[16]:
Local DPE JJK Score No
Class Location Name
Class A York Tom York x x 32 NaN
Jim York x x 10 NaN
Sam York x x 32 NaN
Class B Dump Sarah Dumpton Park x x NaN NaN
Bob Dumpton Park x x NaN NaN
Bill Dumpton Park x x NaN NaN
Class A Dove Andy Dover x x NaN NaN
Hannah Dover x x NaN NaN
Class B Lond Jemma London x x NaN NaN
James London x x NaN NaN
In [17]: cdf2
Out[17]:
Score No
Class Location Name
Class A York Jim NaN 742
Sam NaN 931
Tom NaN 653
Class B Dump Bob 23.1 299
Bill 23.4 198
Sarah 23.5 12
Class A Dove Andy 23 983
Hannah 1 293
Class B Lond Jemma 32.2 NaN
James 32.0 NaN
Updating the data in cdf1 with the data in cdf2
cdf1.update(cdf2, overwrite=False)
results in
In [19]: cdf1
Out[19]:
Local DPE JJK Score No
Class Location Name
Class A York Tom York x x 32 653
Jim York x x 10 742
Sam York x x 32 931
Class B Dump Sarah Dumpton Park x x 23.5 12
Bob Dumpton Park x x 23.1 299
Bill Dumpton Park x x 23.4 198
Class A Dove Andy Dover x x 23 983
Hannah Dover x x 1 293
Class B Lond Jemma London x x 32.2 NaN
James London x x 32.0 NaN
Finally return cdf1 to it's original form and write it to a csv file.
cdf1 = cdf1.reset_index() #These two steps allow us to remove the 'Location' column
del cdf1['Location']
cdf1 = cdf1[col_order] #This will switch Local and Name back to their original order
cdf1.to_csv('temp.csv',index = False)
Two notes: First, given how easy it is to use cdf1.Local.value_counts() or len(cdf1.Local.value_counts()) etc. I'd strongly recommend adding some check summing to make sure that when shifting from Location to the first few letters of Location, you aren't accidentally eliminating a location. Secondly, I sincerely hope there is a typo on line 4 of your desired output.
Onto a plain python solution. In the following, adjust the filenames as needed.
#Open all of the necessary files
csv1 = open('csv1','r')
csv2 = open('csv2','r')
csvout = open('csv_out','w')
#Read past both headers and write the header to the outfile
wstr = csv1.readline()
csvout.write(wstr)
csv2.readline()
#Read csv1 into a dictionary with keys of Class,Name,and first four digits of Local and keep a list of keys for line ordering
line_keys = []
line_dict = {}
for line in csv1:
s = line.split(',')
this_key = (s[0],s[1][0:4],s[2])
line_dict[this_key] = s
line_keys.append(this_key)
#Go through csv2 updating the data in csv1 as necessary
for line in csv2:
s = line.replace('\"','').split(',')
this_key = (s[0],s[1][0:4],s[2])
if this_key in line_dict: #Lowers the crash rate...
#Check if need to replace Score...
if len(s[3]) > 0 and float(s[3]) != 0:
line_dict[this_key][5] = s[3]
#Check if need to repace No...
if len(s[4]) > 0 and float(s[4]) != 0:
line_dict[this_key][6] = s[4]
else:
print "Line not in csv1: %s"%line
#Write the updated line_dict to csvout
for key in line_keys:
wstr = ','.join(line_dict[key])
csvout.write(wstr)
csvout.write('\n')
#Close all of the open filehandles
csv1.close()
csv2.close()
csvout.close()
Hopefully this code is a bit more readable. ;) The backport for Python's new Enum type is here.
from enum import Enum # see PyPI for the backport (enum34)
class Field(Enum):
course = 0
location = 1
student = 2
dpe = 3
jjk = 4
score = -2
number = -1
def __index__(self):
return self._value_
def Float(text):
if not text:
return 0.0
return float(text)
def load_our_data(filename):
"return a dict using the first three fields as the key"
data = dict()
with open(filename) as input:
next(input) # throw away header
for line in input:
fields = line.strip('\n').split(',')
fields[Field.score] = Float(fields[Field.score])
fields[Field.number] = Float(fields[Field.number])
key = (
fields[Field.course].lower(),
fields[Field.location][:4].lower(),
fields[Field.student].lower(),
)
data[key] = fields
return data
def load_their_data(filename):
"return a dict using the first three fields as the key"
data = dict()
with open(filename) as input:
next(input) # throw away header
for line in input:
fields = line.strip('\n').split(',')
fields = [f.strip('"') for f in fields]
fields[Field.score] = Float(fields[Field.score])
fields[Field.number] = Float(fields[Field.number])
key = (
fields[Field.course].lower(),
fields[Field.location][:4].lower(),
fields[Field.student].lower(),
)
data[key] = fields
return data
def merge_data(ours, theirs):
"their data is only used if not blank and non-zero"
for key, our_data in ours.items():
their_data = theirs[key]
if their_data[Field.score]:
our_data[Field.score] = their_data[Field.score]
if their_data[Field.number]:
our_data[Field.number] = their_data[Field.number]
def write_our_data(data, filename):
with open(filename, 'w') as output:
for record in sorted(data.values()):
line = ','.join([str(f) for f in record])
output.write(line + '\n')
if __name__ == '__main__':
ours = load_our_data('one.csv')
theirs = load_their_data('two.csv')
merge_data(ours, theirs)
write_our_data(ours, 'three.csv')
You could use fuzzywuzzy to do the matching of town names, and append as a column to df2:
df1 = pd.read_csv(csv1)
df2 = pd.read_csv(csv2)
towns = df1.Local.unique() # assuming this is complete list of towns
from fuzzywuzzy.fuzz import partial_ratio
In [11]: df2['Local'] = df2.Location.apply(lambda short_location: max(towns, key=lambda t: partial_ratio(short_location, t)))
In [12]: df2
Out[12]:
Class Location Student Scorecard Number Local
0 Class A York Jim 0.0 742 York
1 Class A York Sam 0.0 931 York
2 Class A York Tom 0.0 653 York
3 Class B Dumpt Bob 23.1 299 Dumpton Park
4 Class B Dumpt Bill 23.4 198 Dumpton Park
5 Class B Dumpt Sarah 23.5 12 Dumpton Park
6 Class A Dover Andy 23.0 983 Dover
7 Class A Dover Hannah 1.0 293 Dover
8 Class B Lond Jemma 32.2 0 London
9 Class B Lond James 32.0 0 London
Make the name consistent (at the moment Student and Name are misnamed):
In [13]: df2.rename_axis({'Student': 'Name'}, axis=1, inplace=True)
Now you can merge (on the overlapping columns):
In [14]: res = df1.merge(df2, how='outer')
In [15]: res
Out[15]:
Class Local Name DPE JJK Score No Location Scorecard Number
0 Class A York Tom x x 32 NaN York 0.0 653
1 Class A York Jim x x 10 NaN York 0.0 742
2 Class A York Sam x x 32 NaN York 0.0 931
3 Class B Dumpton Park Sarah x x NaN NaN Dumpt 23.5 12
4 Class B Dumpton Park Bob x x NaN NaN Dumpt 23.1 299
5 Class B Dumpton Park Bill x x NaN NaN Dumpt 23.4 198
6 Class A Dover Andy x x NaN NaN Dover 23.0 983
7 Class A Dover Hannah x x NaN NaN Dover 1.0 293
8 Class B London Jemma x x NaN NaN Lond 32.2 0
9 Class B London James x x NaN NaN Lond 32.0 0
One bit to clean up is the Score, I think I would take the max of the two:
In [16]: res['Score'] = res.loc[:, ['Score', 'Scorecard']].max(1)
In [17]: del res['Scorecard']
del res['No']
del res['Location']
Then you're left with the columns you want:
In [18]: res
Out[18]:
Class Local Name DPE JJK Score Number
0 Class A York Tom x x 32.0 653
1 Class A York Jim x x 10.0 742
2 Class A York Sam x x 32.0 931
3 Class B Dumpton Park Sarah x x 23.5 12
4 Class B Dumpton Park Bob x x 23.1 299
5 Class B Dumpton Park Bill x x 23.4 198
6 Class A Dover Andy x x 23.0 983
7 Class A Dover Hannah x x 1.0 293
8 Class B London Jemma x x 32.2 0
9 Class B London James x x 32.0 0
In [18]: res.to_csv('foo.csv')
Note: to force the dtype to object (and have mixed dtypes, ints and floats, rather than all floats) you can use an apply. I would recommend against this if you're doing any analysis!
res['Score'] = res['Score'].apply(lambda x: int(x) if int(x) == x else x, convert_dtype=False)
Python dictionaries are the way to go here:
studentDict = {}
with open(<csv1>, 'r') as f:
for line in f:
LL = line.rstrip('\n').replace('"','').split(',')
studentDict[LL[0], LL[1], LL[2]] = LL[3:]
with open(<csv2>, 'r') as f:
for line in f:
LL = line.rstrip('\n').replace('"','').split(',')
if LL[-2] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-2] = LL[-2]
if LL[-1] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-1] = LL[-1]
with open(<outFile>, 'w') as f:
for k in studentDict.keys():
v = studentDict[k[0], k[1], k[2]]
f.write(k[0] + ',' + k[1] + ',' + k[2] + ',' + v[0] + ',' + v[1] + ',' + v[2] + ',' + v[3] + '\n')
pandas make this sort of task a bit more convenient.
EDIT: Okay since you can't rely on renaming columns manually, Roman's suggestion to just match on the first few letters is a good one. We have to change a couple things before that though.
In [62]: df1 = pd.read_clipboard(sep=',')
In [63]: df2 = pd.read_clipboard(sep=',')
In [68]: df1
Out[68]:
Class Location Student Scorecard Number
0 Class A York Jim 0.0 742
1 Class A York Sam 0.0 931
2 Class A York Tom 0.0 653
3 Class B Dumpt Bob 23.1 299
4 Class B Dumpt Bill 23.4 198
5 Class B Dumpt Sarah 23.5 12
6 Class A Dover Andy 23.0 983
7 Class A Dover Hannah 1.0 293
8 Class B Lond Jemma 32.2 0
9 Class B Lond James 32.0 0
In [69]: df2
Out[69]:
Class Local Name DPE JJK Score No
0 Class A York Tom x x 32.0 653
1 Class A York Jim x x 10.0 742
2 Class A York Sam x x 32.0 653
3 Class B Dumpton Park Sarah x x 23.5 12
4 Class B Dumpton Park Bob x x 23.1 299
5 Class B Dumpton Park Bill x x 23.4 198
6 Class A Dover Andy x x 23.0 983
7 Class A Dover Hannah x x 1.0 293
8 Class B London Jemma x x 32.2 NaN
9 Class B London James x x 32.0 NaN
Get the columns named the same.
In [70]: df1 = df1.rename(columns={'Location': 'Local', 'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'}
Now for the locations. Save the originals in df2 to a separate Series.
In [71]: locations = df2['Local']
In [72]: df1['Local'] = df1['Local'].str.slice(0, 4)
In [73]: df2['Local'] = df2['Local'].str.slice(0, 4)
Use the string methods to truncate to the first 4 (assuming this won't cause any false matches).
Now set the indices:
In [78]: df1 = df1.set_index(['Class', 'Local', 'Name'])
In [79]: df2 = df2.set_index(['Class', 'Local', 'Name'])
In [80]: df1
Out[80]:
Score No
Class Local Name
Class A York Jim 0.0 742
Sam 0.0 931
Tom 0.0 653
Class B Dump Bob 23.1 299
Bill 23.4 198
Sarah 23.5 12
Class A Dove Andy 23.0 983
Hannah 1.0 293
Class B Lond Jemma 32.2 0
James 32.0 0
In [83]: df1 = df1.replace(0, np.nan)
In [84]: df2 = df2.replace(0, np.nan)
Finally, update the scores as before:
In [85]: df1.update(df2, overwrite=False)
You can get the original locations back by doing:
In [91]: df1 = df1.reset_index()
In [92]: df1['Local'] = locations
And you can write to output to csv (and a bunch of other format) with df1.to_csv('path/to/csv')
You could try using the csv module from the standard library. My solution is very similar to Chris H's, but I used the csv module to read and write the files. (In fact, I stole his technique of storing the keys in a list to save the order).
If you use the csv module, you don't have to worry too much about the quotes, and it also allows you to read the rows directly into dictionaries with the column names as keys.
import csv
# Open first CSV, and read each line as a dictionary with column names as keys.
with open('csv1.csv', 'rb') as csvfile1:
table1 = csv.DictReader(csvfile1,['Class', 'Local', 'Name',
'DPE', 'JJK', 'Score', 'No'])
table1.next() #skip header row
first_table = {}
original_order = [] #list keys to save original order
# build dictionary of rows with name, location, and class as keys
for row in table1:
id = "%s from %s in %s" % (row['Name'], row['Local'][:4], row['Class'])
first_table[id] = row
original_order.append(id)
# Repeat for second csv, but don't worry about order
with open('csv2.csv', 'rb') as csvfile2:
table2 = csv.DictReader(csvfile2, ['Class', 'Location',
'Student', 'Scorecard', 'Number'])
table2.next()
second_table = {}
for row in table2:
id = "%s from %s in %s" % (row['Student'], row['Location'][:4], row['Class'])
second_table[id] = row
with open('student_data.csv', 'wb') as finalfile:
results = csv.DictWriter(finalfile, ['Class', 'Local', 'Name',
'DPE', 'JJK', 'Score', 'No'])
results.writeheader()
# Replace data in first csv with data in second csv when conditions are satisfied.
for student in original_order:
if second_table[student]['Scorecard'] != "0" and second_table[student]['Scorecard'] != "":
first_table[student]['Score'] = second_table[student]['Scorecard']
if second_table[student]['Number'] != "0" and second_table[student]['Number'] != "":
first_table[student]['No'] = second_table[student]['Number']
results.writerow(first_table[student])
Hope this helps.

Categories

Resources