Append two multi indexed data frames in pandas - python

I am trying this simple setup of variables:
In [94]: cc
Out[94]:
d0 d1
class sample
5 66 0.128320 0.970817
66 0.160488 0.969077
77 0.919263 0.008597
6 77 0.811914 0.123960
88 0.639887 0.262943
88 0.312303 0.660786
In [101]: bb
Out[101]:
d0 d1
class sample
2 22 0.730631 0.656266
33 0.871292 0.942768
3 44 0.081831 0.714360
55 0.600095 0.770108
In [102]: aa
Out[102]:
d0 d1
class sample
0 00 0.190409 0.789750
11 0.588001 0.250663
1 22 0.888343 0.428968
33 0.185525 0.450020
I can perform the following command
In [103]: aa.append(bb)
Out[103]:
d0 d1
class sample
0 00 0.190409 0.789750
11 0.588001 0.250663
1 22 0.888343 0.428968
33 0.185525 0.450020
2 22 0.730631 0.656266
33 0.871292 0.942768
3 44 0.081831 0.714360
55 0.600095 0.770108
Why I cant perform the following command in the same manner?
aa.append(cc)
[I get the following exception]
ValueError: all arrays must be same length
UPDATE:
It works fine if I did not provide column names, but if for example I have 4 columns, with names ['d0','d0','d1','d1'] for 4X4 and 8X4, it does not work anymore
here is the code for reproducing the error
import pandas
y1 = [['0','0','1','1'],['00','11','22','33']]
y2 = [['2','2','3','3','4','4'],['44','55','66','77','88','99']]
x1 = np.random.rand(4,4)
x2 = np.random.rand(6,4)
cols = ['d1']*2 + ['d2']*2
names = ['class','idx']
aa = pandas.DataFrame(x1,index=y1,columns = cols)
aa.index.names = names
print aa
bb = pandas.DataFrame(x2,index=y2,columns = cols)
bb.index.names = names
print bb
aa.append(bb)
What should I do to get this running?
Thanks

concatenated = pd.concat([bb, cc])
concatenated
0 1
class sample
2 22 0.730631 0.656266
33 0.871282 0.942768
3 44 0.081831 0.714360
55 0.600095 0.770108
5 66 0.128320 0.970817
66 0.160488 0.969077
77 0.919263 0.008597
6 77 0.811914 0.123960
88 0.639887 0.262943
88 0.312303 0.660786
Answer To Your Edited Question
So to answer your edited question, the problem lies with your column names having duplicates.
cols = ['d1']*2 + ['d2']*2 # <-- this creates ['d1', 'd1', 'd2', 'd2']
and your dataframes end up having what-is-considered duplicated columns, i.e.
In [62]: aa
Out[62]:
d1 d1 d2 d2
class idx
0 00 0.805445 0.442059 0.296162 0.041271
11 0.384600 0.723297 0.997918 0.006661
1 22 0.685997 0.794470 0.541922 0.326008
33 0.117422 0.667745 0.662031 0.634429
and
In [64]: bb
Out[64]:
d1 d1 d2 d2
class idx
2 44 0.465559 0.496039 0.044766 0.649145
55 0.560626 0.684286 0.929473 0.607542
3 66 0.526605 0.836667 0.608098 0.159471
77 0.216756 0.749625 0.096782 0.547273
4 88 0.619338 0.032676 0.218736 0.684045
99 0.987934 0.349520 0.346036 0.926373
pandas.append() (or concat() method) can only append correctly if you have unique column names.
Try this and you will not get any error:-
cols2 = ['d1', 'd2', 'd3', 'd4']
cc = pandas.DataFrame(x1, index=y1, columns=cols2)
cc.index.names = names
dd = pandas.DataFrame(x2, index=y2, columns=cols2)
cc.index.names = names
Now...
In [70]: cc.append(dd)
Out[70]:
d1 d2 d3 d4
class idx
0 00 0.805445 0.442059 0.296162 0.041271
11 0.384600 0.723297 0.997918 0.006661
1 22 0.685997 0.794470 0.541922 0.326008
33 0.117422 0.667745 0.662031 0.634429
2 44 0.465559 0.496039 0.044766 0.649145
55 0.560626 0.684286 0.929473 0.607542
3 66 0.526605 0.836667 0.608098 0.159471
77 0.216756 0.749625 0.096782 0.547273
4 88 0.619338 0.032676 0.218736 0.684045
99 0.987934 0.349520 0.346036 0.926373

Related

summing two columns in a dataframe

My df looks as follows:
Roll Name Age Physics English Maths
0 A1 Max 16 87 79 90
1 A2 Lisa 15 47 75 60
2 A3 Luna 17 83 49 95
3 A4 Ron 16 86 79 93
4 A5 Silvia 15 57 99 91
I'd like to add the columns Physics, English, and Maths and display the results in a separate column 'Grade'.
I've tried the code:
df['Physics'] + df['English'] + df['Maths']
But it just concatenates. I am not taught about the lambda function yet.
How do I go about this?
df['Grade'] = df['Physics'] + df['English'] + df['Maths']
it concatenates maybe your data is in **String** just convert into float or integer.
Check Data Types First by using df.dtypes
Try:
df["total"] = df[["Physics", "English", "Maths"]].sum(axis=1)
df
Check Below code, Its is possible you columns are in string format, belwo will solve that:
import pandas as pd
df = pd.DataFrame({"Physics":['1','2','3'],"English":['1','2','3'],"Maths":['1','2','3']})
df['Total'] = df['Physics'].astype('int') +df['English'].astype('int') +df['Maths'].astype('int')
df
Output:

How to calculate cumulative sum and average on file data in python

I have a below data in file
NAME,AGE,MARKS
A1,12,40
B1,13,54
C1,15,67
D1,11,41
E1,16,59
F1,10,60
If the data was in database table , I would have used Sum and Average function to get the cumulative sum and average
But How to get it with python is a bit challenging , As i am learner
Expected output :
NAME,AGE,MARKS,CUM_SUM,AVG
A1,12,40,40,40
B1,13,54,94,47
C1,15,67,161,53.66
D1,11,41,202,50.5
E1,16,59,261,43.5
F1,10,60,321,45.85
IIUC use:
df = pd.read_csv('file')
df['CUM_SUM'] = df['MARKS'].cumsum()
df['AVG'] = df['MARKS'].expanding().mean()
print (df)
NAME AGE MARKS CUM_SUM AVG
0 A1 12 40 40 40.000000
1 B1 13 54 94 47.000000
2 C1 15 67 161 53.666667
3 D1 11 41 202 50.500000
4 E1 16 59 261 52.200000
5 F1 10 60 321 53.500000
Last use:
df.to_csv('file.csv', index=False)
Or:
out = df.to_string(index=False)

Python manipulation

I have 3 same models(M1,M2,M3) each for 5 customers(x1,x2,x3,x4,x5) and now I came to know from my business that for each customer one model has been chosen by them. The models chosen for the customer could be seen in Best_Models dataframe. Now I have to select the result of the best model that has been chosen by the business for each customer, which can be seen in Output data frame, How can I do that?
import pandas as pd
data1 = {'x1': [86,23,32,13,45,12],
'x2': [96,98,34,12,22,19],
'x3': [56,23,44,12,32,33],
'x4': [96,43,84,72,42,97],
'x5': [16,33,64,82,92,44]
}
Model1 = pd.DataFrame(data1,
columns=['x1','x2','x3','x4','x5']
)
data2 = {'x1': [36,23,32,13,66,12],
'x2': [56,98,64,12,22,19],
'x3': [86,23,44,52,32,33],
'x4': [96,43,74,72,42,97],
'x5': [16,53,64,82,77,44]
}
Model2 = pd.DataFrame(data1,
columns=['x1','x2','x3','x4','x5'])
data3 = {'x1': [36,43,32,13,66,12],
'x2': [56,48,64,12,22,19],
'x3': [86,23,44,54,32,33],
'x4': [96,44,74,44,42,97],
'x5': [16,53,64,82,44,44]
}
Model3 = pd.DataFrame(data3,
columns=['x1','x2','x3','x4','x5'])
Model3
data4 = {"Customer":["x1","x2","x3","x4","x5"],
"Best_Model":["M2","M3","M1","M2","M3"]
}
Best_Models = pd.DataFrame(data4, columns=['Customer', 'Best_Model'])
Best_Models
data5 = {'x1': [36,23,32,13,66,12],
'x2': [56,48,64,12,22,19],
'x3': [56,23,44,12,32,33],
'x4': [96,43,74,72,42,97],
'x5': [16,53,64,82,44,44]
}
Output = pd.DataFrame(data5,
columns=['x1','x2','x3','x4','x5'],
index=['I1', 'I2','I3','I4','I5','I6'])
Output
What I tried:
I tried to do the pivot of the best models dataframe and then map the results but that did not work for me, could anyone suggest me a better way to code this?
Let's try concat then using loc:
(pd.concat([Model1,Model2,Model3], keys=['M1','M2','M3'], axis=1)
.loc[:,[(m,c) for m,c in zip(Best_Models.Best_Model, Best_Models.Customer)]]
)
Output:
M2 M3 M1 M2 M3
x1 x2 x3 x4 x5
0 86 56 56 96 16
1 23 48 23 43 53
2 32 64 44 84 64
3 13 12 12 72 82
4 45 22 32 42 44
5 12 19 33 97 44
Best_Models.apply(lambda r:
{'M1': Model1, 'M2': Model2, 'M3': Model3}[
r['Best_Model']][r['Customer']], axis=1).T.rename(
columns=Best_Models.Customer)
output:
x1 x2 x3 x4 x5
0 86 56 56 96 16
1 23 48 23 43 53
2 32 64 44 84 64
3 13 12 12 72 82
4 45 22 32 42 44
5 12 19 33 97 44
Create a dictionary to map best model names to the actual Model.
Since customers names in the best_models and the Models match we can directly index them
Finally rename the result with the corresponding customer names.

Manipulating data in Pandas

That is my database:
Number Name Points Math Points BG Wish
0 1 Огнян 50 65 MT
1 2 Момчил 61 27 MT
2 3 Радослав 68 68 MT
3 4 Павел 28 16 MT
4 10 Виктор 67 76 MT
5 11 Петър 26 68 BT
6 12 Антон 64 58 BT
7 13 Васил 29 42 BT
8 20 Виктория 62 67 BT
That's my code:
df = pd.read_csv('Input_data.csv', encoding='utf-8-sig')
df['Total'] = df.iloc[:, 2:].sum(axis=1)
df = df.sort_values(['Total', 'Name'], ascending=[0, 1])
df_5.to_excel("BT RANKING_5.xlsx", encoding='utf-8-sig', index=False)
I want for each person who has Wish == MT to double the score in Points Math column.
I tried:
df.loc[df['Wish'] == 'MT', 'Points Math'] = df.loc[df['Points Math'] * 2]
but this didn't work. I als tried to do an if statement, for loop but they didn't work either.
What's the appropriate sytax to do the logic?
Use this:
df['Points_Math'] = np.where(df['Wish'] == 'MT', df['Points Math'] * 2, df['Points Math'])
A new column would be created 'Points_Math' with desired results or you can overwrite by replacing 'Points_Math' with 'Points Math'

Writing a dict of large dataframes to excel

I am creating dicts where the dict keys are strings and the values are large-ish pandas DataFrames. I would like to write these dicts to an excel file but the issue I'm having is that when python writes the dataframe to a csv it cuts out parts. Code:
import pandas as pd
import numpy as np
def create_random_df():
return(pd.DataFrame(np.random.randint(0,100,size=(70,26)),columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')))
dic={'key1': create_random_df() , 'key2': create_random_df()}
with open('test.csv', 'w') as f:
for key in dic.keys():
f.write("%s,%s\n"%(key,dic[key]))
This sort of outputs the format I'd like except for the following:
All of the dataframe columns are in Cell B1 and they're not complete... it's
A B C D E F G H I ... R S T U V W X Y
Z
and then the indexes and dataframe elements are all in columns A. i.e. Cells A2:A4 is
0 55 96 60 47 11 3 2 69 50 ... 3 23 26 3 15 53
78 95 49
1 72 48 12 25 32 57 11 84 5 ... 11 43 56 0 68 55
95 64 84
2 80 56 78 58 79 72 67 97 58 ... 84 34 18 21 71 20
72 36 37
I'd like the dataframes to be written to the csv in their entirety and obviously the values in discrete cells
You can try:
dic={'key1': create_random_df() , 'key2': create_random_df()}
with open('test.csv', 'w') as f:
for key in dic.keys():
df = dic[key]
df.insert(0,'Key', pd.Series([key]))
df.Key = df.Key.fillna('')
f.write(df.to_csv(index=False))

Categories

Resources