Python manipulation - python

I have 3 same models(M1,M2,M3) each for 5 customers(x1,x2,x3,x4,x5) and now I came to know from my business that for each customer one model has been chosen by them. The models chosen for the customer could be seen in Best_Models dataframe. Now I have to select the result of the best model that has been chosen by the business for each customer, which can be seen in Output data frame, How can I do that?
import pandas as pd
data1 = {'x1': [86,23,32,13,45,12],
'x2': [96,98,34,12,22,19],
'x3': [56,23,44,12,32,33],
'x4': [96,43,84,72,42,97],
'x5': [16,33,64,82,92,44]
}
Model1 = pd.DataFrame(data1,
columns=['x1','x2','x3','x4','x5']
)
data2 = {'x1': [36,23,32,13,66,12],
'x2': [56,98,64,12,22,19],
'x3': [86,23,44,52,32,33],
'x4': [96,43,74,72,42,97],
'x5': [16,53,64,82,77,44]
}
Model2 = pd.DataFrame(data1,
columns=['x1','x2','x3','x4','x5'])
data3 = {'x1': [36,43,32,13,66,12],
'x2': [56,48,64,12,22,19],
'x3': [86,23,44,54,32,33],
'x4': [96,44,74,44,42,97],
'x5': [16,53,64,82,44,44]
}
Model3 = pd.DataFrame(data3,
columns=['x1','x2','x3','x4','x5'])
Model3
data4 = {"Customer":["x1","x2","x3","x4","x5"],
"Best_Model":["M2","M3","M1","M2","M3"]
}
Best_Models = pd.DataFrame(data4, columns=['Customer', 'Best_Model'])
Best_Models
data5 = {'x1': [36,23,32,13,66,12],
'x2': [56,48,64,12,22,19],
'x3': [56,23,44,12,32,33],
'x4': [96,43,74,72,42,97],
'x5': [16,53,64,82,44,44]
}
Output = pd.DataFrame(data5,
columns=['x1','x2','x3','x4','x5'],
index=['I1', 'I2','I3','I4','I5','I6'])
Output
What I tried:
I tried to do the pivot of the best models dataframe and then map the results but that did not work for me, could anyone suggest me a better way to code this?

Let's try concat then using loc:
(pd.concat([Model1,Model2,Model3], keys=['M1','M2','M3'], axis=1)
.loc[:,[(m,c) for m,c in zip(Best_Models.Best_Model, Best_Models.Customer)]]
)
Output:
M2 M3 M1 M2 M3
x1 x2 x3 x4 x5
0 86 56 56 96 16
1 23 48 23 43 53
2 32 64 44 84 64
3 13 12 12 72 82
4 45 22 32 42 44
5 12 19 33 97 44

Best_Models.apply(lambda r:
{'M1': Model1, 'M2': Model2, 'M3': Model3}[
r['Best_Model']][r['Customer']], axis=1).T.rename(
columns=Best_Models.Customer)
output:
x1 x2 x3 x4 x5
0 86 56 56 96 16
1 23 48 23 43 53
2 32 64 44 84 64
3 13 12 12 72 82
4 45 22 32 42 44
5 12 19 33 97 44
Create a dictionary to map best model names to the actual Model.
Since customers names in the best_models and the Models match we can directly index them
Finally rename the result with the corresponding customer names.

Related

Optimize dataframe fill and refill Python Pandas

I have changed the column names and have added new columns too.
I am having a numpy array that I have to fill in the respective dataframe columns.
I am getting a delayed response in filling the dataframe using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv("sample.csv")
df = df.tail(1000)
DISPLAY_IN_TRAINING = []
Slice_Middle_Piece_X = slice(None,-1, None)
Slice_Middle_Piece_Y = slice(-1, None)
input_slicer = slice(None, None)
output_slice = slice(None, None)
seq_len = 15 # choose sequence length
n_steps = seq_len - 1
Disp_Data = df
def Generate_DataSet(stock,
df_clone,
seq_len
):
global DISPLAY_IN_TRAINING
data_raw = stock.values # convert to numpy array
data = []
len_data_raw = data_raw.shape[0]
for index in range(0, len_data_raw - seq_len + 1):
data.append(data_raw[index: index + seq_len])
data = np.array(data);
test_set_size = int(np.round(30 / 100 * data.shape[0]));
train_set_size = data.shape[0] - test_set_size;
x_train, y_train = Get_Data_Chopped(data[:train_set_size])
print("Training Sliced Successful....!")
df_train_candle = df_clone[n_steps : train_set_size + n_steps]
if len(DISPLAY_IN_TRAINING) == 0:
DISPLAY_IN_TRAINING = list(df_clone)
df_train_candle.columns = DISPLAY_IN_TRAINING
return [x_train, y_train, df_train_candle]
def Get_Data_Chopped(data_related_to):
x_values = []
y_values = []
for index,iter_values in enumerate(data_related_to):
x_values.append(iter_values[Slice_Middle_Piece_X,input_slicer])
y_values.append([item for sublist in iter_values[Slice_Middle_Piece_Y,output_slice] for item in sublist])
x_values = np.asarray(x_values)
y_values = np.asarray(y_values)
return [x_values,y_values]
x_train, y_train, df_train_candle = Generate_DataSet(df,
Disp_Data,
seq_len
)
df_train_candle.reset_index(drop = True, inplace = True)
df_columns = list(df_train_candle)
df_outputs_name = []
OUTPUT_COLUMN = df.columns
for output_column_name in OUTPUT_COLUMN:
df_outputs_name.append(output_column_name + "_pred")
for i in range(len(df_columns)):
if df_columns[i] == output_column_name:
df_columns[i] = output_column_name + "_orig"
break
df_train_candle.columns = df_columns
df_pred_names = pd.DataFrame(columns = df_outputs_name)
df_train_candle = df_train_candle.join(df_pred_names, how="outer")
for row_index, row_value in enumerate(y_train):
for valueindex, output_label in enumerate(OUTPUT_COLUMN):
df_train_candle.loc[row_index, output_label + "_orig"] = row_value[valueindex]
df_train_candle.loc[row_index, output_label + "_pred"] = row_value[valueindex]
print(df_train_candle.head())
The shape of my y_train is (195, 24) and the dataframe shape is (195, 48). Now I am trying to optimize and make the process work faster. The y_train may change shape to say (195, 1) or (195, 5).
So please can someone tell me what other way (optimized way) for doing the above process? I want a general solution that could fit anything without loosing the data integrity and is faster too.
If teh data size increases from 1000 to 2000 the process become slow. Please advise how to make it faster.
Sample Data df looks like this with shape (1000, 8)
A B C D E F G H
64272 195 215 239 272 22 11 33 55
64273 196 216 240 273 22 11 33 55
64274 197 217 241 274 22 11 33 55
64275 198 218 242 275 22 11 33 55
64276 199 219 243 276 22 11 33 55
The output looks like this:
A_orig B_orig C_orig D_orig E_orig F_orig G_orig H_orig A_pred B_pred C_pred D_pred E_pred F_pred G_pred H_pred
0 10 30 54 87 22 11 33 55 10 30 54 87 22 11 33 55
1 11 31 55 88 22 11 33 55 11 31 55 88 22 11 33 55
2 12 32 56 89 22 11 33 55 12 32 56 89 22 11 33 55
3 13 33 57 90 22 11 33 55 13 33 57 90 22 11 33 55
4 14 34 58 91 22 11 33 55 14 34 58 91 22 11 33 55
Please generate csv columns with 1000 or more lines and see that the program becomes slower. I want to make it faster. I hope this is good to go for understanding.

how to do complex calculations in pandas dataframe

sample dataframe:
df = pd.DataFrame({'sales': ['2020-01','2020-02','2020-03','2020-04','2020-05','2020-06'],
'2020-01': [24,42,18,68,24,30],
'2020-02': [24,42,18,68,24,30],
'2020-03': [64,24,70,70,88,57],
'2020-04': [22,11,44,3,5,78],
'2020-05': [11,35,74,12,69,51]}
I want to find below df['L2']
I studied pandas rolling,groupby,etcs, cannot solve it.
please read L2 formula & givee me a opinion
L2 formula
L2(Jan-20) = 24
-------------------
sales 2020-01
0 2020-01 24
-------------------
L2(Feb-20) = 132 (sum of below matrix 2x2)
sales 2020-01 2020-02
0 2020-01 24 24
1 2020-02 42 42
-------------------
L2(Mar-20) = 154 (sum of matrix 2x2)
sales 2020-02 2020-03
0 2020-02 42 24
1 2020-03 18 70
-------------------
L2(Apr-20) = 187 (sum of below maxtrix 2x2)
sales 2020-03 2020-04
0 2020-03 70 44
1 2020-04 70 3
output
Unnamed: 0 sales Jan-20 Feb-20 Mar-20 Apr-20 May-20 L2 L3
0 0 Jan-20 24 24 64 22 11 24 24
1 1 Feb-20 42 42 24 11 35 132 132
2 2 Mar-20 18 18 70 44 74 154 326
3 3 Apr-20 68 68 70 3 12 187 350
4 4 May-20 24 24 88 5 69 89 545
5 5 Jun-20 30 30 57 78 51 203 433
Values=f.values[:,1:]
L2=[]
RANGE=Values.shape[0]
for a in range(RANGE):
if a==0:
result=Values[a,a]
else:
if Values[a-1:a+1,a-1:a+1].shape==(2,1):
result=np.sum(Values[a-1:a+1,a-2:a])
else:
result=np.sum(Values[a-1:a+1,a-1:a+1])
L2.append(result)
print(L2)
L2 output:-->[24, 132, 154, 187, 89, 203]
f["L2"]=L2
f:
import pandas as pd
import numpy as np
# make a dataset
df = pd.DataFrame({'sales': ['2020-01','2020-02','2020-03','2020-04','2020-05','2020-06'],
'2020-01': [24,42,18,68,24,30],
'2020-02': [24,42,18,68,24,30],
'2020-03': [64,24,70,70,88,57],
'2020-04': [22,11,44,3,5,78],
'2020-05': [11,35,74,12,69,51]})
print(df)
# datawork(L2)
for i in range(0,df.shape[0]):
if i==0:
df.loc[i,'L2']=df.loc[i,'2020-01']
else:
if i!=df.shape[0]-1:
df.loc[i,'L2']=df.iloc[i-1:i+1,i:i+2].sum().sum()
if i==df.shape[0]-1:
df.loc[i,'L2']=df.iloc[i-1:i+1,i-1:i+1].sum().sum()
print(df)
# sales 2020-01 2020-02 2020-03 2020-04 2020-05 L2
#0 2020-01 24 24 64 22 11 24.0
#1 2020-02 42 42 24 11 35 132.0
#2 2020-03 18 18 70 44 74 154.0
#3 2020-04 68 68 70 3 12 187.0
#4 2020-05 24 24 88 5 69 89.0
#5 2020-06 30 30 57 78 51 203.0
I tried another method.
this method uses reshape long(in python : melt), but I applyed reshape long twice in python because time frequency of sales and other columns in df is monthly and not daily, so I did reshape long one more time to make int column corresponding to monthly date.
(I have used Stata more often than python, in Stata, I can only do reshape long one time because it has monthly time frequency, and reshape task is much easier than that of pandas, python)
if you are interested, take a look
# 00.module
import pandas as pd
import numpy as np
from order import order # https://stackoverflow.com/a/68464246/16478699
# 0.make a dataset
df = pd.DataFrame({'sales': ['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06'],
'2020-01': [24, 42, 18, 68, 24, 30],
'2020-02': [24, 42, 18, 68, 24, 30],
'2020-03': [64, 24, 70, 70, 88, 57],
'2020-04': [22, 11, 44, 3, 5, 78],
'2020-05': [11, 35, 74, 12, 69, 51]}
)
df.to_stata('dataset.dta', version=119, write_index=False)
print(df)
# 1.reshape long(in python: melt)
t = list(df.columns)
t.remove('sales')
df_long = df.melt(id_vars='sales', value_vars=t, var_name='var', value_name='val')
df_long['id'] = list(range(1, df_long.shape[0] + 1)) # make id for another resape long
print(df_long)
# 2.another reshape long(in python: melt, reason: make int(col name: tid) corresponding to monthly date of sales and monthly columns in df)
df_long2 = df_long.melt(id_vars=['id', 'val'], value_vars=['sales', 'var'])
df_long2['tid'] = df_long2['value'].apply(lambda x: 1 + list(df_long2.value.unique()).index(x))
print(df_long2)
# 3.back to wide form with tid(in python: pd.pivot)
df_wide = pd.pivot(df_long2, index=['id', 'val'], columns='variable', values=['value', 'tid'])
df_wide.columns = df_wide.columns.map(lambda x: x[1] if x[0] == 'value' else f'{x[0]}_{x[1]}') # change multiindex columns name into just normal columns name
df_wide = df_wide.reset_index()
print(df_wide)
# 4.make values of L2
for i in df_wide.tid_sales.unique():
if list(df_wide.tid_sales.unique()).index(i) + 1 == len(df_wide.tid_sales.unique()):
df_wide.loc[df_wide['tid_sales'] == i, 'L2'] = df_wide.loc[(((df_wide['tid_sales'] == i) | (
df_wide['tid_sales'] == i - 1)) & ((df_wide['tid_var'] == i - 1) | (
df_wide['tid_var'] == i - 2))), 'val'].sum()
else:
df_wide.loc[df_wide['tid_sales'] == i, 'L2'] = df_wide.loc[(((df_wide['tid_sales'] == i) | (
df_wide['tid_sales'] == i - 1)) & ((df_wide['tid_var'] == i) | (
df_wide['tid_var'] == i - 1))), 'val'].sum()
print(df_wide)
# 5.back to shape of df with L2(reshape wide, in python: pd.pivot)
df_final = df_wide.drop(columns=df.filter(regex='^tid')) # no more columns starting with tid needed
df_final = pd.pivot(df_final, index=['sales', 'L2'], columns='var', values='val').reset_index()
df_final = order(df_final, 'L2', f_or_l='last') # order function is made by me
print(df_final)

how to nested boxplot groupBy

I have a dataset of more than 50 features that correspond to the specific movement during leg rehabilitation. I compare the group that used our rehabilitation device with the group recovering without using it. The group includes patients with 3 diagnoses and I want to compare boxplots of before (red boxplot) and after (blue boxplot) for each diagnosis.
This is the snippet I was using and the output I am getting.
Control group data:
dataKONTR
Row DG DKK ... LOS_DCL_LB LOS_DCL_L LOS_DCL_LF
0 Williams1 distorze 0.0 ... 63 57 78
1 Williams2 distorze 0.0 ... 91 68 67
2 Norton1 LCA 1.0 ... 58 90 64
3 Norton2 LCA 1.0 ... 29 91 87
4 Chavender1 distorze 1.0 ... 61 56 75
5 Chavender2 distorze 1.0 ... 54 74 80
6 Bendis1 distorze 1.0 ... 32 57 97
7 Bendis2 distorze 1.0 ... 55 69 79
8 Shawn1 AS 1.0 ... 15 74 75
9 Shawn2 AS 1.0 ... 67 86 79
10 Cichy1 LCA 0.0 ... 45 83 80
This is the snippet I was using and the output I am getting.
temp = "c:/Users/novos/ŠKOLA/Statistika/data Mariana/%s.xlsx"
dataKU = pd.read_excel(temp % "VestlabEXP_KU", engine = "openpyxl", skipfooter= 85) # patients using our rehabilitation tool
dataKONTR = pd.read_excel(temp % "VestlabEXP_kontr", engine = "openpyxl", skipfooter=51) # control group
dataKU_diag = dataKU.dropna()
dataKONTR_diag = dataKONTR.dropna()
dataKUBefore = dataKU_diag[dataKU_diag['Row'].str.contains("1")] # Patients data ending with 1 are before rehab
dataKUAfter = dataKU_diag[dataKU_diag['Row'].str.contains("2")] # Patients data ending with 2 are before rehab
dataKONTRBefore = dataKONTR_diagL[dataKONTR_diag['Row'].str.contains("1")]
dataKONTRAfter = dataKONTR_diagL[dataKONTR_diag['Row'].str.contains("2")]
b1 = dataKUBefore.boxplot(column=list(dataKUBefore.filter(regex='LOS_RT')), by="DG", rot = 45, color=dict(boxes='r', whiskers='r', medians='r', caps='r'),layout=(2,4),return_type='axes')
plt.ylim(0.5, 1.5)
plt.suptitle("")
plt.suptitle("Before, KU")
b2 = dataKUAfter.boxplot(column=list(dataKUAfter.filter(regex='LOS_RT')), by="DG", rot = 45, color=dict(boxes='b', whiskers='b', medians='b', caps='b'),layout=(2,4),return_type='axes')
# dataKUPredP
plt.suptitle("")
plt.suptitle("After, KU")
plt.ylim(0.5, 1.5)
plt.show()
Output is in two figures (red boxplot is all the "before rehab" data and blue boxplot is all the "after rehab")
Can you help me how make the red and blue boxplots next to each other for each diagnosis?
Thank you for any help.
You can try to concatenate the data into a single dataframe:
dataKUPlot = pd.concat({
'Before': dataKUBefore,
'After': dataKUAfter,
}, names=['When'])
You should see an additional index level named When in the output.
Using the example data you posted it looks like this:
>>> pd.concat({'Before': df, 'After': df}, names=['When'])
Row DG DKK ... LOS_DCL_LB LOS_DCL_L LOS_DCL_LF
When
Before 0 Williams1 distorze 0.0 ... 63 57 78
1 Williams2 distorze 0.0 ... 91 68 67
2 Norton1 LCA 1.0 ... 58 90 64
3 Norton2 LCA 1.0 ... 29 91 87
4 Chavender1 distorze 1.0 ... 61 56 75
After 0 Williams1 distorze 0.0 ... 63 57 78
1 Williams2 distorze 0.0 ... 91 68 67
2 Norton1 LCA 1.0 ... 58 90 64
3 Norton2 LCA 1.0 ... 29 91 87
4 Chavender1 distorze 1.0 ... 61 56 75
Then you can plot all of the boxes with a single command and thus on the same plots, by modifying the by grouper:
dataKUAfter.boxplot(column=dataKUPlot.filter(regex='LOS_RT').columns.to_list(), by=['DG', 'When'], rot = 45, layout=(2,4), return_type='axes')
I believe that’s the only “simple” way, I’m afraid that looks a little confused:
Any other way implies manual plotting with matplotlib − and thus better control. For example iterate on all desired columns:
fig, axes = plt.subplots(nrows=2, ncols=3, sharey=True, sharex=True)
pos = 1 + np.arange(max(dataKUBefore['DG'].nunique(), dataKUAfter['DG'].nunique()))
redboxes = {f'{x}props': dict(color='r') for x in ['box', 'whisker', 'median', 'cap']}
blueboxes = {f'{x}props': dict(color='b') for x in ['box', 'whisker', 'median', 'cap']}
ax_it = axes.flat
for colname, ax in zip(dataKUBefore.filter(regex='LOS_RT').columns, ax_it):
# Making a dataframe here to ensure the same ordering
show = pd.DataFrame({
'before': dataKUBefore[colname].groupby(dataKUBefore['DG']).agg(list),
'after': dataKUAfter[colname].groupby(dataKUAfter['DG']).agg(list),
})
ax.boxplot(show['before'].values, positions=pos - .15, **redboxes)
ax.boxplot(show['after'].values, positions=pos + .15, **blueboxes)
ax.set_xticks(pos)
ax.set_xticklabels(show.index, rotation=45)
ax.set_title(colname)
ax.grid(axis='both')
# Hide remaining axes:
for ax in ax_it:
ax.axis('off')
You could add a new column to separate 'Before' and 'After'. Seaborn's boxplots can use that new column as hue. sns.catplot(kind='box', ...) creates a grid of boxplots:
import seaborn as sns
import pandas as pd
import numpy as np
names = ['Adams', 'Arthur', 'Buchanan', 'Buren', 'Bush', 'Carter', 'Cleveland', 'Clinton', 'Coolidge', 'Eisenhower', 'Fillmore', 'Ford', 'Garfield', 'Grant', 'Harding', 'Harrison', 'Hayes', 'Hoover', 'Jackson', 'Jefferson', 'Johnson', 'Kennedy', 'Lincoln', 'Madison', 'McKinley', 'Monroe', 'Nixon', 'Obama', 'Pierce', 'Polk', 'Reagan', 'Roosevelt', 'Taft', 'Taylor', 'Truman', 'Trump', 'Tyler', 'Washington', 'Wilson']
rows = np.array([(name + '1', name + '2') for name in names]).flatten()
dataKONTR = pd.DataFrame({'Row': rows,
'DG': np.random.choice(['AS', 'Distorze', 'LCA'], len(rows)),
'LOS_RT_A': np.random.randint(15, 100, len(rows)),
'LOS_RT_B': np.random.randint(15, 100, len(rows)),
'LOS_RT_C': np.random.randint(15, 100, len(rows)),
'LOS_RT_D': np.random.randint(15, 100, len(rows)),
'LOS_RT_E': np.random.randint(15, 100, len(rows)),
'LOS_RT_F': np.random.randint(15, 100, len(rows))})
dataKONTR = dataKONTR.dropna()
dataKONTR['When'] = ['Before' if r[-1] == '1' else 'After' for r in dataKONTR['Row']]
cols = [c for c in dataKONTR.columns if 'LOS_RT' in c]
df_long = dataKONTR.melt(value_vars=cols, var_name='Which', value_name='Value', id_vars=['When', 'DG'])
g = sns.catplot(kind='box', data=df_long, x='DG', col='Which', col_wrap=3, y='Value', hue='When')
g.set_axis_labels('', '') # remove the x and y labels

Manipulating data in Pandas

That is my database:
Number Name Points Math Points BG Wish
0 1 Огнян 50 65 MT
1 2 Момчил 61 27 MT
2 3 Радослав 68 68 MT
3 4 Павел 28 16 MT
4 10 Виктор 67 76 MT
5 11 Петър 26 68 BT
6 12 Антон 64 58 BT
7 13 Васил 29 42 BT
8 20 Виктория 62 67 BT
That's my code:
df = pd.read_csv('Input_data.csv', encoding='utf-8-sig')
df['Total'] = df.iloc[:, 2:].sum(axis=1)
df = df.sort_values(['Total', 'Name'], ascending=[0, 1])
df_5.to_excel("BT RANKING_5.xlsx", encoding='utf-8-sig', index=False)
I want for each person who has Wish == MT to double the score in Points Math column.
I tried:
df.loc[df['Wish'] == 'MT', 'Points Math'] = df.loc[df['Points Math'] * 2]
but this didn't work. I als tried to do an if statement, for loop but they didn't work either.
What's the appropriate sytax to do the logic?
Use this:
df['Points_Math'] = np.where(df['Wish'] == 'MT', df['Points Math'] * 2, df['Points Math'])
A new column would be created 'Points_Math' with desired results or you can overwrite by replacing 'Points_Math' with 'Points Math'

Append two multi indexed data frames in pandas

I am trying this simple setup of variables:
In [94]: cc
Out[94]:
d0 d1
class sample
5 66 0.128320 0.970817
66 0.160488 0.969077
77 0.919263 0.008597
6 77 0.811914 0.123960
88 0.639887 0.262943
88 0.312303 0.660786
In [101]: bb
Out[101]:
d0 d1
class sample
2 22 0.730631 0.656266
33 0.871292 0.942768
3 44 0.081831 0.714360
55 0.600095 0.770108
In [102]: aa
Out[102]:
d0 d1
class sample
0 00 0.190409 0.789750
11 0.588001 0.250663
1 22 0.888343 0.428968
33 0.185525 0.450020
I can perform the following command
In [103]: aa.append(bb)
Out[103]:
d0 d1
class sample
0 00 0.190409 0.789750
11 0.588001 0.250663
1 22 0.888343 0.428968
33 0.185525 0.450020
2 22 0.730631 0.656266
33 0.871292 0.942768
3 44 0.081831 0.714360
55 0.600095 0.770108
Why I cant perform the following command in the same manner?
aa.append(cc)
[I get the following exception]
ValueError: all arrays must be same length
UPDATE:
It works fine if I did not provide column names, but if for example I have 4 columns, with names ['d0','d0','d1','d1'] for 4X4 and 8X4, it does not work anymore
here is the code for reproducing the error
import pandas
y1 = [['0','0','1','1'],['00','11','22','33']]
y2 = [['2','2','3','3','4','4'],['44','55','66','77','88','99']]
x1 = np.random.rand(4,4)
x2 = np.random.rand(6,4)
cols = ['d1']*2 + ['d2']*2
names = ['class','idx']
aa = pandas.DataFrame(x1,index=y1,columns = cols)
aa.index.names = names
print aa
bb = pandas.DataFrame(x2,index=y2,columns = cols)
bb.index.names = names
print bb
aa.append(bb)
What should I do to get this running?
Thanks
concatenated = pd.concat([bb, cc])
concatenated
0 1
class sample
2 22 0.730631 0.656266
33 0.871282 0.942768
3 44 0.081831 0.714360
55 0.600095 0.770108
5 66 0.128320 0.970817
66 0.160488 0.969077
77 0.919263 0.008597
6 77 0.811914 0.123960
88 0.639887 0.262943
88 0.312303 0.660786
Answer To Your Edited Question
So to answer your edited question, the problem lies with your column names having duplicates.
cols = ['d1']*2 + ['d2']*2 # <-- this creates ['d1', 'd1', 'd2', 'd2']
and your dataframes end up having what-is-considered duplicated columns, i.e.
In [62]: aa
Out[62]:
d1 d1 d2 d2
class idx
0 00 0.805445 0.442059 0.296162 0.041271
11 0.384600 0.723297 0.997918 0.006661
1 22 0.685997 0.794470 0.541922 0.326008
33 0.117422 0.667745 0.662031 0.634429
and
In [64]: bb
Out[64]:
d1 d1 d2 d2
class idx
2 44 0.465559 0.496039 0.044766 0.649145
55 0.560626 0.684286 0.929473 0.607542
3 66 0.526605 0.836667 0.608098 0.159471
77 0.216756 0.749625 0.096782 0.547273
4 88 0.619338 0.032676 0.218736 0.684045
99 0.987934 0.349520 0.346036 0.926373
pandas.append() (or concat() method) can only append correctly if you have unique column names.
Try this and you will not get any error:-
cols2 = ['d1', 'd2', 'd3', 'd4']
cc = pandas.DataFrame(x1, index=y1, columns=cols2)
cc.index.names = names
dd = pandas.DataFrame(x2, index=y2, columns=cols2)
cc.index.names = names
Now...
In [70]: cc.append(dd)
Out[70]:
d1 d2 d3 d4
class idx
0 00 0.805445 0.442059 0.296162 0.041271
11 0.384600 0.723297 0.997918 0.006661
1 22 0.685997 0.794470 0.541922 0.326008
33 0.117422 0.667745 0.662031 0.634429
2 44 0.465559 0.496039 0.044766 0.649145
55 0.560626 0.684286 0.929473 0.607542
3 66 0.526605 0.836667 0.608098 0.159471
77 0.216756 0.749625 0.096782 0.547273
4 88 0.619338 0.032676 0.218736 0.684045
99 0.987934 0.349520 0.346036 0.926373

Categories

Resources