I've got two dataframes df1 and df2 that look like this:
#df1
counts freqs
categories
automatic 13 0.40625
manual 19 0.59375
#df2
counts freqs
categories
Straight Engine 18 0.5625
V engine 14 0.4375
Could anyone explain why pd.concat([df1, df2], axis = 1) will not give me this:
counts freqs
categories
automatic 13 0.40625
manual 19 0.59375
Straight Engine 18 0.5625
V engine 14 0.4375
Here is what I've tried:
1 - Using pd.concat()
I'm suspecting that the way I've built these dataframes may be the source of the issue.
And here is how I've ended up with these particular dataframes:
# imports
import pandas as pd
from pydataset import data # pip install pydataset to get datasets from R
# load data
df_mtcars = data('mtcars')
# change dummyvariables to more describing variables:
df_mtcars['am'][df_mtcars['am'] == 0] = 'manual'
df_mtcars['am'][df_mtcars['am'] == 1] = 'automatic'
df_mtcars['vs'][df_mtcars['vs'] == 0] = 'Straight Engine'
df_mtcars['vs'][df_mtcars['vs'] == 1] = 'V engine'
# describe categorical variables
df1 = pd.Categorical(df_mtcars['am']).describe()
df2 = pd.Categorical(df_mtcars['vs']).describe()
I understand that 'categories' is what is causing the problems here since df_con = pd.concat([df1, df2], axis = 1) raises this error:
TypeError: categories must match existing categories when appending
But it confuses me that this is ok:
# code
df_con = pd.concat([df1, df2], axis = 1)
# output:
counts freqs counts freqs
categories
automatic 13.0 0.40625 NaN NaN
manual 19.0 0.59375 NaN NaN
Straight Engine NaN NaN 18.0 0.5625
V engine NaN NaN 14.0 0.4375
2 - Using df.append() raises the same error as pd.concat()
3 - Using pd.merge() sort of works, but I'm losing the indexes:
# Code
df_merge = pd.merge(df1, df2, how = 'outer')
# Output
counts freqs
0 13 0.40625
1 19 0.59375
2 18 0.56250
3 14 0.43750
3 - Using pd.concat() on transposed dataframes
Since pd.concat() worked with axis = 0 I thought I would get there using transposed dataframes.
# df1.T
categories automatic manual
counts 13.00000 19.00000
freqs 0.40625 0.59375
# df2.T
categories Straight Engine V engine
counts 18.0000 14.0000
freqs 0.5625 0.4375
But still no success:
# code
df_con = pd.concat([df1.T, df2.T], axis = 1)
>>> TypeError: categories must match existing categories when appending
By the way, what I was hoping for here is this:
categories automatic manual Straight Engine V engine
counts 13.00000 19.00000 18.0000 14.0000
freqs 0.40625 0.59375 0.5625 0.4375
Still kind of works with axis = 0 though:
# code
df_con = pd.concat([df1.T, df2.T], axis = 0)
# Output
categories automatic manual Straight Engine V engine
counts 13.00000 19.00000 NaN NaN
freqs 0.40625 0.59375 NaN NaN
counts NaN NaN 18.0000 14.0000
freqs NaN NaN 0.5625 0.4375
But that is still far from what I'm trying to accomplish.
Now I'm thinking that it would be possible to strip the 'category' info from df1 and df2, but I haven't been able to find out how to do that yet.
Thank you for any other suggestions!
try this,
pd.concat([df1.reset_index(),df2.reset_index()],ignore_index=True)
Output:
categories counts freqs
0 automatic 13 0.40625
1 manual 19 0.59375
2 Straight Engine 18 0.56250
3 V engine 14 0.43750
To get again category as index follow this,
pd.concat([df1.reset_index(),df2.reset_index()],ignore_index=True).set_index('categories')
Output:
counts freqs
categories
automatic 13 0.40625
manual 19 0.59375
Straight Engine 18 0.56250
V engine 14 0.43750
for more details follow this docs
Related
I'm new to using pandas and running into an issue when trying to extract values from a pivot table.
pivot = sp.pivot_table(
index="story_points",
values=["duration"],
aggfunc={"min", "mean", "median", "max", "count"},
)
df = pivot.iloc[:, "duration"]
zero = df.iloc[0, "median"]
one = df.iloc[1.0, "median"]
two = df.iloc[2.0, "median"]
three = df.iloc[3.0, "median"]
duration
count max mean median min
story_points_raw
1.0 227 185.125382 7.241453 2.190405 0.000139
2.0 144 68.849965 11.048451 6.106875 0.000058
3.0 45 126.131181 19.560792 13.558588 0.983241
4.0 16 49.043889 11.948981 6.932321 1.878125
5.0 5 128.653403 44.487847 15.489850 2.935891
8.0 8 79.792199 31.325800 17.299543 7.774792
I get an error on the line df = pivot.iloc[:, "duration"]. Is there possibly a better way to create the table to allow easier direct access to the values?
I am working on the Kaggl insurance pricing competition (https://www.kaggle.com/floser/french-motor-claims-datasets-fremtpl2freq). The data looks like this:
IDpol ClaimNb Exposure Area VehPower VehAge DrivAge BonusMalus VehBrand VehGas Density Region
0 1.0 1 0.10 D 5 0 55 50 B12 Regular 1217 R82
1 3.0 1 0.77 D 5 0 55 50 B12 Regular 1217 R82
2 5.0 1 0.75 B 6 2 52 50 B12 Diesel 54 R22
3 10.0 1 0.09 B 7 0 46 50 B12 Diesel 76 R72
4 11.0 1 0.84 B 7 0 46 50 B12 Diesel 76 R72
For preprocessing etc. I am using a Pipeline including a ColumnTransformer:
pt_columns = ['BonusMalus', 'RBVAge']
log_columns = ['Density']
kbins_columns = ['VehAge','VehPower', 'DrivAge']
cat_columns = ['Area', 'Region', 'VehBrand', 'VehGas']
X_train['RBVAge'] = 0
X_train.loc[(X_train['VehGas'] == 'Regular') & (X_train['VehBrand'] == 'B12') & (X_train['VehAge'] == 0), 'RBVAge'] = 1
ct = ColumnTransformer([('pt', 'passthrough', pt_columns),
('log', FunctionTransformer(np.log1p, validate=False), log_columns),
('kbins', KBinsDiscretizer(), kbins_columns),
('ohe', OneHotEncoder(), cat_columns)])
pipe_poisson_reg = Pipeline([('cf_trans', ct),
('ssc', StandardScaler(with_mean = False)),
('poisson_regressor', PoissonRegressor())])
Once the model is fitted I'd like to display the feature importances and names of the features like this:
name
feature_importances_
Area_A
0.25
Area_B
0.10
VehAge
0.30
...
...
The problem I am facing is to get the feature names while using a ColumnTransformer. Especially getting the feature names from the KBinsDiscretizer, which uses OneHot-encoding as well, does not work easily. What I have tried so far is creating a numpy array with the feature names manually, but as I have said I cannot manage to get the feature names from the KBinsDiscretizer and this solution does not seem very elegant.
columnNames = np.append(pt_columns, pipe['cftrans'].transformers_[1][1].get_feature_names(cat_columns))
Is there a simple (maybe even build in) way to create a DataFrame including both the feature names and feature importances?
Well and since we are already here (this might be a bit off topic):
Is there a simple way to create a custom ColumnTransformer which adds the new column 'RBVAge', which I currently add manually?
This will be a whole lot easier in the upcoming v1.0 (or you can get the current github code), which includes PR18444, adding a get_feature_names_out to KBinsDiscretizer as well as Pipeline (and others).
If you are stuck on your version of sklearn, you can have a look at the source code that does that to help you patch together the same; it appears to not do very much itself, just calling to the internal _encoder object to do the work, and OneHotEncoder has had a get_feature_names for a while (soon to be replaced by get_feature_names_out).
I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})
print(df)
a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]
# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64
I am trying to calculate the means of all previous rows for each column of the DataFrame and add the calculated mean column to the DataFrame.
I am using a set of nba games data that contains 20+ features (columns) that I am trying to calculate the means for. Example of the dataset is below. (Note. "...." represent rest of the feature columns)
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Example for calculating two of the columns:
dataset = pd.read_csv('nba.games.stats.csv')
df = dataset
df['Game_mean'] = (df.groupby('Team')['TeamPoints'].apply(lambda x: x.shift().expanding().mean()))
df['TeamPoints_mean'] = (df.groupby('Team')['OpponentsPoints'].apply(lambda x: x.shift().expanding().mean()))
Again, the code only calculates the mean and adding the column to the DataFrame one at a time. Is there a way to get the column means and add them to the DataFrame without doing one at a time? For loop? Example of what I am looking for is below.
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean ...("..." = mean columns of rest of the feature columns)
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Try this one:
(0) sample input:
>>> df
col1 col2 col3
0 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797
2 0.042541 1.196383 6.568839
3 4.784911 0.444671 8.019933
4 3.831556 0.902672 0.198920
5 3.672763 2.236639 1.528215
6 0.792616 2.604049 0.373296
7 2.281992 2.563639 1.500008
8 4.096861 0.598854 4.934116
9 3.632607 1.502801 0.241920
Then processing:
(1) side table to get all the means on the side (I didn't find cummulative mean function, so went with cumsum + count)
>>> df_side=df.assign(col_temp=1).cumsum()
>>> df_side
col1 col2 col3 col_temp
0 1.490977 1.784433 0.852842 1.0
1 5.217640 4.629801 8.619638 2.0
2 5.260182 5.826184 15.188477 3.0
3 10.045093 6.270855 23.208410 4.0
4 13.876649 7.173527 23.407330 5.0
5 17.549412 9.410166 24.935545 6.0
6 18.342028 12.014215 25.308841 7.0
7 20.624021 14.577855 26.808849 8.0
8 24.720882 15.176708 31.742965 9.0
9 28.353489 16.679509 31.984885 10.0
>>> for el in df.columns:
... df_side["{}_mean".format(el)]=df_side[el]/df_side.col_temp
>>> df_side=df_side.drop([el for el in df.columns] + ["col_temp"], axis=1)
>>> df_side
col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842
1 2.608820 2.314901 4.309819
2 1.753394 1.942061 5.062826
3 2.511273 1.567714 5.802103
4 2.775330 1.434705 4.681466
5 2.924902 1.568361 4.155924
6 2.620290 1.716316 3.615549
7 2.578003 1.822232 3.351106
8 2.746765 1.686301 3.526996
9 2.835349 1.667951 3.198489
(2) joining back, on index:
>>> df_final=df.join(df_side)
>>> df_final
col1 col2 col3 col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797 2.608820 2.314901 4.309819
2 0.042541 1.196383 6.568839 1.753394 1.942061 5.062826
3 4.784911 0.444671 8.019933 2.511273 1.567714 5.802103
4 3.831556 0.902672 0.198920 2.775330 1.434705 4.681466
5 3.672763 2.236639 1.528215 2.924902 1.568361 4.155924
6 0.792616 2.604049 0.373296 2.620290 1.716316 3.615549
7 2.281992 2.563639 1.500008 2.578003 1.822232 3.351106
8 4.096861 0.598854 4.934116 2.746765 1.686301 3.526996
9 3.632607 1.502801 0.241920 2.835349 1.667951 3.198489
I am trying to calculate the means of all previous rows for each column of the DataFrame
To get all of the columns, you can do:
df_means = df.join(df.cumsum()/
df.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
However, if Team is a column rather the index, you'd want to get rid of it:
df_data = df.drop('Teams', axis=1)
df_means = df.join(df_data.cumsum()/
df_data.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
You could also do
import numpy as np
df_data = df[[col for col in df.columns
if np.issubdtype(df[col],np.number)]]
Or manually define a list of columns that you want to take the mean of, cols_for_mean, and then do
df_data = df[cols_for_mean]
I have a pandas dataframe containing 16 columns, of which 14 represent variables where i perform a looped Anova test using statsmodels. My dataframe looks something like this (simplified):
ID Cycle_duration Average_support_phase Average_swing_phase Label
1 23.1 34.3 47.2 1
2 27.3 38.4 49.5 1
3 25.8 31.1 45.7 1
4 24.5 35.6 41.9 1
...
So far this is what i'm doing:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('features_total.csv')
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
Which yields:
sum_sq df F PR(>F)
Label 0.124927 2.0 2.561424 0.084312
Residual 1.731424 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I'm getting an individual table print for each variable where the Anova is performed. Basically what i want is to print one single table with the summarized results, or something like this:
sum_sq df F PR(>F)
Cycle_duration 0.1249270 2.0 2.561424 0.084312
Residual 1.7314240 71.0 NaN NaN
Average_support_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
Average_swing_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I can already see a problem because this method always outputs the 'Label' nomenclature before the actual values, and not the variable name in question (like i've shown above, i would like to have the variable name above each 'residual'). Is this even possible with the statsmodels approach?
I'm fairly new to python and excuse me if this has nothing to do with statsmodels - in that case, please do elucidate me on what i should be trying.
You can collect the tables and concatenate them at the end of your loop. This method will create a hierarchical index, but I think that makes it a bit more clear. Something like this:
keys = []
tables = []
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
Somewhat related, I would also suggest correcting for multiple comparisons. This is more a statistical suggestion than a coding suggestion, but considering you are performing numerous statistical tests, it would make sense to account for the probability that one of the test would result in a false positive.