When I summarize a dataframe and join it back on the original dataframe, then I'm having trouble working with the column names.
This is the original dataframe:
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
Now I calculate some statistics and merge the summary back in:
group_summary = df.groupby('col1', as_index = False).agg({'col2': ['mean', 'count']})
df = pd.merge(df, group_summary, on = 'col1')
The dataframe has some strange column names now:
df
Out:
col1 col2 (col2, mean) (col2, count)
0 a 0 0.75 4
1 a 4 0.75 4
2 a -5 0.75 4
3 a 4 0.75 4
4 b 3 3.00 2
5 b 3 3.00 2
I know I can use the columns like df.iloc[:, 2], but I would also like to use them like df['(col2, mean)'], but this returns a KeyError.
Source: This grew out of this previous question.
It's because your GroupBy.agg operation results in a MultiIndex dataframe, and when merging a single-level header DataFrame with a MultiIndexed one, the multiIndex is converted into flat tuples.
Fix your groupby code as follows:
group_summary = df.groupby('col1', as_index=False)['col2'].agg(['mean', 'count'])
Merge now gives flat columns.
df.merge(group_summary, on='col1')
col1 col2 mean count
0 a 0 0.75 4
1 a 4 0.75 4
2 a -5 0.75 4
3 a 4 0.75 4
4 b 3 3.00 2
5 b 3 3.00 2
Better still, use transform to map the output to the input dimensions.
g = df.groupby('col1', as_index=False)['col2']
df.assign(mean=g.transform('mean'), count=g.transform('count'))
col1 col2 mean count
0 a 0 0.75 4
1 a 4 0.75 4
2 b 3 3.00 2
3 a -5 0.75 4
4 b 3 3.00 2
5 a 4 0.75 4
Pro-tip, you can use describe to compute some useful statistics in a single function call.
df.groupby('col1').describe()
col2
count mean std min 25% 50% 75% max
col1
a 4.0 0.75 4.272002 -5.0 -1.25 2.0 4.0 4.0
b 2.0 3.00 0.000000 3.0 3.00 3.0 3.0 3.0
Also see Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
Related
So I have a data frame called df. It looks like this.
0
1
2
1
2
3
4
5
6
7
8
9
I want to sum up the columns and divide the sum of the columns by the sum of the rows.
So for example:
row 1, column 0: (1+4+7)/(1+2+3)
row 2, column 0: (1+4+7)/(4+5+6)
so on and so forth.
so that my final result is like this.
0
1
2
2
2.5
3
0.8
1
1.2
0.5
0.625
0.75
How do I do it in python using pandas and dataframe?
You can also do it this way:
import numpy as np
a = df.to_numpy()
b = np.divide.outer(a.sum(0),a.sum(1))
# divide is a ufunc(universal function) in numpy.
# All ufunc's support outer functionality
out = pd.DataFrame(b, index=df.index, columns=df.columns)
output:
0 1 2
0 2.0 2.500 3.00
1 0.8 1.000 1.20
2 0.5 0.625 0.75
You can use the underlying numpy array:
a = df.to_numpy()
out = pd.DataFrame(a.sum(0)/a.sum(1)[:,None],
index=df.index, columns=df.columns)
output:
0 1 2
0 2.0 2.500 3.00
1 0.8 1.000 1.20
2 0.5 0.625 0.75
Given two grouped dataframes (df_train & df_test), how do I fill the missing values of df_test using values derived from df_train? For this example, I used median.
df_train=pd.DataFrame({'col_1':['A','B','A','A','C','B','B','A','A','B','C'], 'col_2':[float('NaN'),2,1,3,1,float('NaN'),2,3,2,float('NaN'),1]})
df_test=pd.DataFrame({'col_1':['A','A','A','A','B','C','C','B','B','B','C'], 'col_2':[3,float('NaN'),1,2,2,float('NaN'),float('NaN'),3,2,float('NaN'),float('NaN')]})
# These are the median values derived from df_train which I would like to impute into df_test based on the column col_1.
values_used_in_df_train = df_train.groupby(by='col_1')['col_2'].median()
values_used_in_df_train
col_1
A 2.5
B 2.0
C 1.0
Name: col_2, dtype: float64
# For df_train, I can simply do the following:
df_train.groupby('col_1')['col_2'].transform(lambda x : x.fillna(x.median()))
I tried df_test.groupby('col_1')['col_2'].transform(lambda x : x.fillna(values_used_in_df_train)) which does not work.
So I want:
df_test
col_1 col_2
0 A 3.0
1 A NaN
2 A 1.0
3 A 2.0
4 B 2.0
5 C NaN
6 C NaN
7 B 3.0
8 B 2.0
9 B NaN
10 C NaN
to become
df_test
col_1 col_2
0 A 3.0
1 A 2.5
2 A 1.0
3 A 2.0
4 B 2.0
5 C 1.0
6 C 1.0
7 B 3.0
8 B 2.0
9 B 2.0
10 C 1.0
Below here are just my thoughts, you do not have to consider them since it might be irrelevant/confusing.
I guess I could use an if-else method to match row-by-row to the index of values_used_in_df_train but I am trying to achieve this within groupby.
Try merging df_test and values_used_in_df_train:
df_test=df_test.merge(values_used_in_df_train.reset_index(),on='col_1',how='left',suffixes=('','_y'))
Finally fill missing values by using fillna():
df_test['col_2']=df_test['col_2'].fillna(df_test.pop('col_2_y'))
OR
Another way(If order is not important):
append df_test and values_used_in_df_train and then drop NaN's:
df_test=(df_test.append(values_used_in_df_train.reset_index())
.dropna(subset=['col_2'])
.reset_index(drop=True))
I am working on a dataset which is in the following dataframe.
#print(old_df)
col1 col2 col3
0 1 10 1.5
1 1 11 2.5
2 1 12 5,6
3 2 10 7.8
4 2 24 2.1
5 3 10 3.2
6 4 10 22.1
7 4 11 1.3
8 4 89 0.5
9 4 91 3.3
I am trying to generate another data frame which contains selected col1 values as index, selected col2 values as columns and assign respective col3 value.
Eg:
selected_col1 = [1,2]
selected_col2 = [10,11,24]
New data frame should be looking like:
#print(selected_df)
10 11 24
1 1.5 2.5 Nan
2 7.8 Nan 2.1
I have tried following method
selected_col1 = [1,2]
selected_col2 = [10,11,24]
selected_df =pd.DataFrame(index=selected_col1,columns=selected_col2)
for col1_value in selected_col1:
for col2_value in selected_col2:
qry = 'col1 == {} & col2 == {}'.format(col1_value,col2_value)
col3_value = old_df.query(qry).col3.values
if(len(col3_value) > 0):
selected_df.at[col1_value,col2_value] = col3_value[0]
But because my dataframe has around 20 million rows, this brute force kind of method is taking long time. Is there a way better than this?
First filter rows by membership by Series.isin in both columns chained by & for bitwise AND and then use DataFrame.pivot:
df = df[df['col1'].isin(selected_col1) & df['col2'].isin(selected_col2)]
df = df.pivot('col1','col2','col3')
print (df)
col2 10 11 24
col1
1 1.5 2.5 NaN
2 7.8 NaN 2.1
If possible some duplicated pairs in col1 with col2 after filtering use DataFrame.pivot_table:
df = df.pivot_table(index='col1',columns='col2',values='col3', aggfunc='mean')
EDIT:
If use | for bitwise OR get different output:
df = df[df['col1'].isin(selected_col1) | df['col2'].isin(selected_col2)]
df = df.pivot('col1','col2','col3')
print (df)
col2 10 11 12 24
col1
1 1.5 2.5 5,6 NaN
2 7.8 NaN NaN 2.1
3 3.2 NaN NaN NaN
4 22.1 1.3 NaN NaN
This question already has answers here:
How to replace NaN values by Zeroes in a column of a Pandas Dataframe?
(17 answers)
Closed 4 years ago.
I have a list of NaN values in my dataframe and I want to replace NaN values with an empty string.
What I've tried so far, which isn't working:
df_conbid_N_1 = pd.read_csv("test-2019.csv",dtype=str, sep=';', encoding='utf-8')
df_conbid_N_1['Excep_Test'] = df_conbid_N_1['Excep_Test'].replace("NaN","")
Use fillna (docs):
An example -
df = pd.DataFrame({'no': [1, 2, 3],
'Col1':['State','City','Town'],
'Col2':['abc', np.NaN, 'defg'],
'Col3':['Madhya Pradesh', 'VBI', 'KJI']})
df
no Col1 Col2 Col3
0 1 State abc Madhya Pradesh
1 2 City NaN VBI
2 3 Town defg KJI
df.Col2.fillna('', inplace=True)
df
no Col1 Col2 Col3
0 1 State abc Madhya Pradesh
1 2 City VBI
2 3 Town defg KJI
Simple! you can do this way
df_conbid_N_1 = pd.read_csv("test-2019.csv",dtype=str, sep=';',encoding='utf-8').fillna("")
We have pandas' fillna to fill missing values.
Let's go through some uses cases with a sample dataframe:
df = pd.DataFrame({'col1':['John', np.nan, 'Anne'], 'col2':[np.nan, 3, 4]})
col1 col2
0 John NaN
1 NaN 3.0
2 Anne 4.0
As mentioned in the docs, fillna accepts the following as fill values:
values: scalar, dict, Series, or DataFrame
So we can replace with a constant value, such as an empty string with:
df.fillna('')
col1 col2
0 John
1 3
2 Anne 4
1
You can also replace with a dictionary mapping column_name:replace_value:
df.fillna({'col1':'Alex', 'col2':2})
col1 col2
0 John 2.0
1 Alex 3.0
2 Anne 4.0
Or you can also replace with another pd.Series or pd.DataFrame:
df_other = pd.DataFrame({'col1':['John', 'Franc', 'Anne'], 'col2':[5, 3, 4]})
df.fillna(df_other)
col1 col2
0 John 5.0
1 Franc 3.0
2 Anne 4.0
This is very useful since it allows you to fill missing values on the dataframes' columns using some extracted statistic from the columns, such as the mean or mode. Say we have:
df = pd.DataFrame(np.random.choice(np.r_[np.nan, np.arange(3)], (3,5)))
print(df)
0 1 2 3 4
0 NaN NaN 0.0 1.0 2.0
1 NaN 2.0 NaN 2.0 1.0
2 1.0 1.0 2.0 NaN NaN
Then we can easilty do:
df.fillna(df.mean())
0 1 2 3 4
0 1.0 1.5 0.0 1.0 2.0
1 1.0 2.0 1.0 2.0 1.0
2 1.0 1.0 2.0 1.5 1.5
I have a large data frame composed of 450 columns with 550 000 rows.
In the columns i have :
73 float columns
30 columns dates
remainder columns in object
I would like to make a description of my variables, but not only describe as usual, but also include other descriptions in the same matrix. At the final, we will have a matrix of description with the set of 450 variables then a detailed description of:
- dtype
- count
- count null values
- % number of null values
- max
- min
- 50%
- 75%
- 25%
- ......
For now, i have juste a basic function that describe my data like this :
Dataframe.describe(include = 'all')
Do you have a function or method to do this more extensive descrition.
Thanks.
You need write custom functions for Series and then add to final describe DataFrame:
Notice:
First row of final df is count - used function count for count non NaNs values
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,np.nan,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7 1 5 a
1 b NaN 8 3 3 a
2 c NaN 9 5 6 a
3 d 5.0 4 7 9 b
4 e 5.0 2 1 2 b
5 f 4.0 3 0 4 b
df1 = df.describe(include = 'all')
df1.loc['dtype'] = df.dtypes
df1.loc['size'] = len(df)
df1.loc['% count'] = df.isnull().mean()
print (df1)
A B C D E F
count 6 4 6 6 6 6
unique 6 NaN NaN NaN NaN 2
top e NaN NaN NaN NaN b
freq 1 NaN NaN NaN NaN 3
mean NaN 4.5 5.5 2.83333 4.83333 NaN
std NaN 0.57735 2.88097 2.71416 2.48328 NaN
min NaN 4 2 0 2 NaN
25% NaN 4 3.25 1 3.25 NaN
50% NaN 4.5 5.5 2 4.5 NaN
75% NaN 5 7.75 4.5 5.75 NaN
max NaN 5 9 7 9 NaN
dtype object float64 int64 int64 int64 object
size 6 6 6 6 6 6
% count 0 0.333333 0 0 0 0
In pandas, there is no alternative function to describe(), but it clearly isn't displaying all the values that you need. You can use various parameters of the describe() function accordingly.
describe() on a DataFrame only works for numeric types. If you think you have a numeric variable and it doesn't show up in describe(), change the type with:
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)
You could also create new columns for handling the numeric part of a mixed-type column, or convert strings to numbers using a dictionary and the map() function.
describe() on a non-numeric Series will give you some statistics (like count, unique and the most frequently-occurring value).
To call describe() on just the objects (strings) use describe(include = ['O']).