I have a dataframe on which I would like to perform a division based on the entries. To illustrate the problem, say I have the following dataframe:
import pandas as pd
df= pd.DataFrame([[1.,2.,3.,4.], [5.,6.,7.,8.], [9.,10.,11.,12.]],
columns=['A','B','C','D'], index=['x','y','z'])
So I get the following as df:
A B C D
x 1 2 3 4
y 5 6 7 8
z 9 10 11 12
What I would like to do is see how much each value in column D changed as I went from x to y, and again to z.
The dataframe I'd get would be:
A B C D
x 1 2 3 NaN
y 5 6 7 2.0
z 9 10 11 1.5
How do I do this in a systematic way?
You can use div with the column shifted:
In [21]:
df['D'] = df['D'].div(df['D'].shift())
df
Out[21]:
A B C D
x 1 2 3 NaN
y 5 6 7 2.0
z 9 10 11 1.5
Or more succinctly:
In [23]:
df['D'] /= df['D'].shift()
df
Out[23]:
A B C D
x 1 2 3 NaN
y 5 6 7 2.0
z 9 10 11 1.5
you can use pct_change():
In [57]: df.D.pct_change() + 1
Out[57]:
x NaN
y 2.0
z 1.5
Name: D, dtype: float64
or as DF (on the fly):
In [58]: df.assign(D=df.D.pct_change() + 1)
Out[58]:
A B C D
x 1.0 2.0 3.0 NaN
y 5.0 6.0 7.0 2.0
z 9.0 10.0 11.0 1.5
Related
I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['A','D','M','T','B','C','D','E','A','L'],
'id': [1,1,1,2,2,3,3,3,3,5],
'rate': [3.5,4.5,2.0,5.0,4.0,1.5,2.0,2.0,1.0,5.0]})
>> df
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 C 3 1.5
6 D 3 2.0
7 E 3 2.0
8 A 3 1.0
9 L 5 5.0
df = df.groupby('id')['rate'].mean()
what i want is this:
1) find mean of every 'id'.
2) give the number of ids (length) which has mean >= 3.
3) give back all rows of dataframe (where mean of any id >= 3.
Expected output:
Number of ids (length) where mean >= 3: 3
>> dataframe where (mean(id) >=3)
>>df
name id rate
0 A 1 3.0
1 D 1 4.0
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 L 5 5.0
Use GroupBy.transform for means by all groups with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('id')['rate'].transform('mean') >=3]
print (df)
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
9 L 5 5.0
Detail:
print (df.groupby('id')['rate'].transform('mean'))
0 3.333333
1 3.333333
2 3.333333
3 4.500000
4 4.500000
5 1.625000
6 1.625000
7 1.625000
8 1.625000
9 5.000000
Name: rate, dtype: float64
Alternative solution with DataFrameGroupBy.filter:
df = df.groupby('id').filter(lambda x: x['rate'].mean() >=3)
I have a dataframe:
A B C V
1 4 7 T
2 6 8 T
3 9 9 F
and I want to create a new column, summing the rows where V is 'T'
So I want
A B C V D
1 4 7 T 12
2 6 8 T 16
3 9 9 F
Is there any way to do this without iteration?
Mask the values before summing:
df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T')
# Or,
df.select_dtypes(np.number).mask(df['V'] != 'T').sum(axis=1, skipna=False)
0 12.0
1 16.0
2 NaN
dtype: float64
df['D'] = df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T')
df
A B C V D
0 1 4 7 T 12.0
1 2 6 8 T 16.0
2 3 9 9 F NaN
If you actually wanted blanks, use
df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T', '')
0 24
1 32
2
dtype: object
Which returns an object column (not recommended).
Alternatively, using np.where:
np.where(df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), np.nan)
# array([12., 16., nan])
df['D'] = np.where(
df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), np.nan)
df
A B C V D
0 1 4 7 T 12.0
1 2 6 8 T 16.0
2 3 9 9 F 0.0
Use Numpy where
import numpy as np
df['D'] = np.where(df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), None)
df['D'] = df[['A', 'B', 'C']][df['V'] == 'T'].sum(axis=1)
In [51]df:
Out[51]:
A B C V D
0 1 4 7 T 12.000
1 2 6 8 T 16.000
2 3 9 9 F nan
i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.
def check(x):
for column in df:
if df.column.isnull().sum() > 2:
df.drop(column,axis=1)
there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.
P.S:I know about the thresh arguement in df.dropna(thresh,axis)
Any tips?Why isnt my code working?
Thanks
Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')
})
m = ~df.isnull().sum().gt(2)
df = df.loc[:,m]
print(df)
Returns:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())
print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F']
[True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.
I think best here is use dropna with parameter thresh:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame:
N = 2
df = df.dropna(thresh=len(df)-N, axis=1)
print (df)
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a NaN NaN 1 5.0 a
1 b NaN 8.0 3 3.0 a
2 c NaN NaN 5 6.0 a
3 d 5.0 NaN 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f NaN 3.0 0 NaN b
def check(df):
for column in df:
if df[column].isnull().sum() > 2:
df.drop(column,axis=1, inplace=True)
return df
print (df.pipe(check))
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Alternatively, you can use count which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
Out[23]:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
After merging of two data frames:
output = pd.merge(df1, df2, on='ID', how='outer')
I have data frame like this:
index x y z
0 2 NaN 3
0 NaN 3 3
1 2 NaN 4
1 NaN 3 4
...
How to merge rows with the same index?
Expected output:
index x y z
0 2 3 3
1 2 3 4
Perhaps, you could take mean on them.
In [418]: output.groupby('index', as_index=False).mean()
Out[418]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
We can group the DataFrame by the 'index' and then... we can just get the first values with .first() or minimum with .min() etc. depending on the case of course. What do you want to get if the values in z differ?
In [28]: gr = df.groupby('index', as_index=False)
In [29]: gr.first()
Out[29]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
In [30]: gr.max()
Out[30]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
In [31]: gr.min()
Out[31]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
In [32]: gr.mean()
Out[32]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
This was very difficult to phrase. But let me show you what I'm trying to accomplish.
df
Y X
a 10
a 5
a NaN
b 12
b 13
b NaN
c 5
c NaN
c 5
c 6
Y: 10 non-null object
X: 7 non-null int64
Take category 'a' from column Y, it has the median X value (10+5/2), the other missing value for 'a' must be filled with this median value.
Similarly, for category 'b' from column Y, among the non missing values in column X, the median X values is, (12+13/2)
For category 'c' from column Y, among the non missing values in column X, the median X values is, 5 (middle most value)
I used a very long, repetitive code as follows.
grouped = df.groupby(['Y'])[['X']]
grouped.agg([np.median])
X
median
Y
a 7.5
b 12.5
c 5
df.X = df.X.fillna(-1)
df.loc[(df['Y'] == 'a') & (df['X'] == -1), 'X'] = 7.5
df.loc[(df['Y'] == 'b') & (df['X'] == -1), 'X'] = 12.5
df.loc[(df['Y'] == 'c') & (df['X'] == -1), 'X'] = 5
I was told that there is not only repetition but also the use of magic numbers, which should be avoided.
I want to write a function that does this filling efficiently.
Use groupby and transform
The transform looks like
df.groupby('Y').X.transform('median')
0 7.5
1 7.5
2 7.5
3 12.5
4 12.5
5 12.5
6 5.0
7 5.0
8 5.0
9 5.0
Name: X, dtype: float64
And this has the same index as before. Therefore we can easily use it to fillna
df.X.fillna(df.groupby('Y').X.transform('median'))
0 10.0
1 5.0
2 7.5
3 12.0
4 13.0
5 12.5
6 5.0
7 5.0
8 5.0
9 6.0
Name: X, dtype: float64
You can either make a new copy of the dataframe
df.assign(X=df.X.fillna(df.groupby('Y').X.transform('median')))
Y X
0 a 10.0
1 a 5.0
2 a 7.5
3 b 12.0
4 b 13.0
5 b 12.5
6 c 5.0
7 c 5.0
8 c 5.0
9 c 6.0
Or fillna values in place
df.X.fillna(df.groupby('Y').X.transform('median'), inplace=True)
df
Y X
0 a 10.0
1 a 5.0
2 a 7.5
3 b 12.0
4 b 13.0
5 b 12.5
6 c 5.0
7 c 5.0
8 c 5.0
9 c 6.0