using a dictionary to modify the dfs values - python

I have a df like this:
xx yy zz
A 6 5 2
B 4 4 5
B 5 6 7
C 6 6 6
C 7 7 7
Then I have a dictionary with some keys (which correspond to the index names of the df) and values (column names):
{'A':['xx'],'B':['yy','zz'],'C':['xx','zz']}
I would like to use the dictionary to check that those column names that do not appear in the dict values , are set to zero to generate this output:
xx yy zz
A 6 0 0
B 0 4 5
B 0 6 7
C 6 0 6
C 7 0 7
How could I use the dictionary to generate the desired output?

You may use indexing
mask = (pd.DataFrame(d.values(), index=d.keys())
.stack()
.reset_index(level=1, drop=True)
.str.get_dummies()
.groupby(level=0).sum()
.astype(bool)
)
df[mask].fillna(0)
xx yy zz
A 6.0 0.0 0.0
B 0.0 4.0 5.0
B 0.0 6.0 7.0
C 6.0 0.0 6.0
C 7.0 0.0 7.0

What I will do
s=pd.Series(d).explode()
s=pd.crosstab(s.index,s)
df.update(s.mask(s==1))
df
xx yy zz
A 6.0 0.0 0.0
B 0.0 4.0 5.0
B 0.0 6.0 7.0
C 6.0 0.0 6.0
C 7.0 0.0 7.0

Related

Pandas split /group dataframe by row values

I have a dataframe of the following form
In [1]: df
Out [1]:
A B C D
1 0 2 6 0
2 6 1 5 2
3 NaN NaN NaN NaN
4 9 3 2 2
...
15 2 12 5 23
16 NaN NaN NaN NaN
17 8 1 5 3
I'm interested in splitting the dataframe into multiple dataframes (or grouping it) by the NaN rows.
So resulting in something as follows
In [2]: df1
Out [2]:
A B C D
1 0 2 6 0
2 6 1 5 2
In [3]: df2
Out [3]:
A B C D
1 9 3 2 2
...
12 2 12 5 23
In [4]: df3
Out [4]:
A B C D
1 8 1 5 3
You could use the compare-cumsum-groupby pattern, where we find the all-null rows, cumulative sum those to get a group number for each subgroup, and then iterate over the groups:
In [114]: breaks = df.isnull().all(axis=1)
In [115]: groups = [group.dropna(how='all') for _, group in df.groupby(breaks.cumsum())]
In [116]: for group in groups:
...: print(group)
...: print("--")
...:
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
--
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0
--
A B C D
17 8.0 1.0 5.0 3.0
--
You can using local with groupby split
variables = locals()
for x, y in df.dropna(0).groupby(df.isnull().all(1).cumsum()[~df.isnull().all(1)]):
variables["df{0}".format(x + 1)] = y
df1
Out[768]:
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
df2
Out[769]:
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0
I'd use dictionary, groupby with cumsum:
dictofdfs = {}
for n,g in df.groupby(df.isnull().all(1).cumsum()):
dictofdfs[n]= g.dropna()
Output:
dictofdfs[0]
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
dictofdfs[1]
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0
dictofdfs[2]
A B C D
17 8.0 1.0 5.0 3.0

Pandas: convert first value in group to np.nan

I have the following DataFrame:
df = pd.DataFrame({'series1':['A','A','A','A','B','B','B','C','C','C','C'],
'series2':[0,1,10,99,-9,9,0,10,20,10,10]})
series1 series2
0 A 0.0
1 A 1.0
2 A 10.0
3 A 99.0
4 B -9.0
5 B 9.0
6 B 0.0
7 C 10.0
8 C 20.0
9 C 10.0
10 C 10.0
What I want:
df2 = pd.DataFrame({'series1':['A','A','A','A','B','B','B','C','C','C','C'],
'series2':[np.nan,1,10,99,np.nan,9,0,np.nan,20,10,10]})
series1 series2
0 A NaN
1 A 1.0
2 A 10.0
3 A 99.0
4 B NaN
5 B 9.0
6 B 0.0
7 C NaN
8 C 20.0
9 C 10.0
10 C 10.0
I have a feeling this might be able to be done by using Pandas .groupby function:
df.groupby('series1').first()
series2
series1
A 0
B -9
C 10
which gives me the observations I want to convert to NaNs, but I can't figure out a way to easily replace this in the original DataFrame.
This is just a simple example, the actual dataframe I'm working for has >8,000,000 observations.
There's probably a slicker way to do this, but the first element in each group is the 0th element in that group, and cumcount numbers the elements within each group. So:
In [19]: df.loc[df.groupby('series1').cumcount() == 0, 'series2'] = np.nan
In [20]: df
Out[20]:
series1 series2
0 A NaN
1 A 1.0
2 A 10.0
3 A 99.0
4 B NaN
5 B 9.0
6 B 0.0
7 C NaN
8 C 20.0
9 C 10.0
10 C 10.0
You want to locate discontinuities in series1 by shifting it down and comparing to itself:
df.loc[df['series1'].shift() != df['series1'], 'series2'] = np.nan
Another option by shifting the column:
df['series2'] = df.groupby('series1').series2.transform(lambda x: x.shift(-1).shift())
df
# series1 series2
#0 A NaN
#1 A 1.0
#2 A 10.0
#3 A 99.0
#4 B NaN
#5 B 9.0
#6 B 0.0
#7 C NaN
#8 C 20.0
#9 C 10.0
#10 C 10.0
Or you can using head, first or nth all give back same result by the index slicing.
df.loc[df.groupby('series1',as_index=False).head(1).index,'series2'] = np.nan
#df.loc[df.groupby('series1',as_index=False).first().index,'series2'] = np.nan
#df.loc[df.groupby('series1',as_index=False).nth(1).index,'series2'] = np.nan

Fill NaN values of column X with the median value of X for each categorial variable in another column Y

This was very difficult to phrase. But let me show you what I'm trying to accomplish.
df
Y X
a 10
a 5
a NaN
b 12
b 13
b NaN
c 5
c NaN
c 5
c 6
Y: 10 non-null object
X: 7 non-null int64
Take category 'a' from column Y, it has the median X value (10+5/2), the other missing value for 'a' must be filled with this median value.
Similarly, for category 'b' from column Y, among the non missing values in column X, the median X values is, (12+13/2)
For category 'c' from column Y, among the non missing values in column X, the median X values is, 5 (middle most value)
I used a very long, repetitive code as follows.
grouped = df.groupby(['Y'])[['X']]
grouped.agg([np.median])
X
median
Y
a 7.5
b 12.5
c 5
df.X = df.X.fillna(-1)
df.loc[(df['Y'] == 'a') & (df['X'] == -1), 'X'] = 7.5
df.loc[(df['Y'] == 'b') & (df['X'] == -1), 'X'] = 12.5
df.loc[(df['Y'] == 'c') & (df['X'] == -1), 'X'] = 5
I was told that there is not only repetition but also the use of magic numbers, which should be avoided.
I want to write a function that does this filling efficiently.
Use groupby and transform
The transform looks like
df.groupby('Y').X.transform('median')
0 7.5
1 7.5
2 7.5
3 12.5
4 12.5
5 12.5
6 5.0
7 5.0
8 5.0
9 5.0
Name: X, dtype: float64
And this has the same index as before. Therefore we can easily use it to fillna
df.X.fillna(df.groupby('Y').X.transform('median'))
0 10.0
1 5.0
2 7.5
3 12.0
4 13.0
5 12.5
6 5.0
7 5.0
8 5.0
9 6.0
Name: X, dtype: float64
You can either make a new copy of the dataframe
df.assign(X=df.X.fillna(df.groupby('Y').X.transform('median')))
Y X
0 a 10.0
1 a 5.0
2 a 7.5
3 b 12.0
4 b 13.0
5 b 12.5
6 c 5.0
7 c 5.0
8 c 5.0
9 c 6.0
Or fillna values in place
df.X.fillna(df.groupby('Y').X.transform('median'), inplace=True)
df
Y X
0 a 10.0
1 a 5.0
2 a 7.5
3 b 12.0
4 b 13.0
5 b 12.5
6 c 5.0
7 c 5.0
8 c 5.0
9 c 6.0

How to implement sql coalesce in pandas

I have a data frame like
df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
I want to add a new column 'D'. Expected output is
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
Thanks in advance!
Another way is to explicitly fill column D with A,B,C in that order.
df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
Another approach is to use the combine_first method of a pd.Series. Using your example df,
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
we have
>>> df.A.combine_first(df.B).combine_first(df.C)
0 1.0
1 2.0
2 7.0
We can use reduce to abstract this pattern to work with an arbitrary number of columns.
>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
Let's put this all together in a function.
>>> def coalesce(*args):
... return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
I think you need bfill with selecting first column by iloc:
df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
same as:
df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 1
pandas
df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 2
numpy
v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
naive time test
over given data
over larger data
There is already a method for Series in Pandas that does this:
df['D'] = df['A'].combine_first(df['C'])
Or just stack them if you want to look up values sequentially:
df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
This outputs the following:
>>> df
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0

"A value is trying to be set on a copy of a slice from a DataFrame" warning while trying to set dataframe values

I was just trying to do a simple value modification operation in pandas DataFrame.
import pandas as pd
import numpy as np
x = np.linspace(1,10,10)
y = x * 2
z = [-1,-2,-3,4,5,6,7,8,9,10]
df = pd.DataFrame(columns=['x','y','z'])
df['x'] = x
df['y'] = y
df['z'] = z
for i in range(len(df['z'])):
if df['z'].iloc[i] < 0:
df['x'].iloc[i] *= -1
df['y'].iloc[i] *= -1
df['z'].iloc[i] *= -1
However it warned: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I was not aware how chained assignments were used in this case.
It gave me right answer but significantly slower though.
Thanks
Performace if loops is slowier, so the best is avoid it and use vectorized pandas function if possible.
I think you can use mask and if condition is True multiple by -1:
df = df.mask(df['z'] < 0, df.mul(-1))
print (df)
x y z
0 -1.0 -2.0 1
1 -2.0 -4.0 2
2 -3.0 -6.0 3
3 4.0 8.0 4
4 5.0 10.0 5
5 6.0 12.0 6
6 7.0 14.0 7
7 8.0 16.0 8
8 9.0 18.0 9
9 10.0 20.0 10
Another solution is select by condition and multiple by -1:
df.loc[df['z'] < 0] *= -1
print (df)
x y z
0 -1.0 -2.0 1
1 -2.0 -4.0 2
2 -3.0 -6.0 3
3 4.0 8.0 4
4 5.0 10.0 5
5 6.0 12.0 6
6 7.0 14.0 7
7 8.0 16.0 8
8 9.0 18.0 9
9 10.0 20.0 10

Categories

Resources