Pandas split /group dataframe by row values - python

I have a dataframe of the following form
In [1]: df
Out [1]:
A B C D
1 0 2 6 0
2 6 1 5 2
3 NaN NaN NaN NaN
4 9 3 2 2
...
15 2 12 5 23
16 NaN NaN NaN NaN
17 8 1 5 3
I'm interested in splitting the dataframe into multiple dataframes (or grouping it) by the NaN rows.
So resulting in something as follows
In [2]: df1
Out [2]:
A B C D
1 0 2 6 0
2 6 1 5 2
In [3]: df2
Out [3]:
A B C D
1 9 3 2 2
...
12 2 12 5 23
In [4]: df3
Out [4]:
A B C D
1 8 1 5 3

You could use the compare-cumsum-groupby pattern, where we find the all-null rows, cumulative sum those to get a group number for each subgroup, and then iterate over the groups:
In [114]: breaks = df.isnull().all(axis=1)
In [115]: groups = [group.dropna(how='all') for _, group in df.groupby(breaks.cumsum())]
In [116]: for group in groups:
...: print(group)
...: print("--")
...:
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
--
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0
--
A B C D
17 8.0 1.0 5.0 3.0
--

You can using local with groupby split
variables = locals()
for x, y in df.dropna(0).groupby(df.isnull().all(1).cumsum()[~df.isnull().all(1)]):
variables["df{0}".format(x + 1)] = y
df1
Out[768]:
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
df2
Out[769]:
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0

I'd use dictionary, groupby with cumsum:
dictofdfs = {}
for n,g in df.groupby(df.isnull().all(1).cumsum()):
dictofdfs[n]= g.dropna()
Output:
dictofdfs[0]
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
dictofdfs[1]
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0
dictofdfs[2]
A B C D
17 8.0 1.0 5.0 3.0

Related

using a dictionary to modify the dfs values

I have a df like this:
xx yy zz
A 6 5 2
B 4 4 5
B 5 6 7
C 6 6 6
C 7 7 7
Then I have a dictionary with some keys (which correspond to the index names of the df) and values (column names):
{'A':['xx'],'B':['yy','zz'],'C':['xx','zz']}
I would like to use the dictionary to check that those column names that do not appear in the dict values , are set to zero to generate this output:
xx yy zz
A 6 0 0
B 0 4 5
B 0 6 7
C 6 0 6
C 7 0 7
How could I use the dictionary to generate the desired output?
You may use indexing
mask = (pd.DataFrame(d.values(), index=d.keys())
.stack()
.reset_index(level=1, drop=True)
.str.get_dummies()
.groupby(level=0).sum()
.astype(bool)
)
df[mask].fillna(0)
xx yy zz
A 6.0 0.0 0.0
B 0.0 4.0 5.0
B 0.0 6.0 7.0
C 6.0 0.0 6.0
C 7.0 0.0 7.0
What I will do
s=pd.Series(d).explode()
s=pd.crosstab(s.index,s)
df.update(s.mask(s==1))
df
xx yy zz
A 6.0 0.0 0.0
B 0.0 4.0 5.0
B 0.0 6.0 7.0
C 6.0 0.0 6.0
C 7.0 0.0 7.0

Perform arithmetic operations on null values

When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

How to remove certain features that have a low completeness rate in a Data frame(Python)

I have a Data Frame with more than 450 variables and more than 500 000 rows. However, some variables have null values ​​over 90%. I would like to delete features with more than > 90% empty rows.
I made my description of my variables:
Data Frame :
df = pd.DataFrame({
'A':list('abcdefghij'),
'B':[4,np.nan,np.nan,np.nan,np.nan,np.nan, np.nan, np.nan, np.nan, np.nan],
'C':[7,8,np.nan,4,2,3,6,5, 4, 6],
'D':[1,3,5,np.nan,1,0,10,7, np.nan, 5],
'E':[5,3,6,9,2,4,7,3, 5, 9],
'F':list('aaabbbckfr'),
'G':[np.nan,8,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan, np.nan, np.nan]})
print(df)
A B C D E F G
0 a 4.0 7 1 5 a NaN
1 b NaN 8 3 3 a 8.0
2 c NaN NaN 5 6 a NaN
3 d NaN 4 NaN 9 b NaN
4 e NaN 2 1 2 b NaN
5 f NaN 3 0 4 b NaN
6 g NaN 6 10 7 c NaN
7 h NaN 5 7 3 k NaN
8 i NaN 4 NaN 5 f NaN
9 j NaN 6 5 9 r NaN
Describe:
desc = df.describe(include = 'all')
d1 = desc.loc['varType'] = desc.dtypes
d3 = desc.loc['rowsNull'] = df.isnull().sum()
d4 = desc.loc['%rowsNull'] = round((d3/len(df))*100, 2)
print(desc)
A B C D E F G
count 10 1 10 10 10 10 1
unique 10 NaN NaN NaN NaN 6 NaN
top i NaN NaN NaN NaN b NaN
freq 1 NaN NaN NaN NaN 3 NaN
mean NaN 4 5.4 4.3 5.3 NaN 8
std NaN NaN 2.22111 3.16403 2.45176 NaN NaN
min NaN 4 2 0 2 NaN 8
25% NaN 4 4 1.5 3.25 NaN 8
50% NaN 4 5.5 4.5 5 NaN 8
75% NaN 4 6.75 6.5 6.75 NaN 8
max NaN 4 9 10 9 NaN 8
varType object float64 float64 float64 float64 object float64
rowsNull 0 9 1 2 0 0 9
%rowsNull 0 90 10 20 0 0 90
In this exemple we have juste 2 features to delete 'B' and 'G'.
But in my case i find 40 variables whose '%rowsNull' greater than > 90%, how should i do not take into account these variables in my modeling?
I have no idea how to do this.
Please help me.
Thanks.
First compare missing values and then get mean (it working because Trues are processing like 1s), last filter by boolean indexing with loc, because removing columns:
df = df.loc[:, df.isnull().mean() <.9]
print (df)
A C D E F
0 a 7.0 1.0 5 a
1 b 8.0 3.0 3 a
2 c NaN 5.0 6 a
3 d 4.0 NaN 9 b
4 e 2.0 1.0 2 b
5 f 3.0 0.0 4 b
6 g 6.0 10.0 7 c
7 h 5.0 7.0 3 k
8 i 4.0 NaN 5 f
9 j 6.0 5.0 9 r
Detail:
print (df.isnull().mean())
A 0.0
B 0.9
C 0.1
D 0.2
E 0.0
F 0.0
G 0.9
dtype: float64
You can find columns with more than 90% null values and drop
cols_to_drop = df.columns[df.isnull().sum()/len(df) >= .90]
df.drop(cols_to_drop, axis = 1, inplace = True)
A C D E F
0 a 7.0 1.0 5 a
1 b 8.0 3.0 3 a
2 c NaN 5.0 6 a
3 d 4.0 NaN 9 b
4 e 2.0 1.0 2 b
5 f 3.0 0.0 4 b
6 g 6.0 10.0 7
7 h 5.0 7.0 3 k
8 i 4.0 NaN 5 f
9 j 6.0 5.0 9 r
Based on your code, you could do something like
keepCols = desc.columns[desc.loc['%rowsNull'] < 90]
df = df[keepCols]

How to implement sql coalesce in pandas

I have a data frame like
df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
I want to add a new column 'D'. Expected output is
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
Thanks in advance!
Another way is to explicitly fill column D with A,B,C in that order.
df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
Another approach is to use the combine_first method of a pd.Series. Using your example df,
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
we have
>>> df.A.combine_first(df.B).combine_first(df.C)
0 1.0
1 2.0
2 7.0
We can use reduce to abstract this pattern to work with an arbitrary number of columns.
>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
Let's put this all together in a function.
>>> def coalesce(*args):
... return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
I think you need bfill with selecting first column by iloc:
df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
same as:
df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 1
pandas
df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 2
numpy
v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
naive time test
over given data
over larger data
There is already a method for Series in Pandas that does this:
df['D'] = df['A'].combine_first(df['C'])
Or just stack them if you want to look up values sequentially:
df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
This outputs the following:
>>> df
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0

Python Pandas: Get single max numeric value from df with floats and characters/alphas

I want to get the single max numeric value from this df below which contains a mix of floats and characters/alphas.
Here is my df:
df = pd.DataFrame({'group1': ['a','a','a','b','b','b','c','c','d','d','d','d','d'],
'group2': ['c','c','d','d','d','e','f','f','e','d','d','d','e'],
'value1': [1.1,2,3,4,5,6,7,8,9,1,2,3,4],
'value2': [7.1,8,9,10,11,12,43,12,34,5,6,2,3]})
This is what it looks like:
group1 group2 value1 value2
0 a c 1.1 7.1
1 a c 2.0 8.0
2 a d 3.0 9.0
3 b d 4.0 10.0
4 b d 5.0 11.0
5 b e 6.0 12.0
6 c f 7.0 43.0
7 c f 8.0 12.0
8 d e 9.0 34.0
9 d d 1.0 5.0
10 d d 2.0 6.0
11 d d 3.0 2.0
12 d e 4.0 3.0
Expected outcome:
43.0
At the moment I am creating a new df which excludes "group1" and "group2" but there must be a better way to pull the max numeric value?
Note: this thread is linked to http://goo.gl/ZJoR9V
Thanks
Using 0.14.1, this is a nice clean way
In [6]: df.select_dtypes(exclude=['object']).max().max()
Out[6]: 43.0
Or
In [6]: df.select_dtypes(exclude=['object']).unstack().max()
Out[6]: 43.0

Categories

Resources