Pandas sumif with repeated column names - python

What's the best way to sum the columns of df2 by the columns of df3 in the below?
df = pd.DataFrame(np.random.rand(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.random.rand(15).reshape((5,3)),index = ['A','B','C','D','E'])
df2 = pd.concat([df,df1],axis=1)
df3 = pd.DataFrame(np.random.rand(25).reshape((5,5)),columns = np.arange(5),index = ['A','B','C','D','E'])
The answer would be in the shape of df3.
EDIT for clarity:
df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.ones(15).reshape((5,3))*2,index = ['A','B','C','D','E'],columns = [1,3,4])
df2 = pd.concat([df,df1],axis=1)
df3 = pd.DataFrame(np.empty((5,5)),columns = np.arange(5),index = ['A','B','C','D','E'])
print(df2)
0 1 2 3 4 1 3 4
A 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
B 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
C 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
D 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
E 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
The desired result would be:
0 1 2 3 4
A 1.0 3.0 1.0 3.0 3.0
B 1.0 3.0 1.0 3.0 3.0
C 1.0 3.0 1.0 3.0 3.0
D 1.0 3.0 1.0 3.0 3.0
E 1.0 3.0 1.0 3.0 3.0

you can group your DF by columns:
In [57]: df2.groupby(axis=1, by=df2.columns).sum()
Out[57]:
0 1 2 3 4
A 1.0 3.0 1.0 3.0 3.0
B 1.0 3.0 1.0 3.0 3.0
C 1.0 3.0 1.0 3.0 3.0
D 1.0 3.0 1.0 3.0 3.0
E 1.0 3.0 1.0 3.0 3.0
you can specify the axis name explicitly:
In [58]: df2.groupby(axis='columns', by=df2.columns).sum()
Out[58]:
0 1 2 3 4
A 1.0 3.0 1.0 3.0 3.0
B 1.0 3.0 1.0 3.0 3.0
C 1.0 3.0 1.0 3.0 3.0
D 1.0 3.0 1.0 3.0 3.0
E 1.0 3.0 1.0 3.0 3.0
or a short version from #piRSquared
df2.groupby(df2.columns, 1).sum()

Let use T transpose, groupby and sum:
df2.T.groupby(level=0).sum().T
Original df2:
0 1 2 3 4 0 1 \
A 0.627278 0.008150 0.285077 0.931831 0.683035 0.691318 0.873139
B 0.246861 0.108021 0.903743 0.030373 0.870753 0.143835 0.251623
C 0.367309 0.551530 0.193623 0.704314 0.136061 0.102401 0.287334
D 0.580771 0.592600 0.949666 0.806875 0.288331 0.794173 0.034380
E 0.088984 0.838401 0.988919 0.636134 0.353484 0.584571 0.090235
2
A 0.763687
B 0.735570
C 0.405304
D 0.446789
E 0.542930
new_df2 = df2.T.groupby(level=0).sum().T
print(new_df2)
Output new df2:
0 1 2 3 4
A 1.318595 0.881289 1.048764 0.931831 0.683035
B 0.390697 0.359644 1.639314 0.030373 0.870753
C 0.469710 0.838864 0.598927 0.704314 0.136061
D 1.374944 0.626980 1.396455 0.806875 0.288331
E 0.673555 0.928636 1.531849 0.636134 0.353484

solution 1
numpy.dot + pandas.get_dummies
cols = df2.columns.values
pd.DataFrame(
df2.values.dot(pd.get_dummies(cols).values),
df2.index, pd.unique(df2.columns.values)
)
0 1 2 3 4
A 1 3 1 3 3
B 1 3 1 3 3
C 1 3 1 3 3
D 1 3 1 3 3
E 1 3 1 3 3
solution 2
numpy.einsum + pandas.get_dummies
cols = df2.columns.values
pd.DataFrame(
np.einsum('ij,jk->ik', df2.values, pd.get_dummies(cols).values),
df2.index, pd.unique(df2.columns.values)
)
0 1 2 3 4
A 1 3 1 3 3
B 1 3 1 3 3
C 1 3 1 3 3
D 1 3 1 3 3
E 1 3 1 3 3
naive timing
setup
df2 = pd.DataFrame(
[[1, 1, 1, 1, 1, 2, 2, 2]],
list('ABCDE'),
[0, 1, 2, 3, 4, 1, 3, 4]
)

Is this what you mean:
new_df = pd.DataFrame()
for c in df3.columns:
try:
new_df[c] = [sum(x) for x in df2[c].values]
except:
new_df[c] = df2[c].values

Related

Perform arithmetic operations on null values

When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

Split pandas dataframe into multiple dataframes based on null columns

I have a pandas dataframe as follows:
a b c
0 1.0 NaN NaN
1 NaN 7.0 5.0
2 3.0 8.0 3.0
3 4.0 9.0 2.0
4 5.0 0.0 NaN
Is there a simple way to split the dataframe into multiple dataframes based on non-null values?
a
0 1.0
b c
1 7.0 5.0
a b c
2 3.0 8.0 3.0
3 4.0 9.0 2.0
a b
4 5.0 0.0
Using groupby with dropna
for _, x in df.groupby(df.isnull().dot(df.columns)):
print(x.dropna(1))
a b c
2 3.0 8.0 3.0
3 4.0 9.0 2.0
b c
1 7.0 5.0
a
0 1.0
a b
4 5.0 0.0
We can save them in dict
d = {y : x.dropna(1) for y, x in df.groupby(df.isnull().dot(df.columns))}
More Info using the dot to get the null column , if they are same we should combine them together
df.isnull().dot(df.columns)
Out[1250]:
0 bc
1 a
2
3
4 c
dtype: object
So here is a possible solution
def getMap(some_list):
return "".join(["1" if np.isnan(x) else "0" for x in some_list])
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, np.NaN, np.NaN], [np.NaN, 7, 5], [3, 8, 3], [4, 9, 2], [5, 0, np.NaN]])
print(df.head())
x = df[[0, 1, 2]].apply(lambda x: x.tolist(), axis=1).tolist()
nullMap = [getMap(y) for y in x]
nullSet = set(nullMap)
some_dict = {y:[] for y in nullSet}
for y in x:
some_dict[getMap(y)] = [*some_dict[getMap(y)], [z for z in y if ~np.isnan(z)]]
dfs = [pd.DataFrame(y) for y in some_dict.values()]
for df in dfs:
print(df)
This gives the exact output for the input you gave. :)
a
1.0
b c
7.0 5.0
a b c
3.0 8.0 3.0
4.0 9.0 2.0
a b
5.0 0.0

Pandas split /group dataframe by row values

I have a dataframe of the following form
In [1]: df
Out [1]:
A B C D
1 0 2 6 0
2 6 1 5 2
3 NaN NaN NaN NaN
4 9 3 2 2
...
15 2 12 5 23
16 NaN NaN NaN NaN
17 8 1 5 3
I'm interested in splitting the dataframe into multiple dataframes (or grouping it) by the NaN rows.
So resulting in something as follows
In [2]: df1
Out [2]:
A B C D
1 0 2 6 0
2 6 1 5 2
In [3]: df2
Out [3]:
A B C D
1 9 3 2 2
...
12 2 12 5 23
In [4]: df3
Out [4]:
A B C D
1 8 1 5 3
You could use the compare-cumsum-groupby pattern, where we find the all-null rows, cumulative sum those to get a group number for each subgroup, and then iterate over the groups:
In [114]: breaks = df.isnull().all(axis=1)
In [115]: groups = [group.dropna(how='all') for _, group in df.groupby(breaks.cumsum())]
In [116]: for group in groups:
...: print(group)
...: print("--")
...:
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
--
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0
--
A B C D
17 8.0 1.0 5.0 3.0
--
You can using local with groupby split
variables = locals()
for x, y in df.dropna(0).groupby(df.isnull().all(1).cumsum()[~df.isnull().all(1)]):
variables["df{0}".format(x + 1)] = y
df1
Out[768]:
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
df2
Out[769]:
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0
I'd use dictionary, groupby with cumsum:
dictofdfs = {}
for n,g in df.groupby(df.isnull().all(1).cumsum()):
dictofdfs[n]= g.dropna()
Output:
dictofdfs[0]
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
dictofdfs[1]
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0
dictofdfs[2]
A B C D
17 8.0 1.0 5.0 3.0

pandas rolling apply to allow nan

I have a very simple Pandas Series:
xx = pd.Series([1, 2, np.nan, np.nan, 3, 4, 5])
If I run this I get what I want:
>>> xx.rolling(3,1).mean()
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
But if I have to use .apply() I cannot get it to work by ignoring NaNs in the mean() operation:
>>> xx.rolling(3,1).apply(np.mean)
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
>>> xx.rolling(3,1).apply(lambda x : np.mean(x))
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
What should I do in order to both use .apply() and have the result in the first output? My actual problem is more complicated that I have to use .apply() to realize but it boils down to this issue.
You can use np.nanmean()
xx.rolling(3,1).apply(lambda x : np.nanmean(x))
Out[59]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64
If you have to process the nans explicitly, you can do:
xx.rolling(3,1).apply(lambda x : np.mean(x[~np.isnan(x)]))
Out[94]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64

How to implement sql coalesce in pandas

I have a data frame like
df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
I want to add a new column 'D'. Expected output is
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
Thanks in advance!
Another way is to explicitly fill column D with A,B,C in that order.
df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
Another approach is to use the combine_first method of a pd.Series. Using your example df,
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
we have
>>> df.A.combine_first(df.B).combine_first(df.C)
0 1.0
1 2.0
2 7.0
We can use reduce to abstract this pattern to work with an arbitrary number of columns.
>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
Let's put this all together in a function.
>>> def coalesce(*args):
... return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
I think you need bfill with selecting first column by iloc:
df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
same as:
df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 1
pandas
df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 2
numpy
v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
naive time test
over given data
over larger data
There is already a method for Series in Pandas that does this:
df['D'] = df['A'].combine_first(df['C'])
Or just stack them if you want to look up values sequentially:
df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
This outputs the following:
>>> df
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0

Categories

Resources