Python dfply: unable to mask on multiple conditions - python

I am an R user learning how to use Python's dfply, the Python equivalent to R's dplyr. My problem: in dfply, I am unable to mask on multiple conditions in a pipe. I seek a solution involving dfply pipes rather than multiple lines of subsetting.
My code:
# Import
import pandas as pd
import numpy as np
from dfply import *
# Create data frame and mask it
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask((X.a.isnull()) | ~(X.b.isnull())))
print(df)
print(df2)
Here is the oringal data frame, df:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
4 5.0 NaN 1
And here is the result of the piped mask, df2:
a b c
0 NaN 6.0 5
4 5.0 NaN 1
However, I expect this instead:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
Why don't the "|" and "~" operators result in rows in which column "a" is either NaN or column "b" is not NaN?
By the way, I also tried np.logical_or():
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
print(df)
print(df2)
But this resulted in error:
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
ValueError: invalid __array_struct__

Edit: Tweak the second conditional to "df.col2.notnull()". No idea why the tilde is ignored after the pipe.
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >> mask((X.a.isnull()) | (X.b.notnull())))
print(df2)
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2

How about filter_by?
df >> filter_by((X.a.isnull()) | (X.b.isnull()))

Related

Python Dataframe Duplicated Columns while Merging multple times

I have a main dataframe and a sub dataframe. I want to merge each column in sub dataframe into main dataframe with main dataframe column as a reference. I have successfully arrived at my desired answer, except that I see duplicated columns of the main dataframe. Below are the my expected and present answers.
Present solution:
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df =
Ref A Ref Z
0 1 NaN 1 1.0
1 2 2.0 2 2.0
2 3 3.0 3 NaN
3 4 NaN 4 NaN
Expected Answer:
df =
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Update
Use duplicated:
>>> df.loc[:, ~df.columns.duplicated()]
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Old answer
You can use:
# Your code
...
df = pd.concat(df, axis=1)
# Use pop and insert to cleanup your dataframe
df.insert(0, 'Ref', df.pop('Ref').iloc[:, 0])
Output:
>>> df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
What about setting 'Ref' col as index while getting dataframe list. (And resetting index such that you get back Ref as a column)
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left').set_index('Ref') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df = df.reset_index()
Ref A Z
1 NaN 1.0
2 2.0 2.0
3 3.0 NaN
4 NaN NaN
This is a reduction process. Instead of the list comprehension use for - loop, or even reduce:
from functools import reduce
reduce(lambda x, y : x.merge(df1[y],left_on='Ref',right_on=y,how='left'), df1.columns, df)
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
The above is similar to:
for y in df1.columns:
df = df.merge(df1[y],left_on='Ref',right_on=y,how='left')
df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN

Forward fill on custom value in pandas dataframe

I am looking to perform forward fill on some dataframe columns.
the ffill method replaces missing values or NaN with the previous filled value.
In my case, I would like to perform a forward fill, with the difference that I don't want to do that on Nan but for a specific value (say "*").
Here's an example
import pandas as pd
import numpy as np
d = [{"a":1, "b":10},
{"a":2, "b":"*"},
{"a":3, "b":"*"},
{"a":4, "b":"*"},
{"a":np.nan, "b":50},
{"a":6, "b":60},
{"a":7, "b":70}]
df = pd.DataFrame(d)
with df being
a b
0 1.0 10
1 2.0 *
2 3.0 *
3 4.0 *
4 NaN 50
5 6.0 60
6 7.0 70
The expected result should be
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
If replacing "*" by np.nan then ffill, that would cause to apply ffill to column a.
Since my data has hundreds of columns, I was wondering if there is a more efficient way than looping over all columns, check if it countains "*", then replace and ffill.
You can use df.mask with df.isin with df.replace
df.mask(df.isin(['*']),df.replace('*',np.nan).ffill())
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
I think you're going in the right direction, but here's a complete solution. What I'm doing is 'marking' the original NaN values, then replacing "*" with NaN, using ffill, and then putting the original NaN values back.
df = df.replace(np.NaN, "<special>").replace("*", np.NaN).ffill().replace("<special>", np.NaN)
output:
a b
0 1.0 10.0
1 2.0 10.0
2 3.0 10.0
3 4.0 10.0
4 NaN 50.0
5 6.0 60.0
6 7.0 70.0
And here's an alternative solution that does the same thing, without the 'special' marking:
original_nan = df.isna()
df = df.replace("*", np.NaN).ffill()
df[original_nan] = np.NaN

Compare two pandas dataframes and replace value based on condition

I have the following two pandas dataframes:
df1
A B C
0 1 2 1
1 7 3 6
2 3 10 11
df2
A B C
0 2 0 2
1 8 4 7
Where A,B and C are column headings of both dataframes.
I am trying to compare columns of df1 to columns of df2 such that the first row in df2 is the lower bound and the second row is the upper bound. Any values in df1 outside the lower and upper bound (column wise) needs to be replaced with NaN.
So in this example the output should be:
A B C
0 nan 2 nan
1 7 3 6
2 3 nan nan
As a basic I am trying df1[df1 < df2] = np.nan, but this does not work. I have also tried .where() but not getting any success.
Would appreciate some help here, thanks.
IIUC
df=df1.where(df1.ge(df2.iloc[0])&df1.lt(df2.iloc[1]))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
You could do something like:
lower = df1 < df2.iloc[0, :]
upper = df1 > df2.iloc[1, :]
df1[lower | upper] = np.nan
print(df1)
Output
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
Here is one with df.clip and mask:
df1.mask(df1.ne(df1.clip(lower = df2.loc[0],upper = df1.loc[1],axis=1)))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
A slightly different approach using between,
df1.apply(lambda x:x.where(x.between(*df2.values, False)), axis=1)

Creating two shifted columns in grouped pandas data-frame

I have looked all over and I still can't find an example of how to create two shifted columns in a Pandas Dataframe within its groups.
I have done it with one column as follows:
data_frame['previous_category'] = data_frame.groupby('id')['category'].shift()
But I have to do it with 2 columns, shifting one upwards and the other downwards.
Any ideas?
It is possible by custom function with GroupBy.apply, because one column need shift down and second shift up:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'F':list('aaabbb')
})
def f(x):
x['B'] = x['B'].shift()
x['C'] = x['C'].shift(-1)
return x
df = df.groupby('F').apply(f)
print (df)
B C F
0 NaN 8.0 a
1 4.0 9.0 a
2 5.0 NaN a
3 NaN 2.0 b
4 5.0 3.0 b
5 5.0 NaN b
If want shift same way only specify all columns in lists:
df[['B','C']] = df.groupby('F')['B','C'].shift()
print (df)
B C F
0 NaN NaN a
1 4.0 7.0 a
2 5.0 8.0 a
3 NaN NaN b
4 5.0 4.0 b
5 5.0 2.0 b

Perform arithmetic operations on null values

When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

Categories

Resources