Following is what my dataframe looks like and Expected_Output is my desired column.
Group Signal Value1 Value2 Expected_Output
0 1 0 3 1 NaN
1 1 1 4 2 NaN
2 1 0 7 4 NaN
3 1 0 8 9 1.0
4 1 0 5 3 NaN
5 2 1 3 6 NaN
6 2 1 1 2 1.0
7 2 0 3 4 1.0
For a given Group, if Signal == 1, then I am attempting to look at the next three rows(and not the current row) and check if Value1 < Value2. If that condition is true, then I return a 1 in the Expected_Output column. If for example, Value < Value2 condition is satisfied for multiple reasons as it comes within 3 next rows from Signal == 1 in both row 5 & 6(Group 2), then I am also returning a 1 in Expected_Output.
I am assuming the right combination of group by object,np.where, any, shift could be the solution but cant quite get there.
N.B:- Alexander pointed out a conflict in the comments. Ideally, a value being set due to a signal in a prior row will supersede the current row rule conflict in a given row.
If you are going to be checking lots of previous rows, multiple shifts can quickly get messy, but here it's not too bad:
s = df.groupby('Group').Signal
condition = ((s.shift(1).eq(1) | s.shift(2).eq(1) | s.shift(3).eq(1))
& df.Value1.lt(df.Value2))
df.assign(out=np.where(condition, 1, np.nan))
Group Signal Value1 Value2 out
0 1 0 3 1 NaN
1 1 1 4 2 NaN
2 1 0 7 4 NaN
3 1 0 8 9 1.0
4 1 0 5 3 NaN
5 2 1 3 6 NaN
6 2 1 1 2 1.0
7 2 0 3 4 1.0
If you're concerned about the performance of using so many shifts, I wouldn't worry too much, here's a sample on 1 million rows:
In [401]: len(df)
Out[401]: 960000
In [402]: %%timeit
...: s = df.groupby('Group').Signal
...:
...: condition = ((s.shift(1).eq(1) | s.shift(2).eq(1) | s.shift(3).eq(1))
...: & df.Value1.lt(df.Value2))
...:
...: np.where(condition, 1, np.nan)
...:
...:
94.5 ms ± 524 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Alexander identified a conflict in the rules, here is a version using a mask that fits that requirement:
s = (df.Signal.mask(df.Signal.eq(0)).groupby(df.Group)
.ffill(limit=3).mask(df.Signal.eq(1)).fillna(0))
Now you can simply use this column along with your other condition:
np.where((s.eq(1) & df.Value1.lt(df.Value2)).astype(int), 1, np.nan)
array([nan, nan, nan, 1., nan, nan, nan, 1.])
You can create an index that matches your criteria, and then use it to set the expected output to 1.
It is not clear how to treat the expected output when the rules conflict. For example, on row 6, the expected output would be 1 because it satisfied the signal criteria from row five and fits 'the subsequent three rows where value 1 < value 2'. However, it possibly conflicts with the rule that the first signal row is ignored.
idx = (df
.assign(
grp=df['Signal'].eq(1).cumsum(),
cond=df.eval('Value1 < Value2'))
.pipe(lambda df: df[df['grp'] > 0]) # Ignore data preceding first signal.
.groupby(['Group', 'grp'], as_index=False)
.apply(lambda df: df.iloc[1:4, :]) # Ignore current row, get rows 1-3.
.pipe(lambda df: df[df['cond']]) # Find rows where condition is met.
.index.get_level_values(1)
)
df['Expected_Output'] = np.nan
df.loc[idx, 'Expected_Output'] = 1
>>> df
Group Signal Value1 Value2 Expected_Output
0 1 0 3 1 NaN
1 1 1 4 2 NaN
2 1 0 7 4 NaN
3 1 0 8 9 1.0
4 1 0 5 3 NaN
5 2 1 3 6 NaN
6 2 1 1 2 NaN # <<< Intended difference vs. "expected"
7 2 0 3 4 1.0
Related
I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.
You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x
Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0
Let say I have the following dataframe
import pandas as pd
df = pd.DataFrame({
'Est': [1.18,1.83,2.08,2.30,2.45,3.21,3.26,3.54,3.87,4.58,4.59,4.98],
'Buy': [0,1,1,1,0,1,1,0,1,0,0,1]
})
Est Buy
0 1.18 0
1 1.83 1
2 2.08 1
3 2.30 1
4 2.45 0
5 3.21 1
6 3.26 1
7 3.54 0
8 3.87 1
9 4.58 0
10 4.59 0
11 4.98 1
I will like to create a new dataframe with two columns and 4 rows with the following format: the first row contains how many 'Est' values are between 1 and 2, and how many 1's in the column 'Buy'; the second row the same for the 'Est' values between 2 and 3; third row between 3 and 4, and so on. So my output should be
A B
0 2 1
1 3 2
2 4 3
3 3 1
I tried to use the where clause in pandas (or np.where) to create new columns with restrictions like df['Est'] >= 1 & df['Est'] <= 2 and then count. But, is there an easier and cleaner way to do this? Thanks
Sounds like you want to group by the floor of the first column:
g = df.groupby(df['Est'] // 1)
You count the Est column:
count = g['Est'].count()
And sum the Buy column:
buys = g['Buy'].sum()
I have a DataFrame that looks like this table:
index
x
y
value_1
cumsum_1
cumsum_2
0
0.1
1
12
12
0
1
1.2
1
10
12
10
2
0.25
1
7
19
10
3
1.0
2
3
0
3
4
0.72
2
5
5
10
5
1.5
2
10
5
13
So my aim is to calculate the cumulative sum of value_1. But there are two conditions that must be taken into account.
First: If value x is less than 1 the cumsum() is written in column cumsum_1 and if x is greater in column cumsum_2.
Second: column y indicates groups (1,2,3,...). When the value in y changes, the cumsum()-operation start all over again. I think the grouby() method would help.
Does somebody have any idea?
You can use .where() on conditions x < 1 or x >= 1 to temporarily modify the values of value_1 to 0 according to the condition and then groupby cumsum, as follows:
The second condition is catered by the .groupby function while the first condition is catered by the .where() function, detailed below:
.where() keeps the column values when the condition is true and change the values (to 0 in this case) when the condition is false. Thus, for the first condition where column x < 1, value_1 will keep its values for feeding to the subsequent cumsum step to accumulate the filtered values of value_1. For rows where the condition x < 1 is False, value_1 has its values masked to 0. These 0 passed to cumsum for accumulation is effectively the same effect as taking out the original values of value_1 for the accumulation into
column cumsum_1.
The second line of codes accumulates value_1 values to column cumsum_2 with the opposite condition of x >= 1. These 2 lines of codes, in effect, allocate value_1 to cumsum_1 and cumsum_2 according to x < 1 and x >= 1, respectively.
(Thanks for the suggestion of #tdy to simplify the codes)
df['cumsum_1'] = df['value_1'].where(df['x'] < 1, 0).groupby(df['y']).cumsum()
df['cumsum_2'] = df['value_1'].where(df['x'] >= 1, 0).groupby(df['y']).cumsum()
Result:
print(df)
x y value_1 cumsum_1 cumsum_2
0 0.10 1 12 12 0
1 1.20 1 10 12 10
2 0.25 1 7 19 10
3 1.00 2 3 0 3
4 0.72 2 5 5 3
5 1.50 2 10 5 13
Here is another approach using a pivot:
(df.assign(ge1=df['x'].ge(1).map({True: 'cumsum_2', False: 'cumsum_1'}))
.pivot(columns='ge1', values='value_1').fillna(0).groupby(df['y']).cumsum()
.astype(int)
)
output:
ge1 cumsum_1 cumsum_2
0 12 0
1 12 10
2 19 10
3 0 3
4 5 3
5 5 13
full code:
df[['cumsum_1', 'cumsum_2']] = (df.assign(ge1=df['x'].ge(1).map({True: 'cumsum_2', False: 'cumsum_1'}))
.pivot(columns='ge1', values='value_1').fillna(0).groupby(df['y']).cumsum()
.astype(int)
)
(or use pd.concat to concatenate)
output:
index x y value_1 cumsum_1 cumsum_2
0 0 0.10 1 12 12 0
1 1 1.20 1 10 12 10
2 2 0.25 1 7 19 10
3 3 1.00 2 3 0 3
4 4 0.72 2 5 5 3
5 5 1.50 2 10 5 13
Similar to above approaches but a little more chained.
df[['cumsum_1a', 'cumsum2a']] = (df.
assign(
v1 = lambda temp: temp.x >= 1,
v2 = lambda temp: temp.v1 * temp.value_1,
v3 = lambda temp: ~ temp.v1 * temp.value_1
).
groupby('y')[['v2', 'v3']].
cumsum()
)
I have this dataframe
Id,ProductId,Product
1,100,a
1,100,x
1,100,NaN
2,150,NaN
3,150,NaN
4,100,a
4,100,x
4,100,NaN
Here I want to remove some of the rows which contains NaN and some I don't want to remove.
The removing criteria is as follow.
I want to remove only those NaNs rows whose Id already contains the value in Product columns.
for example, here Id1 has already value in Product columns and still contains NaN, so I want to remove that row.
But for id2, there exists only NaN in Product column. So I don't want to remove that one. Similarly for Id3 also, there is only NaN values in the Product columns and I want to keep it that one too.
Final Output would be like this
Id,ProductId,Product
1,100,a
1,100,x
2,150,NaN
3,150,NaN
4,100,a
4,100,x
Dont use groupby if exist alternative, because slow.
vals = df.loc[df['Product'].notnull(), 'Id'].unique()
df = df[~(df['Id'].isin(vals) & df['Product'].isnull())]
print (df)
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
Explanation:
First get all Id with some non missing values:
print (df.loc[df['Product'].notnull(), 'Id'].unique())
[1 4]
Then check these groups with missing values:
print (df['Id'].isin(vals) & df['Product'].isnull())
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
dtype: bool
Invert boolean mask:
print (~(df['Id'].isin(vals) & df['Product'].isnull()))
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 False
dtype: bool
And last filter by boolean indexing:
print (df[~(df['Id'].isin(vals) & df['Product'].isnull())])
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
You can group the dataframe by Id and drop the NaN if the group has more than one element:
>> df.groupby(level='Id', group_keys=False
).apply(lambda x: x.dropna() if len(x) > 1 else x)
ProductId Product
Id
1 100 a
1 100 x
2 150 NaN
3 150 NaN
4 100 a
4 100 x
Calculate groups (Id) where values (Product) are all null, then remove required rows via Boolean indexing with loc accessor:
nulls = df.groupby('Id')['Product'].apply(lambda x: x.isnull().all())
nulls_idx = nulls[nulls].index
df = df.loc[~(~df['Id'].isin(nulls_idx) & df['Product'].isnull())]
print(df)
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
Use groupby+transform with parameter count and then boolean indexing using isnull of Product column as:
count = df.groupby('Id')['Product'].transform('count')
df = df[~(count.ne(0) & df.Product.isnull())]
print(df)
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
Here is a fictitious example:
id cluster
1 3
2 3
3 3
4 1
5 5
So the cluster for id 4 and 5 should be replaced by some text.
So, I'm able to find which values have a frequency of less than 3 using:
counts = distclust.groupby("cluster")["cluster"].count()
counts[counts < 3].index.values
Now, I'm not sure I go and replace these values in my dataframe with some arbitrary text (i.e. "noise")
I think that is enough information, let me know if you'd like me to include anything else:
In [82]: df.groupby('cluster').filter(lambda x: len(x) <= 2)
Out[82]:
id cluster
3 4 1
4 5 5
updating:
In [95]: idx = df.groupby('cluster').filter(lambda x: len(x) <= 2).index
In [96]: df.loc[idx, 'cluster'] = -999
In [97]: df
Out[97]:
id cluster
0 1 3
1 2 3
2 3 3
3 4 -999
4 5 -999
df.cluster.replace((df.cluster.value_counts()<=1).replace({True:'noise',False:np.nan}).dropna())
Out[627]:
0 3
1 3
2 3
3 noise
4 noise
Name: cluster, dtype: object
After assign it back
df.cluster=df.cluster.replace((df.cluster.value_counts()<=1).replace({True:'noise',False:np.nan}).dropna())
df
Out[629]:
id cluster
0 1 3
1 2 3
2 3 3
3 4 noise
4 5 noise