Pandas: Cumulative sum within group with two conditions - python

I have a DataFrame that looks like this table:
index
x
y
value_1
cumsum_1
cumsum_2
0
0.1
1
12
12
0
1
1.2
1
10
12
10
2
0.25
1
7
19
10
3
1.0
2
3
0
3
4
0.72
2
5
5
10
5
1.5
2
10
5
13
So my aim is to calculate the cumulative sum of value_1. But there are two conditions that must be taken into account.
First: If value x is less than 1 the cumsum() is written in column cumsum_1 and if x is greater in column cumsum_2.
Second: column y indicates groups (1,2,3,...). When the value in y changes, the cumsum()-operation start all over again. I think the grouby() method would help.
Does somebody have any idea?

You can use .where() on conditions x < 1 or x >= 1 to temporarily modify the values of value_1 to 0 according to the condition and then groupby cumsum, as follows:
The second condition is catered by the .groupby function while the first condition is catered by the .where() function, detailed below:
.where() keeps the column values when the condition is true and change the values (to 0 in this case) when the condition is false. Thus, for the first condition where column x < 1, value_1 will keep its values for feeding to the subsequent cumsum step to accumulate the filtered values of value_1. For rows where the condition x < 1 is False, value_1 has its values masked to 0. These 0 passed to cumsum for accumulation is effectively the same effect as taking out the original values of value_1 for the accumulation into
column cumsum_1.
The second line of codes accumulates value_1 values to column cumsum_2 with the opposite condition of x >= 1. These 2 lines of codes, in effect, allocate value_1 to cumsum_1 and cumsum_2 according to x < 1 and x >= 1, respectively.
(Thanks for the suggestion of #tdy to simplify the codes)
df['cumsum_1'] = df['value_1'].where(df['x'] < 1, 0).groupby(df['y']).cumsum()
df['cumsum_2'] = df['value_1'].where(df['x'] >= 1, 0).groupby(df['y']).cumsum()
Result:
print(df)
x y value_1 cumsum_1 cumsum_2
0 0.10 1 12 12 0
1 1.20 1 10 12 10
2 0.25 1 7 19 10
3 1.00 2 3 0 3
4 0.72 2 5 5 3
5 1.50 2 10 5 13

Here is another approach using a pivot:
(df.assign(ge1=df['x'].ge(1).map({True: 'cumsum_2', False: 'cumsum_1'}))
.pivot(columns='ge1', values='value_1').fillna(0).groupby(df['y']).cumsum()
.astype(int)
)
output:
ge1 cumsum_1 cumsum_2
0 12 0
1 12 10
2 19 10
3 0 3
4 5 3
5 5 13
full code:
df[['cumsum_1', 'cumsum_2']] = (df.assign(ge1=df['x'].ge(1).map({True: 'cumsum_2', False: 'cumsum_1'}))
.pivot(columns='ge1', values='value_1').fillna(0).groupby(df['y']).cumsum()
.astype(int)
)
(or use pd.concat to concatenate)
output:
index x y value_1 cumsum_1 cumsum_2
0 0 0.10 1 12 12 0
1 1 1.20 1 10 12 10
2 2 0.25 1 7 19 10
3 3 1.00 2 3 0 3
4 4 0.72 2 5 5 3
5 5 1.50 2 10 5 13

Similar to above approaches but a little more chained.
df[['cumsum_1a', 'cumsum2a']] = (df.
assign(
v1 = lambda temp: temp.x >= 1,
v2 = lambda temp: temp.v1 * temp.value_1,
v3 = lambda temp: ~ temp.v1 * temp.value_1
).
groupby('y')[['v2', 'v3']].
cumsum()
)

Related

Resolving conflicts in Pandas dataframe

I am performing record linkage on a dataframe such as:
ID_1 ID_2 Predicted Link Probability
1 0 1 0.9
1 1 1 0.5
1 2 0 0
2 1 1 0.8
2 5 1 0.8
3 1 0 0
3 2 1 0.5
When my model overpredicts and links the same ID_1 to more than one ID_2 (indicated by a 1 in Predicted Link) I want to resolve the conflicts based on the Probability-value. If one predicted link has a higher probability than the other I want to keep a 1 for that, but reverse the other prediction link values for that ID_1 to 0. If the (highest) probabilities are of equal value I want to reverse all the predicted link values to 0. If only one predicted link then the predicted values should be left as they are.
The resulting dataframe would look like this:
ID_1 ID_2 Predicted Link Probability
1 0 1 0.9
1 1 0 0.5
1 2 0 0
2 1 0 0.8
2 5 0 0.8
3 1 0 0
3 2 1 0.5
I am grouping via pandas.groupby, and tried some variations with numpy.select and numpy.where, but without luck. Any help much appreciated!
For each ID_1, you want to keep one and only one row. Thus, grouping is a good start.
First let's construct our data :
import pandas as pd
from io import StringIO
csvfile = StringIO(
"""ID_1\tID_2\tPredicted Link\tProbability
1\t0\t1\t0.9
1\t1\t1\t0.5
1\t2\t0\t0
2\t1\t1\t0.8
2\t5\t1\t0.8
3\t1\t0\t0
3\t2\t1\t0.5""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
We want to a group for each value of ID_1 and then looking for the row holding the max value of Probability for that said value of ID_1. Let's create a mask :
max_proba = df.groupby("ID_1")["Probability"].transform(lambda x : x.eq(x.max()))
max_proba
Out[196]:
0 True
1 False
2 False
3 True
4 True
5 False
6 True
Name: Probability, dtype: bool
Considering your rules, rows 0, 1, 2 and rows 5, 6 are valid (only one max for that ID_1 value), but not the 3 and 4 rows. Let's build a mask that consider these two conditions, True if max value and if only one max value.
To be more accurate, for each ID_1, if a Probablity value is duplicated then it can't be a candidate for the said max. We will then build a max that exclude duplicates Probability value for each ID_1 value
mask_unique = df.groupby(["ID_1", "Probability"])["Probability"].transform(lambda x : len(x) == 1)
mask_unique
Out[284]:
0 True
1 True
2 True
3 False
4 False
5 True
6 True
Name: Probability, dtype: bool
Finally, let's combine our two masks :
df.loc[:, "Predicted Link"] = 1 * (mask_max_proba & mask_unique)
df
Out[285]:
ID_1 ID_2 Predicted Link Probability
0 1 0 1 0.9
1 1 1 0 0.5
2 1 2 0 0.0
3 2 1 0 0.8
4 2 5 0 0.8
5 3 1 0 0.0
6 3 2 1 0.5

Sample Pandas dataframe based on multiple values in column

I'm trying to even up a dataset for machine learning. There are great answers for how to sample a dataframe with two values in a column (a binary choice).
In my case I have many values in column x. I want an equal number of records in the dataframe where
x is 0 or not 0
or in a more complicated example the value in x is 0, 5 or other value
Examples
x
0 5
1 5
2 5
3 0
4 0
5 9
6 18
7 3
8 5
** For the first **
I have 2 rows where x = 0 and 7 where x != 0. The result should balance this up and be 4 rows: the two with x = 0 and 2 where x != 0 (randomly selected). Preserving the same index for the sake of illustration
1 5
3 0
4 0
6 18
** For the second **
I have 2 rows where x = 0, 4 rows where x = 5 and 3 rows where x != 0 && x != 5. The result should balance this up and be 6 rows in total: two for each condition. Preserving the same index for the sake of illustration
1 5
3 0
4 0
5 9
6 18
8 5
I've done examples with 2 conditions & 3 conditions. A solution that generalises to more would be good. It is better if it detects the minimum number of rows (for 0 in this example) so I don't need to work this out first before writing the condition.
How do I do this with pandas? Can I pass a custom function to .groupby() to do this?
IIUC, you could groupby on the condition whether "x" is 0 or not and sample the smallest-group-size number of entries from each group:
g = df.groupby(df['x']==0)['x']
out = g.sample(n=g.count().min()).sort_index()
(An example) output:
1 5
3 0
4 0
5 9
Name: x, dtype: int64
For the second case, we could use numpy.select and numpy.unique to get the groups (the rest are essentially the same as above):
import numpy as np
groups = np.select([df['x']==0, df['x']==5], [1,2], 3)
g = df.groupby(groups)['x']
out = g.sample(n=np.unique(groups, return_counts=True)[1].min()).sort_index()
An example output:
2 5
3 0
4 0
5 9
7 3
8 5
Name: x, dtype: int64
IIUC, and you want any two non-zero records:
mask = df['x'].eq(0)
pd.concat([df[mask], df[~mask].sample(mask.sum())]).sort_index()
Output:
x
1 5
2 5
3 0
4 0
Part II:
mask0 = df['x'].eq(0)
mask5 = df['x'].eq(5)
pd.concat([df[mask0],
df[mask5].sample(mask0.sum()),
df[~(mask0 | mask5)].sample(mask0.sum())]).sort_index()
Output:
x
2 5
3 0
4 0
6 18
7 3
8 5

How to check if there is a row with same value combinations in a dataframe?

I have a dataframe and want to create a new column based on other rows of the dataframe. My dataframe looks like
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
Now I want to check, if the freq of a row is zero, then I will check if there is another row with the same ProjektID and Year an Week where the freq is not 0. If this is true I want a new column "other" which is value 1 and 0 else.
So, the output should be
MitarbeiterID ProjektID Jahr Monat Week mean freq last other
0 583 83224 2020 1 2 3.875 4 0 0
1 373 17364 2020 1 3 5.00 0 4 1
2 923 19234 2020 1 4 5.00 3 3 0
3 643 17364 2020 1 3 4.00 2 2 0
This time I have no approach, can anyone help?
Thanks!
The following solution tests if the required conditions are True.
import io
import pandas as pd
Data
df = pd.read_csv(io.StringIO("""
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
"""), sep="\s\s+", engine="python")
Make a column other with all values zero.
df['other'] = 0
If ProjektID, Jahr, Week are duplicated and any of the Freq values is larger than zero, then the rows that are duplicated (keep=False to also capture the original duplicated row) and where Freq is zero will have the value Other filled with 1. Change any() to all() if you need all values to be larger than zero.
if (df.loc[df[['ProjektID','Jahr', 'Week']].duplicated(), 'freq'] > 0).any(): df.loc[(df[['ProjektID','Jahr', 'Week']].duplicated(keep=False)) & (df['freq'] == 0), ['other']] = 1
else: print("Other stays zero")
Output:
I think the best way to solve this is not to use pandas too much :-) converting things to sets and tuples should make it fast enough.
The idea is to make a dictionary of all the triples (ProjektID, Jahr, Week) that appear in the dataset with freq != 0 and then check for all lines with freq == 0 if their triple belongs to this dictionary or not. In code, I'm creating a dummy dataset with:
x = pd.DataFrame(np.random.randint(0, 2, (8, 4)), columns=['id', 'year', 'week', 'freq'])
which in my case randomly gave:
>>> x
id year week freq
0 1 0 0 0
1 0 0 0 1
2 0 1 0 1
3 0 0 1 0
4 0 1 0 0
5 1 0 0 1
6 0 0 1 1
7 0 1 1 0
Now, we want triplets only where freq != 0, so we use
x1 = x.loc[x['freq'] != 0]
triplets = {tuple(row) for row in x1[['id', 'year', 'week']].values}
Note that I'm using x1.values, which is not a pandas DataFrame but rather a numpy array; so each row in there can now be converted to tuple. This is necessary because dataframe rows, or even numpy array or lists, are mutable objects and cannot be hashed in a dictionary otherwise. Using a set instead of e.g. a list (which doesn't have this restriction) is for efficiency purposes.
Next, we define a boolean variable which is True if a triplet (id, year, week) belongs to the above set:
belongs = x[['id', 'year', 'week']].apply(lambda x: tuple(x) in triplets, axis=1)
We are basically done, this is the further column you want, except for also needing to force freq == 0:
x['other'] = np.logical_and(belongs, x['freq'] == 0).astype(int)
(the final .astype(int) is to have it values 0 and 1, as you were asking, instead of False and True). Final result in my case:
>>> x
id year week freq other
0 1 0 0 0 1
1 0 0 0 1 0
2 0 1 0 1 0
3 0 0 1 0 1
4 0 1 0 0 1
5 1 0 0 1 0
6 0 0 1 1 0
7 0 1 1 0 0
Looks like I am too late ...:
df.set_index(['ProjektID', 'Jahr', 'Week'], drop=True, inplace=True)
df['other'] = 0
df.other.mask(df.freq == 0,
df.freq[df.freq == 0].index.isin(df.freq[df.freq != 0].index),
inplace=True)
df.other = df.other.astype('int')
df.reset_index(drop=False, inplace=True)

2-dimensional bins from a pandas DataFrame based on 3 columns

I'm trying to create 2-dimensional bins from a pandas DataFrame based on 3 columns. Here a snippet from my DataFrame:
Scatters N z Dist_first
---------------------------------------
0 0 0 0.096144 2.761508
1 1 0 -8.229910 17.403039
2 2 0 0.038125 21.466233
3 3 0 -2.050480 29.239867
4 4 0 -1.620470 NaN
5 5 0 -1.975930 NaN
6 6 0 -11.672200 NaN
7 7 0 -16.629000 26.554049
8 8 0 0.096002 NaN
9 9 0 0.176049 NaN
10 10 0 0.176005 NaN
11 11 0 0.215408 NaN
12 12 0 0.255889 NaN
13 13 0 0.301834 27.700308
14 14 0 -29.593600 9.155065
15 15 1 -2.582290 NaN
16 16 1 0.016441 2.220946
17 17 1 -17.329100 NaN
18 18 1 -5.442320 34.520919
19 19 1 0.001741 39.579189
For my result each Dist_first should be binned with all "z <= 0" of lower index within a group "N" than the Distance itself. "Scatters" is a copy of the index left from an operation in an earlier stage of my code which is not relevant here. Nonetheless I came to use it instead of the index in the example below. The bins for the distances and z's are in 10 m and 0.1 m steps, respectively and I can obtain a result from looping through groups of the dataFrame:
# create new column for maximal possible distances per group N
for j in range(N.groupby('N')['Dist_first'].count().max()):
N[j+1] = N.loc[N[N['Dist_first'].notna()].groupby('N')['Scatters'].nlargest(j+1).groupby('N').min()]['Dist_first']
# fill nans with zeros to allow
N[j+1] = N[j+1].fillna(0)
# make sure no value is repeated
if j+1 > 1:
N[j+1] = N[j+1]-N[list(np.arange(j)+1)].sum(axis=1)
# and set all values <= 0 to NaN
N[N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] <= 0] = np.nan
# backwards fill to make sure every distance gets all necessary depths
N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] = N.set_index('N').groupby('N').bfill().set_index('Scatters')[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)]
# bin the result(s)
for j in range(N.groupby('N')['Dist_first'].count().max()):
binned = N[N['z'] >= 0].groupby([pd.cut(N[N['z'] >= 0]['z'], bins_v, include_lowest=True), pd.cut(N[N['z'] >= 0][j+1], bins_h, include_lowest=True)])
binned = binned.size().unstack()
## rename
binned.index = N_v.index; binned.columns = N_h.index
## and sum up with earlier chunks
V = V+binned
This bit of code works just fine and the result for the small snippet of the data I've shared looks like this:
Distance [m] 0.0 10.0 20.0 30.0 40.0
Depth [m]
----------------------------------------------------
0.0 1 1 1 4 2
0.1 1 2 2 4 0
0.2 0 3 0 3 0
0.3 0 2 0 2 0
0.4 0 0 0 0 0
However, the whole dataset(s) are excesively large (> 300 mio rows each) and looping through all rows is not an option. Therefore I'm looking for some vectorized solution.
I suggest you to calculate creiteria in extra columns and then use Pandas standard binning function, like qcut. It can be applied separately along the 2 binning dimensions. Not most elegant, but definitely vectorized.

pandas dataframe apply using additional arguments

with below example:
df = pd.DataFrame({'signal':[1,0,0,1,0,0,0,0,1,0,0,1,0,0],'product':['A','A','A','A','A','A','A','B','B','B','B','B','B','B'],'price':[1,2,3,4,5,6,7,1,2,3,4,5,6,7],'price2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2]})
I have a function "fill_price" to create a new column 'Price_B' based on 'signal' and 'price'. For every 'product' subgroup, Price_B equals to Price if 'signal' is 1. Price_B equals previous row's Price_B if signal is 0. If the subgroup starts with a 0 'signal', then 'price_B' will be kept at 0 until 'signal' turns 1.
Currently I have:
def fill_price(df, signal,price_A):
p = df[price_A].where(df[signal] == 1)
return p.ffill().fillna(0).astype(df[price_A].dtype)
this is then applied using:
df['Price_B'] = fill_price(df,'signal','price')
However, I want to use df.groupby('product').apply() to apply this fill_price function to two subsets of 'product' columns separately, and also apply it to both'price' and 'price2' columns. Could someone help with that?
I basically want to do:
df.groupby('product',groupby_keys=False).apply(fill_price, 'signal','price2')
IIUC, you can use this syntax:
df['Price_B'] = df.groupby('product').apply(lambda x: fill_price(x,'signal','price2')).reset_index(level=0, drop=True)
Output:
price price2 product signal Price_B
0 1 1 A 1 1
1 2 2 A 0 1
2 3 1 A 0 1
3 4 2 A 1 2
4 5 1 A 0 2
5 6 2 A 0 2
6 7 1 A 0 2
7 1 2 B 0 0
8 2 1 B 1 1
9 3 2 B 0 1
10 4 1 B 0 1
11 5 2 B 1 2
12 6 1 B 0 2
13 7 2 B 0 2
You can write this much simplier without the extra function.
df['Price_B'] = (df.groupby('product',as_index=False)
.apply(lambda x: x['price2'].where(x.signal==1).ffill().fillna(0))
.reset_index(level=0, drop=True))

Categories

Resources