Following is the dataframe I have. The 'Target' column is the desired output.
Group Item Value Target
1 0 5 0
1 1 4 0
1 0 6 0
1 0 3 1
1 1 2 0
1 0 1 1
2 1 8 0
2 0 9 0
2 0 7 1
In a given Group, if Item == 1, then I am trying to find the first future/next row where the Value is less than the corresponding Value for Item == 1. For example, in the second row, the Item == 1 and the corresponding Value is 4. The first future row where Value is less than 4 is the 4th row which has a Value of 3. Thereby, Target column specifies the find with a 1. It could be possible where two Item==1 has the same future row where conditions satisfy. In that case, we can also have a 1 in Target.
import pandas as pd
df = pd.DataFrame({'Group1': [1,1,1,1,1,1,2,2,2], 'Item': [0,1,0,0,1,0,1,0,0], 'Value': [5,4,6,3,2,1,8,9,7]})
df['next_Value'] = df.groupby(['Group'])['Value'].shift(-1)
Create a help key with cumsum , then we try to get the first value of each group by using transform, and compare each value within the group with the first value , if it is less we should return 1
df['helpkey']=df.groupby('Group').Item.cumsum()
df['New']=(df.Value<df.groupby(['Group','helpkey']).Value.transform('first')).astype(int)
df
Out[51]:
Group Item Value Target helpkey New
0 1 0 5 0 0 0
1 1 1 4 0 1 0
2 1 0 6 0 1 0
3 1 0 3 1 1 1
4 1 1 2 0 2 0
5 1 0 1 1 2 1
6 2 1 8 0 1 0
7 2 0 9 0 1 0
8 2 0 7 1 1 1
Related
I have a data frame with a column with only 0's and 1's. I need to create a flag column where there are more than a certain number of consecutive ones in the first column.
In the example below, x >= 4 , if there are 4 or more consecutive one's, then the flag should be 1 for all those consecutive rows.
col1 Flag
0 1 0
1 0 0
2 1 1
3 1 1
4 1 1
5 1 1
6 0 0
7 1 0
8 1 0
9 0 0
10 1 1
11 1 1
12 1 1
13 1 1
14 1 1
15 0 0
One change, let's say there is a new column group, we need to group by that and find the flag,
Group col1 Flag
0 A 1 0
1 B 0 0
2 B 1 1
3 B 1 1
4 B 1 1
5 B 1 1
6 C 0 0
7 C 1 0
8 C 1 0
9 C 0 0
10 D 1 0
11 D 1 0
12 D 1 0
13 E 1 0
14 E 1 0
15 E 0 0
As you can there are consecutive ones from 10 to 14 but they belong to different groups. And elements in group can be in any order.
No that hard try with cumsum create the key then do the transform count
(df.groupby(df.col1.ne(1).cumsum())['col1'].transform('count').ge(5) & df.col1.eq(1)).astype(int)
Out[83]:
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 1
11 1
12 1
13 1
14 1
15 0
Name: col1, dtype: int32
You can achieve this in a couple of steps:
rolling(4).sum() to attain consecutive summations of your column
Use where to get the 1's from "col1" where their summation window (from the previous step) is >= 4. Turn the rest of the values into np.NaN
bfill(limit=3) to backwards fill the leftover 1s in your column by a maximum of 3 places.
fillna(0) fill what's leftover with 0
df["my_flag"] = (df["col1"]
.where(
df["col1"].rolling(4).sum() >= 4
) # Selects the 1's whose consecutive sum >= 4. All other values become NaN
.bfill(limit=3) # Moving backwards from our leftover values,
# take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
col1 Flag my_flag
0 1 0 0
1 0 0 0
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 0 0 0
7 1 0 0
8 1 0 0
9 0 0 0
10 1 1 1
11 1 1 1
12 1 1 1
13 1 1 1
14 1 1 1
15 0 0 0
Edit:
For a grouped approach, you just need to use groupby().rolling to create your mask for use in where(). Everything after that is the same. I separated the rolling step to keep it as readable as possible:
grouped_counts_ge_4 = (df.groupby("Group")["col1"]
.rolling(4)
.sum()
.ge(4)
.reset_index(level=0, drop=True))
df["my_flag"] = (df["col1"]
.where(grouped_counts_ge_4)
.bfill(limit=3) # Moving backwards from our leftover values, take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
Group col1 Flag my_flag
0 A 1 0 0
1 B 0 0 0
2 B 1 1 1
3 B 1 1 1
4 B 1 1 1
5 B 1 1 1
6 C 0 0 0
7 C 1 0 0
8 C 1 0 0
9 C 0 0 0
10 D 1 0 0
11 D 1 0 0
12 D 1 0 0
13 E 1 0 0
14 E 1 0 0
15 E 0 0 0
Try this:
df['Flag'] = np.where(df['col1'].groupby((df['col1'].diff().ne(0) | df['col1'].eq(0)).cumsum()).transform('size').ge(4),1,0)
I have a dataframe that you can see how it is in the following. The column named target is my desired column:
group value target
1 1 0
1 2 0
1 3 2
1 4 0
1 5 1
2 1 0
2 2 0
2 3 0
2 4 1
2 5 3
Now I want to find the first non-zero value in the target column for each group and remove rows before that row in each group. So the output should be like this:
group value target
1 3 2
1 4 0
1 5 1
2 4 1
2 5 3
I have seen this post, but I don't how to change the code to get my desired result.
How can I do this?
In the groupby, set sort to False, get the cumsum, then filter for rows not equal to 0 :
df.loc[df.groupby(["group"], sort=False).target.cumsum() != 0]
group value target
2 1 3 2
3 1 4 0
4 1 5 1
8 2 4 1
9 2 5 3
This shoul do. I'm sure you can do it with less reset_index(), but this shouldn't affect too much the speed if your dataframe isn't too big:
idx = dff[dff.target.ne(0)].reset_index().groupby('group').index.first()
mask = (dff.reset_index().set_index('group')['index'].ge(idx.to_frame()['index'])).values
df_final = dff[mask]
Output:
0 group value target
3 1 3 2
4 1 4 0
5 1 5 1
9 2 4 1
10 2 5 3
I have a dataframe which consists of PartialRoutes (which result together in full routes) and a treatment variable and I am trying to reduce the dataframe to the full routes by grouping these together and keeping the treatment variable.
To make this more clear, the df looks like
PartialRoute Treatment
0 1
1 0
0 0
0 0
1 0
2 0
3 0
0 0
1 1
2 0
where every 0 in 'Partial Route' starts a new group, which means I always want to group all values until a new route starts/ a new 0 in index.
So in this example there exists 4 groups
PartialRoute Treatment
0 1
1 0
-----------------
0 0
-----------------
0 0
1 0
2 0
3 0
-----------------
0 0
1 1
2 0
-----------------
and the result should look like
Route Treatment
0 1
1 0
2 0
3 1
Is there any solution to solve this elegant?
Create groups by comparing by Series.eq with cumulative sum by Series.cumsum and then aggregate per groups, e.g. by sum or max:
df1 = df.groupby(df['PartialRoute'].eq(0).cumsum())['Treatment'].sum().reset_index()
print (df1)
PartialRoute Treatment
0 1 1
1 2 0
2 3 0
3 4 1
Detail:
print (df['PartialRoute'].eq(0).cumsum())
0 1
1 1
2 2
3 3
4 3
5 3
6 3
7 4
8 4
9 4
Name: PartialRoute, dtype: int32
If first value of DataFrame is not 0 get different groups - starting by 0:
print (df)
PartialRoute Treatment
0 1 1
1 1 0
2 0 0
3 0 0
4 1 0
5 2 0
6 3 0
7 0 0
8 1 1
9 2 0
print (df['PartialRoute'].eq(0).cumsum())
0 0
1 0
2 1
3 2
4 2
5 2
6 2
7 3
8 3
9 3
Name: PartialRoute, dtype: int32
df1 = df.groupby(df['PartialRoute'].eq(0).cumsum())['Treatment'].sum().reset_index()
print (df1)
PartialRoute Treatment
0 0 1
1 1 0
2 2 0
3 3 1
I have a dataset, which contains columns: week, shop, Item number and price. Also I have an array of unique numbers, which are equal to Item Numbers, but in different order.
I want to add new columns to this dataset based on these unique numbers. First of all, I need to group this dataset by week and shop. Then in particular week and particular shop I need to find an Item number which is equal to new column name (Element from array of unique numbers). If there is no such field fill with null.
Then i should fill all fields in a particular week and particular shop with price of this Item number.
Here some code that I've tried, but it works very slow, because the amount of rows is very big.
#real dataset
data2
weeks = data2['Week'].unique()
for k in range(len(Unique_number)):
for i in range(len(weeks)):
temp_array = data2.loc[data2["Week"] == weeks[i]]
stores = temp_array['Shop'].unique()
for j in range(len(stores)):
temp_array2 = temp_array.loc[data2["Shop"] == stores[j]]
price = temp_array2.loc[temp_array2["Item number"] == Unique_number[k], "Price"]
if (price.empty):
price = 0
else:
price = price.values[0]
data2.loc[(data2["Week"] == weeks[i]) & (data2["Shop"] == stores[j]),Unique_number[k]] = price
I want something like this
Unique_numbers = [0,1,2,3]
dataframe before
week; shop; Item number; price
1 1 0 2
1 2 1 3
2 1 3 4
2 1 2 5
3 4 1 6
3 1 2 7
dataframe after
week; shop; Item number; price; 0; 1; 2; 3
1 1 0 2 2 0 0 0
1 2 1 3 0 3 0 0
2 1 3 4 0 0 5 4
2 1 2 5 0 0 5 4
3 4 1 6 0 6 0 0
3 1 2 7 0 0 7 0
Setup
u = df['Item number'].to_numpy()
w = np.asarray(Unique_numbers)
g = [df.week, df.shop]
Using some broadcasted comparison here (assumes that all of your price values are greater than 0).
pd.DataFrame(
np.equal.outer(u, w) * df['price'].to_numpy()[:, None]).groupby(g).transform('max')
0 1 2 3
0 2 0 0 0
1 0 3 0 0
2 0 0 5 4
3 0 0 5 4
4 0 6 0 0
5 0 0 7 0
This turns out a combination of pivot and merge:
df.merge(df.pivot_table(index=['week', 'shop'],
columns='Item number',
values='price',
fill_value=0)
.reindex(Unique_numbers, axis=1),
left_on=['week', 'shop'],
right_index=True,
how='left'
)
Output:
week shop Item number price 0 1 2 3
0 1 1 0 2 2 0 0 0
1 1 2 1 3 0 3 0 0
2 2 1 3 4 0 0 5 4
3 2 1 2 5 0 0 5 4
4 3 4 1 6 0 6 0 0
5 3 1 2 7 0 0 7 0
I am working with a dataframe, consisting of a continuity column df['continuity'] and a column group df['group'].
Both are binary columns.
I want to add an extra column 'group_id' that gives consecutive rows of 1s the same integer value, where the first group of rows have a
1, then 2 etc. After each time where the continuity value of a row is 0, the counting should start again at 1.
Since this question is rather specific, I'm not sure how to tackle this vectorized. Below an example, where the first two
columns are the input and the column the output I'd like to have.
continuity group group_id
1 0 0
1 1 1
1 1 1
1 1 1
1 0 0
1 1 2
1 1 2
1 1 2
1 0 0
1 0 0
1 1 3
1 1 3
0 1 1
0 0 0
1 1 1
1 1 1
1 0 0
1 0 0
1 1 2
1 1 2
I believe you can use:
#get unique groups in both columns
b = df[['continuity','group']].ne(df[['continuity','group']].shift()).cumsum()
#identify first 1
c = ~b.duplicated() & (df['group'] == 1)
#cumulative sum of first values only if group are 1, else 0 per groups
df['new'] = np.where(df['group'] == 1,
c.groupby(b['continuity']).cumsum(),
0).astype(int)
print (df)
continuity group group_id new
0 1 0 0 0
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 0 0 0
5 1 1 2 2
6 1 1 2 2
7 1 1 2 2
8 1 0 0 0
9 1 0 0 0
10 1 1 3 3
11 1 1 3 3
12 0 1 1 1
13 0 0 0 0
14 1 1 1 1
15 1 1 1 1
16 1 0 0 0
17 1 0 0 0
18 1 1 2 2
19 1 1 2 2