Identifying groups with same column value and count them - python

I am working with a dataframe, consisting of a continuity column df['continuity'] and a column group df['group'].
Both are binary columns.
I want to add an extra column 'group_id' that gives consecutive rows of 1s the same integer value, where the first group of rows have a
1, then 2 etc. After each time where the continuity value of a row is 0, the counting should start again at 1.
Since this question is rather specific, I'm not sure how to tackle this vectorized. Below an example, where the first two
columns are the input and the column the output I'd like to have.
continuity group group_id
1 0 0
1 1 1
1 1 1
1 1 1
1 0 0
1 1 2
1 1 2
1 1 2
1 0 0
1 0 0
1 1 3
1 1 3
0 1 1
0 0 0
1 1 1
1 1 1
1 0 0
1 0 0
1 1 2
1 1 2

I believe you can use:
#get unique groups in both columns
b = df[['continuity','group']].ne(df[['continuity','group']].shift()).cumsum()
#identify first 1
c = ~b.duplicated() & (df['group'] == 1)
#cumulative sum of first values only if group are 1, else 0 per groups
df['new'] = np.where(df['group'] == 1,
c.groupby(b['continuity']).cumsum(),
0).astype(int)
print (df)
continuity group group_id new
0 1 0 0 0
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 0 0 0
5 1 1 2 2
6 1 1 2 2
7 1 1 2 2
8 1 0 0 0
9 1 0 0 0
10 1 1 3 3
11 1 1 3 3
12 0 1 1 1
13 0 0 0 0
14 1 1 1 1
15 1 1 1 1
16 1 0 0 0
17 1 0 0 0
18 1 1 2 2
19 1 1 2 2

Related

replacing the value of one column conditional on two other columns in pandas

I have a data-frame df:
year ID category
1 1 0
2 1 1
3 1 1
4 1 0
1 2 0
2 2 0
3 2 1
4 2 0
I want to create a new column such that: for a particular 'year' if the 'category' is 1, the 'new-category' will be always 1 for the upcoming years:
year ID category new_category
1 1 0 0
2 1 1 1
3 1 1 1
4 1 0 1
1 2 0 0
2 2 0 0
3 2 1 1
4 2 0 1
I have tried if-else condition but I am getting the same 'category' column
for row in range(1,df.category[i-1]):
df['new_category'] = df['category'].replace('0',df['category'].shift(1))
But I am not getting the desired column
TRY:
df['new_category'] = df.groupby('ID')['category'].cummax()
OUTPUT:
year ID category new_category
0 1 1 0 0
1 2 1 1 1
2 3 1 1 1
3 4 1 0 1
4 1 2 0 0
5 2 2 0 0
6 3 2 1 1
7 4 2 0 1

Flag creation based on count of consecutive ones in a column

I have a data frame with a column with only 0's and 1's. I need to create a flag column where there are more than a certain number of consecutive ones in the first column.
In the example below, x >= 4 , if there are 4 or more consecutive one's, then the flag should be 1 for all those consecutive rows.
col1 Flag
0 1 0
1 0 0
2 1 1
3 1 1
4 1 1
5 1 1
6 0 0
7 1 0
8 1 0
9 0 0
10 1 1
11 1 1
12 1 1
13 1 1
14 1 1
15 0 0
One change, let's say there is a new column group, we need to group by that and find the flag,
Group col1 Flag
0 A 1 0
1 B 0 0
2 B 1 1
3 B 1 1
4 B 1 1
5 B 1 1
6 C 0 0
7 C 1 0
8 C 1 0
9 C 0 0
10 D 1 0
11 D 1 0
12 D 1 0
13 E 1 0
14 E 1 0
15 E 0 0
As you can there are consecutive ones from 10 to 14 but they belong to different groups. And elements in group can be in any order.
No that hard try with cumsum create the key then do the transform count
(df.groupby(df.col1.ne(1).cumsum())['col1'].transform('count').ge(5) & df.col1.eq(1)).astype(int)
Out[83]:
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 1
11 1
12 1
13 1
14 1
15 0
Name: col1, dtype: int32
You can achieve this in a couple of steps:
rolling(4).sum() to attain consecutive summations of your column
Use where to get the 1's from "col1" where their summation window (from the previous step) is >= 4. Turn the rest of the values into np.NaN
bfill(limit=3) to backwards fill the leftover 1s in your column by a maximum of 3 places.
fillna(0) fill what's leftover with 0
df["my_flag"] = (df["col1"]
.where(
df["col1"].rolling(4).sum() >= 4
) # Selects the 1's whose consecutive sum >= 4. All other values become NaN
.bfill(limit=3) # Moving backwards from our leftover values,
# take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
col1 Flag my_flag
0 1 0 0
1 0 0 0
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 0 0 0
7 1 0 0
8 1 0 0
9 0 0 0
10 1 1 1
11 1 1 1
12 1 1 1
13 1 1 1
14 1 1 1
15 0 0 0
Edit:
For a grouped approach, you just need to use groupby().rolling to create your mask for use in where(). Everything after that is the same. I separated the rolling step to keep it as readable as possible:
grouped_counts_ge_4 = (df.groupby("Group")["col1"]
.rolling(4)
.sum()
.ge(4)
.reset_index(level=0, drop=True))
df["my_flag"] = (df["col1"]
.where(grouped_counts_ge_4)
.bfill(limit=3) # Moving backwards from our leftover values, take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
Group col1 Flag my_flag
0 A 1 0 0
1 B 0 0 0
2 B 1 1 1
3 B 1 1 1
4 B 1 1 1
5 B 1 1 1
6 C 0 0 0
7 C 1 0 0
8 C 1 0 0
9 C 0 0 0
10 D 1 0 0
11 D 1 0 0
12 D 1 0 0
13 E 1 0 0
14 E 1 0 0
15 E 0 0 0
Try this:
df['Flag'] = np.where(df['col1'].groupby((df['col1'].diff().ne(0) | df['col1'].eq(0)).cumsum()).transform('size').ge(4),1,0)

Column number detection using dataframe pattern

I have a dataset appearing something like this:
empl_ID day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10
1 1 1 1 1 1 1 0 1 1 1
2 0 0 1 1 1 1 1 1 1 0
3 0 1 0 0 1 1 1 1 1 1
4 1 0 1 0 1 1 1 0 1 0
5 1 0 0 1 1 1 1 1 1 1
6 0 0 0 0 1 1 1 1 1 1
Each row indicates a record for an employee with employee id column 'empl_ID'. I am trying to write a code in Python that tracks the first occurrence of '1' in that record. For eg, empl_ID 1 the first time '1' occurs is in day_1 column so the label for that would be 1. For empl_ID 2, the first '1' occurs in column day_3, so the label will be 3. Similarly for all the other employees the labels will be 2,1,1 and 5 respectively. The resultant dataset looks something like this:
empl_ID day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10 label
1 1 1 1 1 1 1 0 1 1 1 1
2 0 0 1 1 1 1 1 1 1 0 3
3 0 1 0 0 1 1 1 1 1 1 2
4 1 0 1 0 1 1 1 0 1 0 1
5 1 0 0 1 1 1 1 1 1 1 1
6 0 0 0 0 1 1 1 1 1 1 5
If someone could please help me in writing a Python code for the above problem statement that would be very helpful. Thanks in advance!
s=df.set_index('empl_ID').idxmax(1).str.split('_').str[-1]
empl_ID
1 1
2 3
3 2
4 1
5 1
6 5
dtype: object
df['new']=s.values

ffil with a group by and matching a condition

I am willing to foward fill the value of log for each id whenever you find the first 1 in the log column
Example:
df
id log
1 0
1 1
1 0
1 0
2 1
2 0
3 1
3 0
3 1
to
id log ffil_log
1 0 0
1 1 1
1 0 1
1 0 1
2 1 1
2 0 1
3 1 1
3 0 1
3 1 1
My try was:
df['ffil_log']=df.log.where(df.log==1).groupby(df.id).ffill()
You can use cummax and groupby such as:
df['ffil_log'] = df.groupby('id')['log'].cummax()
for each id, once your reach 1 in a row, it will be the value for the one after, and you get as expected
id log ffil_log
0 1 0 0
1 1 1 1
2 1 0 1
3 1 0 1
4 2 1 1
5 2 0 1
6 3 1 1
7 3 0 1
8 3 1 1

Create Duplicate Rows and Change Values in Specific Columns

How to create x amount of duplicates based on a row in the dataframe and change a single or multi variables from specific columns. The rows are then added to the end of the same dataframe.
A B C D E F
0 1 1 0 1 1 0
1 2 2 1 1 1 0
2 2 2 1 1 1 0
3 2 2 1 1 1 0
4 1 1 0 1 1 0 <- Create 25 Duplicates of this row (4) and change variable C to 1
5 1 1 0 1 1 0
6 2 2 1 1 1 0
7 2 2 1 1 1 0
8 2 2 1 1 1 0
9 1 1 0 1 1 0
I repeat only 10 times to keep length of result reasonable.
# Number of repeats |
# v
df.append(df.loc[[4] * 10].assign(C=1), ignore_index=True)
A B C D E F
0 1 1 0 1 1 0
1 2 2 1 1 1 0
2 2 2 1 1 1 0
3 2 2 1 1 1 0
4 1 1 0 1 1 0
5 1 1 0 1 1 0
6 2 2 1 1 1 0
7 2 2 1 1 1 0
8 2 2 1 1 1 0
9 1 1 0 1 1 0
10 1 1 1 1 1 0
11 1 1 1 1 1 0
12 1 1 1 1 1 0
13 1 1 1 1 1 0
14 1 1 1 1 1 0
15 1 1 1 1 1 0
16 1 1 1 1 1 0
17 1 1 1 1 1 0
18 1 1 1 1 1 0
19 1 1 1 1 1 0
Per comments, try:
df.append(df.loc[[4] * 10].assign(**{'C': 1}), ignore_index=True)
I am using repeat and reindex
s=df.iloc[[4],] # pick the row you want to do repeat
s=s.reindex(s.index.repeat(45))# repeat the row by the giving number
#s=pd.DataFrame([df.iloc[4,].tolist()]*25) if need enhance the speed , using this line replace the above
s.loc[:,'C']=1 # change the value
pd.concat([df,s]) #append to the original df

Categories

Resources