Group one column of dataframe by variable index - python

I have a dataframe which consists of PartialRoutes (which result together in full routes) and a treatment variable and I am trying to reduce the dataframe to the full routes by grouping these together and keeping the treatment variable.
To make this more clear, the df looks like
PartialRoute Treatment
0 1
1 0
0 0
0 0
1 0
2 0
3 0
0 0
1 1
2 0
where every 0 in 'Partial Route' starts a new group, which means I always want to group all values until a new route starts/ a new 0 in index.
So in this example there exists 4 groups
PartialRoute Treatment
0 1
1 0
-----------------
0 0
-----------------
0 0
1 0
2 0
3 0
-----------------
0 0
1 1
2 0
-----------------
and the result should look like
Route Treatment
0 1
1 0
2 0
3 1
Is there any solution to solve this elegant?

Create groups by comparing by Series.eq with cumulative sum by Series.cumsum and then aggregate per groups, e.g. by sum or max:
df1 = df.groupby(df['PartialRoute'].eq(0).cumsum())['Treatment'].sum().reset_index()
print (df1)
PartialRoute Treatment
0 1 1
1 2 0
2 3 0
3 4 1
Detail:
print (df['PartialRoute'].eq(0).cumsum())
0 1
1 1
2 2
3 3
4 3
5 3
6 3
7 4
8 4
9 4
Name: PartialRoute, dtype: int32
If first value of DataFrame is not 0 get different groups - starting by 0:
print (df)
PartialRoute Treatment
0 1 1
1 1 0
2 0 0
3 0 0
4 1 0
5 2 0
6 3 0
7 0 0
8 1 1
9 2 0
print (df['PartialRoute'].eq(0).cumsum())
0 0
1 0
2 1
3 2
4 2
5 2
6 2
7 3
8 3
9 3
Name: PartialRoute, dtype: int32
df1 = df.groupby(df['PartialRoute'].eq(0).cumsum())['Treatment'].sum().reset_index()
print (df1)
PartialRoute Treatment
0 0 1
1 1 0
2 2 0
3 3 1

Related

Increment the value in a new column based on a condition using an existing column

I have a pandas dataframe with two columns:
temp_1 flag
1 0
1 0
1 0
2 0
3 0
4 0
4 1
4 0
5 0
6 0
6 1
6 0
and I wanted to create a new column named "final" based on :
if "flag" has a value = 1 , then it increments "temp_1" by 1 and following values as well. If we find value = 1 again in flag column then the previous value in "final" with get incremented by 1 , please refer to expected output
I have tired using .cumsum() with filters but not getting the desired result.
Expected output
temp_1 flag final
1 0 1
1 0 1
1 0 1
2 0 2
3 0 3
4 0 4
4 1 5
4 0 5
5 0 6
6 0 7
6 1 8
6 0 8
Just do cumsum for flag:
>>> df['final'] = df['temp_1'] + df['flag'].cumsum()
>>> df
temp_1 flag final
0 1 0 1
1 1 0 1
2 1 0 1
3 2 0 2
4 3 0 3
5 4 0 4
6 4 1 5
7 4 0 5
8 5 0 6
9 6 0 7
10 6 1 8
11 6 0 8
>>>

replacing the value of one column conditional on two other columns in pandas

I have a data-frame df:
year ID category
1 1 0
2 1 1
3 1 1
4 1 0
1 2 0
2 2 0
3 2 1
4 2 0
I want to create a new column such that: for a particular 'year' if the 'category' is 1, the 'new-category' will be always 1 for the upcoming years:
year ID category new_category
1 1 0 0
2 1 1 1
3 1 1 1
4 1 0 1
1 2 0 0
2 2 0 0
3 2 1 1
4 2 0 1
I have tried if-else condition but I am getting the same 'category' column
for row in range(1,df.category[i-1]):
df['new_category'] = df['category'].replace('0',df['category'].shift(1))
But I am not getting the desired column
TRY:
df['new_category'] = df.groupby('ID')['category'].cummax()
OUTPUT:
year ID category new_category
0 1 1 0 0
1 2 1 1 1
2 3 1 1 1
3 4 1 0 1
4 1 2 0 0
5 2 2 0 0
6 3 2 1 1
7 4 2 0 1

Flag creation based on count of consecutive ones in a column

I have a data frame with a column with only 0's and 1's. I need to create a flag column where there are more than a certain number of consecutive ones in the first column.
In the example below, x >= 4 , if there are 4 or more consecutive one's, then the flag should be 1 for all those consecutive rows.
col1 Flag
0 1 0
1 0 0
2 1 1
3 1 1
4 1 1
5 1 1
6 0 0
7 1 0
8 1 0
9 0 0
10 1 1
11 1 1
12 1 1
13 1 1
14 1 1
15 0 0
One change, let's say there is a new column group, we need to group by that and find the flag,
Group col1 Flag
0 A 1 0
1 B 0 0
2 B 1 1
3 B 1 1
4 B 1 1
5 B 1 1
6 C 0 0
7 C 1 0
8 C 1 0
9 C 0 0
10 D 1 0
11 D 1 0
12 D 1 0
13 E 1 0
14 E 1 0
15 E 0 0
As you can there are consecutive ones from 10 to 14 but they belong to different groups. And elements in group can be in any order.
No that hard try with cumsum create the key then do the transform count
(df.groupby(df.col1.ne(1).cumsum())['col1'].transform('count').ge(5) & df.col1.eq(1)).astype(int)
Out[83]:
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 1
11 1
12 1
13 1
14 1
15 0
Name: col1, dtype: int32
You can achieve this in a couple of steps:
rolling(4).sum() to attain consecutive summations of your column
Use where to get the 1's from "col1" where their summation window (from the previous step) is >= 4. Turn the rest of the values into np.NaN
bfill(limit=3) to backwards fill the leftover 1s in your column by a maximum of 3 places.
fillna(0) fill what's leftover with 0
df["my_flag"] = (df["col1"]
.where(
df["col1"].rolling(4).sum() >= 4
) # Selects the 1's whose consecutive sum >= 4. All other values become NaN
.bfill(limit=3) # Moving backwards from our leftover values,
# take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
col1 Flag my_flag
0 1 0 0
1 0 0 0
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 0 0 0
7 1 0 0
8 1 0 0
9 0 0 0
10 1 1 1
11 1 1 1
12 1 1 1
13 1 1 1
14 1 1 1
15 0 0 0
Edit:
For a grouped approach, you just need to use groupby().rolling to create your mask for use in where(). Everything after that is the same. I separated the rolling step to keep it as readable as possible:
grouped_counts_ge_4 = (df.groupby("Group")["col1"]
.rolling(4)
.sum()
.ge(4)
.reset_index(level=0, drop=True))
df["my_flag"] = (df["col1"]
.where(grouped_counts_ge_4)
.bfill(limit=3) # Moving backwards from our leftover values, take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
Group col1 Flag my_flag
0 A 1 0 0
1 B 0 0 0
2 B 1 1 1
3 B 1 1 1
4 B 1 1 1
5 B 1 1 1
6 C 0 0 0
7 C 1 0 0
8 C 1 0 0
9 C 0 0 0
10 D 1 0 0
11 D 1 0 0
12 D 1 0 0
13 E 1 0 0
14 E 1 0 0
15 E 0 0 0
Try this:
df['Flag'] = np.where(df['col1'].groupby((df['col1'].diff().ne(0) | df['col1'].eq(0)).cumsum()).transform('size').ge(4),1,0)

Identifying groups with same column value and count them

I am working with a dataframe, consisting of a continuity column df['continuity'] and a column group df['group'].
Both are binary columns.
I want to add an extra column 'group_id' that gives consecutive rows of 1s the same integer value, where the first group of rows have a
1, then 2 etc. After each time where the continuity value of a row is 0, the counting should start again at 1.
Since this question is rather specific, I'm not sure how to tackle this vectorized. Below an example, where the first two
columns are the input and the column the output I'd like to have.
continuity group group_id
1 0 0
1 1 1
1 1 1
1 1 1
1 0 0
1 1 2
1 1 2
1 1 2
1 0 0
1 0 0
1 1 3
1 1 3
0 1 1
0 0 0
1 1 1
1 1 1
1 0 0
1 0 0
1 1 2
1 1 2
I believe you can use:
#get unique groups in both columns
b = df[['continuity','group']].ne(df[['continuity','group']].shift()).cumsum()
#identify first 1
c = ~b.duplicated() & (df['group'] == 1)
#cumulative sum of first values only if group are 1, else 0 per groups
df['new'] = np.where(df['group'] == 1,
c.groupby(b['continuity']).cumsum(),
0).astype(int)
print (df)
continuity group group_id new
0 1 0 0 0
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 0 0 0
5 1 1 2 2
6 1 1 2 2
7 1 1 2 2
8 1 0 0 0
9 1 0 0 0
10 1 1 3 3
11 1 1 3 3
12 0 1 1 1
13 0 0 0 0
14 1 1 1 1
15 1 1 1 1
16 1 0 0 0
17 1 0 0 0
18 1 1 2 2
19 1 1 2 2

Finding efficiently pandas (part of) rows with unique values

Given a pandas dataframe with a row per individual/record. A row includes a property value and its evolution across time (0 to N).
A schedule includes the estimated values of a variable 'property' for a number of entities from day 1 to day 10 in the following example.
I want to filter entities with unique values for a given period and get those values
csv=',property,1,2,3,4,5,6,7,8,9,10\n0,100011,0,0,0,0,3,3,3,3,3,0\n1,100012,0,0,0,0,2,2,2,8,8,0\n2, \
100012,0,0,0,0,2,2,2,2,2,0\n3,100012,0,0,0,0,0,0,0,0,0,0\n4,100011,0,0,0,0,2,2,2,2,2,0\n5, \
180011,0,0,0,0,2,2,2,2,2,0\n6,110012,0,0,0,0,0,0,0,0,0,0\n7,110011,0,0,0,0,3,3,3,3,3,0\n8, \
110012,0,0,0,0,3,3,3,3,3,0\n9,110013,0,0,0,0,0,0,0,0,0,0\n10,100011,0,0,0,0,3,3,3,3,4,0'
from StringIO import StringIO
import numpy as np
schedule = pd.read_csv(StringIO(csv), index_col=0)
print schedule
property 1 2 3 4 5 6 7 8 9 10
0 100011 0 0 0 0 3 3 3 3 3 0
1 100012 0 0 0 0 2 2 2 8 8 0
2 100012 0 0 0 0 2 2 2 2 2 0
3 100012 0 0 0 0 0 0 0 0 0 0
4 100011 0 0 0 0 2 2 2 2 2 0
5 180011 0 0 0 0 2 2 2 2 2 0
6 110012 0 0 0 0 0 0 0 0 0 0
7 110011 0 0 0 0 3 3 3 3 3 0
8 110012 0 0 0 0 3 3 3 3 3 0
9 110013 0 0 0 0 0 0 0 0 0 0
10 100011 0 0 0 0 3 3 3 3 4 0
I want to find records/individuals for who property has not changed during a given period and the corresponding unique values
Here is what i came with : I want to locate individuals with property in [100011, 100012, 1100012] between days 7 and 10
props = [100011, 100012, 1100012]
begin = 7
end = 10
res = schedule['property'].isin(props)
df = schedule.ix[res, begin:end]
print "df \n%s " %df
We have :
df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
res = df.apply(lambda x: np.unique(x).size == 1, axis=1)
print "res : %s\n" %res
df_f = df.ix[res,]
print "df filtered %s \n" % df_f
res = pd.Series(df_f.values.ravel()).unique().tolist()
print "unique values : %s " %res
Giving :
res :
0 True
1 False
2 True
3 True
4 True
10 False
dtype: bool
df filtered
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
unique values : [3, 2, 0]
As those operations need to be run many times (in millions) on a million rows dataframe, i need to be able to run it as quickly as possible.
(#MaxU) : schedule can be seen as a database/repository updated many times. The repository is then requested as well many times for unique values
Would you have some ideas for improvements/ alternate ways ?
Given your df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
You can simplify your code to:
df_f = df[df.apply(pd.Series.nunique, axis=1) == 1]
print(df_f)
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
And the final step to:
res = df_f.iloc[:,0].unique().tolist()
print(res)
[3, 2, 0]
It's not fully vectorised, but maybe this clarifies things a bit towards that?

Categories

Resources