I have this dataframe :
id start end
1 1 2
1 13 27
1 30 35
1 36 40
2 2 5
2 8 10
2 25 30
I want to groupby over id and aggregate rows where difference of end of n-1 row and start of n row is less than 10 for example. I already find a way using a loop but it's far too long with over a million rows.
So the expected outcome would be :
id start end
1 1 2
1 13 40
2 2 10
2 25 30
First I can get the required difference by using df['diff']=df['start'].shift(-1)-df['end']. How can I gather ids based on the condition for each different id ?
Thanks !
I believe you can create groups by suntract shifted end by DataFrameGroupBy.shift with greater like 10 and cumulative sum and pass to GroupBy.agg:
g = df['start'].sub(df.groupby('id')['end'].shift()).gt(10).cumsum()
df = (df.groupby(['id',g])
.agg({'start':'first', 'end': 'last'})
.reset_index(level=1, drop=True)
.reset_index())
print (df)
id start end
0 1 1 2
1 1 13 40
2 2 2 10
3 2 25 30
Related
Let's consider this Dataframe:
$> df
a b
0 6 50
1 2 20
2 9 60
3 4 40
4 5 20
I want to compute column D based on:
The max value between:
integer 0
A slice of column B at that row's index
So I have created a column C (all zeroes) in my dataframe in order use DataFrame.max(axis=1). However, short of using apply or looping over the DataFrame, I don't know how to slice the input values. Expected result would be:
$> df
a b c d
0 6 50 0 60
1 2 20 0 60
2 9 60 0 60
3 4 40 0 40
4 5 20 0 20
So essentially, d's 3rd row is computed (pseudo-code) as max(df[3:,"b"], df[3:,"c"]), and similarly for each row.
Since the input columns (b, c) have already been computed, there has to be a way to slice the input as I calculate each row for D without having to loop, as this is slow.
Seems like this could work: Reverse "B", find cummax, then reverse it back and assign it to "d". Then use where on "d" to see if any value is less than 0:
df['d'] = df['b'][::-1].cummax()[::-1]
df['d'] = df['d'].where(df['d']>0, 0)
We can replace the last line with the one below using clip (thanks #Either), and drop the 2nd reversal (assuming indexes match) making it all a one liner:
df['d'] = df['b'][::-1].cummax().clip(lower=0)
Output:
a b d
0 6 50 60
1 2 20 60
2 9 60 60
3 4 40 40
4 5 20 20
I have two dataframe, and i am able to merge it. but I want to merge it in specific format ( column wise), Below are the further details
>df1
id A B C
0 1 20 0 1
1 2 23 1 2
>df2
id A B C
0 1 10 1 1
1 2 20 1 1
Below is my code and output
df = pd.merge(df1,df2,on='id',suffixes=('_Pre', '_Post'))
The output of this is :
id A_Pre B_Pre C_Pre A_Post B_Post C_Post
0 1 20 0 1 10 1 1
1 2 23 1 2 20 1 1
But the EXPECTED output should be, Can someone help or guide me for this :
id A_Pre A_Post B_Pre B_Post C_Pre C_Post
0 1 20 10 0 1 1 1
1 2 23 20 1 1 2 1
When subsequently manipulation is possible you can do domething like:
df[np.array([[x+"_Pre", x+"_Post"] for x in df1.columns.drop("id")]).flatten()]
If you just want to modify the order of your columns you can use reindex :
df = df.reindex(columns=['A_Pre','A_Post','B_Pre','B_Post','C_Pre','C_Post'])
You can order the columns in the new dataset using sorted and just add the column "id" in a second statement
order_col = sorted(df.columns[1:], key=lambda x:x[:3])
df_final = pd.concat([df['id'],df[order_col]], axis=1)
I want to group rows by 'Age', and return a count of 1) how many rows make up each group, and 2) how many of those rows meet a condition.
Given a DataFrame that looks like this:
Age Died
0 26 0
1 26 0
2 27 1
3 28 0
4 28 1
5 28 1
I want to return a DataFrame that looks like this:
Age Count Died_Count
26 2 0
27 1 1
28 3 2
I have tried numerous combinations of various groupbys such as groupby(['Age', 'Died']) with different aggregators (sum,count) but can't seem to find a winning combination. Can someone point me in the right direction?
You can use namedagg:
(
df.groupby('Age')
.agg(Count=('Died', 'size'),
Died_count=('Died', 'sum'))
.reset_index()
)
Assume your dataframe is df
res=df.groupby("Age").agg({'Age': 'count', 'Died': 'sum'}).rename(columns={"Age":"Count"})
output
Count Died
Age
26 2 0
27 1 1
28 3 2
you can reset index and set Age to a column as well.
res = res.reset_index(drop=False)
output
Age Count Died
0 26 2 0
1 27 1 1
2 28 3 2
In Python, I have a pandas data frame df.
ID Ref Dist
A 0 10
A 0 10
A 1 20
A 1 20
A 2 30
A 2 30
A 3 5
A 3 5
B 0 8
B 0 8
B 1 40
B 1 40
B 2 7
B 2 7
I want to group by ID and Ref, and take the first row of the Dist column in each group.
ID Ref Dist
A 0 10
A 1 20
A 2 30
A 3 5
B 0 8
B 1 40
B 2 7
And I want to sum up the Dist column in each ID group.
ID Sum
A 65
B 55
I tried this to do the first step, but this gives me just an index of the row and Dist, so I cannot move on to the second step.
df.groupby(['ID', 'Ref'])['Dist'].head(1)
It'd be wonderful if somebody helps me for this.
Thank you!
I believe this is what you're looking for.
The first step you need to use first since you want the first in the groupby. Once you've done that, use reset_index() so you can use a groupby afterwards and sum it up using ID.
df.groupby(['ID','Ref'])['Dist'].first()\
.reset_index().groupby(['ID'])['Dist'].sum()
ID
A 65
B 55
Just drop_duplicates before the groupby. The default behavior is to keep the first duplicate row, which is what you want.
df.drop_duplicates(['ID', 'Ref']).groupby('ID').Dist.sum()
#A 65
#B 55
#Name: Dist, dtype: int64
I have this DataFrame df:
ID EVAL
11 1
11 0
22 0
11 1
33 0
44 0
22 1
11 1
I need to estimate the % of rows with EVAL equal to 1 and 0 for two groups: Group 1 contains those IDs that are repeated more than or equal to 3 times in df. Group 2 contains IDs that are repeated less than 3 times in df.
The result should be this one:
GROUP EVAL_0 EVAL_1
1 25 75
2 75 25
You can get the percentage of IDs that are repeated three or more times with value_counts() and then using a boolean index with mean.
>>> (df.ID.value_counts() >= 3).mean()
0.25
This is the gist of the work, but depending on what you wanted to do with it, if you wanted output like yours you could just create a DataFrame
>>> g1_perc = (df.ID.value_counts() >= 3).mean()
>>> pd.DataFrame(dict(group=[1, 2], perc_group=[g1_perc*100, (1-g1_perc)*100]))
group perc_group
0 1 25.0
1 2 75.0
The second column with the opposite percentage looks a bit needless to me.