Second largest value is showing wrong value on pandas dataframe, when groupby

Second largest value is showing wrong value on pandas dataframe, when groupby - python

I have a table, and I'm trying to get the second largest "percent" value by column "Day".
I can get the second largest value, but the column 'Hour', is not the right one
Table:df
name
Day
Hour
percent
000_RJ_S1
26
10
0.908494
000_RJ_S1
26
11
0.831482
000_RJ_S1
26
12
0.843846
000_RJ_S1
26
13
0.877238
000_RJ_S1
26
17
0.163908
000_RJ_S1
26
18
0.230296
000_RJ_S1
26
19
0.359440
000_RJ_S1
26
20
0.379988
Script Used:
df = df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent').min())
Output:
As you can see, the "Hour" value is wrong. It should be "13" and not "10". The second largest value is right.
name
Day
Hour
percent
000_RJ_S1
26
10
0.877238
It should be:
name
Day
Hour
percent
000_RJ_S1
26
13
0.877238
I can't figure out what's is wrong. Could you guys help me with this issue.
Thanks a lot

Sort the percent columns before grouping, and use the nth function instead:
(df.sort_values('percent', ascending=False)
.groupby(['name', 'Day'],sort=False, as_index = False)
.nth(1)
)
name Day Hour percent
3 000_RJ_S1 26 13 0.877238

The reason you have got 10 is because of the min() function.
The nlargest() in the lambda would return the two rows with largest percent values and when you apply min() what it does is it selects the minimum values from each column separately which gave you that output.
You can use iloc[1] instead of min() to get the desired result
Here's the code using iloc:
df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent')).iloc[1]

One solution is to use a double groupby:
cols = ['name','Day']
# get the top 2 per group
s = df.groupby(cols)['percent'].nlargest(2)
# get the index of min per group
idx = s.droplevel(cols).groupby(s.droplevel(-1).index).idxmin()
# slice original data with those indexes
df2 = df.loc[idx.values]
Output:
name Day Hour percent
3 000_RJ_S1 26 13 0.877238

Related

How do I calculate the percentage (counted non-numerical values) in Pandas?

Basically, I have the columns date and intensity which I have grouped by date this way:
intensity = dataframe_scraped.groupby(["date","intensity"]).count()['sentiment']
which yielded the following results:
date intensity
2021-01 negative 33
neutral 72
positive 44
strong_negative 24
strong_positive 22
..
2022-05 positive 13
strong_negative 20
strong_positive 16
weak_negative 12
weak_positive 18
I want to calculate the percentages of these numerical values by date in order to bar-plot it later. Any ideas on how to achieve this?
I've tried something naïve along the lines of:
100 * dataframe_scraped.groupby(["date","intensity"]).count()['sentiment'] / dataframe_scraped.groupby(["date","intensity"]).count()['sentiment'].transform('sum')

I think this should work:
df.value_counts(subset=["date", "intensity"]) / df.value_counts(subset=["date"])
This counts the number of each value in the group, divided by the total number in the date group (so this would be negative's 33 / sum of 2021-01, for example).
The other interpretation of your question is that you wanted the proportion as a total of all counts in the whole dataframe, so you could use this:
df.value_counts(subset=["B", "C"], normalize=True)
which returns the count's proportion against all other groups.

Pandas groupby value and get value of max date and min date

I have a pandas DataFrame df looking like this :
item year value
A 2010 20
A 2011 25
A 2012 32
B 2016 20
B 2019 40
B 2018 50
My goal is to be able, for each item to calculate the difference of value between each date. Then for example, I want to find for item A : 12 (32 -20 because year max is 2012 and year min is 2010) and for item B : 20 (40 - 20, because year max is 2019 and year min is 2016).
I use the following code to get, for each item, year max and year min :
df.groupby("item").agg({'year':[np.min, np.max]})
Then, I find the year min and year max for each item. However, I stuck to make what I want.

Try sort_values by year, then you can groupby and select first for min and last for max:
g = df.sort_values('year').groupby('item')
out = g['value'].last() - g['value'].first()
Output:
item
A 12
B 20
Name: value, dtype: int64

Use:
def fun(x):
return x[x.index.max()] - x[x.index.min()]
res = df.set_index("year").groupby("item").agg(fun)
print(res)
Output
value
item
A 12
B 20

Use loc accessor in agg to calculate value difference and also, you can also concat the first and last year in an item to give you a clear indication of the range.
df.sort_values(by=['item','year']).groupby('item').agg( year=('year', lambda x: str(x.iloc[0]) +'-'+str(x.iloc[-1])),value=('value', lambda x: x.iloc[-1]-x.iloc[0]))
year value
item
A 2010-2012 12
B 2016-2019 20

Which month has the highest cumulative sum in multiindex pandas

I have this MultiIndex pandas dataframe:
chamber_temp
month day
1 1 0.000000
2 0.005977
3 0.001439
4 -0.000119
5 0.000514
...
12 27 0.001799
28 0.002346
29 -0.001815
30 0.001102
31 -0.004189
What I want to get is which month has the highest cumsum().
What I am trying to do is for each month there should 1 value which will give me the cumulative sum of all the values for day in that month, that is the problem which I am trying to get help on.

You can leverage on level parameter in Series.sum when there's MultiIndex to avoid groupby in such cases.
df['champer_temp'].sum(level=0).idxmax()

Please Try
df.groupby('month')['chamber_temp'].sum().idxmax()

How to take the max value of each category in 1 column across multiple rows

I am using Python 3.4 on Jupyternotebook.
I am looking to select the max of each product type from the below table. I've found the groupby code as written below but I am struggling to figure out how to do the search so that it takes into account the max for all box (box_1 and box_2), etc etc.
Perhaps best described as some sort of fuzzy matching?
Ideally my output should give me the max in each category:
box_2 18
bottles_3 31
.
.
.
How should I do this?
data = {'Product':['Box_1','Bottles_1','Pen_1','Markers_1','Bottles_2','Pen_2','Markers_2','Bottles_3','Box_2','Markers_2','Markers_3','Pen_3'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','Sales'])
df1
df1.groupby(['Product'])['Sales'].max()

If I understand correctly, you first have to look at the category and then retrieve both the name of the product and the maximum value. Here is how to do that:
df1=pd.DataFrame(data, columns=['Product','Sales'])
df1['Category'] = df1.Product.str.split('_').str.get(0)
df1["rank"] = df1.groupby("Category")["Sales"].rank("dense", ascending=False)
df1[df1["rank"]==1.0][['Product','Sales']]
The rank function will rank the products within the categories according to their Sales value. Then, you need to filter out any category that ranks lower. That will give you the desired dataframe:
Product Sales
2 Pen_1 31
7 Bottles_3 31
8 Box_2 18
10 Markers_3 18

Here you go:
df1['Type'] = df1.Product.str.split('_').str.get(0)
df1.groupby(['Type'])['Sales'].max()
## -- End pasted text --
Out[1]:
Type
Bottles 31
Box 18
Markers 18
Pen 31
Name: Sales, dtype: int64

You can split values by _, select first values by indexing str[0] and pass to groupby and DataFrameGroupBy.idxmax for Product by maximal Sales:
df1 = df1.set_index('Product')
df2 = (df1.groupby(df1.index.str.split('_').str[0])['Sales']
.agg([('Product','idxmax'), ('Sales','max')])
.reset_index(drop=True))
print (df2)
Product Sales
0 Bottles_3 31
1 Box_2 18
2 Markers_3 18
3 Pen_1 31

How to obtain 1 column from a series object pandas?

I originally have 3 columns, timestamp,response_time and type columns, what I need to do is find the mean of response time where all timestamps are same hence I grouped all timestamps together and applied mean function on them. I got the following series which is fine:
0 16.949689
1 17.274615
2 16.858884
3 17.025155
4 17.062008
5 16.846885
6 17.172994
7 17.025797
8 17.001974
9 16.924636
10 16.813300
11 17.152066
12 17.291899
13 16.946970
14 16.972884
15 16.871824
16 16.840024
17 17.227682
18 17.288211
19 17.370553
20 17.395759
21 17.449579
22 17.340357
23 17.137308
24 16.981012
25 16.946727
26 16.947073
27 16.830850
28 17.366538
29 17.054468
30 16.823983
31 17.115429
32 16.859003
33 16.919645
34 17.351895
35 16.930233
36 17.025194
37 16.824997
And I need to be able to plot column 1 vs column 2, but I am not abel to extract them seperately.
I obtained this column by doing groupby('timestamp') and then a mean() on that.
The problem I need to solve is how to extract each column of this series? or is there a better way to calculate the mean of 1 column for all same entries of another column?
ORIGINAL DATA :
1445544152817,SEND_MSG,123
1445544152817,SEND_MSG,123
1445544152829,SEND_MSG,135
1445544152829,SEND_MSG,135
1445544152830,SEND_MSG,135
1445544152830,GET_QUEUE,12
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152831,SEND_MSG,138
1445544152831,SEND_MSG,136
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152832,SEND_MSG,138
1445544152832,SEND_MSG,138
1445544152833,SEND_MSG,138
1445544152833,SEND_MSG,139
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152835,SEND_MSG,140
1445544152835,SEND_MSG,141
1445544152849,SEND_MSG,155
1445544152849,SEND_MSG,155
1445544152850,GET_QUEUE,21
1445544152850,GET_QUEUE,21
For each timestamp I want to find average of response_time and plot,I did that successfully as shown in the series above(first data) but I cannot seperate the timestamp and response_time columns anymore.

A Series always has just one column. The first column you see is the index. You can get it by your_series.index(). If you want the timestamp to become a data column again, and not an index you can use the as_index keyword in groupby:
df.groupby('timestamp', as_index = False).mean()
Or use your_series.reset_index().

if its a series, you can directly use:
your_series.mean()
you can extract the column by:
df['column_name']
then you can apply mean() to the series:
df['column_name'].mean()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Second largest value is showing wrong value on pandas dataframe, when groupby - python

Sort the percent columns before grouping, and use the nth function instead: (df.sort_values('percent', ascending=False) .groupby(['name', 'Day'],sort=False, as_index = False) .nth(1) ) name Day Hour percent 3 000_RJ_S1 26 13 0.877238

Related

How do I calculate the percentage (counted non-numerical values) in Pandas?

Pandas groupby value and get value of max date and min date

Which month has the highest cumulative sum in multiindex pandas

How to take the max value of each category in 1 column across multiple rows

How to obtain 1 column from a series object pandas?

Categories

Resources