How to obtain 1 column from a series object pandas? - python

I originally have 3 columns, timestamp,response_time and type columns, what I need to do is find the mean of response time where all timestamps are same hence I grouped all timestamps together and applied mean function on them. I got the following series which is fine:
0 16.949689
1 17.274615
2 16.858884
3 17.025155
4 17.062008
5 16.846885
6 17.172994
7 17.025797
8 17.001974
9 16.924636
10 16.813300
11 17.152066
12 17.291899
13 16.946970
14 16.972884
15 16.871824
16 16.840024
17 17.227682
18 17.288211
19 17.370553
20 17.395759
21 17.449579
22 17.340357
23 17.137308
24 16.981012
25 16.946727
26 16.947073
27 16.830850
28 17.366538
29 17.054468
30 16.823983
31 17.115429
32 16.859003
33 16.919645
34 17.351895
35 16.930233
36 17.025194
37 16.824997
And I need to be able to plot column 1 vs column 2, but I am not abel to extract them seperately.
I obtained this column by doing groupby('timestamp') and then a mean() on that.
The problem I need to solve is how to extract each column of this series? or is there a better way to calculate the mean of 1 column for all same entries of another column?
ORIGINAL DATA :
1445544152817,SEND_MSG,123
1445544152817,SEND_MSG,123
1445544152829,SEND_MSG,135
1445544152829,SEND_MSG,135
1445544152830,SEND_MSG,135
1445544152830,GET_QUEUE,12
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152831,SEND_MSG,138
1445544152831,SEND_MSG,136
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152832,SEND_MSG,138
1445544152832,SEND_MSG,138
1445544152833,SEND_MSG,138
1445544152833,SEND_MSG,139
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152835,SEND_MSG,140
1445544152835,SEND_MSG,141
1445544152849,SEND_MSG,155
1445544152849,SEND_MSG,155
1445544152850,GET_QUEUE,21
1445544152850,GET_QUEUE,21
For each timestamp I want to find average of response_time and plot,I did that successfully as shown in the series above(first data) but I cannot seperate the timestamp and response_time columns anymore.

A Series always has just one column. The first column you see is the index. You can get it by your_series.index(). If you want the timestamp to become a data column again, and not an index you can use the as_index keyword in groupby:
df.groupby('timestamp', as_index = False).mean()
Or use your_series.reset_index().

if its a series, you can directly use:
your_series.mean()
you can extract the column by:
df['column_name']
then you can apply mean() to the series:
df['column_name'].mean()

Related

Second largest value is showing wrong value on pandas dataframe, when groupby

I have a table, and I'm trying to get the second largest "percent" value by column "Day".
I can get the second largest value, but the column 'Hour', is not the right one
Table:df
name
Day
Hour
percent
000_RJ_S1
26
10
0.908494
000_RJ_S1
26
11
0.831482
000_RJ_S1
26
12
0.843846
000_RJ_S1
26
13
0.877238
000_RJ_S1
26
17
0.163908
000_RJ_S1
26
18
0.230296
000_RJ_S1
26
19
0.359440
000_RJ_S1
26
20
0.379988
Script Used:
df = df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent').min())
Output:
As you can see, the "Hour" value is wrong. It should be "13" and not "10". The second largest value is right.
name
Day
Hour
percent
000_RJ_S1
26
10
0.877238
It should be:
name
Day
Hour
percent
000_RJ_S1
26
13
0.877238
I can't figure out what's is wrong. Could you guys help me with this issue.
Thanks a lot
Sort the percent columns before grouping, and use the nth function instead:
(df.sort_values('percent', ascending=False)
.groupby(['name', 'Day'],sort=False, as_index = False)
.nth(1)
)
name Day Hour percent
3 000_RJ_S1 26 13 0.877238
The reason you have got 10 is because of the min() function.
The nlargest() in the lambda would return the two rows with largest percent values and when you apply min() what it does is it selects the minimum values from each column separately which gave you that output.
You can use iloc[1] instead of min() to get the desired result
Here's the code using iloc:
df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent')).iloc[1]
One solution is to use a double groupby:
cols = ['name','Day']
# get the top 2 per group
s = df.groupby(cols)['percent'].nlargest(2)
# get the index of min per group
idx = s.droplevel(cols).groupby(s.droplevel(-1).index).idxmin()
# slice original data with those indexes
df2 = df.loc[idx.values]
Output:
name Day Hour percent
3 000_RJ_S1 26 13 0.877238

Pandas Time Series: Remove Rows Per ID

I have a Pandas dataframe of the form:
Date ID Temp
2019/03/27 1 23
2019/04/27 2 32
2019/04/27 1 42
2019/04/28 1 41
2019/01/27 2 33
2019/08/27 2 23
What I need to do?
Select Rows which are at least 30 days old from their latest measurement for each id.
i.e. the latest date for Id = 2 is 2019/08/27, so for ID =2 I need to select rows which are at least 30 days older. So, the row with 2019/08/27 for ID=2 will itself be dropped.
Similarly, the latest date for ID = 1 is 2019/04/28. This means I can select rows for ID =1 only if the date is less than 2019/03/28 (30 days older). So, the row 2019/04/27 with ID=1 will be dropped.
How to do this in Pandas. Any help is greatly appreciated.
Thank you.
Final dataframe will be:
Date ID Temp
2019/03/27 1 23
2019/04/27 2 32
2019/01/27 2 33
In your case using groupby + transform('last') and filter the original df
Yourdf=df[df.Date<df.groupby('ID').Date.transform('last')-pd.Timedelta('30 days')].copy()
Date ID Temp
0 2019-03-27 1 23
1 2019-04-27 2 32
4 2019-01-27 2 33
Notice I am adding the .copy at the end to prevent the setting copy error.

Grouping the data with pandas TimeGrouper with interval between 5 to 25 min, 25 to 45 min, 45 to 05 minutes

I am new to python pandas and I am trying to group my data on 20 Minute interval. If I use Data.groupby([pd.TimeGrouper('20Min')) it is working but it is giving the grouped data from 0 to 20 min, 20-40 etc. But I want to group my data between 5 to 25 min, 25 to 45 min etc.
Can you help how to achieve this pandas TimeGrouper.
Thanks in advance.
Data.groupby([pd.TimeGrouper(freq='20Min',base=5, label='right')
will give you data grouped by 20 Min, starting with forward 5 min (since we are using base as 5 and label as right. i.e, it will be grouped as 05 to 25, 25 to 45 etc. and if you use this:
Data.groupby([pd.TimeGrouper(freq='20Min',base=55, label='left')
will give you data grouped by 20 Min with backward 5 min (i.e, your data will group as :55 to :15, :15 to :35 etc)

Subsetting a Pandas.DataFrame object only where there is a difference between two rows in python

I was wondering if it there were an easy way in python to return a subset of my DataFrame rows only where there is a change between two consecutive rows. For example, my dataframe object might look like this:
Date A B
20160713070000 20 21
20160713070100 20 23
20160713070128 20 23
20160713070128 21 24
20160713070134 23 24
In this case, I would want to return the following dataframe object:
Date A B
20160713070000 20 21
20160713070100 20 23
20160713070128 21 24
20160713070134 23 24
Thanks for the help!
I'd use drop_duplicates() function:
In [262]: df.drop_duplicates(subset=['A','B'])
Out[262]:
Date A B
0 20160713070000 20 21
1 20160713070100 20 23
3 20160713070128 21 24
4 20160713070134 23 24
Assuming your dataframe is df, try the following:
sub_df = df[df.groupby('Date')['A'].transform(lambda x: x.index[-1])==df.index]

Group by to create unique values appearing by date, as well as non-unique values by date

I have a data frame that looks like:
app_id subproduct date
0 23 3 2015-05-29
1 23 4 2015-05-29
2 25 5 2015-05-29
3 23 3 2015-05-29
4 24 7 2015-05-29
....
I run:
groupings =insightevents.groupby([insightevents['created_at_date'].dt.year,\
insightevents['created_at_date'].dt.month,\
insightevents['created_at_date'].dt.week,insightevents['created_at_date'].dt.day,
insightevents['created_at_date'].dt.dayofweek]);
inboxinsights=pd.DataFrame([groupings['app_id'].unique(),groupings['subproduct'].unique()]).transpose()
This gives me:
app_id subproduct
2015 5 22 29 4 [23,24,25] [3,4,5,7]
However, what I want is actually not to get just the unique values, but overall just the app_ids and sub_product loads on the day as additional columns, so:
unique_ app_id unique_subproduct subproduct app_id
2015 5 22 29 4 [23,24,25] [3,4,5,7] [3,3,4,5,7] [23,23,23,24,25]
I find that just doing:
inboxinsights=pd.DataFrame([groupings['app_id'].unique(), groupings['subproduct'].unique(),groupings['app_id'],groupings['subproduct']]).transpose()
Doesn't work and just gives me:
AttributeError: 'Series' object has no attribute 'type'
If you wanted just the number of unique values, that's easy:
inboxinsights.groupby('date').agg({'app_id': 'nunique', 'subproduct': 'nunique'})
returns:
But it looks like you want the list of what those actually are. I found this other SO question helpful:
not_unique_inboxinsights = groupby('date').agg(lambda x: tuple(x))
And then you say want both the unique and not-unique. For that, I would make two groupby dataframes and concatenate them, like this:
unique_inboxinsights = groupby('date').agg(lambda x: set(tuple(x)))
Hope that helps.

Categories

Resources