I have this MultiIndex pandas dataframe:
chamber_temp
month day
1 1 0.000000
2 0.005977
3 0.001439
4 -0.000119
5 0.000514
...
12 27 0.001799
28 0.002346
29 -0.001815
30 0.001102
31 -0.004189
What I want to get is which month has the highest cumsum().
What I am trying to do is for each month there should 1 value which will give me the cumulative sum of all the values for day in that month, that is the problem which I am trying to get help on.
You can leverage on level parameter in Series.sum when there's MultiIndex to avoid groupby in such cases.
df['champer_temp'].sum(level=0).idxmax()
Please Try
df.groupby('month')['chamber_temp'].sum().idxmax()
Related
I have 2 dataframes:
df_dec_light and df_rally.
df_dec_light.head():
log_return month year
1970-12-01 0.003092 12 1970
1970-12-02 0.011481 12 1970
1970-12-03 0.004736 12 1970
1970-12-04 0.006279 12 1970
1970-12-07 0.005351 12 1970
1970-12-08 -0.005239 12 1970
1970-12-09 0.000782 12 1970
1970-12-10 0.004235 12 1970
1970-12-11 0.003774 12 1970
1970-12-14 -0.005109 12 1970
df_rally.head():
rally_start rally_end
0 1970-12-18 1970-12-31
1 1971-12-17 1971-12-31
2 1972-12-15 1972-12-29
3 1973-12-21 1973-12-31
4 1974-12-20 1974-12-31
I need to filter df_dec_light based on condition that df_dec_light.index is between values of columns df_rally['rally_start']and df_rally['rally_end'].
I've tried something like this:
df_dec_light[(df_dec_light.index >= df_rally['rally_start']) & (df_dec_light.index <= df_rally['rally_end'])]
I was expecting to to recieve filtered df_dec_light dataframe with indexes that are within intervals between df_rally['rally_start'] and df_rally['rally_end'].
Something like this:
log_return month year
1970-12-18 0.001997 12 1970
1970-12-21 -0.003108 12 1970
1970-12-22 0.001111 12 1970
1970-12-23 0.000666 12 1970
1970-12-24 0.005644 12 1970
1970-12-28 0.005283 12 1970
1970-12-29 0.010810 12 1970
1970-12-30 0.002061 12 1970
1970-12-31 -0.001301 12 1970
Would really apreciate any help. Thanks!
Let's create an IntervalIndex from the start and end column values in df_rally dataframe, then map the intervals on index of df_dec_light dataframe and use notna to check if the index values are contained in any interval
ix = pd.IntervalIndex.from_arrays(df_rally.rally_start, df_rally.rally_end, closed='both')
mask = df_dec_light.index.map(ix.to_series()).notna()
then use the mask to filter the dataframe
df_dec_light[mask]
To solve this we can first turn the ranges in df_rally into pd.DateTimeIndex by calling pd.date_range on each row. This will give us each row of df_rally as a pd.DateTimeIndex.
As we want to later check if the index of df_dec_light is in any of the ranges, we want to combine all of these ranges. This is done with union.
We assert that the newly created pd.Series index_list is not empty and then select its first element. This element is the pd.DateTimeIndex on which we can now call union with all other pd.DateTimeIndex.
We can now use pd.Index.isin to create a boolean array of whether each index Date is found in the passed set of Dates.
If we now apply this mask to df_dec_light it returns only the entries that are within one of the specified ranges of df_rally.
index_list = df_rally.apply(lambda x: pd.date_range(x['rally_start'], x['rally_end']), axis=1)
assert(not index_list.empty)
all_ranges=index_list.iloc[0]
for range in index_list:
all_ranges=all_ranges.union(range)
print(all_ranges)
mask = df_dec_light.index.isin(all_ranges)
print(df_dec_light[mask])
i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.
I have a dataframe/series containing hourly sampled data over a couple of years. I'd like to sum the values for each month, then calculate the mean of those monthly totals over all the years.
I can get a multi-index dataframe/series of the totals using:
df.groupby([df.index.year, df.index.month]).sum()
Date & Time Date & Time
2016 3 220.246292
4 736.204574
5 683.240291
6 566.693919
7 948.116766
8 761.214823
9 735.168033
10 771.210572
11 542.314915
12 434.467037
2017 1 728.983901
2 639.787918
3 709.944521
4 704.610437
5 685.729297
6 760.175060
7 856.928659
But I don't know how to then combine the data to get the means.
I might be totally off on the wrong track too. Also not sure I've labelled the question very well.
I think you need mean per years - so per first level:
df.groupby([df.index.year, df.index.month]).sum().mean(level=0)
You can use groupby twice, once to get the monthly sum, once to get the mean of monthly sum:
(df.groupby(pd.Grouper(freq='M')).sum()
.groupby(pd.Grouper(freq='Y')).mean()
)
i have a dataframe like below.
Trying to sum Week 7 and Week 8
SalesQuantity values for all the regarding productCodes[1-317], and update their week 7 rows Sales Quantity as a new value. And deleting their Week 8 rows from Dataframe.
Week column range is [7-26] and all of the weeks include [1-317] product code
cause of the original data is before group by [Week,ProductCode]
Week ProductCode SalesQuantity
7 1 49.285714
7 2 36.714286
7 3 33.285714
7 4 36.857143
7 5 42.714286
... ... ...
8 3 61.000000
26 314 4.285714
26 315 3.571429
26 316 6.142857
26 317 3.285714
Example Result : From the above table, adding week 7+8 SalesQuantities for product code 3: 61.000+33.285714= 94.285.714 new SalesQuantity updated value for week 7 is founded for ProductCode 3.
After that, need delete Week 8 row for ProductCode 3.
How to automate it for all of the ProductCode[1-317]?
Thanks
Use the `groupby()' method:
sumSales = data[['productCode', 'SalesQuality']].groupby('ProductCode').sum()
This creates a new DataFrame, with the sum of SalesQuality, indexed with the product code. The data[['productCode', 'SalesQuality']] part creates a sub-selection of the original data frame, otherwise the weeks also get summed.
I originally have 3 columns, timestamp,response_time and type columns, what I need to do is find the mean of response time where all timestamps are same hence I grouped all timestamps together and applied mean function on them. I got the following series which is fine:
0 16.949689
1 17.274615
2 16.858884
3 17.025155
4 17.062008
5 16.846885
6 17.172994
7 17.025797
8 17.001974
9 16.924636
10 16.813300
11 17.152066
12 17.291899
13 16.946970
14 16.972884
15 16.871824
16 16.840024
17 17.227682
18 17.288211
19 17.370553
20 17.395759
21 17.449579
22 17.340357
23 17.137308
24 16.981012
25 16.946727
26 16.947073
27 16.830850
28 17.366538
29 17.054468
30 16.823983
31 17.115429
32 16.859003
33 16.919645
34 17.351895
35 16.930233
36 17.025194
37 16.824997
And I need to be able to plot column 1 vs column 2, but I am not abel to extract them seperately.
I obtained this column by doing groupby('timestamp') and then a mean() on that.
The problem I need to solve is how to extract each column of this series? or is there a better way to calculate the mean of 1 column for all same entries of another column?
ORIGINAL DATA :
1445544152817,SEND_MSG,123
1445544152817,SEND_MSG,123
1445544152829,SEND_MSG,135
1445544152829,SEND_MSG,135
1445544152830,SEND_MSG,135
1445544152830,GET_QUEUE,12
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152831,SEND_MSG,138
1445544152831,SEND_MSG,136
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152832,SEND_MSG,138
1445544152832,SEND_MSG,138
1445544152833,SEND_MSG,138
1445544152833,SEND_MSG,139
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152835,SEND_MSG,140
1445544152835,SEND_MSG,141
1445544152849,SEND_MSG,155
1445544152849,SEND_MSG,155
1445544152850,GET_QUEUE,21
1445544152850,GET_QUEUE,21
For each timestamp I want to find average of response_time and plot,I did that successfully as shown in the series above(first data) but I cannot seperate the timestamp and response_time columns anymore.
A Series always has just one column. The first column you see is the index. You can get it by your_series.index(). If you want the timestamp to become a data column again, and not an index you can use the as_index keyword in groupby:
df.groupby('timestamp', as_index = False).mean()
Or use your_series.reset_index().
if its a series, you can directly use:
your_series.mean()
you can extract the column by:
df['column_name']
then you can apply mean() to the series:
df['column_name'].mean()