I have the following data:
df =
MONTH DAY HOUR DURATION
1 1 7 20
1 1 7 21
1 2 7 20
1 2 8 22
2 1 7 19
2 1 8 25
2 1 8 29
2 2 8 27
I want to get the mean DURATION grouped by HOUR and averaged over MONTH and DAY. In other words, I want to know what is the average DURATION per HOUR.
This is my current code. If I delete 'MONTH','DAY' from df.groupby(['MONTH','DAY','HOUR','DURATION']), then I get higher values of DURATION, which are not correct. Therefore I decided to keep 'MONTH','DAY'.
grouped = df.groupby(['MONTH','DAY','HOUR','DURATION']).size() \
.groupby(level=['HOUR','DURATION']).mean().reset_index()
grouped
However, anyway, it gives me incorrect output. This is an example for some random data (it can be seen that the hour 8 is repeated many times, also the column 0 appears).
HOUR DURATION 0
0 7 122.0 1.0
1 8 77.0 1.0
2 8 82.0 1.0
3 8 83.0 1.0
Have you tried:
df.groupby("HOUR").agg({'DURATION_1' : 'mean', 'DURATION_2' : 'mean'})
Related
id zone price
0 0000001 1 33.0
1 0000001 2 24.0
2 0000001 3 34.0
3 0000001 4 45.0
4 0000001 5 51.0
I have the above pandas dataframe, here there are multiple ids (only 1 id is shown here). dataframe consist of a certain id with 5 zones and 5 prices. these prices should follow the below pattern
p1 (price of zone 1) < p2< p3< p4< p5
if anything out of order we should identify and print anomaly records to a file.
here in this example p3 <p4 <p5 but p1 and p2 are erroneous. (p1 > p2 whereas p1 < p2 is expected)
therefore 1st 2 records should be printed to a file
likewise this has to be done to the entire dataframe for all unique ids in it
My dataframe is huge, what is the most efficient way to do this filtering and identify erroneous records?
You can compute the diff per group after sorting the values to ensure the zones are increasing. If the diff is ≤ 0 the price is not strictly increasing and the rows should be flagged:
s = (df.sort_values(by=['id', 'zone']) # sort rows
.groupby('id') # group by id
['price'].diff() # compute the diff
.le(0) # flag those ≤ 0 (not increasing)
)
df[s|s.shift(-1)] # slice flagged rows + previous row
Example output:
id zone price
0 1 1 33.0
1 1 2 24.0
Example input:
id zone price
0 1 1 33.0
1 1 2 24.0
2 1 3 34.0
3 1 4 45.0
4 1 5 51.0
5 2 1 20.0
6 2 2 24.0
7 2 3 34.0
8 2 4 45.0
9 2 5 51.0
saving to file
df[s|s.shift(-1)].to_csv('incorrect_prices.csv')
Another way would be to first sort your dataframe by id and zone in ascending order and compare the next price with previous price using groupby.shift() creating a new column. Then you can just print out the prices that have fell in value:
import numpy as np
import pandas as pd
df.sort_values(by=['id','zone'],ascending=True)
df['increase'] = np.where(df.zone.eq(1),'no change',
np.where(df.groupby('id')['price'].shift(1) < df['price'],'inc','dec'))
>>> df
id zone price increase
0 1 1 33 no change
1 1 2 24 dec
2 1 3 34 inc
3 1 4 45 inc
4 1 5 51 inc
5 2 1 34 no change
6 2 2 56 inc
7 2 3 22 dec
8 2 4 55 inc
9 2 5 77 inc
10 3 1 44 no change
11 3 2 55 inc
12 3 3 44 dec
13 3 4 66 inc
14 3 5 33 dec
>>> df.loc[df.increase.eq('dec')]
id zone price increase
1 1 2 24 dec
7 2 3 22 dec
12 3 3 44 dec
14 3 5 33 dec
I have added some extra ID's to try and mimic your real data.
I have a large year-long dataframe of occurrences with month (1-12), week (1-52), day_of_week (0-6), and hour (0-23).
Below is just a snippet of the dataset. Each row is an occurrence.
The first part of the snippet below shows multiple occurrences captured with a date/timestamp of 2018-04-01 00:00:00 (Sunday). The second part of the snippet below (after the first ellipses) shows multiple occurrences in the following hour and the third part is the next hour, and so on.
month week day_of_week hour
0 4 13 6 0
1 4 13 6 0
2 4 13 6 0
3 4 13 6 0
4 4 13 6 0
...
100 4 13 6 1
101 4 13 6 1
102 4 13 6 1
...
...
300 4 13 6 2
301 4 13 6 2
302 4 13 6 2
...
I would like to be able to display a summary of this dataset showing the weekly average count of occurrences for each of the hours (0-23) as well as for each month.
For example:
month hour weekly_ave
4 0 100
4 1 175
4 2 250
...
4 23 500
5 0 90
How do I do this using pandas groupby and aggregate functions?
Thanks!
df.groupby(['month','hour'])['hour'].count()
Then, if you need this formatted a little bit nicer:
df.groupby(['month','hour'])['hour'].count().rename("weekly:ave").reset_index()
I was able to figure it out. I had to do a second groupby:
df.groupby(['month', 'hour', 'week']) \
.agg({'day_of_week': 'count'}) \
.groupby(['month', 'hour']).mean() \
.rename(columns={"day_of_week": "weekly_ave"}).reset_index()
This gave me what I needed but is there a more elegant way of doing this?
Thanks.
How can I get the sum of a column, grouped by another column? E.g for the following DataFrame
Paid Week
5 1
2 1
7 2
7 2
How would I get the following output?
Paid Week
7 1
14 2
I have tried this, but it doesn't seem to actually add the values. It also prints some other columns along with it.
print df.groupby(['Paid','Week']).sum()
Week Paid
1 0.0
0.5
2.4
3.0
3.8
3.9
6.6
2 0.0
0.9
2.4
Use:
df.groupby('Week', as_index=False)['Paid'].sum()
Output:
Week Paid
0 1 7
1 2 14
You can use:
df.groupby(by=['Week']).sum()
Would give the following output:
Paid
Week
1 7
2 14
Use the below if you do not want to index the group labels.
df.groupby(by=['Week'], as_index=False).sum()
Output:
Week Paid
0 1 7
1 2 14
df.groupby(['Week'])['Paid'].sum()
I have a pd.DataFrame I'd like to transform:
id values days time value_per_day
0 1 15 15 1 1
1 1 20 5 2 4
2 1 12 12 3 1
I'd like to aggregate these into equal buckets of 10 days. Since days at time 1 is larger than 10, this should spill into the next row, having the value/day of the 2nd row an average of the 1st and the 2nd.
Here is the resulting output, where (values, 0) = 15*(10/15) = 10 and (values, 1) = (5+20)/2:
id values days value_per_day
0 1 10 10 1.0
1 1 25 10 2.5
2 1 10 10 1.0
3 1 2 2 1.0
I've tried pd.Grouper:
df.set_index('days').groupby([pd.Grouper(freq='10D', label='right'), 'id']).agg({'values': 'mean'})
Out[146]:
values
days id
5 days 1 16
15 days 1 10
But I'm clearly using it incorrectly.
csv for convenience:
id,values,days,time
1,10,15,1
1,20,5,2
1,12,12,3
Notice: this is a time cost solution
newdf=df.reindex(df.index.repeat(df.days))
v=np.arange(sum(df.days))//10
dd=pd.DataFrame({'value_per_day': newdf.groupby(v).value_per_day.mean(),'days':np.bincount(v)})
dd
Out[102]:
days value_per_day
0 10 1.0
1 10 2.5
2 10 1.0
3 2 1.0
dd.assign(value=dd.days*dd.value_per_day)
Out[103]:
days value_per_day value
0 10 1.0 10.0
1 10 2.5 25.0
2 10 1.0 10.0
3 2 1.0 2.0
I did not include groupby id here, if you need that for your real data, you can do for loop with df.groupby(id) , then apply above steps within the for loop
I want to make a new column of the 5 day return for a stock, let's say. I am using pandas dataframe. I computed a moving average using the rolling_mean function, but I'm not sure how to reference lines like i would in a spreadsheet (B6-B1) for example. Does anyone know how I can do this index reference and subtraction?
sample data frame:
day price 5-day-return
1 10 -
2 11 -
3 15 -
4 14 -
5 12 -
6 18 i want to find this ((day 5 price) -(day 1 price) )
7 20 then continue this down the list
8 19
9 21
10 22
Are you wanting this:
In [10]:
df['5-day-return'] = (df['price'] - df['price'].shift(5)).fillna(0)
df
Out[10]:
day price 5-day-return
0 1 10 0
1 2 11 0
2 3 15 0
3 4 14 0
4 5 12 0
5 6 18 8
6 7 20 9
7 8 19 4
8 9 21 7
9 10 22 10
shift returns the row at a specific offset, we use this to subtract this from the current row. fillna fills the NaN values which will occur prior to the first valid calculation.