How can I group dates into pandas - python

Datos
2015-01-01 58
2015-01-02 42
2015-01-03 41
2015-01-04 13
2015-01-05 6
... ...
2020-06-18 49
2020-06-19 41
2020-06-20 23
2020-06-21 39
2020-06-22 22
2000 rows × 1 columns
I have this df which is made up of a column whose data represents the average temperature of each day in an interval of years. I would like to know how to get the maximum of each day (taking into account that the year has 365 days) and obtain a df similar to this:
Datos
1 40
2 50
3 46
4 8
5 26
... ...
361 39
362 23
363 23
364 37
365 25
365 rows × 1 columns
Forgive my ignorance and thank you very much for the help.

You can do this:
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(by=pd.Grouper(key='Date', freq='D')).max().reset_index()
df['Day'] = df['Date'].dt.dayofyear
print(df)
Date Temp Day
0 2015-01-01 58.0 1
1 2015-01-02 42.0 2
2 2015-01-03 41.0 3
3 2015-01-04 13.0 4
4 2015-01-05 6.0 5
... ... ... ...
1995 2020-06-18 49.0 170
1996 2020-06-19 41.0 171
1997 2020-06-20 23.0 172
1998 2020-06-21 39.0 173
1999 2020-06-22 22.0 174

Make a new column:
df["day of year"] = df.Datos.dayofyear
Then
df.groupby("day of year").max()

Related

Monthly aggregated values, pandas dataframe

A sample CSV data in which the first column is a time stamp (date + time):
2018-01-01 10:00:00,23,43
2018-01-02 11:00:00,34,35
2018-01-05 12:00:00,25,4
2018-01-10 15:00:00,22,96
2018-01-01 18:00:00,24,53
2018-03-01 10:00:00,94,98
2018-04-20 10:00:00,90,9
2018-04-10 10:00:00,45,51
2018-01-01 10:00:00,74,44
2018-12-01 10:00:00,76,87
2018-11-01 10:00:00,76,87
2018-12-12 10:00:00,87,90
I already wrote some codes to do the monthly aggregated values task while waiting for someone to give me some suggestions.
Thanks #moys, anyway!
import pandas as pd
df = pd.read_csv('Sample.txt', header=None, names = ['Timestamp', 'Value 1', 'Value 2'])
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['Monthly'] = df1['Timestamp'].dt.to_period('M')
grouper = pd.Grouper(key='Monthly')
df2 = df1.groupby(grouper)['Value 1', 'Value 2'].sum().reset_index()
The output is:
Monthly Value 1 Value 2
0 2018-01 202 275
1 2018-03 94 98
2 2018-04 135 60
3 2018-12 163 177
4 2018-11 76 87
What if there's a dataset with more columns, how to motified the my code to make it automatically working on the dataset which has more columns?
2018-02-01 10:00:00,23,43,32
2018-02-02 11:00:00,34,35,43
2018-03-05 12:00:00,25,4,43
2018-02-10 15:00:00,22,96,24
2018-05-01 18:00:00,24,53,98
2018-02-01 10:00:00,94,98,32
2018-02-20 10:00:00,90,9,24
2018-07-10 10:00:00,45,51,32
2018-01-01 10:00:00,74,44,34
2018-12-04 10:00:00,76,87,53
2018-12-02 10:00:00,76,87,21
2018-12-12 10:00:00,87,90,98
You can do something like below
df.groupby(pd.to_datetime(df['date']).dt.month).sum().reset_index()
Output Here, 'date' column is the month number.
date val1 val2
0 1 202 275
1 3 94 98
2 4 135 60
3 11 76 87
4 12 163 177

How to add up current row value to the sum of subsequent values (relatively to the date corresponding to the row) in pandas?

Is there a way to add up current row value to the sum of subsequent values (relatively to the date corresponding to the row) in pandas ?
I'd like to take the YTD of the corresponding row and add the sum of all the remaining Budget Values for 2019.
Let's suppose we are in the 4th month of 2019.
For example, for row 0, I'd like to have 101 + the sum of the subsequent Values that are under "Budget" and "2019".
For row 1, the same logic would apply (199 + the sum of the subsequent values), etc...
My current table is like this :
Value Type Date YTD YEP (year in projection)
0 100 Budget 2019-01-01 101 NaN
1 50 Budget 2019-02-01 199 NaN
2 20 Budget 2019-03-01 275 NaN
3 123 Budget 2019-04-01 332 NaN
4 56 Budget 2019-05-01 332 NaN
5 76 Budget 2019-06-01 332 NaN
6 98 Budget 2019-07-01 332 NaN
7 126 Budget 2019-08-01 332 NaN
8 90 Budget 2019-09-01 332 NaN
9 80 Budget 2019-10-01 332 NaN
10 67 Budget 2019-11-01 332 NaN
11 87 Budget 2019-12-01 332 NaN
12 101 Actual 2019-01-01 101 NaN
13 98 Actual 2019-02-01 199 NaN
14 76 Actual 2019-03-01 275 NaN
15 57 Actual 2019-04-01 332 NaN
Desired table :
Value Type Date YTD YEP (year in projection)
0 100 Budget 2019-01-01 101 974
1 50 Budget 2019-02-01 199 1022
2 20 Budget 2019-03-01 275 1078
3 123 Budget 2019-04-01 332 1012
4 56 Budget 2019-05-01 NaN NaN
5 76 Budget 2019-06-01 NaN NaN
6 98 Budget 2019-07-01 NaN NaN
7 126 Budget 2019-08-01 NaN NaN
8 90 Budget 2019-09-01 NaN NaN
9 80 Budget 2019-10-01 NaN NaN
10 67 Budget 2019-11-01 NaN NaN
11 87 Budget 2019-12-01 NaN NaN
12 101 Actual 2019-01-01 101 974
13 98 Actual 2019-02-01 199 1022
14 76 Actual 2019-03-01 275 1078
15 57 Actual 2019-04-01 332 1012
Here are Excel screencaps to grasp better the calculation I'm talking about :
screencap1
screencap2
This Excel screencap shows well what I want to do, even though it's not rigorously the same thing (since I don't want to visually delimit the area to sum, whereas here with pandas I want to set conditions).
Note that I know how to set conditions on Python, but here the problem is deeper, and that's precisely why i ask you people help.
Is there a function to say "hey I want you to take the sum of the batch of numbers, but always starting from to where you are positioned" (that's what relative position and dollars on Excel allow us to do).
Thank you !
Alex
We can use GroupBy.cumsum
by inverting the DataFrame previously using [::-1].
df['Date'] = pd.to_datetime(df['Date'])
df['YEP'] = ( df[::-1].loc[df['Type'].eq('Budget')]
.groupby(df['Date'].dt.year)
.Value
.cumsum()
.sub(df['Value'])
.add(df['YTD'])
.groupby(df['Date'])
.transform('first') )
print(df)
Value Type Date YTD YEP
0 100 Budget 2019-01-01 101 974.0
1 50 Budget 2019-02-01 199 1022.0
2 20 Budget 2019-03-01 275 1078.0
3 123 Budget 2019-04-01 332 1012.0
4 56 Budget 2019-05-01 332 956.0
5 76 Budget 2019-06-01 332 880.0
6 98 Budget 2019-07-01 332 782.0
7 126 Budget 2019-08-01 332 656.0
8 90 Budget 2019-09-01 332 566.0
9 80 Budget 2019-10-01 332 486.0
10 67 Budget 2019-11-01 332 419.0
11 87 Budget 2019-12-01 332 332.0
12 101 Actual 2019-01-01 101 974.0
13 98 Actual 2019-02-01 199 1022.0
14 76 Actual 2019-03-01 275 1078.0
15 57 Actual 2019-04-01 332 1012.0
Then we can use DataFrame.mask to mask when there are repeated values:
df[['YTD','YEP']] = df[['YTD','YEP']].mask(df.assign(year = df['Date'].dt.year)
.duplicated(['Type','YTD','year']))
#df[['YTD','YEP']] = df[['YTD','YEP']].mask(df.duplicated(['Type','YTD']))
print(df)
Value Type Date YTD YEP
0 100 Budget 2019-01-01 101.0 974.0
1 50 Budget 2019-02-01 199.0 1022.0
2 20 Budget 2019-03-01 275.0 1078.0
3 123 Budget 2019-04-01 332.0 1012.0
4 56 Budget 2019-05-01 NaN NaN
5 76 Budget 2019-06-01 NaN NaN
6 98 Budget 2019-07-01 NaN NaN
7 126 Budget 2019-08-01 NaN NaN
8 90 Budget 2019-09-01 NaN NaN
9 80 Budget 2019-10-01 NaN NaN
10 67 Budget 2019-11-01 NaN NaN
11 87 Budget 2019-12-01 NaN NaN
12 101 Actual 2019-01-01 101.0 974.0
13 98 Actual 2019-02-01 199.0 1022.0
14 76 Actual 2019-03-01 275.0 1078.0
15 57 Actual 2019-04-01 332.0 1012.0
Please note that this operation is carried out for each year, although this dataframe only shows 2019

Python fill zeros in a timeseries dataframe

I have a list of dates and a dataframe. Now the dataframe has an id column and other values that are not consistent for all dates. I want to fill zeros in all columns for the ids and dates where there is no data. Let me show you by example:
date id clicks conv rev
2019-01-21 234 34 1 10
2019-01-21 235 32 0 0
2019-01-24 234 56 2 20
2019-01-23 235 23 3 30
date list is like this:
[2019-01-01, 2019-01-02,2019-01-03 ....2019-02-28]
What I want is to add zeros for all the missing dates in the dataframe for all ids. So the resultant df should look like:
date id clicks conv rev
2019-01-01 234 0 0 0
2019-01-01 235 0 0 0
. . . .
. . . .
2019-01-21 234 34 1 10
2019-01-21 235 32 0 0
2019-01-22 234 0 0 0
2019-01-22 235 0 0 0
2019-01-23 234 0 0 0
2019-01-23 235 0 0 0
2019-01-24 234 56 2 20
2019-01-23 235 23 3 30
. . . .
2019-02-28 0 0 0 0
With set_index + reindex from the cartesian product of values. Here I'll create the dates with pd.date_range to save some typing, and ensure dates are datetime
import pandas as pd
df['date'] = pd.to_datetime(df.date)
my_dates = pd.date_range('2019-01-01', '2019-02-28', freq='D')
idx = pd.MultiIndex.from_product([my_dates, df.id.unique()], names=['date', 'id'])
df = df.set_index(['date', 'id']).reindex(idx).fillna(0).reset_index()
Output: df
date id clicks conv rev
0 2019-01-01 234 0.0 0.0 0.0
1 2019-01-01 235 0.0 0.0 0.0
...
45 2019-01-23 235 23.0 3.0 30.0
46 2019-01-24 234 56.0 2.0 20.0
47 2019-01-24 235 0.0 0.0 0.0
...
115 2019-02-27 235 0.0 0.0 0.0
116 2019-02-28 234 0.0 0.0 0.0
117 2019-02-28 235 0.0 0.0 0.0

Customised start and end date of the month

I have a data frame which contains date and value. I have to compute sum of the values for each month.
i.e., df.groupby(pd.Grouper(freq='M'))['Value'].sum()
But the problem is in my data set starting date of the month is 21 and ending at 20. Is there any way to tell that group the month from 21th day to 20th day to pandas.
Assume my data frame contains starting and ending date is,
starting_date=datetime.datetime(2015,11,21)
ending_date=datetime.datetime(2017,11,20)
so far i tried,
starting_date=df['Date'].min()
ending_date=df['Date'].max()
month_wise_sum=[]
while(starting_date<=ending_date):
temp=starting_date+datetime.timedelta(days=31)
e_y=temp.year
e_m=temp.month
e_d=20
temp= datetime.datetime(e_y,e_m,e_d)
month_wise_sum.append(df[df['Date'].between(starting_date,temp)]['Value'].sum())
starting_date=temp+datetime.timedelta(days=1)
print month_wise_sum
My above code does the thing. but still waiting for pythonic way to achieve it.
My biggest problem is slicing data frame for month wise
for example,
2015-11-21 to 2015-12-20
Is there any pythonic way to achieve this?
Thanks in Advance.
For Example consider this as my dataframe. It contains date from date_range(datetime.datetime(2017,01,21),datetime.datetime(2017,10,20))
Input:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
.. ... ...
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
[273 rows x 2 columns]
I want to slice this dataframe like below
Iter-1:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
30 2017-02-20 0.616847
iter-2:
Date Value
31 2017-02-21 2.356993
32 2017-02-22 -0.265603
33 2017-02-23 -0.651336
34 2017-02-24 -0.952791
35 2017-02-25 0.124278
36 2017-02-26 0.545956
37 2017-02-27 0.671670
38 2017-02-28 -0.836518
39 2017-03-01 1.178424
40 2017-03-02 0.182758
41 2017-03-03 -0.733987
42 2017-03-04 0.112974
43 2017-03-05 -0.357269
44 2017-03-06 1.454310
45 2017-03-07 -1.201187
46 2017-03-08 0.212540
47 2017-03-09 0.082771
48 2017-03-10 -0.906591
49 2017-03-11 -0.931166
50 2017-03-12 -0.391388
51 2017-03-13 -0.893409
52 2017-03-14 -1.852290
53 2017-03-15 0.368390
54 2017-03-16 -1.672943
55 2017-03-17 -0.934288
56 2017-03-18 -0.154785
57 2017-03-19 0.552378
58 2017-03-20 0.096006
.
.
.
iter-n:
Date Value
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
So that i could calculate each month's sum of value series
[0.7536957367200978, -4.796100620186059, -1.8423374363366014, 2.3780759926221267, 5.753755441349653, -0.01072884830461407, -0.24877912707664018, 11.666305431020149, 3.0772592888909065]
I hope i explained thoroughly.
For the purpose of testing my solution, I generated some random data, frequency is daily but it should work for every frequencies.
index = pd.date_range('2015-11-21', '2017-11-20')
df = pd.DataFrame(index=index, data={0: np.random.rand(len(index))})
Here you see that I passed as index an array of datetimes. Indexing with dates allow in pandas for a lot of added functionalities. With your data you should do (if the Date column already only contains datetime values) :
df = df.set_index('Date')
Then I would realign artificially your data by substracting 20 days to the index :
from datetime import timedelta
df.index -= timedelta(days=20)
and then I would resample data to a monthly indexing, summing all data in the same month :
df.resample('M').sum()
The resulting dataframe is indexed by the last datetime of each month (for me something like :
0
2015-11-30 3.191098
2015-12-31 16.066213
2016-01-31 16.315388
2016-02-29 13.507774
2016-03-31 15.939567
2016-04-30 17.094247
2016-05-31 15.274829
2016-06-30 13.609203
but feel free to reindex it :)
Using pandas.cut() could be a quick solution for you:
import pandas as pd
import numpy as np
start_date = "2015-11-21"
# As #ALollz mentioned, the month with the original end_date='2017-11-20' was missing.
# since pd.date_range() only generates dates in the specified range (between start= and end=),
# '2017-11-31'(using freq='M') exceeds the original end='2017-11-20' and thus is cut off.
# the similar situation applies also to start_date (using freq="MS") when start_month might be cut off
# easy fix is just to extend the end_date to a date in the next month or use
# the end-date of its own month '2017-11-30', or replace end= to periods=25
end_date = "2017-12-20"
# create a testing dataframe
df = pd.DataFrame({ "date": pd.date_range(start_date, periods=710, freq='D'), "value": np.random.randn(710)})
# set up bins to include all dates to create expected date ranges
bins = [ d.replace(day=20) for d in pd.date_range(start_date, end_date, freq="M") ]
# group and summary using the ranges from the above bins
df.groupby(pd.cut(df.date, bins)).sum()
value
date
(2015-11-20, 2015-12-20] -5.222231
(2015-12-20, 2016-01-20] -4.957852
(2016-01-20, 2016-02-20] -0.019802
(2016-02-20, 2016-03-20] -0.304897
(2016-03-20, 2016-04-20] -7.605129
(2016-04-20, 2016-05-20] 7.317627
(2016-05-20, 2016-06-20] 10.916529
(2016-06-20, 2016-07-20] 1.834234
(2016-07-20, 2016-08-20] -3.324972
(2016-08-20, 2016-09-20] 7.243810
(2016-09-20, 2016-10-20] 2.745925
(2016-10-20, 2016-11-20] 8.929903
(2016-11-20, 2016-12-20] -2.450010
(2016-12-20, 2017-01-20] 3.137994
(2017-01-20, 2017-02-20] -0.796587
(2017-02-20, 2017-03-20] -4.368718
(2017-03-20, 2017-04-20] -9.896459
(2017-04-20, 2017-05-20] 2.350651
(2017-05-20, 2017-06-20] -2.667632
(2017-06-20, 2017-07-20] -2.319789
(2017-07-20, 2017-08-20] -9.577919
(2017-08-20, 2017-09-20] 2.962070
(2017-09-20, 2017-10-20] -2.901864
(2017-10-20, 2017-11-20] 2.873909
# export the result
summary = df.groupby(pd.cut(df.date, bins)).value.sum().tolist()
..

How do I group hourly data by day and count only values greater than a set amount in Pandas?

I am new to Pandas but have been working with python for a few years now.
I have a large data set of hourly data with multiple columns. I need to group the data by day then count how many times the value is above 85 for each day for each column.
example data:
date KMRY KSNS PCEC1 KFAT
2014-06-06 13:00:00 56.000000 63.0 17 11
2014-06-06 14:00:00 58.000000 61.0 17 11
2014-06-06 15:00:00 63.000000 63.0 16 10
2014-06-06 16:00:00 67.000000 65.0 12 11
2014-06-06 17:00:00 67.000000 67.0 10 13
2014-06-06 18:00:00 72.000000 75.0 9 14
2014-06-06 19:00:00 77.000000 79.0 9 15
2014-06-06 20:00:00 84.000000 81.0 9 23
2014-06-06 21:00:00 81.000000 86.0 12 31
2014-06-06 22:00:00 84.000000 84.0 13 28
2014-06-06 23:00:00 83.000000 86.0 15 34
2014-06-07 00:00:00 84.000000 86.0 16 36
2014-06-07 01:00:00 86.000000 89.0 17 43
2014-06-07 02:00:00 86.000000 89.0 20 44
2014-06-07 03:00:00 89.000000 89.0 22 49
2014-06-07 04:00:00 86.000000 86.0 22 51
2014-06-07 05:00:00 86.000000 89.0 21 53
From the sample above my results should look like the following:
date KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0
Any help you be greatly appreciated.
(D_RH>85).sum()
The above code gets me close but I need a daily break down also not just the column counts.
One way would be to make date a DatetimeIndex and then groupby the result of the comparison to 85. For example:
>>> df["date"] = pd.to_datetime(df["date"]) # only if it isn't already
>>> df = df.set_index("date")
>>> (df > 85).groupby(df.index.date).sum()
KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0

Categories

Resources