How get monthly mean in pandas using groupby - python

I have the next DataFrame:
data=pd.read_csv('anual.csv', parse_dates='Fecha', index_col=0)
data
DatetimeIndex: 290 entries, 2011-01-01 00:00:00 to 2011-12-31 00:00:00
Data columns (total 12 columns):
HR 290 non-null values
PreciAcu 290 non-null values
RadSolar 290 non-null values
T 290 non-null values
Presion 290 non-null values
Tmax 290 non-null values
HRmax 290 non-null values
Presionmax 290 non-null values
RadSolarmax 290 non-null values
Tmin 290 non-null values
HRmin 290 non-null values
Presionmin 290 non-null values
dtypes: float64(4), int64(8)
where:
data['HR']
Fecha
2011-01-01 37
2011-02-01 70
2011-03-01 62
2011-04-01 69
2011-05-01 72
2011-06-01 71
2011-07-01 71
2011-08-01 70
2011-09-01 40
...
2011-12-17 92
2011-12-18 78
2011-12-19 79
2011-12-20 76
2011-12-21 78
2011-12-22 80
2011-12-23 72
2011-12-24 70
In addition, some months are not always complete. My goal is to calculate the average of each month from daily data. This is achieved as follows:
monthly=data.resample('M', how='mean')
HR PreciAcu RadSolar T Presion Tmax
Fecha
2011-01-31 68.586207 3.744828 163.379310 17.496552 0 25.875862
2011-02-28 68.666667 1.966667 208.000000 18.854167 0 28.879167
2011-03-31 69.136364 3.495455 218.090909 20.986364 0 30.359091
2011-04-30 68.956522 1.913043 221.130435 22.165217 0 31.708696
2011-05-31 72.700000 0.550000 201.100000 18.900000 0 27.460000
2011-06-30 70.821429 6.050000 214.000000 23.032143 0 30.621429
2011-07-31 78.034483 5.810345 188.206897 21.503448 0 27.951724
2011-08-31 71.750000 1.028571 214.750000 22.439286 0 30.657143
2011-09-30 72.481481 0.185185 196.962963 21.714815 0 29.596296
2011-10-31 68.083333 1.770833 224.958333 18.683333 0 27.075000
2011-11-30 71.750000 0.812500 169.625000 18.925000 0 26.237500
2011-12-31 71.833333 0.160000 159.533333 17.260000 0 25.403333
The first error I find is in the column of precipitation, since all observations are 0 in January and an average of 3.74 is obtained for this particular month.
When averages in Excel and compare them with the results above, there is significant variation. For Example, the mean of HR for Febrero is
mean HR using pandas=68.66
mean HR using excel=67
Another detail I found:
data['PreciAcu']['2011-01'].count()
29 and should be 31
Am I doing something wrong?
How I can fix this error?
Annex csv file:
[link] https://www.dropbox.com/s/p5hl137bqm82j41/anual.csv

Your date column is being misinterpreted, because it's in DD/MM/YYYY format. Set dayfirst=True instead:
>>> df = pd.read_csv('anual.csv', parse_dates='Fecha', dayfirst=True, index_col=0, sep="\s+")
>>> df['PreciAcu']['2011-01'].count()
31
>>> df.resample("M").mean()
HR PreciAcu RadSolar T Presion Tmax \
Fecha
2011-01-31 68.774194 0.000000 162.354839 16.535484 0 25.393548
2011-02-28 67.000000 0.000000 193.481481 15.418519 0 25.696296
2011-03-31 59.083333 0.850000 254.541667 21.295833 0 32.325000
2011-04-30 61.200000 1.312000 260.640000 24.676000 0 34.760000
2011-05-31 NaN NaN NaN NaN NaN NaN
2011-06-30 68.428571 8.576190 236.619048 25.009524 0 32.028571
2011-07-31 81.518519 11.488889 185.407407 22.429630 0 27.681481
2011-08-31 76.451613 0.677419 219.645161 23.677419 0 30.719355
2011-09-30 77.533333 2.883333 196.100000 21.573333 0 28.723333
2011-10-31 73.120000 1.260000 196.280000 19.552000 0 27.636000
2011-11-30 71.277778 -79.333333 148.555556 18.250000 0 26.511111
2011-12-31 73.741935 0.067742 134.677419 15.687097 0 24.019355
HRmax Presionmax Tmin
Fecha
2011-01-31 92.709677 0 10.909677
2011-02-28 92.111111 0 8.325926
2011-03-31 89.291667 0 13.037500
2011-04-30 89.400000 0 17.328000
2011-05-31 NaN NaN NaN
2011-06-30 92.095238 0 19.761905
2011-07-31 97.185185 0 18.774074
2011-08-31 96.903226 0 18.670968
2011-09-30 97.200000 0 16.373333
2011-10-31 97.000000 0 13.412000
2011-11-30 94.555556 0 11.877778
2011-12-31 94.161290 0 10.070968
[12 rows x 9 columns]
(Note, though - I'd forgotten this -- that dayfirst=True isn't strict, see here. Maybe using date_parser would be safer.)

Related

How to compare row values for diff columns for the entire df and obtain the latest record?

I have a df of about 100000 rows, a sample of which is as follows:
id commodity frequency ms_id created modified measuring_type tariff overshoot_delta timestamp time_series_id quantity type
0 12188 1 900 12191 2019-03-25 12:40:00 2019-11-19 05:38:00 29 0 0 2019-03-16 23:00:00 12188 50.25 220
1 12858 1 900 12861 2019-04-08 15:13:00 2019-11-19 05:39:00 29 0 0 2019-03-16 23:00:00 12858 50.25 220
2 12858 7 900 12861 2019-04-08 15:13:00 2019-11-19 05:39:00 29 0 0 2019-03-16 23:00:00 12858 50.25 220
3 12188 1 900 12191 2019-03-25 12:40:00 2019-11-19 05:38:00 29 10 0 2019-03-16 23:00:00 12188 50.25 250
4 12188 1 900 12191 2019-03-25 12:41:00 2019-11-19 05:38:00 29 10 0 2019-03-16 23:00:00 12188 50.25 250
What I would like to do is to check the values in the columns: commodity, measuring_type, tariff, timestamp, type and see if there are duplicates in any rows. If the values in the above-mentioned columns are exactly the same for any 2 rows, then I want to take the last value (greatest time) from the created column. Such a check has to be done for all the rows in the df.
From the above example, the expected output:
id commodity frequency ms_id created modified measuring_type tariff overshoot_delta timestamp time_series_id quantity type
0 12858 1 900 12861 2019-04-08 15:13:00 2019-11-19 05:39:00 29 0 0 2019-03-16 23:00:00 12858 50.25 220
1 12858 7 900 12861 2019-04-08 15:13:00 2019-11-19 05:39:00 29 0 0 2019-03-16 23:00:00 12858 50.25 220
2 12188 1 900 12191 2019-03-25 12:41:00 2019-11-19 05:38:00 29 10 0 2019-03-16 23:00:00 12188 50.25 250
The first 2 rows had same values for the columns commodity, measuring_type, tariff, timestamp, type, so the time values in the created column have to be compared for those 2 rows and the greatest one (2019-04-08 15:13:00) has to be selected. Similarly for the last 2 rows.
Since the third row had a different value, it shouldn't be dropped and this must be added to the output.
How can this be done?
Thanks
Let us try sort_values then drop_duplicates
df=df.sort_values('created').drop_duplicates(['commodity', 'measuring_type', 'tariff', 'timestamp', 'type'], keep='last')

How can I group dates into pandas

Datos
2015-01-01 58
2015-01-02 42
2015-01-03 41
2015-01-04 13
2015-01-05 6
... ...
2020-06-18 49
2020-06-19 41
2020-06-20 23
2020-06-21 39
2020-06-22 22
2000 rows × 1 columns
I have this df which is made up of a column whose data represents the average temperature of each day in an interval of years. I would like to know how to get the maximum of each day (taking into account that the year has 365 days) and obtain a df similar to this:
Datos
1 40
2 50
3 46
4 8
5 26
... ...
361 39
362 23
363 23
364 37
365 25
365 rows × 1 columns
Forgive my ignorance and thank you very much for the help.
You can do this:
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(by=pd.Grouper(key='Date', freq='D')).max().reset_index()
df['Day'] = df['Date'].dt.dayofyear
print(df)
Date Temp Day
0 2015-01-01 58.0 1
1 2015-01-02 42.0 2
2 2015-01-03 41.0 3
3 2015-01-04 13.0 4
4 2015-01-05 6.0 5
... ... ... ...
1995 2020-06-18 49.0 170
1996 2020-06-19 41.0 171
1997 2020-06-20 23.0 172
1998 2020-06-21 39.0 173
1999 2020-06-22 22.0 174
Make a new column:
df["day of year"] = df.Datos.dayofyear
Then
df.groupby("day of year").max()

Datetime fails when setting astype, date mangled

I am importing a csv of 20 variables and 1500 records. There are 5 date columns that are in UK date format dd/mm/yyyy , and import as .str
I need to be be able to subract one date from another.They are hsopital admissions, - I need to subtract discharge date from admission date to get length of stay.
I have has a number of problems.
To illustrate I have used 2 columns.
import pandas as pd
import numpy as np
from datetime import datetime
import .csv
df = pd.read_csv("/Users........csv", usecols = ['ADMIDATE', 'DISDATE'])
df
ADMIDATE DISDATE
0 04/02/2018 07/02/2018
1 25/07/2017 1801-01-01
2 28/06/2017 01/07/2017
3 22/06/2017 1801-01-01
4 11/12/2017 15/12/2017
... ... ...
1503 25/01/2019 27/01/2019
1504 31/08/2018 1801-01-01
1505 20/09/2018 05/11/2018
1506 28/09/2018 1801-01-01
1507 21/02/2019 24/02/2019
1508 rows × 2 columns
I removed about 100 records with a DISDATE of 1801-01-01, - these are likely bad data from the patient still being in hospital when the data was collected.
To convert the dates to datetime, I have used .astype('datetime64[ns]')
This is because I didn't know how to use pd.to_datetime on multiple columns.
df[['ADMIDATE', 'DISDATE']] = df[['ADMIDATE', 'DISDATE']].astype('datetime64[ns]')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null int64
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1)
memory usage: 32.9 KB
So, the conversion appears to have worked.
However on examining the data, ADMIDATE has become yyyy-mm-dd and DISDATE yyyy-dd-mm.
df.head(20)
Unnamed: 0 ADMIDATE DISDATE
0 0 2018-04-02 2018-07-02
1 2 2017-06-28 2017-01-07
2 4 2017-11-12 2017-12-15
3 5 2017-09-04 2017-12-04
4 6 2017-05-30 2017-01-06
5 7 2017-02-08 2017-07-08
6 8 2017-11-17 2017-11-18
7 9 2018-03-14 2018-03-20
8 10 2017-04-26 2017-03-05
9 11 2017-05-16 2017-05-17
10 12 2018-01-17 2018-01-19
11 13 2017-12-18 2017-12-20
12 14 2017-02-10 2017-04-10
13 16 2017-03-30 2017-07-04
14 17 2017-01-12 2017-12-18
15 18 2017-12-07 2017-07-14
16 19 2017-05-04 2017-08-04
17 20 2017-10-30 2017-01-11
18 21 2017-06-19 2017-06-22
19 22 2017-04-05 2017-08-05
So when I subract the ADMIDATE from the DISDATE I am getting negative values.
df['DISDATE'] - df['ADMIDATE']
0 91 days
1 -172 days
2 33 days
3 91 days
4 -144 days
...
1394 188 days
1395 -291 days
1396 2 days
1397 -132 days
1398 3 days
Length: 1399, dtype: timedelta64[ns]
I would like a method that works on all my date columns, keeps the UK format and allows me to do basic operations on the date fields.
After the suggestions from #code-different which seems very sensible below
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
The format is unchanged despite dayfirst=True.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null datetime64[ns]
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](3)
memory usage: 32.9 KB
df.head()
Unnamed: 0 ADMIDATE DISDATE
0 1970-01-01 00:00:00.000000000 2018-04-02 2018-07-02
1 1970-01-01 00:00:00.000000002 2017-06-28 2017-01-07
2 1970-01-01 00:00:00.000000004 2017-11-12 2017-12-15
3 1970-01-01 00:00:00.000000005 2017-09-04 2017-12-04
4 1970-01-01 00:00:00.000000006 2017-05-30 2017-01-06
I have also tried format='%d%m%Y' and still the year is first. Would datetime.strptime be any good?.
just tell pandas.to_datetime to use a specific and adequate format, e.g.:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ADMIDATE': ['04/02/2018', '25/07/2017',
'28/06/2017', '22/06/2017', '11/12/2017'],
'DISDATE': ['07/02/2018', '1801-01-01',
'01/07/2017', '1801-01-01', '15/12/2017']}).replace({'1801-01-01': np.datetime64('NaT')})
for col in ['ADMIDATE', 'DISDATE']:
df[col] = pd.to_datetime(df[col], format='%d/%m/%Y')
# df
# ADMIDATE DISDATE
# 0 2018-02-04 2018-02-07
# 1 2017-07-25 NaT
# 2 2017-06-28 2017-07-01
# 3 2017-06-22 NaT
# 4 2017-12-11 2017-12-15
# Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 ADMIDATE 5 non-null datetime64[ns]
# 1 DISDATE 3 non-null datetime64[ns]
# dtypes: datetime64[ns](2)
Note: replace '1801-01-01' with np.datetime64('NaT') so you don't have to ignore errors when calling pd.to_datetime.
to_datetime is the function you want. It does not support multiple columns so you just loop over the columns one by one. The strings are in UK format (day-first) so you simply tell to_datetime that:
df = pd.read_csv('/path/to/file.csv', usecols = ['ADMIDATE','DISDATE']).replace({'1801-01-01': pd.NA})
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
astype('datetime64[ns]') is too inflexible for what you need.

Customised start and end date of the month

I have a data frame which contains date and value. I have to compute sum of the values for each month.
i.e., df.groupby(pd.Grouper(freq='M'))['Value'].sum()
But the problem is in my data set starting date of the month is 21 and ending at 20. Is there any way to tell that group the month from 21th day to 20th day to pandas.
Assume my data frame contains starting and ending date is,
starting_date=datetime.datetime(2015,11,21)
ending_date=datetime.datetime(2017,11,20)
so far i tried,
starting_date=df['Date'].min()
ending_date=df['Date'].max()
month_wise_sum=[]
while(starting_date<=ending_date):
temp=starting_date+datetime.timedelta(days=31)
e_y=temp.year
e_m=temp.month
e_d=20
temp= datetime.datetime(e_y,e_m,e_d)
month_wise_sum.append(df[df['Date'].between(starting_date,temp)]['Value'].sum())
starting_date=temp+datetime.timedelta(days=1)
print month_wise_sum
My above code does the thing. but still waiting for pythonic way to achieve it.
My biggest problem is slicing data frame for month wise
for example,
2015-11-21 to 2015-12-20
Is there any pythonic way to achieve this?
Thanks in Advance.
For Example consider this as my dataframe. It contains date from date_range(datetime.datetime(2017,01,21),datetime.datetime(2017,10,20))
Input:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
.. ... ...
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
[273 rows x 2 columns]
I want to slice this dataframe like below
Iter-1:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
30 2017-02-20 0.616847
iter-2:
Date Value
31 2017-02-21 2.356993
32 2017-02-22 -0.265603
33 2017-02-23 -0.651336
34 2017-02-24 -0.952791
35 2017-02-25 0.124278
36 2017-02-26 0.545956
37 2017-02-27 0.671670
38 2017-02-28 -0.836518
39 2017-03-01 1.178424
40 2017-03-02 0.182758
41 2017-03-03 -0.733987
42 2017-03-04 0.112974
43 2017-03-05 -0.357269
44 2017-03-06 1.454310
45 2017-03-07 -1.201187
46 2017-03-08 0.212540
47 2017-03-09 0.082771
48 2017-03-10 -0.906591
49 2017-03-11 -0.931166
50 2017-03-12 -0.391388
51 2017-03-13 -0.893409
52 2017-03-14 -1.852290
53 2017-03-15 0.368390
54 2017-03-16 -1.672943
55 2017-03-17 -0.934288
56 2017-03-18 -0.154785
57 2017-03-19 0.552378
58 2017-03-20 0.096006
.
.
.
iter-n:
Date Value
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
So that i could calculate each month's sum of value series
[0.7536957367200978, -4.796100620186059, -1.8423374363366014, 2.3780759926221267, 5.753755441349653, -0.01072884830461407, -0.24877912707664018, 11.666305431020149, 3.0772592888909065]
I hope i explained thoroughly.
For the purpose of testing my solution, I generated some random data, frequency is daily but it should work for every frequencies.
index = pd.date_range('2015-11-21', '2017-11-20')
df = pd.DataFrame(index=index, data={0: np.random.rand(len(index))})
Here you see that I passed as index an array of datetimes. Indexing with dates allow in pandas for a lot of added functionalities. With your data you should do (if the Date column already only contains datetime values) :
df = df.set_index('Date')
Then I would realign artificially your data by substracting 20 days to the index :
from datetime import timedelta
df.index -= timedelta(days=20)
and then I would resample data to a monthly indexing, summing all data in the same month :
df.resample('M').sum()
The resulting dataframe is indexed by the last datetime of each month (for me something like :
0
2015-11-30 3.191098
2015-12-31 16.066213
2016-01-31 16.315388
2016-02-29 13.507774
2016-03-31 15.939567
2016-04-30 17.094247
2016-05-31 15.274829
2016-06-30 13.609203
but feel free to reindex it :)
Using pandas.cut() could be a quick solution for you:
import pandas as pd
import numpy as np
start_date = "2015-11-21"
# As #ALollz mentioned, the month with the original end_date='2017-11-20' was missing.
# since pd.date_range() only generates dates in the specified range (between start= and end=),
# '2017-11-31'(using freq='M') exceeds the original end='2017-11-20' and thus is cut off.
# the similar situation applies also to start_date (using freq="MS") when start_month might be cut off
# easy fix is just to extend the end_date to a date in the next month or use
# the end-date of its own month '2017-11-30', or replace end= to periods=25
end_date = "2017-12-20"
# create a testing dataframe
df = pd.DataFrame({ "date": pd.date_range(start_date, periods=710, freq='D'), "value": np.random.randn(710)})
# set up bins to include all dates to create expected date ranges
bins = [ d.replace(day=20) for d in pd.date_range(start_date, end_date, freq="M") ]
# group and summary using the ranges from the above bins
df.groupby(pd.cut(df.date, bins)).sum()
value
date
(2015-11-20, 2015-12-20] -5.222231
(2015-12-20, 2016-01-20] -4.957852
(2016-01-20, 2016-02-20] -0.019802
(2016-02-20, 2016-03-20] -0.304897
(2016-03-20, 2016-04-20] -7.605129
(2016-04-20, 2016-05-20] 7.317627
(2016-05-20, 2016-06-20] 10.916529
(2016-06-20, 2016-07-20] 1.834234
(2016-07-20, 2016-08-20] -3.324972
(2016-08-20, 2016-09-20] 7.243810
(2016-09-20, 2016-10-20] 2.745925
(2016-10-20, 2016-11-20] 8.929903
(2016-11-20, 2016-12-20] -2.450010
(2016-12-20, 2017-01-20] 3.137994
(2017-01-20, 2017-02-20] -0.796587
(2017-02-20, 2017-03-20] -4.368718
(2017-03-20, 2017-04-20] -9.896459
(2017-04-20, 2017-05-20] 2.350651
(2017-05-20, 2017-06-20] -2.667632
(2017-06-20, 2017-07-20] -2.319789
(2017-07-20, 2017-08-20] -9.577919
(2017-08-20, 2017-09-20] 2.962070
(2017-09-20, 2017-10-20] -2.901864
(2017-10-20, 2017-11-20] 2.873909
# export the result
summary = df.groupby(pd.cut(df.date, bins)).value.sum().tolist()
..

pandas dataframe by index and integer

So I have a pandas dataframe indexed by date.
I need to grab a value from the dataframe by date...and then grab the value from the dataframe that was the day before...except I can't just subtract a day, since weekends and holidays are missing from the data.
It would be great if I could write:
x = dataframe.ix[date]
and
i = dataframe.ix[date].index
date2 = dataframe[i-1]
I'm not married to this solution. If there is a way to get the date or index number exactly one prior to the date I know, I would be happy...(short of looping through the whole dataframe and testing to see if I have a match, and saving the count...)
Use .get_loc to get the integer position of a label value in the index:
In [51]:
df = pd.DataFrame(index=pd.date_range(start=dt.datetime(2015,1,1), end=dt.datetime(2015,2,1)), data={'a':np.arange(32)})
df
Out[51]:
a
2015-01-01 0
2015-01-02 1
2015-01-03 2
2015-01-04 3
2015-01-05 4
2015-01-06 5
2015-01-07 6
2015-01-08 7
2015-01-09 8
2015-01-10 9
2015-01-11 10
2015-01-12 11
2015-01-13 12
2015-01-14 13
2015-01-15 14
2015-01-16 15
2015-01-17 16
2015-01-18 17
2015-01-19 18
2015-01-20 19
2015-01-21 20
2015-01-22 21
2015-01-23 22
2015-01-24 23
2015-01-25 24
2015-01-26 25
2015-01-27 26
2015-01-28 27
2015-01-29 28
2015-01-30 29
2015-01-31 30
2015-02-01 31
Here using .get_loc on the index will return the ordinal position:
In [52]:
df.index.get_loc('2015-01-10')
Out[52]:
9
pass this value using .iloc to get a row value by ordinal position:
In [53]:
df.iloc[df.index.get_loc('2015-01-10')]
Out[53]:
a 9
Name: 2015-01-10 00:00:00, dtype: int32
You can then subtract 1 from this to get the previous row:
In [54]:
df.iloc[df.index.get_loc('2015-01-10') - 1]
Out[54]:
a 8
Name: 2015-01-09 00:00:00, dtype: int32

Categories

Resources