Subset selected days data in Python - python

I have some time series data as:
import pandas as pd
index = pd.date_range('06/01/2014',periods=24*30,freq='H')
df1 = pd.DataFrame(range(len(index)),index=index)
Now I want to subset data of below dates
selec_dates = ['2014-06-10','2014-06-15','2014-06-20']
I tried following statement but it is not working
sub_data = df1.loc[df1.index.isin(pd.to_datetime(selec_dates))]
Where am I doing wrong? Is there any other approach to subset selected days data?

You need compare dates and for test membership use numpy.in1d:
sub_data = df1.loc[np.in1d(df1.index.date, pd.to_datetime(selec_dates).date)]
print (sub_data)
a
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
...
If want use isin, is necessary create Series with same index:
sub_data = df1.loc[pd.Series(df1.index.date, index=df1.index)
.isin(pd.to_datetime(selec_dates).date)]
print (sub_data)
a
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
2014-06-10 11:00:00 227
...

I'm sorry and misunderstood your question
df1[pd.Series(df1.index.date, index=df1.index).isin(pd.to_datetime(selec_dates).date)]
Should perform what was needed
original answer
Please check the pandas documentation on selection
You can easily do
sub_data = df1.loc[pd.to_datetime(selec_dates)]

You can use .query() method:
In [202]: df1.query('#index.normalize() in #selec_dates')
Out[202]:
0
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
... ...
2014-06-20 14:00:00 470
2014-06-20 15:00:00 471
2014-06-20 16:00:00 472
2014-06-20 17:00:00 473
2014-06-20 18:00:00 474
2014-06-20 19:00:00 475
2014-06-20 20:00:00 476
2014-06-20 21:00:00 477
2014-06-20 22:00:00 478
2014-06-20 23:00:00 479
[72 rows x 1 columns]

Edit: I have been made aware this only works if you are working with a daterange in the same month and year as in your query. For a more general (and better answer) see #jezrael solution.
You can use np.in1d and .day on your index if you wanted to do it as you tried:
selec_dates = ['2014-06-10','2014-06-15','2014-06-20']
df1.loc[np.in1d(df1.index.day, (pd.to_datetime(selec_dates).day))]
This gives you as you require:
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
2014-06-10 11:00:00 227
2014-06-10 12:00:00 228
2014-06-10 13:00:00 229
2014-06-10 14:00:00 230
2014-06-10 15:00:00 231
2014-06-10 16:00:00 232
2014-06-10 17:00:00 233
2014-06-10 18:00:00 234
2014-06-10 19:00:00 235
2014-06-10 20:00:00 236
2014-06-10 21:00:00 237
2014-06-10 22:00:00 238
2014-06-10 23:00:00 239
2014-06-15 00:00:00 336
2014-06-15 01:00:00 337
2014-06-15 02:00:00 338
2014-06-15 03:00:00 339
2014-06-15 04:00:00 340
2014-06-15 05:00:00 341
...
2014-06-15 18:00:00 354
2014-06-15 19:00:00 355
2014-06-15 20:00:00 356
2014-06-15 21:00:00 357
2014-06-15 22:00:00 358
2014-06-15 23:00:00 359
2014-06-20 00:00:00 456
2014-06-20 01:00:00 457
2014-06-20 02:00:00 458
2014-06-20 03:00:00 459
2014-06-20 04:00:00 460
2014-06-20 05:00:00 461
2014-06-20 06:00:00 462
2014-06-20 07:00:00 463
2014-06-20 08:00:00 464
2014-06-20 09:00:00 465
2014-06-20 10:00:00 466
2014-06-20 11:00:00 467
2014-06-20 12:00:00 468
2014-06-20 13:00:00 469
2014-06-20 14:00:00 470
2014-06-20 15:00:00 471
2014-06-20 16:00:00 472
2014-06-20 17:00:00 473
2014-06-20 18:00:00 474
2014-06-20 19:00:00 475
2014-06-20 20:00:00 476
2014-06-20 21:00:00 477
2014-06-20 22:00:00 478
2014-06-20 23:00:00 479
[72 rows x 1 columns]
I used these Sources for this answer:
- Selecting a subset of a Pandas DataFrame indexed by DatetimeIndex with a list of TimeStamps
- In Python-Pandas, How can I subset a dataframe by specific datetime index values?
- return pandas DF column with the number of days elapsed between index and today's date
- Get weekday/day-of-week for Datetime column of DataFrame
- https://stackoverflow.com/a/36893416/2254228

Use the string repr of the date, leaving out the time periods in the day.
pd.concat([df1['2014-06-10'] , df1['2014-06-15'], df1['2014-06-20']])

Related

Merge two pandas dataframes by index and replace column values in Python

I have two pandas dataframes:
DF1
index = np.arange('2020-01-01 00:00', '2020-01-01 00:04', dtype='datetime64[m]')
df = np.random.randint(100,500, size=(4,4))
columns =['Open','High','Low','Close']
df = pd.DataFrame(df, index=index, columns = columns)
df.index.name = 'Time'
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 315 298 296 493
2020-01-01 00:03:00 324 411 198 101
DF2
index = np.arange('2020-01-01 00:02', '2020-01-01 00:05', dtype='datetime64[m]')
df2 = np.random.randint(100,500, size=(3,4))
columns =['Open','High','Low','Close']
df2 = pd.DataFrame(df2, index=index, columns = columns)
df2.index.name = 'Time'
Open High Low Close
Time
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350
I need to merge both dataframes by the index (Time) and replace the column values of DF1 by the column values of DF2.
This is my expected output:
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475 ->>>> Correspond to DF1
2020-01-01 00:01:00 362 135 456 235 ->>>> Correspond to DF1
2020-01-01 00:02:00 430 394 131 490 ->>>> Correspond to DF2
2020-01-01 00:03:00 190 211 394 359 ->>>> Correspond to DF2
2020-01-01 00:04:00 192 291 143 350 ->>>> Correspond to DF2
I have try several functions including merge or concat (concat([df1, df2], join="inner")) but with no success. Any help would be very appreciated. Thanks!
Try this:
df2.combine_first(df)
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350
Because you mentioned pd.concat, here is how you could do it with that.
out = pd.concat([df, df2])
out = out[~out.index.duplicated(keep='last')]
print(out)
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350

Slicing pandas DateTimeIndex with steps

I often deal with pandas DataFrames with DateTimeIndexes, where I want to - for example - select only the parts where the hour of the index = 6. The only way I currently know how to do this is with reindexing:
df.reindex(pd.date_range(*df.index.to_series().agg([min, max]).apply(lambda ts: ts.replace(hour=6)), freq="24H"))
But this is quite unreadable and complex, which gets even worse when there is a MultiIndex with multiple DateTimeIndex levels. I know of methods that use .reset_index() and then either df.where or df.loc with conditional statements, but is there a simpler way to do this with regular IndexSlicing? I tried it as follows
df.loc[df.index.min().replace(hour=6)::pd.Timedelta(24, unit="H")]
but this gives a TypeError:
TypeError: '>=' not supported between instances of 'Timedelta' and 'int'
If your index is a DatetimeIndex, you can use:
>>> df[df.index.hour == 6]
val
2022-03-01 06:00:00 7
2022-03-02 06:00:00 31
2022-03-03 06:00:00 55
2022-03-04 06:00:00 79
2022-03-05 06:00:00 103
2022-03-06 06:00:00 127
2022-03-07 06:00:00 151
2022-03-08 06:00:00 175
2022-03-09 06:00:00 199
2022-03-10 06:00:00 223
2022-03-11 06:00:00 247
2022-03-12 06:00:00 271
2022-03-13 06:00:00 295
2022-03-14 06:00:00 319
2022-03-15 06:00:00 343
2022-03-16 06:00:00 367
2022-03-17 06:00:00 391
2022-03-18 06:00:00 415
2022-03-19 06:00:00 439
2022-03-20 06:00:00 463
2022-03-21 06:00:00 487
Setup:
dti = pd.date_range('2022-3-1', '2022-3-22', freq='1H')
df = pd.DataFrame({'val': range(1, len(dti)+1)}, index=dti)

Why pandas .last('1W') don't show last 7 days?

Trying to extract a max value from a Pandas Dataframe with a daytime as index, I'm using .last('1W').
My data goes from the first day of month (2020-09-01 00:00:00). It seems to work properly until I reach today (monday 07/09/2020). At first I supposed that .last() takes the last days of week from starting value (sunday I guess) instead of the last 7 days (as I assumed) but, what confuses me is that if I extend the hours, the resulting dataframe shifts the first sample too...
I'm try to simulate this with:
import pandas as pd
i = pd.date_range('2020-09-01', periods=24*6+5, freq='1H')
values = range(0, 24*6+5 )
df = pd.DataFrame({'A': values}, index=i)
print(df)
print(df.last('1W'))
With output:
A
2020-09-01 00:00:00 0
2020-09-01 01:00:00 1
2020-09-01 02:00:00 2
2020-09-01 03:00:00 3
2020-09-01 04:00:00 4
... ...
2020-09-07 00:00:00 144
2020-09-07 01:00:00 145
2020-09-07 02:00:00 146
2020-09-07 03:00:00 147
2020-09-07 04:00:00 148
[149 rows x 1 columns]
A
2020-09-06 05:00:00 125
2020-09-06 06:00:00 126
2020-09-06 07:00:00 127
2020-09-06 08:00:00 128
2020-09-06 09:00:00 129
2020-09-06 10:00:00 130
2020-09-06 11:00:00 131
2020-09-06 12:00:00 132
2020-09-06 13:00:00 133
2020-09-06 14:00:00 134
2020-09-06 15:00:00 135
2020-09-06 16:00:00 136
2020-09-06 17:00:00 137
2020-09-06 18:00:00 138
2020-09-06 19:00:00 139
2020-09-06 20:00:00 140
2020-09-06 21:00:00 141
2020-09-06 22:00:00 142
2020-09-06 23:00:00 143
2020-09-07 00:00:00 144
2020-09-07 01:00:00 145
2020-09-07 02:00:00 146
2020-09-07 03:00:00 147
2020-09-07 04:00:00 148
Process finished with exit code 0
The first value in df is 0 at 2020-09-01 00:00:00
But,
When I try to apply last('1W'), the selection goes from 2020-09-06 05:00:00, to the last value, instead of the last 7 days... as I assumed, nor from 2020-09-06 00:00:00 if the operator works from sunday to sunday.
If you're looking for an offset of 7 days, why not use the Day offset, rather than the Week?
"1W" offset isn't the same as "7D" because "1W" starting on a Monday in a two-week dataset where the last row is Tuesday will have only 2 days. "2W" will include previous week (Monday-Sunday) + (Monday-Tuesday).
You can see the effects of changing the start day of the week by calling the offset class directly, like so:
week_offset = pd.tseries.offsets.Week(n=1, weekday=0) # week starting Monday
day_offset = pd.tseries.offsets.Day(n=7) # or simply "7D"
df.last(day_offset)

Python - How to group values with same hour stamp but different dates by hour

I need to show the amount of calls that are made within every hour in an entire month. So far I could resample the CSV in the following way:
Amount
Date
2017-03-01 00:00:00 5
2017-03-01 01:00:00 1
.
.
2017-03-31 22:00:00 7
2017-03-31 23:00:00 2
The date is a datetimeIndex y resampled all values in intervals of one hour.
What I need is to be able to group by hour all rows, what i mean is to group all the calls that are made in every day of the month at 21:00,for example, sum the amount and show it in a single row.
For example:
Amount
Date
2017-03 00:00:00 600
2017-03 01:00:00 200
2017-03 02:00:00 30
.
.
2017-03 22:00:00 500
2017-03 23:00:00 150
Setup
import pandas as pd
import datetime as dt
idx = pd.date_range('2017-03-01 00:00:00', '2017-03-31 23:00:00', freq='H')
df = pd.DataFrame(index=idx, columns=['Amount'], data=np.random.randint(1,100,len(idx)))
Solution
#Convert index to a list of dates without days like (2017-03 00:00:00 )
group_date = [dt.datetime.strftime(e, '%Y-%m %H:%M:%S') for e in df.index]
#group the data by the new group_date
df.groupby(group_date)['Amount'].sum().to_frame()
Out[592]:
Amount
2017-03 00:00:00 1310
2017-03 01:00:00 1339
2017-03 02:00:00 1438
2017-03 03:00:00 1660
2017-03 04:00:00 1503
2017-03 05:00:00 1466
2017-03 06:00:00 1380
2017-03 07:00:00 1720
2017-03 08:00:00 1399
2017-03 09:00:00 1633
2017-03 10:00:00 1632
2017-03 11:00:00 1706
2017-03 12:00:00 1526
2017-03 13:00:00 1433
2017-03 14:00:00 1577
2017-03 15:00:00 1061
2017-03 16:00:00 1769
2017-03 17:00:00 1449
2017-03 18:00:00 1621
2017-03 19:00:00 1602
2017-03 20:00:00 1541
2017-03 21:00:00 1409
2017-03 22:00:00 1711
2017-03 23:00:00 1313
You can use DatetimeIndex.strftime with groupby and aggregating sum:
df1 = df.groupby(df.index.strftime('%Y-%m %H:%M:%S'))[['Amount']].sum()
#with borrowing sample from Allen
print (df1)
Amount
2017-03 00:00:00 1528
2017-03 01:00:00 1505
2017-03 02:00:00 1606
2017-03 03:00:00 1750
2017-03 04:00:00 1493
2017-03 05:00:00 1649
2017-03 06:00:00 1390
2017-03 07:00:00 1147
2017-03 08:00:00 1687
2017-03 09:00:00 1602
2017-03 10:00:00 1755
2017-03 11:00:00 1381
2017-03 12:00:00 1390
2017-03 13:00:00 1565
2017-03 14:00:00 1805
2017-03 15:00:00 1674
2017-03 16:00:00 1375
2017-03 17:00:00 1601
2017-03 18:00:00 1493
2017-03 19:00:00 1422
2017-03 20:00:00 1781
2017-03 21:00:00 1709
2017-03 22:00:00 1578
2017-03 23:00:00 1583
Another solution with DatetimeIndex.to_period and DatetimeIndex.hour. Then use groupby and sum and last create Index from MultiIndex with map:
a = df.index.to_period('M')
b = df.index.hour
df1 = df.groupby([a,b])[['Amount']].sum()
#http://stackoverflow.com/questions/17118071/python-add-leading-zeroes-using-str-format
df1.index = df1.index.map(lambda x: '{0[0]} {0[1]:0>2}:00:00'.format(x))
print (df1)
Amount
2017-03 00:00:00 1739
2017-03 01:00:00 1502
2017-03 02:00:00 1585
2017-03 03:00:00 1710
2017-03 04:00:00 1679
2017-03 05:00:00 1371
2017-03 06:00:00 1489
2017-03 07:00:00 1252
2017-03 08:00:00 1540
2017-03 09:00:00 1443
2017-03 10:00:00 1589
2017-03 11:00:00 1499
2017-03 12:00:00 1837
2017-03 13:00:00 1834
2017-03 14:00:00 1695
2017-03 15:00:00 1616
2017-03 16:00:00 1499
2017-03 17:00:00 1329
2017-03 18:00:00 1727
2017-03 19:00:00 1764
2017-03 20:00:00 1754
2017-03 21:00:00 1621
2017-03 22:00:00 1486
2017-03 23:00:00 1672
Timings:
In [394]: %timeit (jez(df))
1 loop, best of 3: 630 ms per loop
In [395]: %timeit (df.groupby(df.index.strftime('%Y-%m %H:%M:%S'))[['Amount']].sum())
1 loop, best of 3: 792 ms per loop
#Allen's solution
In [396]: %timeit (df.groupby([dt.datetime.strftime(e, '%Y-%m %H:%M:%S') for e in df.index])['Amount'].sum().to_frame())
1 loop, best of 3: 663 ms per loop
Code for timings:
np.random.seed(100)
#[68712 rows x 1 columns]
idx = pd.date_range('2010-03-01 00:00:00', '2017-12-31 23:00:00', freq='H')
df = pd.DataFrame(index=idx, columns=['Amount'], data=np.random.randint(1,100,len(idx)))
print (df.head())
def jez(df):
a = df.index.to_period('M')
b = df.index.hour
df1 = df.groupby([a,b])[['Amount']].sum()
df1.index = df1.index.map(lambda x: '{0[0]} {0[1]:0>2}:00:00'.format(x))
return (df1)

Python pandas. Group By and removing a timestamp

I have the below pandas data frame. I need to do a Group By by column B and sum col A and remove the time stamp. So..In the below...should have one record with the A's summed up. Som How I do thus in pandas?
A B
2013-03-15 17:00:00 1 134
2013-03-15 18:00:00 810 134
2013-03-15 19:00:00 1797 134
2013-03-15 20:00:00 813 134
2013-03-15 21:00:00 1323 134
2013-03-16 05:00:00 98 134
2013-03-16 06:00:00 515 134
2013-03-16 10:00:00 377 134
2013-03-16 11:00:00 1798 134
2013-03-16 12:00:00 985 134
2013-03-17 08:00:00 258 134
This can be done with a straight-forward groupby operation:
import io
import pandas as pd
content='''\
date time A B
2013-03-15 17:00:00 1 134
2013-03-15 18:00:00 810 134
2013-03-15 19:00:00 1797 134
2013-03-15 20:00:00 813 135
2013-03-15 21:00:00 1323 134
2013-03-16 05:00:00 98 134
2013-03-16 06:00:00 515 135
2013-03-16 10:00:00 377 134
2013-03-16 11:00:00 1798 136
2013-03-16 12:00:00 985 136
2013-03-17 08:00:00 258 137'''
df = pd.read_table(io.BytesIO(content), sep='\s+',
parse_dates=[[0, 1]], header=0,
index_col=0)
print(df.groupby(['B']).sum())
yields
A
B
134 4406
135 1328
136 2783
137 258
Some of the values in B were changed to show a more interesting groupby operation.

Categories

Resources