Groupby DatetimeIndex month

Groupby DatetimeIndex month - python

I'm trying to groupby respective month over a span of years but to no avail. For e.g. Group all the Januarys from 2011 - 2013 together. Group all the Febs together.
Partial Dataset:
Date
2011-01-01 161
2011-02-01 117
2011-03-01 239
2012-01-01 289
2012-02-01 294
2012-03-01 378
2013-01-01 383
2013-02-01 361
Expected Output:
Date
2011-01-01 161
2012-01-01 117
2013-01-01 239
2011-02-01 289
2012-02-01 294
2013-02-01 378
2011-03-01 383
2012-03-01 361
Attempted:
Date is DatetimeIndex
df = df.groupby([df.index.year],[df.index.month])
Output: TypeError: unhashable type: 'list'

You are passing two lists, pass one list with two elements, for example:
df = df.groupby([df.index.year, df.index.month])

We can try argsort
df=df.iloc[df.index.month.argsort()]
df
2011-01-01 161
2012-01-01 289
2013-01-01 383
2011-02-01 117
2012-02-01 294
2013-02-01 361
2011-03-01 239
2012-03-01 378
Name: Date, dtype: int64

Related

Excel [h]:mm duration to pandas timedelta

I am importing data from an Excel worksheet where I have a 'Duration' field displayed in [h]:mm (so that the total number of hours is shown). I understand that underneath, this is simply number of days as a float.
I want to work with this as a timedelta column or similar in a Pandas dataframe but no matter what I do it's dropping any hours over 24 (e.g. the days portion).
Excel data (over 24 hours highlighted):
Pandas import (1d 7h 51m):
BATCH_NO Duration
354 7154 04:36:00
465 7270 06:35:00
466 7271 08:05:00
467 7272 05:54:00
468 7273 09:10:00
472 7277 06:15:00
476 7280 10:23:00
477 7284 06:09:00
499 7313 06:46:00
503 7322 05:27:00
510 7333 14:15:00
515 7335 1900-01-01 07:51:00
516 7338 07:51:00
517 7339 09:00:00
518 7339 05:29:00
519 7339 09:00:00
520 7339 05:29:00
522 7342 12:10:00
525 7343 08:00:00
530 7346 08:25:00
Running a to_datetime conversion simply drops the day (integer) part of the column:
BATCH_NO Duration
354 7154 04:36:00
465 7270 06:35:00
466 7271 08:05:00
467 7272 05:54:00
468 7273 09:10:00
472 7277 06:15:00
476 7280 10:23:00
477 7284 06:09:00
499 7313 06:46:00
503 7322 05:27:00
510 7333 14:15:00
515 7335 07:51:00
516 7338 07:51:00
517 7339 09:00:00
518 7339 05:29:00
519 7339 09:00:00
520 7339 05:29:00
522 7342 12:10:00
525 7343 08:00:00
530 7346 08:25:00
I have tried importing by fixing the dtype as float, but only str or object work - dtype={'Duration': str} works.
float gives the error float() argument must be a string or a number, not 'datetime.time' and even with str or object, Python still thinks the column i a datetime.time
Ideally I do not want to change the Excel source data or export to .csv as in intermediate step.

If I got it correctly, the imported objects are datetime and time with the datetime in Julian calendar.
So you must convert with a custom function:
from datetime import datetime, time, timedelta
def convert(t):
if isinstance(t, time):
t = datetime.combine(datetime.min, t)
delta = t-datetime.min
if delta.days != 0:
delta -= timedelta(days=693594)
return delta
df['Duration'].apply(convert)
Output:
0 0 days 04:36:00
1 0 days 06:35:00
2 0 days 08:05:00
3 0 days 05:54:00
4 0 days 09:10:00
5 0 days 06:15:00
6 0 days 10:23:00
7 0 days 06:09:00
8 0 days 06:46:00
9 0 days 05:27:00
10 0 days 14:15:00
11 1 days 07:51:00 # corrected
12 0 days 07:51:00
13 0 days 09:00:00
14 0 days 05:29:00
15 0 days 09:00:00
...

Merge two pandas dataframes by index and replace column values in Python

I have two pandas dataframes:
DF1
index = np.arange('2020-01-01 00:00', '2020-01-01 00:04', dtype='datetime64[m]')
df = np.random.randint(100,500, size=(4,4))
columns =['Open','High','Low','Close']
df = pd.DataFrame(df, index=index, columns = columns)
df.index.name = 'Time'
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 315 298 296 493
2020-01-01 00:03:00 324 411 198 101
DF2
index = np.arange('2020-01-01 00:02', '2020-01-01 00:05', dtype='datetime64[m]')
df2 = np.random.randint(100,500, size=(3,4))
columns =['Open','High','Low','Close']
df2 = pd.DataFrame(df2, index=index, columns = columns)
df2.index.name = 'Time'
Open High Low Close
Time
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350
I need to merge both dataframes by the index (Time) and replace the column values of DF1 by the column values of DF2.
This is my expected output:
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475 ->>>> Correspond to DF1
2020-01-01 00:01:00 362 135 456 235 ->>>> Correspond to DF1
2020-01-01 00:02:00 430 394 131 490 ->>>> Correspond to DF2
2020-01-01 00:03:00 190 211 394 359 ->>>> Correspond to DF2
2020-01-01 00:04:00 192 291 143 350 ->>>> Correspond to DF2
I have try several functions including merge or concat (concat([df1, df2], join="inner")) but with no success. Any help would be very appreciated. Thanks!

Try this:
df2.combine_first(df)
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350
Because you mentioned pd.concat, here is how you could do it with that.
out = pd.concat([df, df2])
out = out[~out.index.duplicated(keep='last')]
print(out)
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350

subsetting pandas dataframe on specific date value

I have a pandas dataframe like this
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to subset this dataframe on date == '2016-01-04.Datatypes of df dataframe are
df.dtypes
Out[1264]:
order_id object
buyer_id object
item_id object
time datetime64[ns]
This is what I am doing in python
df[df['time'] == '2016-01-04']
But it returns me an empty dataframe. But,when I do
df[df['time'] < '2016-01-05'] it works. Please help

The problem here is that the comparison is being performed for an exact match, as none of the times are '00:00:00' then no matches occur, you'd have to compare just the date components in order for this to work:
In [20]:
df[df['time'].dt.date == pd.to_datetime('2016-01-04').date()]
Out[20]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00

IIUC you can use DatetimeIndex Partial String Indexing:
print df
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
df = df.set_index('time')
print df['2016-01-04']
order_id buyer_id item_id
time
2016-01-04 10:20:00 537 79 93
2016-01-04 10:30:00 540 191 93
2016-01-04 13:39:00 556 251 82

how to subset pandas dataframe on date

I have a pandas DataFrame like this..
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to get all the buyer_ids present before 6th jan 2016 but not after 6th Jan 2016
so, it should return me buyer_id 79
I am doing following in Python.
df.buyer_id[(df['time'] < '2016-01-06')]
This returns me all the buyer ids before 6th jan 2016 but how to check for the condition if its not present after 6th jan ? Please help

IIUC you could use isin method to achieve what you want:
df.time = pd.to_datetime(df.time)
In [52]: df
Out[52]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
exclude = df.buyer_id[(df['time'] > '2016-01-06')]
select = df.buyer_id[(df['time'] < '2016-01-06')]
In [53]: select
Out[53]:
0 79
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
In [54]: exclude
Out[54]:
5 79
6 261
7 64
8 261
9 309
Name: buyer_id, dtype: int64
In [55]: select[~select.isin(exclude)]
Out[55]:
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64

You could use:
df.groupby('buyer_id').apply(lambda x: True if (x.time < '01-06-2016').any() and not (x.time > '01-06-2016').any() else False)
buyer_id
64 False
79 False
191 True
251 True
261 False
309 False
dtype: bool

Python - Statsmodels.tsa.seasonal_decompose - missing values in head and tail of dataframe

I have the following dataframe, that I'm calling "sales_df":
Value
Date
2004-01-01 0
2004-02-01 173
2004-03-01 225
2004-04-01 230
2004-05-01 349
2004-06-01 258
2004-07-01 270
2004-08-01 223
... ...
2015-06-01 218
2015-07-01 215
2015-08-01 233
2015-09-01 258
2015-10-01 252
2015-11-01 256
2015-12-01 188
2016-01-01 70
I want to separate its trend from its seasonal component and for that I use statsmodels.tsa.seasonal_decompose through the following code:
decomp=sm.tsa.seasonal_decompose(sales_df.Value)
df=pd.concat([sales_df,decomp.trend],axis=1)
df.columns=['sales','trend']
This is getting me this:
sales trend
Date
2004-01-01 0 NaN
2004-02-01 173 NaN
2004-03-01 225 NaN
2004-04-01 230 NaN
2004-05-01 349 NaN
2004-06-01 258 NaN
2004-07-01 270 236.708333
2004-08-01 223 248.208333
2004-09-01 243 251.250000
... ... ...
2015-05-01 270 214.416667
2015-06-01 218 215.583333
2015-07-01 215 212.791667
2015-08-01 233 NaN
2015-09-01 258 NaN
2015-10-01 252 NaN
2015-11-01 256 NaN
2015-12-01 188 NaN
2016-01-01 70 NaN
Note that there are 6 NaN's in the start and in the end of the Trend's series.
So I ask, is that right? Why is that happening?

This is expected as seasonal_decompose uses a symmetric moving average by default if the filt argument is not specified (as you did). The frequency is inferred from the time series.
https://searchcode.com/codesearch/view/86129185/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby DatetimeIndex month - python

You are passing two lists, pass one list with two elements, for example: df = df.groupby([df.index.year, df.index.month])

We can try argsort df=df.iloc[df.index.month.argsort()] df 2011-01-01 161 2012-01-01 289 2013-01-01 383 2011-02-01 117 2012-02-01 294 2013-02-01 361 2011-03-01 239 2012-03-01 378 Name: Date, dtype: int64

Related

Excel [h]:mm duration to pandas timedelta

Merge two pandas dataframes by index and replace column values in Python

subsetting pandas dataframe on specific date value

how to subset pandas dataframe on date

Python - Statsmodels.tsa.seasonal_decompose - missing values in head and tail of dataframe

Categories

Resources