I'm trying to groupby respective month over a span of years but to no avail. For e.g. Group all the Januarys from 2011 - 2013 together. Group all the Febs together.
Partial Dataset:
Date
2011-01-01 161
2011-02-01 117
2011-03-01 239
2012-01-01 289
2012-02-01 294
2012-03-01 378
2013-01-01 383
2013-02-01 361
Expected Output:
Date
2011-01-01 161
2012-01-01 117
2013-01-01 239
2011-02-01 289
2012-02-01 294
2013-02-01 378
2011-03-01 383
2012-03-01 361
Attempted:
Date is DatetimeIndex
df = df.groupby([df.index.year],[df.index.month])
Output: TypeError: unhashable type: 'list'
You are passing two lists, pass one list with two elements, for example:
df = df.groupby([df.index.year, df.index.month])
We can try argsort
df=df.iloc[df.index.month.argsort()]
df
2011-01-01 161
2012-01-01 289
2013-01-01 383
2011-02-01 117
2012-02-01 294
2013-02-01 361
2011-03-01 239
2012-03-01 378
Name: Date, dtype: int64
Related
I am importing data from an Excel worksheet where I have a 'Duration' field displayed in [h]:mm (so that the total number of hours is shown). I understand that underneath, this is simply number of days as a float.
I want to work with this as a timedelta column or similar in a Pandas dataframe but no matter what I do it's dropping any hours over 24 (e.g. the days portion).
Excel data (over 24 hours highlighted):
Pandas import (1d 7h 51m):
BATCH_NO Duration
354 7154 04:36:00
465 7270 06:35:00
466 7271 08:05:00
467 7272 05:54:00
468 7273 09:10:00
472 7277 06:15:00
476 7280 10:23:00
477 7284 06:09:00
499 7313 06:46:00
503 7322 05:27:00
510 7333 14:15:00
515 7335 1900-01-01 07:51:00
516 7338 07:51:00
517 7339 09:00:00
518 7339 05:29:00
519 7339 09:00:00
520 7339 05:29:00
522 7342 12:10:00
525 7343 08:00:00
530 7346 08:25:00
Running a to_datetime conversion simply drops the day (integer) part of the column:
BATCH_NO Duration
354 7154 04:36:00
465 7270 06:35:00
466 7271 08:05:00
467 7272 05:54:00
468 7273 09:10:00
472 7277 06:15:00
476 7280 10:23:00
477 7284 06:09:00
499 7313 06:46:00
503 7322 05:27:00
510 7333 14:15:00
515 7335 07:51:00
516 7338 07:51:00
517 7339 09:00:00
518 7339 05:29:00
519 7339 09:00:00
520 7339 05:29:00
522 7342 12:10:00
525 7343 08:00:00
530 7346 08:25:00
I have tried importing by fixing the dtype as float, but only str or object work - dtype={'Duration': str} works.
float gives the error float() argument must be a string or a number, not 'datetime.time' and even with str or object, Python still thinks the column i a datetime.time
Ideally I do not want to change the Excel source data or export to .csv as in intermediate step.
If I got it correctly, the imported objects are datetime and time with the datetime in Julian calendar.
So you must convert with a custom function:
from datetime import datetime, time, timedelta
def convert(t):
if isinstance(t, time):
t = datetime.combine(datetime.min, t)
delta = t-datetime.min
if delta.days != 0:
delta -= timedelta(days=693594)
return delta
df['Duration'].apply(convert)
Output:
0 0 days 04:36:00
1 0 days 06:35:00
2 0 days 08:05:00
3 0 days 05:54:00
4 0 days 09:10:00
5 0 days 06:15:00
6 0 days 10:23:00
7 0 days 06:09:00
8 0 days 06:46:00
9 0 days 05:27:00
10 0 days 14:15:00
11 1 days 07:51:00 # corrected
12 0 days 07:51:00
13 0 days 09:00:00
14 0 days 05:29:00
15 0 days 09:00:00
...
I have two pandas dataframes:
DF1
index = np.arange('2020-01-01 00:00', '2020-01-01 00:04', dtype='datetime64[m]')
df = np.random.randint(100,500, size=(4,4))
columns =['Open','High','Low','Close']
df = pd.DataFrame(df, index=index, columns = columns)
df.index.name = 'Time'
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 315 298 296 493
2020-01-01 00:03:00 324 411 198 101
DF2
index = np.arange('2020-01-01 00:02', '2020-01-01 00:05', dtype='datetime64[m]')
df2 = np.random.randint(100,500, size=(3,4))
columns =['Open','High','Low','Close']
df2 = pd.DataFrame(df2, index=index, columns = columns)
df2.index.name = 'Time'
Open High Low Close
Time
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350
I need to merge both dataframes by the index (Time) and replace the column values of DF1 by the column values of DF2.
This is my expected output:
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475 ->>>> Correspond to DF1
2020-01-01 00:01:00 362 135 456 235 ->>>> Correspond to DF1
2020-01-01 00:02:00 430 394 131 490 ->>>> Correspond to DF2
2020-01-01 00:03:00 190 211 394 359 ->>>> Correspond to DF2
2020-01-01 00:04:00 192 291 143 350 ->>>> Correspond to DF2
I have try several functions including merge or concat (concat([df1, df2], join="inner")) but with no success. Any help would be very appreciated. Thanks!
Try this:
df2.combine_first(df)
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350
Because you mentioned pd.concat, here is how you could do it with that.
out = pd.concat([df, df2])
out = out[~out.index.duplicated(keep='last')]
print(out)
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350
I have a pandas dataframe like this
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to subset this dataframe on date == '2016-01-04.Datatypes of df dataframe are
df.dtypes
Out[1264]:
order_id object
buyer_id object
item_id object
time datetime64[ns]
This is what I am doing in python
df[df['time'] == '2016-01-04']
But it returns me an empty dataframe. But,when I do
df[df['time'] < '2016-01-05'] it works. Please help
The problem here is that the comparison is being performed for an exact match, as none of the times are '00:00:00' then no matches occur, you'd have to compare just the date components in order for this to work:
In [20]:
df[df['time'].dt.date == pd.to_datetime('2016-01-04').date()]
Out[20]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
IIUC you can use DatetimeIndex Partial String Indexing:
print df
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
df = df.set_index('time')
print df['2016-01-04']
order_id buyer_id item_id
time
2016-01-04 10:20:00 537 79 93
2016-01-04 10:30:00 540 191 93
2016-01-04 13:39:00 556 251 82
I have a pandas DataFrame like this..
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to get all the buyer_ids present before 6th jan 2016 but not after 6th Jan 2016
so, it should return me buyer_id 79
I am doing following in Python.
df.buyer_id[(df['time'] < '2016-01-06')]
This returns me all the buyer ids before 6th jan 2016 but how to check for the condition if its not present after 6th jan ? Please help
IIUC you could use isin method to achieve what you want:
df.time = pd.to_datetime(df.time)
In [52]: df
Out[52]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
exclude = df.buyer_id[(df['time'] > '2016-01-06')]
select = df.buyer_id[(df['time'] < '2016-01-06')]
In [53]: select
Out[53]:
0 79
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
In [54]: exclude
Out[54]:
5 79
6 261
7 64
8 261
9 309
Name: buyer_id, dtype: int64
In [55]: select[~select.isin(exclude)]
Out[55]:
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
You could use:
df.groupby('buyer_id').apply(lambda x: True if (x.time < '01-06-2016').any() and not (x.time > '01-06-2016').any() else False)
buyer_id
64 False
79 False
191 True
251 True
261 False
309 False
dtype: bool
I have the following dataframe, that I'm calling "sales_df":
Value
Date
2004-01-01 0
2004-02-01 173
2004-03-01 225
2004-04-01 230
2004-05-01 349
2004-06-01 258
2004-07-01 270
2004-08-01 223
... ...
2015-06-01 218
2015-07-01 215
2015-08-01 233
2015-09-01 258
2015-10-01 252
2015-11-01 256
2015-12-01 188
2016-01-01 70
I want to separate its trend from its seasonal component and for that I use statsmodels.tsa.seasonal_decompose through the following code:
decomp=sm.tsa.seasonal_decompose(sales_df.Value)
df=pd.concat([sales_df,decomp.trend],axis=1)
df.columns=['sales','trend']
This is getting me this:
sales trend
Date
2004-01-01 0 NaN
2004-02-01 173 NaN
2004-03-01 225 NaN
2004-04-01 230 NaN
2004-05-01 349 NaN
2004-06-01 258 NaN
2004-07-01 270 236.708333
2004-08-01 223 248.208333
2004-09-01 243 251.250000
... ... ...
2015-05-01 270 214.416667
2015-06-01 218 215.583333
2015-07-01 215 212.791667
2015-08-01 233 NaN
2015-09-01 258 NaN
2015-10-01 252 NaN
2015-11-01 256 NaN
2015-12-01 188 NaN
2016-01-01 70 NaN
Note that there are 6 NaN's in the start and in the end of the Trend's series.
So I ask, is that right? Why is that happening?
This is expected as seasonal_decompose uses a symmetric moving average by default if the filt argument is not specified (as you did). The frequency is inferred from the time series.
https://searchcode.com/codesearch/view/86129185/