I have the following DataFrame:
data = [[99330,12,122],[1123,1230,1287],[123,101,812739],[1143,1230123,252],[234,342,4546],[2445,3453,3457],[7897,8657,5675],[46,5675,453],[76,484,3735],[363,93,4568],[385,568,367],[458,846,4847],[574,45747,658468],[57457,46534,4675]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
df1:
col_A col_B col_C
2022-01-01 99330 12 122
2022-01-02 1123 1230 1287
2022-01-03 123 101 812739
2022-01-04 1143 1230123 252
2022-01-05 234 342 4546
2022-01-06 2445 3453 3457
2022-01-07 7897 8657 5675
2022-01-08 46 5675 453
2022-01-09 76 484 3735
2022-01-10 363 93 4568
2022-01-11 385 568 367
2022-01-12 458 846 4847
2022-01-13 574 45747 658468
2022-01-14 57457 46534 4675
I am applying the following to get rolling returns:
periodicity_dict = {1:'daily', 7:'weekly'}
df_columns = df1.columns
for key in periodicity_dict:
for col in df_columns:
df1[col+'_rolling']= np.nan
for i in range(key, len(df1[col][df1[col].first_valid_index():df1[col].last_valid_index()])):
df1[col+'_rolling'].iloc[i] = (df1[col].iloc[i] - df[col].iloc[i-key])/df[col].iloc[i-key]
But I am getting the following error: KeyError: 'col_A'.
What am I doing wrong? And is there a better way to do this with less loops?
I think you are looking for something like the shift method (no for-loop is needed):
df1['col_A_rolling'] = (df1['col_A'] - df1['col_A'].shift(7)) / df1['col_A'].shift(7)
OUTPUT:
col_A col_B col_C col_A_rolling
2022-01-01 99330 12 122 NaN
2022-01-02 1123 1230 1287 NaN
2022-01-03 123 101 812739 NaN
2022-01-04 1143 1230123 252 NaN
2022-01-05 234 342 4546 NaN
2022-01-06 2445 3453 3457 NaN
2022-01-07 7897 8657 5675 NaN
2022-01-08 46 5675 453 -0.999537
2022-01-09 76 484 3735 -0.932324
2022-01-10 363 93 4568 1.951220
2022-01-11 385 568 367 -0.663167
2022-01-12 458 846 4847 0.957265
2022-01-13 574 45747 658468 -0.765235
2022-01-14 57457 46534 4675 6.275801
I have the following data:
Data:
ObjectID,Date,Price,Vol,Mx
101,2017-01-01,,145,203
101,2017-01-02,,155,163
101,2017-01-03,67.0,140,234
101,2017-01-04,78.0,130,182
101,2017-01-05,58.0,178,202
101,2017-01-06,53.0,134,204
101,2017-01-07,52.0,134,183
101,2017-01-08,62.0,148,176
101,2017-01-09,42.0,152,193
101,2017-01-10,80.0,137,150
I want to add a new column called CheckCount counting the values in the Vol and Mx columns IF they are greater than 150. I have written the following code:
Code:
import pandas as pd
Observations = pd.read_csv("C:\\Users\\Observations.csv", parse_dates=['Date'], index_col=['ObjectID', 'Date'])
Observations['CheckCount'] = (Observations[['Vol', 'Mx']]>150).count(axis=1)
print(Observations)
However, unfortunately it is counting every value (result is always 2) rather than only where the values are >150 - what is wrong with my code?
Current Result:
ObjectID,Date,Price,Vol,Mx,CheckCount
101,2017-01-01,,145,203,2
101,2017-01-02,,155,163,2
101,2017-01-03,67.0,140,234,2
101,2017-01-04,78.0,130,182,2
101,2017-01-05,58.0,178,202,2
101,2017-01-06,53.0,134,204,2
101,2017-01-07,52.0,134,183,2
101,2017-01-08,62.0,148,176,2
101,2017-01-09,42.0,152,193,2
101,2017-01-10,80.0,137,150,2
Desired Result:
ObjectID,Date,Price,Vol,Mx,CheckCount
101,2017-01-01,,145,203,1
101,2017-01-02,,155,163,2
101,2017-01-03,67.0,140,234,1
101,2017-01-04,78.0,130,182,1
101,2017-01-05,58.0,178,202,2
101,2017-01-06,53.0,134,204,1
101,2017-01-07,52.0,134,183,1
101,2017-01-08,62.0,148,176,1
101,2017-01-09,42.0,152,193,2
101,2017-01-10,80.0,137,150,0
Are you looking for:
df['CheckCount'] = df[['Vol','Mx']].gt(150).sum(1)
Output:
ObjectID Date Price Vol Mx CheckCount
0 101 2017-01-01 NaN 145 203 1
1 101 2017-01-02 NaN 155 163 2
2 101 2017-01-03 67.0 140 234 1
3 101 2017-01-04 78.0 130 182 1
4 101 2017-01-05 58.0 178 202 2
5 101 2017-01-06 53.0 134 204 1
6 101 2017-01-07 52.0 134 183 1
7 101 2017-01-08 62.0 148 176 1
8 101 2017-01-09 42.0 152 193 2
9 101 2017-01-10 80.0 137 150 0
I have a pandas DataFrame like this..
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to get all the buyer_ids present before 6th jan 2016 but not after 6th Jan 2016
so, it should return me buyer_id 79
I am doing following in Python.
df.buyer_id[(df['time'] < '2016-01-06')]
This returns me all the buyer ids before 6th jan 2016 but how to check for the condition if its not present after 6th jan ? Please help
IIUC you could use isin method to achieve what you want:
df.time = pd.to_datetime(df.time)
In [52]: df
Out[52]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
exclude = df.buyer_id[(df['time'] > '2016-01-06')]
select = df.buyer_id[(df['time'] < '2016-01-06')]
In [53]: select
Out[53]:
0 79
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
In [54]: exclude
Out[54]:
5 79
6 261
7 64
8 261
9 309
Name: buyer_id, dtype: int64
In [55]: select[~select.isin(exclude)]
Out[55]:
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
You could use:
df.groupby('buyer_id').apply(lambda x: True if (x.time < '01-06-2016').any() and not (x.time > '01-06-2016').any() else False)
buyer_id
64 False
79 False
191 True
251 True
261 False
309 False
dtype: bool
I have the following dataframe, that I'm calling "sales_df":
Value
Date
2004-01-01 0
2004-02-01 173
2004-03-01 225
2004-04-01 230
2004-05-01 349
2004-06-01 258
2004-07-01 270
2004-08-01 223
... ...
2015-06-01 218
2015-07-01 215
2015-08-01 233
2015-09-01 258
2015-10-01 252
2015-11-01 256
2015-12-01 188
2016-01-01 70
I want to separate its trend from its seasonal component and for that I use statsmodels.tsa.seasonal_decompose through the following code:
decomp=sm.tsa.seasonal_decompose(sales_df.Value)
df=pd.concat([sales_df,decomp.trend],axis=1)
df.columns=['sales','trend']
This is getting me this:
sales trend
Date
2004-01-01 0 NaN
2004-02-01 173 NaN
2004-03-01 225 NaN
2004-04-01 230 NaN
2004-05-01 349 NaN
2004-06-01 258 NaN
2004-07-01 270 236.708333
2004-08-01 223 248.208333
2004-09-01 243 251.250000
... ... ...
2015-05-01 270 214.416667
2015-06-01 218 215.583333
2015-07-01 215 212.791667
2015-08-01 233 NaN
2015-09-01 258 NaN
2015-10-01 252 NaN
2015-11-01 256 NaN
2015-12-01 188 NaN
2016-01-01 70 NaN
Note that there are 6 NaN's in the start and in the end of the Trend's series.
So I ask, is that right? Why is that happening?
This is expected as seasonal_decompose uses a symmetric moving average by default if the filt argument is not specified (as you did). The frequency is inferred from the time series.
https://searchcode.com/codesearch/view/86129185/