I have the following DF :
A Index B Index C Index D Index
PX_LAST PX_LAST PX_LAST PX_LAST
2021-12-31 1.5101 0.195 -2.101 -0.509
2022-01-03 1.628 0.244 -2.032 -0.468
2022-01-04 1.6473 0.233 -2.074 -0.511
2022-01-05 1.7052 0.229 -2.045 -0.468
2022-01-06 1.7211 0.261 -1.965 -0.37
2022-01-07 1.762 0.285 -1.97 -0.338
2022-01-10 1.7603 0.284 -1.964 -0.361
2022-01-11 1.7357 0.347 -1.961 -0.348
2022-01-12 1.7428 0.321 -1.995 -0.384
2022-01-13 1.7041 0.288 -1.993 -0.394
2022-01-14 1.7841 0.332 -1.959 -0.352
2022-01-17 1.7841 0.355 -1.948 -0.339
2022-01-18 1.8735 0.368 -1.941 -0.311
2022-01-19 1.8646 0.38 -1.924 -0.283
2022-01-20 1.804 0.363 -1.918 -0.306
2022-01-21 1.7581 0.332 -1.925 -0.291
2022-01-24 1.7706 0.305 -1.959 -0.28
2022-01-25 1.7689 0.331 -1.954 -0.294
2022-01-26 1.8637 0.336 -1.951 -0.265
2022-01-27 1.7994 0.344 -1.943 -0.33
2022-01-28 1.7694 0.367 -1.95 -0.365
2022-01-31 1.7767 0.424 -1.969 -0.402
When I try to plot it doing :
df.plot(x=df.index,y=["A Index","B Index","X Index","D Index"])
it does throw the following issue
KeyError: "Index([2021-12-31, 2022-01-03, 2022-01-04, 2022-01-05, 2022-01-06, 2022-01-07,\n 2022-01-10, 2022-01-11, 2022-01-12, 2022-01-13,\n ...\n dtype='object', length=135) not in index"
What are these '\n'? How can I plot this DF ?
Many Thanks
The \n are likely just a representation of the newline characters in the error message.
Plotting by index is the default behavior if no x is specified. df.plot(y=["A Index", "B Index"]) gives
Related
I have the following dataframe with daily data:
day value
2017-08-04 0.832
2017-08-05 0.892
2017-08-06 0.847
2017-08-07 0.808
2017-08-08 0.922
2017-08-09 0.894
2017-08-10 2.332
2017-08-11 0.886
2017-08-12 0.973
2017-08-13 0.980
... ...
2022-03-21 0.821
2022-03-22 1.121
2022-03-23 1.064
2022-03-24 1.058
2022-03-25 0.891
2022-03-26 1.010
2022-03-27 1.023
2022-03-28 1.393
2022-03-29 2.013
2022-03-30 3.872
[1700 rows x 1 columns]
I need to generate pooled averages using moving windows. I explain it group by group:
The first group must contain the data from 2017-08-04 to 2017-08-08, but also the data from 2018-08-04 to 2018-08-08, and so on until the last year. As shown below:
2017-08-04 0.832
2017-08-05 0.892
2017-08-06 0.847
2017-08-07 0.808
2017-08-08 0.922
---------- -----
2018-08-04 2.125
2018-08-05 2.200
2018-08-06 2.339
2018-08-07 2.035
2018-08-08 1.953
... ...
2020-08-04 0.965
2020-08-05 0.941
2020-08-06 0.917
2020-08-07 0.922
2020-08-08 0.909
---------- -----
2021-08-04 1.348
2021-08-05 1.302
2021-08-06 1.272
2021-08-07 1.258
2021-08-08 1.281
The second group must run one day the temporary window. That is, data from 2017-08-05 to 2017-08-09, from 2018-08-05 to 2018-08-09, and so on until the last year. As shown below:
2017-08-05 0.892
2017-08-06 0.847
2017-08-07 0.808
2017-08-08 0.922
2017-08-09 1.823
---------- -----
2018-08-05 2.200
2018-08-06 2.339
2018-08-07 2.035
2018-08-08 1.953
2018-08-09 2.009
... ...
2020-08-05 0.941
2020-08-06 0.917
2020-08-07 0.922
2020-08-08 0.909
2020-08-09 1.934
---------- -----
2021-08-05 1.302
2021-08-06 1.272
2021-08-07 1.258
2021-08-08 1.281
2021-08-09 2.348
And the following groups must follow the same dynamic. Finally, I need to form a DataFrame, where the indices are the central date of each window (the length of the DataFrame will be 365 days of the year) and the values are the average of each of the groups mentioned above.
I have been trying Groupby and Rolling at the same time. But any solution based on other pandas methods is completely valid.
First question here and a long one - there are a couple of things I am struggling with regarding merging and formatting my dataframes. I have some half working solutions ones but I am unsure if they are the best possible based on what I want.
Here are the standard formats of the dataframes I am merging with pandas.
df1 =
RT %Area RRT
0 4.83 5.257 0.509
1 6.76 0.424 0.712
2 7.27 0.495 0.766
3 7.70 0.257 0.811
4 7.79 0.122 0.821
5 9.49 92.763 1.000
6 11.40 0.681 1.201
df2=
RT %Area RRT
0 4.83 0.731 0.508
1 6.74 1.243 0.709
2 7.28 0.109 0.766
3 7.71 0.287 0.812
4 7.79 0.177 0.820
5 9.50 95.824 1.000
6 11.31 0.348 1.191
7 11.40 1.166 1.200
8 12.09 0.113 1.273
df3 = ...
Currently I am using a reduce operation on pd.merge_ordered() like below to merge my dataframes (3+). This kind of yields what I want and was from a previous question (pandas three-way joining multiple dataframes on columns). I am merging on RRT, and want the indexes with the same RRT values to be placed on the same row - and if the RRT values are unique for that dataset I want a NaN for missing data from other datasets.
#The for loop I use to generate the list of formatted dataframes prior to merging
dfs = []
for entry in os.scandir(directory):
if (entry.path.endswith(".csv")) and entry.is_file():
entry = pd.read_csv(entry.path, header=None)
#Block of formatting code removed
dfs.append(entry.round(2))
dfs = [df1ar,df2ar,df3ar]
df_final = reduce(lambda left,right: pd.merge_ordered(left,right,on='RRT'), dfs)
cols = ['RRT', 'RT_x', '%Area_x', 'RT_y', '%Area_y', 'RT', '%Area']
df_final = df_final[cols]
print(df_final)
RRT RT_x %Area_x RT_y %Area_y RT %Area
0 0.508 NaN NaN 4.83 0.731 NaN NaN
1 0.509 4.83 5.257 NaN NaN 4.83 5.257
2 0.709 NaN NaN 6.74 1.243 NaN NaN
3 0.712 6.76 0.424 NaN NaN 6.76 0.424
4 0.766 7.27 0.495 7.28 0.109 7.27 0.495
5 0.811 7.70 0.257 NaN NaN 7.70 0.257
6 0.812 NaN NaN 7.71 0.287 NaN NaN
7 0.820 NaN NaN 7.79 0.177 NaN NaN
8 0.821 7.79 0.122 NaN NaN 7.79 0.122
9 1.000 9.49 92.763 9.50 95.824 9.49 92.763
10 1.191 NaN NaN 11.31 0.348 NaN NaN
11 1.200 NaN NaN 11.40 1.166 NaN NaN
12 1.201 11.40 0.681 NaN NaN 11.40 0.681
13 1.273 NaN NaN 12.09 0.113 NaN NaN
This works, but:
Can I can insert a multiindex based on the filename of the dataframe that the data came from from and place it above the corresponding columns? Like the suffix option but related back to filename and for more than two sets of data. Is this better done prior to merging? and if so how do I do it? (I've included the for loop I use for to create a list of tables prior to merging.
Is this reduced merge_ordered the simplest way of doing this?
Can I do a similar merge with pd.merge_asof() and use the tolerance value to fine tune the merging based on the similarities between the RRT values? That is, can it be done without cutting off data from the longer dataframes?
I've tried the above and searched for answers, but I'm struggling to find the most efficient way to do everything I want.
concat = pd.concat(dfs, axis=1, keys=['A','B','C'])
concat_final = concat.round(3)
print(concat_final)
A B C
RT %Area RRT RT %Area RRT RT %Area RRT
0 4.83 5.257 0.509 4.83 0.731 0.508 4.83 5.257 0.509
1 6.76 0.424 0.712 6.74 1.243 0.709 6.76 0.424 0.712
2 7.27 0.495 0.766 7.28 0.109 0.766 7.27 0.495 0.766
3 7.70 0.257 0.811 7.71 0.287 0.812 7.70 0.257 0.811
4 7.79 0.122 0.821 7.79 0.177 0.820 7.79 0.122 0.821
5 9.49 92.763 1.000 9.50 95.824 1.000 9.49 92.763 1.000
6 11.40 0.681 1.201 11.31 0.348 1.191 11.40 0.681 1.201
7 NaN NaN NaN 11.40 1.166 1.200 NaN NaN NaN
8 NaN NaN NaN 12.09 0.113 1.273 NaN NaN NaN
I have also tried this - and I get the multiindex to denote which file (A,B,C, just as placeholders) it came from. However, it has obviously not merged based on the RRT value like I want.
Can I apply an operation to change this into a similar format to the pd.merge_ordered() format above? Would groupby() work?
Thanks!
I have a large CSV file as below:
dd hh v.amm v.alc v.no2 v.cmo aqi
t
2018-11-03 00:00:00 3 0 0.390 0.490 1.280 1.760 2.560
2018-11-03 00:01:00 3 0 0.390 0.490 1.280 1.760 2.560
2018-11-03 00:02:00 3 0 0.380 0.460 1.300 1.610 2.500
2018-11-03 00:03:00 3 0 0.380 0.450 1.310 1.600 2.490
...
2018-11-28 23:56:00 28 23 0.670 0.560 1.100 1.870 2.940
2018-11-28 23:57:00 28 23 0.660 0.570 1.100 1.990 2.950
2018-11-28 23:58:00 28 23 0.660 0.570 1.100 1.990 2.950
2018-11-28 23:59:00 28 23 0.650 0.530 1.130 1.880 2.870
[37440 rows x 7 columns]
I'd like to take the average of 60 minutes to obtain hourly data. The final data would look something like this:
dd hh v.amm v.alc v.no2 v.cmo aqi
t
2018-11-03 00:00:00 3 0 0.390 0.490 1.280 1.760 2.560
2018-11-03 01:00:00 3 1 0.390 0.490 1.280 1.760 2.560
2018-11-03 02:00:00 3 2 0.380 0.460 1.300 1.610 2.500
2018-11-03 03:00:00 3 3 0.380 0.450 1.310 1.600 2.490
I tried
print (df['v.amm'].resample('60Min').mean())
t
2018-11-03 00:00:00 0.357
2018-11-03 01:00:00 0.354
2018-11-03 02:00:00 0.369
2018-11-03 03:00:00 0.384
but I don't think it's efficient as it only prints one specific column at a time, without heading as well.
I am combining 3 seoerate columns of year,month and day into a single column of my dataframe. But the year is in 2 digit which is giving error.
I have tried to_datetime() to do the same in jupyter notebook
Dataframe is in this form:
Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL
61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50
61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54
61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75
data.rename(columns={'Yr':'Year','Mo':'Month','Dy':'Day'},inplace=True)
data['Date']=pd.to_datetime(data[['Year','Month','Day']],format='%y%m%d')
The error i am getting is:
cannot assemble the datetimes: time data 610101 does not match format '%Y%m%d' (match)
There is problem to_datetime with specify columns ['Year','Month','Day'] need YYYY format, so need alternative solution, because year is YY only:
s = data[['Yr','Mo','Dy']].astype(str).apply('-'.join, 1)
data['Date'] = pd.to_datetime(s, format='%y-%m-%d')
print (data)
Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL \
0 61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83
1 61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79
2 61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50
CLO BEL Date
0 12.58 18.50 2061-01-01
1 9.67 17.54 2061-01-02
2 7.67 12.75 2061-01-03
For a financial application, I'm trying to create a DataFrame where each row is a session date value for a particular equity. To get the data, I'm using Pandas Remote Data. So, for example, the features I'm trying to create might be the adjusted closes for the preceding 32 sessions.
This is easy to do in a for-loop, but it takes quite a long time for large features sets (like going back to 1960 on "ge" and making each row contain the preceding 256 session values). Does anyone see a good way to vectorize this code?
import pandas as pd
def featurize(equity_data, n_sessions, col_label='Adj Close'):
"""
Generate a raw (unnormalized) feature set from the input data.
The value at col_label on the given date is taken
as a feature, and each row contains values for n_sessions
"""
features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):],
columns=range((-n_sessions + 1), 1))
for i in range(len(features.index)):
features.iloc[i, :] = equity_data[i:(n_sessions + i)][col_label].values
return features
I could alternatively just multi-thread this easily, but I'm guessing that pandas does that automatically if I can vectorize it. I mention that mainly because my primary concern is performance. So, if multi-threading is likely to outperform vectorization in any significant way, then I'd prefer that.
Short example of input and output:
>>> eq_data
Open High Low Close Volume Adj Close
Date
2014-01-02 15.42 15.45 15.28 15.44 31528500 14.96
2014-01-03 15.52 15.64 15.30 15.51 46122300 15.02
2014-01-06 15.72 15.76 15.52 15.58 42657600 15.09
2014-01-07 15.73 15.74 15.35 15.38 54476300 14.90
2014-01-08 15.60 15.71 15.51 15.54 48448300 15.05
2014-01-09 15.83 16.02 15.77 15.84 67836500 15.34
2014-01-10 16.01 16.11 15.94 16.07 44984000 15.57
2014-01-13 16.37 16.53 16.08 16.11 57566400 15.61
2014-01-14 16.31 16.43 16.17 16.40 44039200 15.89
2014-01-15 16.37 16.73 16.35 16.70 64118200 16.18
2014-01-16 16.67 16.76 16.56 16.73 38410800 16.21
2014-01-17 16.78 16.78 16.45 16.52 37152100 16.00
2014-01-21 16.64 16.68 16.36 16.41 35597200 15.90
2014-01-22 16.44 16.62 16.37 16.55 28741900 16.03
2014-01-23 16.49 16.53 16.31 16.43 37860800 15.92
2014-01-24 16.19 16.21 15.78 15.83 66023500 15.33
2014-01-27 15.90 15.91 15.52 15.71 51218700 15.22
2014-01-28 15.97 16.01 15.51 15.72 57677500 15.23
2014-01-29 15.48 15.53 15.20 15.26 52241500 14.90
2014-01-30 15.43 15.45 15.18 15.25 32654100 14.89
2014-01-31 15.09 15.10 14.90 14.96 64132600 14.61
>>> features = data.featurize(eq_data, 3)
>>> features
-2 -1 0
Date
2014-01-06 14.96 15.02 15.09
2014-01-07 15.02 15.09 14.9
2014-01-08 15.09 14.9 15.05
2014-01-09 14.9 15.05 15.34
2014-01-10 15.05 15.34 15.57
2014-01-13 15.34 15.57 15.61
2014-01-14 15.57 15.61 15.89
2014-01-15 15.61 15.89 16.18
2014-01-16 15.89 16.18 16.21
2014-01-17 16.18 16.21 16
2014-01-21 16.21 16 15.9
2014-01-22 16 15.9 16.03
2014-01-23 15.9 16.03 15.92
2014-01-24 16.03 15.92 15.33
2014-01-27 15.92 15.33 15.22
2014-01-28 15.33 15.22 15.23
2014-01-29 15.22 15.23 14.9
2014-01-30 15.23 14.9 14.89
2014-01-31 14.9 14.89 14.61
So each row of features is a series of 3 (n_sessions) successive values from the 'Adj Close' column of the features DataFrame.
====================
Improved version based on Primer's answer below:
def featurize(equity_data, n_sessions, column='Adj Close'):
"""
Generate a raw (unnormalized) feature set from the input data.
The value at column on the given date is taken
as a feature, and each row contains values for n_sessions
>>> timeit.timeit('data.featurize(data.get("ge", dt.date(1960, 1, 1),
dt.date(2014, 12, 31)), 256)', setup=s, number=1)
1.6771750450134277
"""
features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):],
columns=map(str, range((-n_sessions + 1), 1)), dtype='float64')
values = equity_data[column].values
for i in range(n_sessions - 1):
features.iloc[:, i] = values[i:(-n_sessions + i + 1)]
features.iloc[:, n_sessions - 1] = values[(n_sessions - 1):]
return features
It looks like shift is your friend here and something like this will do:
df = pd.DataFrame({'adj close': np.random.random(10) + 15},index=pd.date_range(start='2014-01-02', periods=10, freq='B'))
df.index.name = 'date'
df
adj close
date
2014-01-02 15.650
2014-01-03 15.775
2014-01-06 15.750
2014-01-07 15.464
2014-01-08 15.966
2014-01-09 15.475
2014-01-10 15.164
2014-01-13 15.281
2014-01-14 15.568
2014-01-15 15.648
features = pd.DataFrame(data=df['adj close'], index=df.index)
features.columns = ['0']
features['-1'] = df['adj close'].shift()
features['-2'] = df['adj close'].shift(2)
features.dropna(inplace=True)
features
0 -1 -2
date
2014-01-06 15.750 15.775 15.650
2014-01-07 15.464 15.750 15.775
2014-01-08 15.966 15.464 15.750
2014-01-09 15.475 15.966 15.464
2014-01-10 15.164 15.475 15.966
2014-01-13 15.281 15.164 15.475
2014-01-14 15.568 15.281 15.164
2014-01-15 15.648 15.568 15.281