I have following data set in panda dataframe
print data
Result:
Open High Low Close Adj Close Volume
Date
2018-05-25 12.70 12.73 12.48 12.61 12.610000 1469800
2018-05-24 12.99 13.08 12.93 12.98 12.980000 814800
2018-05-23 13.19 13.30 13.06 13.12 13.120000 1417500
2018-05-22 13.46 13.57 13.25 13.27 13.270000 1189000
2018-05-18 13.41 13.44 13.36 13.38 13.380000 986300
2018-05-17 13.19 13.42 13.19 13.40 13.400000 1056200
2018-05-16 13.01 13.14 13.01 13.12 13.120000 481300
If I just want to print single column just close it shows with the date index
print data.Low
Result:
Date
2018-05-25 12.48
2018-05-24 12.93
2018-05-23 13.06
2018-05-22 13.25
2018-05-18 13.36
2018-05-17 13.19
2018-05-16 13.01
Is there way to slice/print just the closing price. So the output will be like:
12.48
12.93
13.06
13.25
13.36
13.19
13.01
In pandas Series and DataFrame always need some index values.
Default RangeIndex is possible create by:
print data.reset_index(drop=True).Low
But if need write only values to file as column without index and with no header:
data.Low.to_csv(file, index=False, header=None)
If need convert column to list:
print data.Low.tolist()
[12.48, 12.93, 13.06, 13.25, 13.36, 13.19, 13.01]
And for 1d numpy array:
print data.Low.values
[12.48 12.93 13.06 13.25 13.36 13.19 13.01]
If want 1xM array:
print (data[['Low']].values)
[[12.48]
[12.93]
[13.06]
[13.25]
[13.36]
[13.19]
[13.01]]
Related
I have the following dataframe, named 'ORDdataM', with a DateTimeIndex column 'date', and a price point column 'ORDprice'. The date column has no timezone associated with it (and is naive) but is actually in 'Australia/ACT'. I want to convert it into 'America/New_York' time.
ORDprice
date
2021-02-23 18:09:00 24.01
2021-02-23 18:14:00 23.91
2021-02-23 18:19:00 23.98
2021-02-23 18:24:00 24.00
2021-02-23 18:29:00 24.04
... ...
2021-02-25 23:44:00 23.92
2021-02-25 23:49:00 23.88
2021-02-25 23:54:00 23.92
2021-02-25 23:59:00 23.91
2021-02-26 00:09:00 23.82
The line below is one that I have played around with quite a bit, but I cannot figure out what is erroneous. The only error message is:
KeyError: 'date'
ORDdataM['date'] = ORDdataM['date'].dt.tz_localize('Australia/ACT').dt.tz_convert('America/New_York')
I have also tried
ORDdataM.date = ORDdataM.date.dt.tz_localize('Australia/ACT').dt.tz_convert('America/New_York')
What is the issue here?
Your date is index not a column, try:
df.index = df.index.tz_localize('Australia/ACT').tz_convert('America/New_York')
df
# ORDprice
#date
#2021-02-23 02:09:00-05:00 24.01
#2021-02-23 02:14:00-05:00 23.91
#2021-02-23 02:19:00-05:00 23.98
#2021-02-23 02:24:00-05:00 24.00
#2021-02-23 02:29:00-05:00 24.04
#2021-02-25 07:44:00-05:00 23.92
#2021-02-25 07:49:00-05:00 23.88
#2021-02-25 07:54:00-05:00 23.92
#2021-02-25 07:59:00-05:00 23.91
#2021-02-25 08:09:00-05:00 23.82
I have the following dataframe:
data
Out[120]:
High Low Open Close Volume Adj Close
Date
2018-01-02 12.66 12.50 12.52 12.66 20773300.0 10.842077
2018-01-03 12.80 12.67 12.68 12.76 29765600.0 10.927719
2018-01-04 13.04 12.77 12.78 12.98 37478200.0 11.116128
2018-01-05 13.22 13.04 13.06 13.20 46121900.0 11.304538
2018-01-08 13.22 13.11 13.21 13.15 33828300.0 11.261715
... ... ... ... ... ...
2020-06-25 6.05 5.80 5.86 6.03 73612700.0 6.030000
2020-06-26 6.07 5.81 6.04 5.91 118435400.0 5.910000
2020-06-29 6.07 5.81 5.91 6.01 58208400.0 6.010000
2020-06-30 6.10 5.90 5.98 6.08 61909300.0 6.080000
2020-07-01 6.18 5.95 6.10 5.98 62333600.0 5.980000
[629 rows x 6 columns]
Some of the dates are missing in Dates Column. I know i can do this to get all the dates:
pd.date_range(start, end, freq ='D')
Out[121]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
'2018-01-09', '2018-01-10',
...
'2020-06-23', '2020-06-24', '2020-06-25', '2020-06-26',
'2020-06-27', '2020-06-28', '2020-06-29', '2020-06-30',
'2020-07-01', '2020-07-02'],
dtype='datetime64[ns]', length=914, freq='D')
How can i compare all the dates with the index and just add the dates which are missing.
Use DataFrame.reindex, working also if need some custom start and end datimes:
df = df.reindex(pd.date_range(start, end, freq ='D'))
Or DataFrame.asfreq for add missing datetimes between existing data:
df = df.asfreq('d')
First off, here is my dataframe:
Date 2012-09-04 00:00:00 2012-09-05 00:00:00 2012-09-06 00:00:00 2012-09-07 00:00:00 2012-09-10 00:00:00 2012-09-11 00:00:00 2012-09-12 00:00:00 2012-09-13 00:00:00 2012-09-14 00:00:00 2012-09-17 00:00:00 ... 2017-08-22 00:00:00 2017-08-23 00:00:00 2017-08-24 00:00:00 2017-08-25 00:00:00 2017-08-28 00:00:00 2017-08-29 00:00:00 2017-08-30 00:00:00 2017-08-31 00:00:00 2017-09-01 00:00:00 Type
AABTX 9.73 9.73 9.83 9.86 9.83 9.86 9.86 9.96 9.98 9.96 ... 11.44 11.45 11.44 11.46 11.46 11.47 11.47 11.51 11.52 Hybrid
AACTX 9.66 9.65 9.77 9.81 9.78 9.81 9.82 9.92 9.95 9.93 ... 12.32 12.32 12.31 12.33 12.34 12.34 12.35 12.40 12.41 Hybrid
AADTX 9.71 9.70 9.85 9.90 9.86 9.89 9.91 10.02 10.07 10.05 ... 13.05 13.04 13.03 13.05 13.06 13.06 13.08 13.14 13.15 Hybrid
AAETX 9.92 9.91 10.07 10.13 10.08 10.12 10.14 10.26 10.32 10.29 ... 13.84 13.84 13.82 13.85 13.86 13.86 13.89 13.96 13.98 Hybrid
AAFTX 9.85 9.84 10.01 10.06 10.01 10.05 10.07 10.20 10.26 10.23 ... 14.09 14.08 14.07 14.09 14.11 14.11 14.15 14.24 14.26 Hybrid
That is a bit hard to read but essentially these are just closing prices for several mutual funds (638) which the Type label in the last column. I'd like to plot all of these on a single plot and have a legend labeling what type each plot is.
I'd like to see how many potential clusters I may need. This was my first though to visualize the data but if you have any other recommendations, feel free to suggest it.
Also, in my first attempt, I tried:
parallel_coordinates(closing_data, 'Type', alpha=0.2, colormap=dark2_cmap)
plt.show()
It just shows up as a black blob and after some research I found that it doesn't handle large number of features that well.
My suggestion is to transpose the dataframe, as timestamp comes more naturally as an index and you will be able to address individual time series as df.AABTX or df['AABTX'].
With a smaller number of time series you could have tried df.plot(), but when in it is rather large you should not be surpried to see some mess initially.
Try plotting a subset of your data, but please make sure the time is in index, not columns names.
You may be looking for something like the silhouette analysis which is implemented in the scikit-learn machine learning library. It should allow to find an optimal number of clusters to consider for your data.
EDIT: Just when I gave up i found the answer here:
rmlag = lambda xs: np.argmax(xs[::-1])
df['Open'].rolling(window=5).apply(func=rmlag)
I'm wrestling with the following issue: How can i add a column to a DataFrame that, for each row, calculates the number of days (periods) since an n-period high was reached?
Below is a sample DataFrame i'm working with. I've calculated the rolling 5-day high as
df['Rolling 5 Day High'] = df['Open'].rolling(5).max()
How can I calculate, for each row, the number of days since the respective 5-day high was reached? For example, the "Number of Days Since" for the row indexed at 2012-03-16 should be 4 since this row's corresponding rolling 5-day high of 14.88 was reached on 2012-03-12. For the next row at index 2012-03-19, the value should be 3 given this row's rolling 5-day high of 14.79 was reached on 2012-03-14.
Open Rolling 5 Day High
Date
2012-03-12 14.88 NaN
2012-03-13 14.65 NaN
2012-03-14 14.79 NaN
2012-03-15 14.41 NaN
2012-03-16 14.59 14.88
2012-03-19 14.68 14.79
2012-03-20 14.56 14.79
2012-03-21 14.40 14.68
2012-03-22 14.35 14.68
2012-03-23 14.40 14.68
2012-03-26 14.69 14.69
2012-03-27 14.78 14.78
2012-03-28 15.01 15.01
2012-03-29 15.14 15.14
2012-03-30 15.36 15.36
2012-04-02 15.36 15.36
2012-04-03 15.44 15.44
2012-04-04 14.85 15.44
2012-04-05 14.67 15.44
2012-04-09 14.40 15.44
2012-04-10 14.38 15.44
2012-04-11 14.35 14.85
2012-04-12 14.36 14.67
2012-04-13 14.55 14.55
2012-04-16 14.26 14.55
I have this code :
close[close['Datetime'].isin(datefilter)] #Only date in the range
close1='Close' ; start='12/18/2015 00:00:00';
end='3/1/2016 00:00:00'; freq='1d0h00min';
datefilter= pd.date_range(start=start, end=end, freq= freq).values
But, strangely, some columns are given back with Nan:
Datetime ENTA KITE BSTC SAGE AGEN MGNX ESPR FPRX
2015-12-18 31.73 63.38 16.34 56.88 12.24 NaN NaN 38.72
2015-12-21 32.04 63.60 16.26 56.75 12.18 NaN NaN 42.52
Just wondering the reasons, and how can we remedy ?
Original :
Datetime ENTA KITE BSTC SAGE AGEN MGNX ESPR FPRX
0 2013-03-21 17.18 29.0 20.75 30.1 11.52 11.52 38.72
1 2013-03-22 16.81 30.53 21.25 30.0 11.64 11.52 39.42
2 2013-03-25 16.83 32.15 20.8 27.59 11.7 11.52 42.52
3 2013-03-26 17.09 29.55 20.6 27.5 11.76 11.52 11.52
EDIT:
it seems related to the datetime hh:mm:ss filtering.