How to convert a OHLCV named data array into a numpy dataframe? - python

My data consist of a particular OHLCV object that is a bit weird in that it can only be accessed by the name, like this:
# rA = [<MtApi.MqlRates object at 0x000000A37A32B308>,...]
type(rA)
# <class 'list'>
ccnt = len(rA) # 100
for i in range(ccnt):
print('{} {} {} {} {} {} {}'.format(i, rA[i].MtTime, rA[i].Open, rA[i].High, rA[i].Low, rA[i].Close, rA[i].TickVolume))
#0 1607507400 0.90654 0.90656 0.90654 0.90656 7
#1 1607507340 0.90654 0.9066 0.90653 0.90653 20
#2 1607507280 0.90665 0.90665 0.90643 0.90653 37
#3 1607507220 0.90679 0.90679 0.90666 0.90666 22
#4 1607507160 0.90699 0.90699 0.90678 0.90678 29
with some additional formatting I have:
Time Open High Low Close Volume
-----------------------------------------------------------------
1607507400 0.90654 0.90656 0.90654 0.90656 7
1607507340 0.90654 0.90660 0.90653 0.90653 20
1607507280 0.90665 0.90665 0.90643 0.90653 37
1607507220 0.90679 0.90679 0.90666 0.90666 22
I have tried things like this:
df = pd.DataFrame(data = rA, index = range(100), columns = ['MtTime', 'Open', 'High','Low', 'Close', 'TickVolume'])
# Resulting in:
# TypeError: iteration over non-sequence
How can I convert this thing into a Panda DataFrame,
so that I can plot this using the original names?
Plotting using matplotlib should then be possible with something like this:
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
...
df = pd.DataFrame(rA) # not working
df['time'] = pd.to_datetime(df['MtTime'], unit='s')
plt.plot(df['MtTime'], df['Open'], 'r-', label='Open')
plt.plot(df['MtTime'], df['Close'], 'b-', label='Close')
plt.legend(loc='upper left')
plt.title('EURAUD candles')
plt.show()
Possibly related questions (but were not helpful to me):
Numpy / Matplotlib - Transform tick data into OHLCV
OHLC aggregator doesn't work with dataframe on pandas?
How to convert a pandas dataframe into a numpy array with the column names
Converting Numpy Structured Array to Pandas Dataframes
Pandas OHLC aggregation on OHLC data
Getting Open, High, Low, Close for 5 min stock data python
Converting OHLC stock data into a different timeframe with python and pandas

One idea is use list comprehension for extract values to list of tuples:
L = [(rA[i].MtTime, rA[i].Open, rA[i].High, rA[i].Low, rA[i].Close, rA[i].TickVolume)
for i in range(len(rA))]
df = pd.DataFrame(L, columns = ['MtTime', 'Open', 'High','Low', 'Close', 'TickVolume']))
Or if possible:
df = pd.DataFrame({'MtTime':list(rA.MtTime), 'Open':list(rA.Open),
'High':list(rA.High),'Low':list(rA.Low),
'Close':list(rA.Close), 'TickVolume':list(rA.TickVolume)})

Related

Fourier Result on Time Series explained python

I have passed my time series data,which is essentially measurements from a sensor about pressure, through a Fourier transformation, similar to what is described in https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101.
The file used can be found here:
https://docs.google.com/spreadsheets/d/1MLETSU5Trl5gLGO6pv32rxBsR8xZNkbK/edit?usp=sharing&ouid=110574180158524908052&rtpof=true&sd=true
The code related is this :
import pandas as pd
import numpy as np
file='test.xlsx'
df=pd.read_excel(file,header=0)
#df=pd.read_csv(file,header=0)
df.head()
df.tail()
# drop ID
df=df[['JSON_TIMESTAMP','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_ADH_COATWEIGHT_SP']]
# extract year month
df["year"] = df["JSON_TIMESTAMP"].str[:4]
df["month"] = df["JSON_TIMESTAMP"].str[5:7]
df["day"] = df["JSON_TIMESTAMP"].str[8:10]
df= df.sort_values( ['year', 'month','day'],
ascending = [True, True,True])
df['JSON_TIMESTAMP'] = df['JSON_TIMESTAMP'].astype('datetime64[ns]')
df.sort_values(by='JSON_TIMESTAMP', ascending=True)
df1=df.copy()
df1 = df1.set_index('JSON_TIMESTAMP')
df1 = df1[["ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB"]]
import matplotlib.pyplot as plt
#plt.figure(figsize=(15,7))
plt.rcParams["figure.figsize"] = (25,8)
df1.plot()
#df.plot(style='k. ')
plt.show()
df1.hist(bins=20)
from scipy.fft import rfft,rfftfreq
## https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101
# convert into x and y
x = list(range(len(df1.index)))
y = df1['ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB']
# apply fast fourier transform and take absolute values
f=abs(np.fft.fft(df1))
# get the list of frequencies
num=np.size(x)
freq = [i / num for i in list(range(num))]
# get the list of spectrums
spectrum=f.real*f.real+f.imag*f.imag
nspectrum=spectrum/spectrum[0]
# plot nspectrum per frequency, with a semilog scale on nspectrum
plt.semilogy(freq,nspectrum)
nspectrum
type(freq)
freq= np.array(freq)
freq
type(nspectrum)
nspectrum = nspectrum.flatten()
# improve the plot by adding periods in number of days rather than frequency
import pandas as pd
results = pd.DataFrame({'freq': freq, 'nspectrum': nspectrum})
results['period'] = results['freq'] / (1/365)
plt.semilogy(results['period'], results['nspectrum'])
# improve the plot by convertint the data into grouped per day to avoid peaks
results['period_round'] = results['period'].round()
grouped_day = results.groupby('period_round')['nspectrum'].sum()
plt.semilogy(grouped_day.index, grouped_day)
#plt.xticks([1, 13, 26, 39, 52])
My end result is this :
Result of Fourier Trasformation for Data
My question is, what does this eventually show for our data, and intuitively what does the spike at the last section mean?What can I do with such result?
Thanks in advance all!

Python KeyError when creating a Series for Matplotlib

I am graphing data that is stored in a csv. I pull pull 2 columns of data into a dataframe then convert to series and graph with matplotlib.
from pandas import Series
from matplotlib import pyplot
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('Proxy/Proxy_Analytics/API_Statistics.csv')
df
Date Distinct_FLD Not_On_MM API_Call_Count Cost CACHE_Count
0 2018-11-12 35711 18468 18468 8.31060 35711
1 2018-11-13 36118 18741 11004 4.95180 46715
2 2018-11-14 34073 17629 8668 3.90060 55383
3 2018-11-15 34126 17522 7817 3.51765 63200
#Cost
cost_df = df[['Date','Cost']]
cost_series = cost_df.set_index('Date')['Cost']
plt.style.use('dark_background')
plt.title('Domain Rank API Cost Over Time')
plt.ylabel('Cost in Dollars')
cost_series.plot(c = 'red')
plt.show()
And this works totally fine. I would like to do the same and graph multiple rows but when I try to convert the df to series I am getting an error:
#Not Cost
not_cost = df[['Date','Distinct_FLD','Not_On_MM','API_Call_Count','CACHE_Count']]
not_cost_series = not_cost.set_index('Date')['Distinct_FLD','Not_On_MM','API_Call_Count','CACHE_Count']
Error:
KeyError: ('Distinct_FLD', 'Not_On_MM', 'API_Call_Count', 'CACHE_Count')
What can I do to fix this?
It seems that you are trying to convert the columns of a DataFrame into multiple Series, indexed by the 'Date' column of your DataFrame.
Maybe you can try:
not_cost = df[['Date','Distinct_FLD','Not_On_MM','API_Call_Count','CACHE_Count']]
not_cost_series = not_cost.set_index('Date')
Distinct_FLD = not_cost_series['Distinct_FLD']
Not_On_MM = not_cost_series['Not_On_MM']
.
.
.

DataFrame Pandas shows NAN

I have hdf5 and I have moved to DataFrame, but problem is when I want to plot, nothing shows on the graph. And I have checked new dataframe, but I saw, there was nothing.
This is my DF (
I don't allowed to post pics, so please click to the link )
df1 = pd.DataFrame(df.Price, index = df.Timestamp)
plt.figure()
df1.plot()
plt.show()
Second DF shows NAN in price column. Whats wrong?
I think you need set_index from column Timestamp, select column Price and plot:
#convert column to floats
df['Price'] = df['Price'].astype(float)
df.set_index('Timestamp')['Price'].plot()
#if some non numeric data, convert them to NaNs
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
df.set_index('Timestamp')['Price'].plot()
And get NaNs if use DataFrame constructor, because data not aligned - values of index of df are not same as Timestamp column.
You can do this by adding .values, And how about creating a series instead?
#df1 = pd.DataFrame(df.Price.values, df.Timestamp)
serie = pd.Series(df.Price.values, df.Timestamp)
Saw it was answered here: pandas.Series() Creation using DataFrame Columns returns NaN Data entries
Full example:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=["Price","Timestamp","Random"])
df.Price = np.random.randint(100, size = 10)
df.Timestamp = [datetime.datetime(2000,1,1) + \
datetime.timedelta(days=int(i)) for i in np.random.randint(100, size = 10)]
df.Random = np.random.randint(10, size= 10)
serie = pd.Series(df.Price.values, df.Timestamp)
serie.plot()
plt.show()
Difference
print("{}\n{}".format(type(df.Price), type(df.Price.values)))
<class 'pandas.core.series.Series'> # does not work
<class 'numpy.ndarray'> # works

Using python to take a 32x32 matrices append many of these matrices to a single array then adding a timestamp index to each matrix

I am very new to coding python and I am working with a .CSV file that gives me a 32x32 matrix in a 1024 column row with a time stamp. I reshaped the data to give me 32x32 arrays and looped through each row appending the matrices to a numpy array.
`i = 0
while i < len(df_array):
if i == 0:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = spec
else:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = np.concatenate((spectrum_matrix, spec), axis = 0)
i = i + 1
print("job done")`
What I would like to do is to add the time stamp from the original data file and add them to each of the matrices thus allowing me to re sample the data over a 5 minute average. I also would like to plot the bins a to get a plot similar to this Drop size distribution
As a reference I am reading in the data .CSV with pandas and here is an example of a portion of the raw data: 01.06.2017;18:22:20;0.122;0.00;51;7.401;10375;18745;57;27;0.00;23.6;0.110;0;
<SPECTRUM>;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
The ;'s after the SPECTRUM is the 32x32 matrix.
Thanks in advance for any help!
Python and associated packages can do many things without loops
From my understanding of your data you have a (8640 x 32 x 32) Data Structure (time x size x velocity).
Pandas works very well with 2D data structures, however for higher dimensional data I would recommend you get familiar with xarray. With this package along with pandas you can create and manipulate your data without having to resort to loops.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import seaborn as sns
%matplotlib inline
#create random data
data = (np.random.binomial(n =5, p =0.2, size =(8640,32,32))*1000).astype(int)
#create labels for data
sizes= np.linspace(1,5,32)
velocities = np.linspace(1,1000, num = 32)
#make time range of 24 hours with 10sec intervals
ind = pd.date_range(start='2014-01-01', periods=8640, freq='10s')
#convert data to xarray 3D data structure
df = xr.DataArray(data, coords = [ind, sizes, velocities],
dims = ['time', 'size', 'speed'])
#make a 5 min average of the data
min_average= df.resample('300s', dim = 'time', how = 'mean')
#plot sample of data and 5 min average
my1d = min_average.isel(size = 5, speed= 10)
my1d.plot(label = '5 min avg')
plt.gca()
df.isel(size = 5, speed =10).plot(alpha = 0.3, c = 'r', label = 'raw_data')
plt.legend()
As for making a distribution plot like you linked things become a bit trickier but is possible:
#transform your data to only have mean speed for each time and size
#and convert to pandas dataframe
mean_speed =min_average.mean(dim = ['speed'])
#for some reason xarray make you name the new column when you convert
#to a pandas dataframe. I then get rid of the extra empty variable with
#a list comprehension
df= mean_speed.to_dataframe('').unstack().T
df.index = np.array([np.array(i)[1].astype(float) for i in df.index])
#make a contourplot of your new data
plt.contourf(df.columns, df.index, df.values, cmap ='PuBu_r')
plt.title('mean speed')
plt.ylabel('size')
plt.xlabel('time')
plt.colorbar()

Separating out pandas series for pyplot

I currently have a set of series in pandas and each series is composed of two data sets. I need to separate out the two data sets into lists while retaining the series information, ie. the time and intensity data for 58V.
My current code looks like:
import numpy as numpy
import pandas as pd
xl = pd.ExcelFile("TEST_ATD.xlsx")
df = xl.parse("Sheet1")
series = xl.parse("Sheet1")
voltages = []
for item in df:
if "V" in item:
voltages.append(item)
data_list = []
for value in voltages:
print(df[value])
How do I select a particular data set from the series to extract them into a list? If I ask it to print(df[value]) returns my data sets, an example of which looks like:
Name: 58V, dtype: int64
0.000 0
0.180 1
0.360 1.2
0.540 1.5
0.720 1.2
..
35.277 0
35.457 0
35.637 0
NaN 0
Ultimately I plan to plot these data sets into a line graph with pyplot.
~~~ UPDATE ~~~
using
for value in voltages:
intensity=[]
for row in series[value].tolist():
intensity.append(row)
time=range(0,len(intensity))
pc_intensity = []
for item in intensity:
pc_intensity.append((100/max(intensity)*item))
plt.plot(time, pc_intensity)
axes = plt.gca()
axes.set_ylim([0,100])
plt.title(value)
plt.ylabel('Intensity')
plt.xlabel('Time')
plt.savefig(value +'.png')
plt.clf()
print(value)
I am able to get the plots of the first 8 data series (using arbitrary x axis), however, anything past the 8th series and my plots are empty? I have experimented and found this to be due to some of the series being different lengths. I'm confused as to why this would effect the plots as the x-axis is directly related to the length of the data set it is being plotted against?
I am not sure what you are trying to acheive but I'll take a guess
df = pd.DataFrame({'A': range(1, 10), 'B': range(1, 10), 'C': range(1, 10), 'D': range(1, 10), 'E': [1,1,1,2,2,2,2,3,4]})
for col in df.columns:
print(df[col].values.tolist())
this would print every columns of your dataframe as list
if you are just trying to plot something why not just use
df.plot()

Categories

Resources