Plotting and calculating mid price and weightet mid-price - python

I have a problem with my code - Somehow it keeps giving me a keyerror: "None of [Float]...."
I need to calculate: P_mid = P_offer+P_bid/2
and
volume weightet mid_price = VWMP = (P_bid * Size_offer)+(P_offer * Size_bid)/Size_Offer+Size_Bid
So far my code looks like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
nasdaq_1 = pd.read_csv (r'Path to csv')
np.array(nasdaq_1)
#print(nasdaq_1)
mid_price = (np.array(nasdaq_1.Offer_Price) + np.array(nasdaq_1.Bid_Price))/2
#print(mid_price)
weightet_mid_price = (np.array(nasdaq_1.Offer_Price)*np.array(nasdaq_1.Bid_Size) + np.array(nasdaq_1.Bid_Price)*np.array(nasdaq_1.Offer_Size))/(np.array(nasdaq_1.Offer_Size)+np.array(nasdaq_1.Bid_Size))
print(weightet_mid_price)
nasdaq_1[mid_price].plot()
plt.figure(figsize=(10,10))
plt.plot(nasdaq_1.index, nasdaq_1[mid_price])
plt.xlabel("Datetime")
plt.ylabel("$ price")
plt.title("Mid-price between bid and offer prices")
All help is highly appreciated!!
CSV data sample:
|DateTime,Time,Exchange,Symbol,Bid_Price,Bid_Size,Offer_Price,Offer_Size
|2017-01-03 09:30:00,93000766290000.0,T,PFE,32.55,8.0,32.76,8.0
|2017-01-03 09:30:01,93001992610000.0,T,PFE,32.67,8.0,32.7,31.0
|2017-01-03 09:30:02,93002933311000.0,T,PFE,32.67,7.0,32.7,2.0
|2017-01-03 09:30:03,93003882764000.0,T,PFE,32.7,1.0,32.76,17.0
|2017-01-03 09:30:04,93004943608000.0,T,PFE,32.7,1.0,32.73,13.0
|2017-01-03 09:30:05,93005991747000.0,T,PFE,32.69,2.0,32.74,41.0
|2017-01-03 09:30:06,93006506218000.0,T,PFE,32.67,5.0,32.74,41.0
Image shows the data I am using. Screenshot below.

You do not need to cast the data frame columns into numpy arrays for your calculations.
The error you see is due to the line nasdaq_1[mid_price].plot().
df[x] expects x to be either a column name or a list/array of columns. You are passing a numpy array with entries which cannot be found.
Try the following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
s = io.StringIO("""DateTime,Time,Exchange,Symbol,Bid_Price,Bid_Size,Offer_Price,Offer_Size
2017-01-03 09:30:00,93000766290000.0,T,PFE,32.55,8.0,32.76,8.0
2017-01-03 09:30:01,93001992610000.0,T,PFE,32.67,8.0,32.7,31.0
2017-01-03 09:30:02,93002933311000.0,T,PFE,32.67,7.0,32.7,2.0
2017-01-03 09:30:03,93003882764000.0,T,PFE,32.7,1.0,32.76,17.0
2017-01-03 09:30:04,93004943608000.0,T,PFE,32.7,1.0,32.73,13.0
2017-01-03 09:30:05,93005991747000.0,T,PFE,32.69,2.0,32.74,41.0
2017-01-03 09:30:06,93006506218000.0,T,PFE,32.67,5.0,32.74,41.0
""")
nasdaq_1 = pd.read_csv(s, parse_dates=['DateTime'])
mid_price = (nasdaq_1["Offer_Price"] + nasdaq_1["Bid_Price"])/2
weightet_mid_price = (
(nasdaq_1["Offer_Price"]*nasdaq_1["Bid_Size"] + nasdaq_1["Bid_Price"]*nasdaq_1["Offer_Size"])
/ (nasdaq_1["Offer_Size"] + nasdaq_1["Bid_Size"])
)
fig, ax = plt.subplots(figsize=(10,10))
ax.plot(nasdaq_1["DateTime"], mid_price)
ax.set_xlabel("Datetime")
ax.set_ylabel("$ price")
ax.set_title("Mid-price between bid and offer prices")
fig.autofmt_xdate()
Edit:
Parse the DateTime column to make it datetime values instead of strings.

Related

find correlation between runoff station and the corresponding SPI1,SPI3,SPI6 for the station

I need some help finding the correlation between my plots. In the code below, I first calculated the runoff for a station in Norway. Then I calculated SPI1, SPI3 and SPI6 for the drive.
I further wish to find the correlation the run-off station has with relative SPI1, SPI3 and SPI6
Is it possible to get one plot that shows both runoff, SPI1-3-6?
from pandas import read_csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import cartopy
from datetime import date,datetime
import datetime as dt
timeserie of July-water flow at stasjon 200011
dir1 = "/Users/mad/Desktop/mystations/"
files = os.listdir(dir1)
files = np.sort(files)
files_txt = [i for i in files if i.endswith('.txt_')]
df = pd.read_csv(dir1+files_txt[0], skiprows=6, header=None, index_col=0,sep=" ",names = ["Year","Runoff"], na_values=-9999)
list(df.columns)
df.index = pd.to_datetime(df.index,format="%Y%m%d/%H%M")
pd.to_datetime(df.index,format="%Y%m%d/%H").min()
pd.to_datetime(df.index,format="%Y%m%d/%H").max()
parse_dates=True
myperiod = df["1985":"2018"]
df = myperiod.resample("m").mean()
july = df[df.index.month == 7]
plt.figure(figsize=(16,4))
plt.plot(July)
plt.title("Time series")
plt.ylabel("runoff [mm/day]")
plt.xlabel("year")
plt.show()
enter image description here
timeserie of july-SPI 1 to the same station
spi1= pd.read_csv('SPI1_and_rr_for_200011.0.csv',header=0 , na_values=-9999)
spi1.index = pd.to_datetime(df.index,format='%Y-%m-%d')
a = spi1[spi1.index.month == 7]
a.plot(y='spi',figsize=(16,4))
enter image description here
timeserie of july-SPI 3 to the same station
spi3 = pd.read_csv('SPI3_and_rr_for_200011.0.csv',header=0,parse_dates=True)
spi3.index = pd.to_datetime(df.index, format='%Y-%m-%d')
b = spi3[spi3.index.month == 7]
b.plot(y='spi',figsize=(16,4))
enter image description here
timeserie of july-SPI 6 to the same station
spi6 = pd.read_csv('SPI6_and_rr_for_200011.0.csv',header=0,parse_dates=True)
spi6.index = pd.to_datetime(df.index, format='%Y-%m-%d')
c = spi6[spi6.index.month == 7]
c.plot(y='spi',figsize=(16,4))
enter image description here

np.log returns a dataframe full of NaNs

I have made 2 functions, one for the cumulative logarithmic returns and the other for the total relative return.
Cumulative logarithmic returns:
# Cumulative logarithmic returns function:
def tlog_r(data, start, end):
tlog_return = copy.deepcopy(data)
for t in range(0,len(tlog_return)):
x = data[t]
y = data[0]
tlog_return[t] = x/y
tlog_return = np.log(tlog_return)
tlog_return[0] = 0
return tlog_return`
Total relative returns:
# Total relative returns function:
def tr_rel(data):
tlog_return = copy.deepcopy(data)
for t in range(0,len(tlog_return)):
x = data[t]
y = data[0]
tlog_return[t] = x/y
tlog_return = np.log(tlog_return)
tlog_return[0] = 0
tr_relative = copy.deepcopy(tlog_return)
for t in range(0,len(tr_relative)):
tr_relative[t] = 100*(np.exp(tr_relative[t])-1)
print(tr_relative)
return tr_relative`
I want to calculate them from a dataframe of a stock between 2 dates.
It doesn't give any error but if dates don't start in 2000, 2005 or 2011 it returns a dataframe full of NaNs except for the value in index [0].
Why is this happening? How can I solve it?
In case you need it, this is the part of the code where I call the functions:
from relative_returns_functions import tlog_r, tr_rel
from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import copy
ticker='AAPL'
start_date='2000-01-01'
end_date='2019-12-31'
price='Close'
# Program
panel_data = data.DataReader(ticker , 'yahoo', start_date, end_date)[price]
title = '{} {} price'.format(ticker, price) #Plot title
panel_data.plot(title=title)
# Data procesing
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')
panel_data = panel_data.reindex(all_weekdays)
panel_data = panel_data.fillna(method='ffill')
# Plot
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10,6))
comp_title = '{} returns comparation'.format(ticker)
fig.suptitle(comp_title)
sum_log_returns = tlog_r(panel_data, start_date, end_date)
ax1.plot(sum_log_returns.index, sum_log_returns, label=ticker)
ax1.set_ylabel('Cumulative log returns')
ax1.legend(loc='best')
tot_logreturns = tr_rel(panel_data)
ax2.plot(tot_logreturns.index, tot_logreturns, label=ticker)
ax2.set_ylabel('Total relative returns (%)')
ax2.legend(loc='best')
plt.show()
Here you have a minimal reproducible example, you will have to import the functions, pandas, numpy and copy.
ticker='AAPL'
start_date='2000-01-01'
end_date='2019-12-31'
price='Close'
panel_data = data.DataReader(ticker , 'yahoo', start_date, end_date)[price]
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')
panel_data = panel_data.reindex(all_weekdays)
panel_data = panel_data.fillna(method='ffill')
sum_log_returns = tlog_r(panel_data, start_date, end_date)
print(sum_log_returns)
tot_logreturns = tr_rel(panel_data)
print(tot_logreturns)

Pandas rolling standard deviation

Is anyone else having trouble with the new rolling.std() in pandas? The deprecated method was rolling_std(). The new method runs fine but produces a constant number that does not roll with the time series.
Sample code is below. If you trade stocks, you may recognize the formula for Bollinger bands. The output I get from rolling.std() tracks the stock day by day and is obviously not rolling.
This in in pandas 0.19.1. Any help would be appreciated.
import datetime
import pandas as pd
import pandas_datareader.data as web
start = datetime.datetime(2012,1,1)
end = datetime.datetime(2012,12,31)
g = web.DataReader(['AAPL'], 'yahoo', start, end)
stocks = g['Close']
stocks['Date'] = pd.to_datetime(stocks.index)
stocks['AAPL_LO'] = stocks['AAPL'] - stocks['AAPL'].rolling(20).std() * 2
stocks['AAPL_HI'] = stocks['AAPL'] + stocks['AAPL'].rolling(20).std() * 2
stocks.dropna(axis=0, how='any', inplace=True)
import pandas as pd
from pandas_datareader import data as pdr
import numpy as np
import datetime
end = datetime.date.today()
begin=end-pd.DateOffset(365*10)
st=begin.strftime('%Y-%m-%d')
ed=end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
def bollinger_strat(data, window, no_of_std):
rolling_mean = data['Close'].rolling(window).mean()
rolling_std = data['Close'].rolling(window).std()
df['Bollinger High'] = rolling_mean + (rolling_std * no_of_std)
df['Bollinger Low'] = rolling_mean - (rolling_std * no_of_std)
bollinger_strat(data,20,2)

How to adjust dataframe rows to columns

import pandas as pd
import pandas.io.data as web
from pandas import Series, DataFrame
import matplotlib
import matplotlib.pyplot as plt
from numpy.random import randn
import numpy as np
matplotlib.style.use('ggplot')
stocks = {'xom': '2014-01-01', 'dvn': '2013-01-01', 'aapl': '2013-01-01'}
L = dict()
for stock, date in stocks.items():
price = web.get_data_yahoo(stock, date)['Adj Close']
change = price.diff().cumsum()
perChange = change / price.iloc[0]
L[stock] = perChange
df = pd.concat(L, axis=1)
df2 = df.describe()
How do I format df2 so that the columns are min, max, std, etc...and the rows are the stock symbol?
use the transpose of the dateframe: DataFrame.T
df2 = df.describe().T # this is the equivalent of df.describe().transpose()
print df2
count mean std min 25% 50% 75% max
aapl 665 0.195720 0.331271 -0.284546 -0.089219 0.110605 0.501857 0.783157
dvn 665 0.202538 0.143291 -0.246586 0.104409 0.175463 0.286709 0.548577
xom 413 -0.049164 0.062285 -0.273573 -0.096234 -0.045035 -0.001124 0.060982
You want to add;
df2 = df2.transpose()

Visually separating bar chart clusters in pandas

This is more of a hack that almost works.
#!/usr/bin/env python
from pandas import *
import matplotlib.pyplot as plt
from numpy import zeros
# Create original dataframe
df = DataFrame(np.random.rand(5,4), index=['art','mcf','mesa','perl','gcc'],
columns=['pol1','pol2','pol3','pol4'])
# Estimate average
average = df.mean()
average.name = 'average'
# Append dummy row with zeros and then average
row = DataFrame([dict({p:0.0 for p in df.columns}), ])
df = df.append(row)
df = df.append(average)
print df
df.plot(kind='bar')
plt.show()
and gives:
pol1 pol2 pol3 pol4
art 0.247309 0.139797 0.673009 0.265708
mcf 0.951582 0.319486 0.447658 0.259821
mesa 0.888686 0.177007 0.845190 0.946728
perl 0.902977 0.863369 0.194451 0.698102
gcc 0.836407 0.700306 0.739659 0.265613
0 0.000000 0.000000 0.000000 0.000000
average 0.765392 0.439993 0.579993 0.487194
and
It gives the visual separation between benchmarks and average.
Is there a way to get rid of the 0 at the x-axis??
It turns out that DataFrame does not allow me to have muptiple dummy rows this way.
My solution was to change
row = pd.DataFrame([dict({p:0.0 for p in df.columns}), ])
into
row = pd.Series([dict({p:0.0 for p in df.columns}), ])
row.name = ""
Series can be named with empty string.
Still pretty hacky, but it works:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create original dataframe
df = pd.DataFrame(np.random.rand(5,4), index=['art','mcf','mesa','perl','gcc'],
columns=['pol1','pol2','pol3','pol4'])
# Estimate average
average = df.mean()
average.name = 'average'
# Append dummy row with zeros and then average
row = pd.DataFrame([dict({p:0.0 for p in df.columns}), ])
df = df.append(row)
df = df.reindex(np.where(df.index, df.index, ''))
df = df.append(average)
print df
df.plot(kind='bar')
plt.show()

Categories

Resources