I am trying to plot histogram of percentage change in stock
my code looks like:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("M:/Trading/1.JOZO/ALXN.csv")
dataframe = (data['Adj Close'])
zmena1 = (dataframe.pct_change(periods = 1)*100)
data["Zmena"] = zmena1
plt.hist(zmena1, bins = "auto", range = "auto" )
plt.show
but i get an error:
mn, mx = [mi + 0.0 for mi in range]
TypeError: Can't convert 'float' object to str implicitly
I tried str(zmena1) but can to get it...
Do not know how to move through this one...
From the name of the csv file, I can guess that your data can be retrieved from Yahoo finance, so using the Remote Access datareader I'm downloading all 2016 data to play with:
import datetime
data = web.DataReader('ALXN', data_source='yahoo',
start=datetime.datetime(2016, 1, 1))
Now I can calculate the percent change in the [0,100] range
data['Zmena'] = data['Adj Close'].pct_change(periods=1)*100
From there, I would definitely use the built-in DataFrame.hist function:
data['Zmena'].hist()
Using plt.hist
In case you do want to use plt.hist instead, you need to filter out the NaN(not a number), in particular you will always have one as the first entry:
print(data[['Adj Close','Zmena']].head())
Adj Close Zmena
Date
2016-01-04 184.679993 NaN
2016-01-05 184.899994 0.119126
2016-01-06 184.070007 -0.448884
2016-01-07 174.369995 -5.269741
2016-01-08 168.130005 -3.578592
So, in order to use plt.hist:
plt.hist(data.Zmena.dropna())
Another problem is that you're specifying bins = "auto", range = "auto", when really you should just not pass them if you want to default. See the documentation for both parameters at pyplot.hist
Related
i have a simnple issue. I want to download Data from yfinance and store it in a Dataframe. That works.
Now how can i additionally extract the X and Y Values, that are stored in that Dataframe?
I mean, just from the fact, that the Data is plottable i conclude, that there are x and y values for every datapoint on the plot.
Here ist the simple code
import yfinance as yf
import matplotlib.pyplot as plt
stock = 'TSLA'
start = '2020-11-01'
df = yf.download(stock , start=start)
What i finally want to achieve ´would be to use the X and Y values to feed them into a polyfit function.
In that way i am trying to do a regression on the pricechart data from a stock, to finally be able to take derivatives and apply some analysis on that function.
Anybody has a good idea?
I appreciate, thanks a lot,
Benjamin
You can save the date and close price like this:
X=df.index
Y=df.Close
And if you want to plot the Closeprice accordind to the date:
df.reset_index().plot(x='Date', y='Close')
If you want to use all the data except the close column to predict the Closeprice, you can keep them with:
X=df.drop(columns='Close')
I’m working on a Jupyter notebook script using Python and Matplotlib which is supposed to fetch historical stock prices for specified stocks via the yfinance package and plot each stock’s volatility vs. potential return.
The expected and actual results can be found here.
As you can see in the second image, the annotations beside each point for the stock symbols are completely missing. I’m very new to Matplotlib, so I’m at a bit of a loss. The code being used is as follows:
import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from functools import reduce
from google.colab import files
sns.set()
directory = '/datasets/stocks/'
stocks = ['AAPL', 'MSFT', 'AMD', 'TWTR', 'TSLA']
#Download each stock's 6-month historical daily stock price and save to a .csv
df_list = list()
for ticker in stocks:
data = yf.download(ticker, group_by="Ticker", period='6mo')
df = pd.concat([data])
csv = df.to_csv()
with open(directory+ticker+'.csv', 'w') as f:
f.write(csv)
#Get the .csv filename as well as the full path to each file
ori_name = []
for stock in stocks:
ori_name.append(stock + '.csv')
stocks = [directory + s for s in ori_name]
dfs = [pd.read_csv(s)[['Date', 'Close']] for s in stocks]
data = reduce(lambda left,right: pd.merge(left,right,on='Date'), dfs).iloc[:, 1:]
returns = data.pct_change()
mean_daily_returns = returns.mean()
volatilities = returns.std()
combine = pd.DataFrame({'returns': mean_daily_returns * 252,
'volatility': volatilities * 252})
g = sns.jointplot("volatility", "returns", data=combine, kind="reg",height=7)
#Apply Annotations
for i in range(combine.shape[0]):
name = ori_name[i].replace(',csv', '')
x = combine.iloc[i, 1]
y = combine.iloc[i, 0]
print(name)
print(x, y)
print('\n')
plt.annotate(name, xy=(x,y))
plt.show()
Printing out the stock name and the respective x,y position I am trying to place the annotation at shows the following:
AAPL.csv
4.285630458382526 0.24836925418906455
MSFT.csv
3.3916453932738966 0.5159276490876817
AMD.csv
6.040090684498841 -0.002179408770566866
TWTR.csv
7.911518867192316 0.8556785016280568
TSLA.csv
9.154424353004579 -0.40596099327336554
Unless I am mistaken, these are the exact points that are being plotted on the graph. As such, I am confused as to why the text isn’t being correctly annotated. I would assume it has something to do with the xycoords argument for plt.annotate(), but I don’t know enough about the different coordinate systems to know which one to use or whether that’s even the root cause of the issue.
Any help would be greatly appreciated. Thank you!
As #JodyKlymak stated in his comment above, the issue with my code stems from jointplot containing several subplots, preventing annotate() from knowing which axes to base the text placement off of. This was easily fixed by simply replacing plt.annotate() with g.ax_joint.annotate().
I have converted a continuous dataset to categorical. I am getting nan values when ever the value of the continuous data is 0.0 after conversion. Below is my code
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins)
category = category.to_frame()
print (category)
How do I convert the values so that I dont get NaN values. I have attached two screenshots for better understanding how the actual data looks and how the convert data looks. This is the main dataset. This is the what it becomes after using bins and pandas.cut(). How can thos "0.00" stays like the other values in the dataset.
When using pd.cut, you can specify the parameter include_lowest = True. This will make the first internal left inclusive (it will include the 0 value as your first interval starts with 0).
So in your case, you can adjust your code to be
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins,include_lowest=True)
category = category.to_frame()
print (category)
Documentation Reference for pd.cut
I am graphing data that is stored in a csv. I pull pull 2 columns of data into a dataframe then convert to series and graph with matplotlib.
from pandas import Series
from matplotlib import pyplot
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('Proxy/Proxy_Analytics/API_Statistics.csv')
df
Date Distinct_FLD Not_On_MM API_Call_Count Cost CACHE_Count
0 2018-11-12 35711 18468 18468 8.31060 35711
1 2018-11-13 36118 18741 11004 4.95180 46715
2 2018-11-14 34073 17629 8668 3.90060 55383
3 2018-11-15 34126 17522 7817 3.51765 63200
#Cost
cost_df = df[['Date','Cost']]
cost_series = cost_df.set_index('Date')['Cost']
plt.style.use('dark_background')
plt.title('Domain Rank API Cost Over Time')
plt.ylabel('Cost in Dollars')
cost_series.plot(c = 'red')
plt.show()
And this works totally fine. I would like to do the same and graph multiple rows but when I try to convert the df to series I am getting an error:
#Not Cost
not_cost = df[['Date','Distinct_FLD','Not_On_MM','API_Call_Count','CACHE_Count']]
not_cost_series = not_cost.set_index('Date')['Distinct_FLD','Not_On_MM','API_Call_Count','CACHE_Count']
Error:
KeyError: ('Distinct_FLD', 'Not_On_MM', 'API_Call_Count', 'CACHE_Count')
What can I do to fix this?
It seems that you are trying to convert the columns of a DataFrame into multiple Series, indexed by the 'Date' column of your DataFrame.
Maybe you can try:
not_cost = df[['Date','Distinct_FLD','Not_On_MM','API_Call_Count','CACHE_Count']]
not_cost_series = not_cost.set_index('Date')
Distinct_FLD = not_cost_series['Distinct_FLD']
Not_On_MM = not_cost_series['Not_On_MM']
.
.
.
I'm using matplotlib to plot some data imported from CSV files. These files have the following format:
Date,Time,A,B
25/07/2016,13:04:31,5,25550
25/07/2016,13:05:01,0,25568
....
01/08/2016,19:06:43,0,68425
The dates are formatted as they would be in the UK, i.e. %d/%m/%Y. The end result is to have two plots: one of how A changes with time, and one of how B changes with time. I'm importing the data from the CSV like so:
import matplotlib
matplotlib.use('Agg')
from matplotlib.mlab import csv2rec
import matplotlib.pyplot as plt
from datetime import datetime
import sys
...
def analyze_log(file, y):
data = csv2rec(open(file, 'rb'))
fig = plt.figure()
date_vec = [datetime.strptime(str(x), '%Y-%m-%d').date() for x in data['date']]
print date_vec[0]
print date_vec[len(date_vec)-1]
time_vec = [datetime.strptime(str(x), '%Y-%m-%d %X').time() for x in data['time']]
print time_vec[0]
print time_vec[len(time_vec)-1]
datetime_vec = [datetime.combine(d, t) for d, t in zip(date_vec, time_vec)]
print datetime_vec[0]
print datetime_vec[len(datetime_vec)-1]
y_vec = data[y]
plt.plot(datetime_vec, y_vec)
...
# formatters, axis headers, etc.
...
return plt
And all was working fine before 01 August. However, since then, matplotlib is trying to plot my 01/08/2016 data points as 2016-01-08 (08 Jan)!
I get a plotting error because it tries to plot from January to July:
RuntimeError: RRuleLocator estimated to generate 4879 ticks from 2016-01-08 09:11:00+00:00 to 2016-07-29 16:22:34+00:00:
exceeds Locator.MAXTICKS * 2 (2000)
What am I doing wrong here? The results of the print statements in the code above are:
2016-07-25
2016-01-08 #!!!!
13:04:31
19:06:43
2016-07-25 13:04:31
2016-01-08 19:06:43 #!!!!
Matplotlib's csv2rec function parses your dates already and tries to be intelligent when it comes to parsing dates. The function has two options to influence the parsing, dayfirst should help here:
dayfirst: default is False so that MM-DD-YY has precedence over DD-MM-YY.
yearfirst: default is False so that MM-DD-YY has precedence over YY-MM-DD.
See http://labix.org/python-dateutil#head-b95ce2094d189a89f80f5ae52a05b4ab7b41af47 for further information.
You're using strings in %d/%m/%Y format but you've given the format specifier as %Y-%m-%d.