I’m working on a Jupyter notebook script using Python and Matplotlib which is supposed to fetch historical stock prices for specified stocks via the yfinance package and plot each stock’s volatility vs. potential return.
The expected and actual results can be found here.
As you can see in the second image, the annotations beside each point for the stock symbols are completely missing. I’m very new to Matplotlib, so I’m at a bit of a loss. The code being used is as follows:
import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from functools import reduce
from google.colab import files
sns.set()
directory = '/datasets/stocks/'
stocks = ['AAPL', 'MSFT', 'AMD', 'TWTR', 'TSLA']
#Download each stock's 6-month historical daily stock price and save to a .csv
df_list = list()
for ticker in stocks:
data = yf.download(ticker, group_by="Ticker", period='6mo')
df = pd.concat([data])
csv = df.to_csv()
with open(directory+ticker+'.csv', 'w') as f:
f.write(csv)
#Get the .csv filename as well as the full path to each file
ori_name = []
for stock in stocks:
ori_name.append(stock + '.csv')
stocks = [directory + s for s in ori_name]
dfs = [pd.read_csv(s)[['Date', 'Close']] for s in stocks]
data = reduce(lambda left,right: pd.merge(left,right,on='Date'), dfs).iloc[:, 1:]
returns = data.pct_change()
mean_daily_returns = returns.mean()
volatilities = returns.std()
combine = pd.DataFrame({'returns': mean_daily_returns * 252,
'volatility': volatilities * 252})
g = sns.jointplot("volatility", "returns", data=combine, kind="reg",height=7)
#Apply Annotations
for i in range(combine.shape[0]):
name = ori_name[i].replace(',csv', '')
x = combine.iloc[i, 1]
y = combine.iloc[i, 0]
print(name)
print(x, y)
print('\n')
plt.annotate(name, xy=(x,y))
plt.show()
Printing out the stock name and the respective x,y position I am trying to place the annotation at shows the following:
AAPL.csv
4.285630458382526 0.24836925418906455
MSFT.csv
3.3916453932738966 0.5159276490876817
AMD.csv
6.040090684498841 -0.002179408770566866
TWTR.csv
7.911518867192316 0.8556785016280568
TSLA.csv
9.154424353004579 -0.40596099327336554
Unless I am mistaken, these are the exact points that are being plotted on the graph. As such, I am confused as to why the text isn’t being correctly annotated. I would assume it has something to do with the xycoords argument for plt.annotate(), but I don’t know enough about the different coordinate systems to know which one to use or whether that’s even the root cause of the issue.
Any help would be greatly appreciated. Thank you!
As #JodyKlymak stated in his comment above, the issue with my code stems from jointplot containing several subplots, preventing annotate() from knowing which axes to base the text placement off of. This was easily fixed by simply replacing plt.annotate() with g.ax_joint.annotate().
Related
Just to be upfront, I am a Mechanical Engineer with limited coding experience thou I have some programming classes under my belt( Java, C++, and lisp)
I have inherited this code from my predecessor and am just trying to make it work for what I'm doing with it. I need to iterate through an excel file that has column A values of 0, 1, 2, and 3 (in the code below this correlates to "Revs" ) but I need to pick out all the value = 0 and put into a separate folder, and again for value = 2, etc.. Thank you for bearing with me, I appreciate any help I can get
import pandas as pd
import numpy as np
import os
import os.path
import xlsxwriter
import matplotlib.pyplot as plt
import six
import matplotlib.backends.backend_pdf
from matplotlib.gridspec import GridSpec
from matplotlib.ticker import AutoMinorLocator, MultipleLocator
def CamAnalyzer(entryName):
#Enter excel data from file as a dataframe
df = pd.read_excel (str(file_loc) + str(entryName), header = 1) #header 1 to get correct header row
print (df)
#setup grid for plots
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(17,22))
gs = GridSpec(3,2, figure=fig)
props = dict(boxstyle='round', facecolor='w', alpha=1)
#create a list of 4 smaller dataframes by splitting df when the rev count changes and name them
dfSplit = list(df.groupby("Revs"))
names = ["Air Vent","Inlet","Diaphram","Outlet"]
for x, y in enumerate(dfSplit):
#for each smaller dataframe #x,(df-y), create a polar plot and assign it to a space in the grid
dfs = y[1]
r = dfs["Measurement"].str.strip(" in") #radius measurement column has units. ditch em
r = r.apply(pd.to_numeric) + zero/2 #convert all values in the frame to a float
theta = dfs["Rads"]
if x<2:
ax = fig.add_subplot(gs[1,x],polar = True)
else:
ax = fig.add_subplot(gs[2,x-2],polar = True)
ax.set_rlim(0,0.1) #set limits to radial axis
ax.plot(theta, r)
ax.grid(True)
ax.set_title(names[x]) #nametag
#create another subplot in the grid that overlays all 4 smaller dataframes on one plot
ax2 = fig.add_subplot(gs[0,:],polar = True)
ax2.set_rlim(0,0.1)
for x, y in enumerate(dfSplit):
dfs = y[1]
r = dfs["Measurement"].str.strip(" in")
r = r.apply(pd.to_numeric) + zero/2
theta = dfs["Rads"]
ax2.plot(theta, r)
ax2.set_title("Sample " + str(entryName).strip(".xlsx") + " Overlayed")
ax2.legend(names,bbox_to_anchor=(1.1, 1.05)) #place legend outside of plot area
plt.savefig(str(file_loc) + "/Results/" + str(entryName).strip(".xlsx") + ".png")
print("Results Saved")
I'm on my phone, so I can't check exact code examples, but this should get you started.
First, most of the code you posted is about graphing, and therefore not useful for your needs. The basic approach: use pandas (a library), to read in the Excel sheet, use the pandas function 'groupby' to split that sheet by 'Revs', then iterate through each Rev, and use pandas again to write back to a file. Copying the relevant sections from above:
#this brings in the necessary library
import pandas as pd
#Read excel data from file as a dataframe
#header should point to the row that describes your columns. The first row is row 0.
df = pd.read_excel("filename.xlsx", header = 1)
#create a list of 4 smaller dataframes using GroupBy.
#This returns a 'GroupBy' object.
dfSplit = df.groupby("Revs")
#iterate through the groupby object, saving each
#iterating over key (name) and value (dataframes)
#use the name to build a filename
for name, frame in dfSplit:
frame.to_excel("Rev "+str(name)+".xlsx")
Edit: I had a chance to test this code, and it should now work. This will depend a little on your actual file (eg, which row is your header row).
I want to add the label name to the respective hoverlabels. Eg- The hoverlabel in the image that reads %{label} should instead read Workplace Closing and so on.
There is no custom_data or text property for parallel category plots. I tried using meta by passing it a list of all the labels (meta=[dim[x]['label'] for x in range(len(dim))]), but it displays the entire list on every hoverlabel rather than one element per hoverlabel. I also tried using %{label}, %{labels}, %{dimension} and some more to find any built in functionality, like you would use %{x} or %{y} in a plot with x and y arguments.
import numpy as np
import pandas as pd
import plotly.graph_objects as go
df = {}
dim = []
for idx,var in enumerate(['country','School closing','Workplace closing','Cancel public events',
'Restrictions on gatherings','Close public transport','Stay at home requirements',
'Restrictions on internal movement','International travel controls',
'Public information campaigns']):
df[var] = np.random.randint(4, size=4)
dim.append(go.parcats.Dimension(values=df[var], label=var.title(), categoryorder='category ascending'))
df = pd.DataFrame(df)
fig = go.Figure(data = [go.Parcats(dimensions=[x for x in dim],
line={'color': df.country})])
fig.update_traces(hovertemplate='%{label}')
Help would be appreciated!
My name is Luis Francisco Gomez and I am in the course Intermediate Python > 1 Matplotlib > Sizes that belongs to the Data Scientist with Python in DataCamp. I am reproducing the exercises of the course where in this part you have to make a scatter plot in which the size of the points are equivalent to the population of the countries. I try to reproduce the results of DataCamp with this code:
# load subpackage
import matplotlib.pyplot as plt
## load other libraries
import pandas as pd
import numpy as np
## import data
gapminder = pd.read_csv("https://assets.datacamp.com/production/repositories/287/datasets/5b1e4356f9fa5b5ce32e9bd2b75c777284819cca/gapminder.csv")
gdp_cap = gapminder["gdp_cap"].tolist()
life_exp = gapminder["life_exp"].tolist()
# create an np array that contains the population
pop = gapminder["population"].tolist()
pop_np = np.array(pop)
plt.scatter(gdp_cap, life_exp, s = pop_np*2)
# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000, 10000, 100000],['1k', '10k', '100k'])
# Display the plot
plt.show()
However a get this:
But in theory you need to get this:
I don't understand what is the problem with the argument s in plt.scatter .
You need to scale your s,
plt.scatter(gdp_cap, life_exp, s = pop_np*2/1000000)
The marker size in points**2.
Per docs
This is because your sizes are too large, scale it down. Also, there's no need to create all the intermediate arrays:
plt.scatter(gapminder.gdp_cap,
gapminder.life_exp,
s=gapminder.population/1e6)
Output:
I think you should use
plt.scatter(gdp_cap, life_exp, s = gdp_cap*2)
or maybe reduce or scale pop_np
General Overview:
I am creating a graph of a large data set, however i have created a sample text document so that it is easier to overcome the problems.
The Data is from an excel document that will be saved as a CSV.
Problem:
I am able to compile the data a it will graph (see below) However how i pull the data will not work for all of the different excel sheet i am going to pull off of.
More Detail of problem:
The Y-Values (Labeled 'Value' and 'Value1') are being pulled for the excel sheet from the numbers 26 and 31 (See picture and Code).
This is a problem because the Values 26 and 31 will not be the same for each graph.
Lets take a look for this to make more sense.
Here is my code
import pandas as pd
import matplotlib.pyplot as plt
pd.read_csv('CSV_GM_NB_Test.csv').T.to_csv('GM_NB_Transpose_Test.csv,header=False)
df = pd.read_csv('GM_NB_Transpose_Test.csv', skiprows = 2)
DID = df['SN']
Value = df['26']
Value1 = df['31']
x= (DID[16:25])
y= (Value[16:25])
y1= (Value1[16:25])
"""
print(x,y)
print(x,y1)
"""
plt.plot(x.astype(int), y.astype(int))
plt.plot(x.astype(int), y1.astype(int))
plt.show()
Output:
Data Set:
Below in the comments you will find the 0bin to my Data Set this is because i do not have enough reputation to post two links.
As you can see from the Data Set
X- DID = Blue
Y-Value = Green
Y-Value1 = Grey
Troublesome Values = Red
The problem again is that the data for the Y-Values are pulled from Row 10&11 from values 26,31 under SN
Let me know if more information is needed.
Thank you
Not sure why you are creating the transposed CSV version. It is also possible to work directly from your original data. For example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('CSV_GM_NB_Test.csv', skiprows=8)
data = df.ix[:,19:].T
data.columns = df['SN']
data.plot()
plt.show()
This would give you:
You can use pandas.DataFrame.ix() to give you a sliced version of your data using integer positions. The [:,19:] says to give you columns 19 onwards. The final .T transposes it. You can then apply the values for the SN column as column headings using .columns to specify the names.
I am trying to create a python script that reads a CSV file that contains data arranged with sample names across the first row and data below each name, as such:
sample1,sample2,sample3
343.323,234.123,312.544
From the dataset I am trying to draw cumulative distribution functions for each sample onto the same axis. Using the code below:
import matplotlib.pyplot as plt
import numpy as np
import csv
def isfloat(value):
'''make sure sample values are floats
(problem with different number of values per sample)'''
try:
float(value)
return True
except ValueError:
return False
def createCDFs (dataset):
'''create a dictionary with sample name as key and data for each
sample as one list per key'''
dataset = dataset
num_headers = len(list(dataset))
dict_CDF = {}
for a in dataset.keys():
dict_CDF["{}".format(a)]= 1. * np.arange(len(dataset[a])) / (len(dataset[a]) - 1)
return dict_CDF
def getdata ():
'''retrieve data from a CSV file - file must have sample names in first row
and data below'''
with open('file.csv') as csvfile:
reader = csv.DictReader(csvfile, delimiter = ',' )
#create a dict that has sample names as key and associated ages as lists
dataset = {}
for row in reader:
for column, value in row.iteritems():
if isfloat(value):
dataset.setdefault(column, []).append(value)
else:
break
return dataset
x = getdata()
y = createCDFs(x)
#plot data
for i in x.keys():
ax1 = plt.subplot(1,1,1)
ax1.plot(x[i],y[i],label=str(i))
plt.legend(loc='upper left')
plt.show()
This gives the output below, which only properly displays one of the samples (Sample1 in Figure 1A).
Figure 1A. Only one CDF is displaying correctly (Sample1). B. Expected output
The number of values per sample differ and I think this is where my problem lies.
This has been really bugging me as I think the solution should be rather simple. Any help/suggestions would be helpful. I simply want to know how I display the data correctly. Data can be found here. The expected output is shown in Figure 1B.
Here is a simpler approach. That of course depends on if you want to use pandas. I used this approach to calculate cum dist
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data_req = pd.read_table("yourfilepath", sep=",")
#sort values per column
sorted_values = data_req.apply(lambda x: x.sort_values())
#plot with matplotlib
#note that you have to drop the Na's on columns to have appropriate
#dimensions per variable.
for col in sorted_values.columns:
y = np.linspace(0.,1., len(sorted_values[col].dropna()))
plt.plot(sorted_values[col].dropna(), y)
In the end, I got the figure you were looking for: