Problem with matplotlib.pyplot with matplotlib.pyplot.scatter in the argument s - python

My name is Luis Francisco Gomez and I am in the course Intermediate Python > 1 Matplotlib > Sizes that belongs to the Data Scientist with Python in DataCamp. I am reproducing the exercises of the course where in this part you have to make a scatter plot in which the size of the points are equivalent to the population of the countries. I try to reproduce the results of DataCamp with this code:
# load subpackage
import matplotlib.pyplot as plt
## load other libraries
import pandas as pd
import numpy as np
## import data
gapminder = pd.read_csv("https://assets.datacamp.com/production/repositories/287/datasets/5b1e4356f9fa5b5ce32e9bd2b75c777284819cca/gapminder.csv")
gdp_cap = gapminder["gdp_cap"].tolist()
life_exp = gapminder["life_exp"].tolist()
# create an np array that contains the population
pop = gapminder["population"].tolist()
pop_np = np.array(pop)
plt.scatter(gdp_cap, life_exp, s = pop_np*2)
# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000, 10000, 100000],['1k', '10k', '100k'])
# Display the plot
plt.show()
However a get this:
But in theory you need to get this:
I don't understand what is the problem with the argument s in plt.scatter .

You need to scale your s,
plt.scatter(gdp_cap, life_exp, s = pop_np*2/1000000)
The marker size in points**2.
Per docs

This is because your sizes are too large, scale it down. Also, there's no need to create all the intermediate arrays:
plt.scatter(gapminder.gdp_cap,
gapminder.life_exp,
s=gapminder.population/1e6)
Output:

I think you should use
plt.scatter(gdp_cap, life_exp, s = gdp_cap*2)
or maybe reduce or scale pop_np

Related

Barplot with significant differences and interactions in python?

I started to use python 6 months ago and may be my question is a naive one. I would like to visualize my data and ANOVA statistics. It is common to do this using a barplot with added lines indicating significant differences and interactions. How do you make plot like this using python ?
enter image description here
Here is a simple dataframe, with 3 columns (A,B and the p_values already calculated with a t-test)
mport pandas as pd
import matplotlib.pyplot as plt
import numpy as np
ar = np.array([ [565.0, 81.0, 1.630947e-02],
[1006.0, 311.0, 1.222740e-27],
[2929.0, 1292.0, 5.559912e-12],
[3365.0, 1979.0, 2.507474e-22],
[2260.0, 1117.0, 1.540305e-01]])
df = pd.DataFrame(ar,columns = ['A', 'B', 'p_value'])
ax = plt.subplot()
# I calculate the percentage
(df.iloc[:,0:2]/df.iloc[:,0:2].sum()*100).plot.bar(ax=ax)
for container, p_val in zip(ax.containers,df['p_value']):
labels = [f"{round(v,1)}%" if (p_val > 0.05) else f"(**)\n{round(v,1)}%" for v in container.datavalues]
ax.bar_label(container,labels=labels, fontsize=10,padding=8)
plt.show()
Initially I just wanted to add a "**" each time a significant difference is observed between the 2 columns A & B. But the initial code above is not really working.
Now I would prefer having the added lines indicating significant differences and interactions between the A&B columns. But I have no ideas how to make it happen.
Regards
JYK

How can i Plot arrows in a existing mplsoccer pitch?

I tried to do the tutorial of McKay Johns on YT (reference to the Jupyter Notebook to see the data (https://github.com/mckayjohns/passmap/blob/main/Pass%20map%20tutorial.ipynb).
I understood everything but I wanted to do a little change. I wanted to change plt.plot(...) with:
plt.arrow(df['x'][x],df['y'][x], df['endX'][x] - df['x'][x], df['endY'][x]-df['y'][x],
shape='full', color='green')
But the problem is, I still can't see the arrows. I tried multiple changes but I've failed. So I'd like to ask you in the group.
Below you can see the code.
## Read in the data
df = pd.read_csv('...\Codes\Plotting_Passes\messibetis.csv')
#convert the data to match the mplsoccer statsbomb pitch
#to see how to create the pitch, watch the video here: https://www.youtube.com/watch?v=55k1mCRyd2k
df['x'] = df['x']*1.2
df['y'] = df['y']*.8
df['endX'] = df['endX']*1.2
df['endY'] = df['endY']*.8
# Set Base
fig ,ax = plt.subplots(figsize=(13.5,8))
# Change background color of base
fig.set_facecolor('#22312b')
# Change color of base inside
ax.patch.set_facecolor('#22312b')
#this is how we create the pitch
pitch = Pitch(pitch_type='statsbomb',
pitch_color='#22312b', line_color='#c7d5cc')
# Set the axes to our Base
pitch.draw(ax=ax)
# X-Achsen => 0 to 120
# Y-Achsen => 80 to 0
# Lösung: Y-Achse invertieren:
plt.gca().invert_yaxis()
#use a for loop to plot each pass
for x in range(len(df['x'])):
if df['outcome'][x] == 'Successful':
#plt.plot((df['x'][x],df['endX'][x]),(df['y'][x],df['endY'][x]),color='green')
plt.scatter(df['x'][x],df['y'][x],color='green')
**plt.arrow(df['x'][x],df['y'][x], df['endX'][x] - df['x'][x], df['endY'][x]-df['y'][x],
shape='full', color='green')** # Here is the problem!
if df['outcome'][x] == 'Unsuccessful':
plt.plot((df['x'][x],df['endX'][x]),(df['y'][x],df['endY'][x]),color='red')
plt.scatter(df['x'][x],df['y'][x],color='red')
plt.title('Messi Pass Map vs Real Betis',color='white',size=20)
It always shows:
The problem is that plt.arrow has default values for head_width and head_length, which are too small for your figure. I.e. it is drawing arrows, the arrow heads are just way too tiny to see them (even if you zoom out). E.g. try something as follows:
import pandas as pd
import matplotlib.pyplot as plt
from mplsoccer.pitch import Pitch
df = pd.read_csv('https://raw.githubusercontent.com/mckayjohns/passmap/main/messibetis.csv')
...
# create a dict for the colors to avoid repetitive code
colors = {'Successful':'green', 'Unsuccessful':'red'}
for x in range(len(df['x'])):
plt.scatter(df['x'][x],df['y'][x],color=colors[df.outcome[x]], marker=".")
plt.arrow(df['x'][x],df['y'][x], df['endX'][x] - df['x'][x],
df['endY'][x]-df['y'][x], color=colors[df.outcome[x]],
head_width=1, head_length=1, length_includes_head=True)
# setting `length_includes_head` to `True` ensures that the arrow head is
# *part* of the line, not added on top
plt.title('Messi Pass Map vs Real Betis',color='white',size=20)
Result:
Note that you can also use plt.annotate for this, passing specific props to the parameter arrowprops. E.g.:
import pandas as pd
import matplotlib.pyplot as plt
from mplsoccer.pitch import Pitch
df = pd.read_csv('https://raw.githubusercontent.com/mckayjohns/passmap/main/messibetis.csv')
...
# create a dict for the colors to avoid repetitive code
colors = {'Successful':'green', 'Unsuccessful':'red'}
for x in range(len(df['x'])):
plt.scatter(df['x'][x],df['y'][x],color=colors[df.outcome[x]], marker=".")
props= {'arrowstyle': '-|>,head_width=0.25,head_length=0.5',
'color': colors[df.outcome[x]]}
plt.annotate("", xy=(df['endX'][x],df['endY'][x]),
xytext=(df['x'][x],df['y'][x]), arrowprops=props)
plt.title('Messi Pass Map vs Real Betis',color='white',size=20)
Result (a bit sharper, if you ask me, but maybe some tweaking with params in plt.arrow can also achieve that):

How to iterate distance calculation for different vehicles from coordinates

I am new to coding and need help developing a Time Space Diagram (TSD) from a CSV file which I got from a VISSIM simulation as a result.
A general TSD looks like this: TSD and I have a CSV which looks like this:
CSV.
I want to take "VEHICLE:SIMSEC" which represent the simulation time which I want it represented as the X axis on TSD, "NO" which represent the vehicle number (there are 185 different vehicles and I want to plot all 185 of them on the plot) as each of the line represented on TSD, "COORDFRONTX" which is the x coordinate of the simulation, and "COORDFRONTY" which is the y coordinate of the simulation as positions which would be the y axis on TSD.
I have tried the following code but did not get the result I want.
import pandas as pd
import matplotlib.pyplot as mp
# take data
data = pd.read_csv(r"C:\Users\hk385\Desktop\VISSIM_DATA_CSV.csv")
df = pd.DataFrame(data, columns=["VEHICLE:SIMSEC", "NO", "DISTTRAVTOT"])
# plot the dataframe
df.plot(x="NO", y=["DISTTRAVTOT"], kind="scatter")
# print bar graph
mp.show()
The plot came out to be uninterpretable as there were too many dots. The diagram looks like this: Time Space Diagram. So would you be able to help me or guide me to get a TSD from the CSV I have?
Suggestion made by mitoRibo,
The top 20 rows of the csv is the following:
VEHICLE:SIMSEC,NO,LANE\LINK\NO,LANE\INDEX,POS,POSLAT,COORDFRONTX,COORDFRONTY,COORDREARX,COORDREARY,DISTTRAVTOT
5.9,1,1,1,2.51,0.5,-1.259,-3.518,-4.85,-1.319,8.42
6.0,1,1,1,10.94,0.5,0.932,-4.86,-2.659,-2.661,16.86
6.1,1,1,1,19.37,0.5,3.125,-6.203,-0.466,-4.004,25.29
6.2,1,1,1,27.82,0.5,5.319,-7.547,1.728,-5.348,33.73
6.3,1,1,1,36.26,0.5,7.515,-8.892,3.924,-6.693,42.18
6.4,1,1,1,44.72,0.5,9.713,-10.238,6.122,-8.039,50.64
6.5,1,1,1,53.18,0.5,11.912,-11.585,8.321,-9.386,59.1
6.6,1,1,1,61.65,0.5,14.112,-12.933,10.521,-10.734,67.56
6.7,1,1,1,70.12,0.5,16.314,-14.282,12.724,-12.082,76.04
6.8,1,1,1,78.6,0.5,18.518,-15.632,14.927,-13.432,84.51
6.9,1,1,1,87.08,0.5,20.723,-16.982,17.132,-14.783,93.0
7.0,1,1,1,95.57,0.5,22.93,-18.334,19.339,-16.135,101.49
7.1,1,1,1,104.07,0.5,25.138,-19.687,21.547,-17.487,109.99
7.2,1,1,1,112.57,0.5,27.348,-21.04,23.757,-18.841,118.49
7.3,1,1,1,121.08,0.5,29.56,-22.395,25.969,-20.195,127.0
7.4,1,1,1,129.59,0.5,31.773,-23.75,28.182,-21.551,135.51
7.5,1,1,1,138.11,0.5,33.987,-25.107,30.396,-22.907,144.03
7.6,1,1,1,146.64,0.5,36.203,-26.464,32.612,-24.264,152.56
7.7,1,1,1,155.17,0.5,38.421,-27.822,34.83,-25.623,161.09
Thank you.
You can groupby and iterate through different vehicles, adding each one to your plot. I changed your example data so there were 2 different vehicles.
import pandas as pd
import io
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO("""
VEHICLE:SIMSEC,NO,LANE_LINK_NO,LANE_INDEX,POS,POSLAT,COORDFRONTX,COORDFRONTY,COORDREARX,COORDREARY,DISTTRAVTOT
5.9,1,1,1,2.51,0.5,-1.259,-3.518,-4.85,-1.319,0
6.0,1,1,1,10.94,0.5,0.932,-4.86,-2.659,-2.661,16.86
6.1,1,1,1,19.37,0.5,3.125,-6.203,-0.466,-4.004,25.29
6.2,1,1,1,27.82,0.5,5.319,-7.547,1.728,-5.348,33.73
6.3,1,1,1,36.26,0.5,7.515,-8.892,3.924,-6.693,42.18
6.4,1,1,1,44.72,0.5,9.713,-10.238,6.122,-8.039,50.64
6.5,1,1,1,53.18,0.5,11.912,-11.585,8.321,-9.386,59.1
6.6,1,1,1,61.65,0.5,14.112,-12.933,10.521,-10.734,67.56
6.7,1,1,1,70.12,0.5,16.314,-14.282,12.724,-12.082,76.04
6.8,1,1,1,78.6,0.5,18.518,-15.632,14.927,-13.432,84.51
6.9,1,1,1,87.08,0.5,20.723,-16.982,17.132,-14.783,90
6.0,2,1,1,95.57,0.5,22.93,-18.334,19.339,-16.135,0
6.1,2,1,1,104.07,0.5,25.138,-19.687,21.547,-17.487,30
6.2,2,1,1,112.57,0.5,27.348,-21.04,23.757,-18.841,40
6.3,2,1,1,121.08,0.5,29.56,-22.395,25.969,-20.195,50
6.4,2,1,1,129.59,0.5,31.773,-23.75,28.182,-21.551,60
6.5,2,1,1,138.11,0.5,33.987,-25.107,30.396,-22.907,70
6.6,2,1,1,146.64,0.5,36.203,-26.464,32.612,-24.264,80
6.7,2,1,1,155.17,0.5,38.421,-27.822,34.83,-25.623,90
"""),sep=',')
fig = plt.figure()
#Iterate through each vehicle, adding it to the plot
for vehicle_no,vehicle_df in df.groupby('NO'):
plt.plot(vehicle_df['VEHICLE:SIMSEC'],vehicle_df['DISTTRAVTOT'], label=vehicle_no)
plt.legend() #comment this out if you don't want a legned
plt.show()
plt.close()
If you don't mind could you please try this.
mp.scatter(x="NO", y=["DISTTRAVTOT"])
If still not work please attach your data for me to test from my side.

Issue Annotating Points With Matplotlib

I’m working on a Jupyter notebook script using Python and Matplotlib which is supposed to fetch historical stock prices for specified stocks via the yfinance package and plot each stock’s volatility vs. potential return.
The expected and actual results can be found here.
As you can see in the second image, the annotations beside each point for the stock symbols are completely missing. I’m very new to Matplotlib, so I’m at a bit of a loss. The code being used is as follows:
import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from functools import reduce
from google.colab import files
sns.set()
directory = '/datasets/stocks/'
stocks = ['AAPL', 'MSFT', 'AMD', 'TWTR', 'TSLA']
#Download each stock's 6-month historical daily stock price and save to a .csv
df_list = list()
for ticker in stocks:
data = yf.download(ticker, group_by="Ticker", period='6mo')
df = pd.concat([data])
csv = df.to_csv()
with open(directory+ticker+'.csv', 'w') as f:
f.write(csv)
#Get the .csv filename as well as the full path to each file
ori_name = []
for stock in stocks:
ori_name.append(stock + '.csv')
stocks = [directory + s for s in ori_name]
dfs = [pd.read_csv(s)[['Date', 'Close']] for s in stocks]
data = reduce(lambda left,right: pd.merge(left,right,on='Date'), dfs).iloc[:, 1:]
returns = data.pct_change()
mean_daily_returns = returns.mean()
volatilities = returns.std()
combine = pd.DataFrame({'returns': mean_daily_returns * 252,
'volatility': volatilities * 252})
g = sns.jointplot("volatility", "returns", data=combine, kind="reg",height=7)
#Apply Annotations
for i in range(combine.shape[0]):
name = ori_name[i].replace(',csv', '')
x = combine.iloc[i, 1]
y = combine.iloc[i, 0]
print(name)
print(x, y)
print('\n')
plt.annotate(name, xy=(x,y))
plt.show()
Printing out the stock name and the respective x,y position I am trying to place the annotation at shows the following:
AAPL.csv
4.285630458382526 0.24836925418906455
MSFT.csv
3.3916453932738966 0.5159276490876817
AMD.csv
6.040090684498841 -0.002179408770566866
TWTR.csv
7.911518867192316 0.8556785016280568
TSLA.csv
9.154424353004579 -0.40596099327336554
Unless I am mistaken, these are the exact points that are being plotted on the graph. As such, I am confused as to why the text isn’t being correctly annotated. I would assume it has something to do with the xycoords argument for plt.annotate(), but I don’t know enough about the different coordinate systems to know which one to use or whether that’s even the root cause of the issue.
Any help would be greatly appreciated. Thank you!
As #JodyKlymak stated in his comment above, the issue with my code stems from jointplot containing several subplots, preventing annotate() from knowing which axes to base the text placement off of. This was easily fixed by simply replacing plt.annotate() with g.ax_joint.annotate().

Plotting data with matplotlib takes forever & plot crashes with higher number of samples

got an issue with plotting x,y data. x is the time series, y the value as y(x). Data3.txt is a text file simply containing all the data with no headers in a matrix with 20 columns and 65534 rows.
Here is the code
import csv
import numpy as np
import matplotlib.pyplot as plt
dates = []
with open('Data3.txt') as csvDataFile:
csvReader = csv.reader(csvDataFile,quoting=csv.QUOTE_NONNUMERIC)
for row in csvReader:
dates.append(row)
np.array(map(float, dates))
time=[]
value=[]
samples=8000
for row in dates:
time.append(row[0])
value.append(row[1])
print(len(time))
print(len(time[:samples]))
plt.plot(time[:samples], value[:samples])
plt.ylim(0,40)
plt.xlim(0,1200)
plt.show()
The plot is shown until I set samples to 7000 - see attached Figure_1. As soon as I set samples to 8000 it takes a lot longer to plot & the outcome is Figure_2.
print(len(time))
gives 65543
print(len(time[:samples]))
gives 8000
Really confused by that. Can anyone explain where this error might come from? Any hint concerning a smarter way of plotting is much appreciated, as well. I would now align every column one listname & make the plots.

Categories

Resources