Scatter Matrix and linear regression

Scatter Matrix and linear regression - python

I have some data and want to show them in a scatter matrix
thats fine and something I was able to do with:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
np.random.seed(100)
xlin = np.linspace(0, 10, 20)
a = xlin*0.2 + np.random.randn(20)*3
b = a*2 + 2 + np.random.randn(20)*3
c = b*0.1 - a*2 + np.random.randn(20)*3
d = a*a*0.1 + np.random.randn(20)*3
df = pd.DataFrame(np.array([a,b,c,d]).T, columns=list('ABCD'))
df.plot()
plt.show()
but now I want to draw a scatter matrix of my dataframe - ok so far, so good
I did with:
from pandas.plotting import scatter_matrix
from scipy.stats import linregress
scatter_matrix(df, figsize=(12,12))
plt.show()
Now I want to show the correlations of the datasets using 'linregress'.
I even succeeded doing this but only in an extra plot below the scatter matrix. But how do I manage to get as a part of the scatter matrix?
In the end it should look like this:
scatter matrix

Related

Can I take a table from excel and plot a histogram in python?

I have 2 tables a 10 by 110 and a 35 by 110 and both contain random numbers from a exponential distribution function provided by my professor. The assignment is to prove the central limit theorem in statistics.
What I thought to try is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
"importing data"
df1 = pd.read_excel(r'C:\Users\Henry\Desktop\n10.xlsx')
df2 = pd.read_excel(r'C:\Users\Henry\Desktop\n30.xlsx')
df1avg = pd.read_excel(r'C:\Users\Henry\Desktop\n10avg.xlsx')
df2avg = pd.read_excel(r'C:\Users\Henry\Desktop\n30avg.xlsx')
"plotting n10 histogram"
plt.hist(df1, bins=34)
plt.hist(df1avg, bins=11)
"plotting n30 histogram"
plt.hist(df2, bins=63)
plt.hist(df2avg, bins=11)
Is that ok or do I need to format the tables into a singular column, and if so what is the most efficient way to do that?

I suspect that you will want to flatten your dataframe first, as illustrated below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = np.random.exponential(1, [40, 5])
df = pd.DataFrame(N) # convert to dataframe
bin_edges = np.linspace(0,6,30)
plt.figure()
plt.hist(df, bins = bin_edges, density = True)
plt.xlabel('Value')
plt.ylabel('Probability density')
The multiple (5) colours of lines per bin shows the histograms for each column of the data frame.
Fortunately, this is not hard to adjust. You can convert the data frame to a numpy array and flatten it:
plt.hist(df.to_numpy().flatten(), bins = bin_edges, density = True)
plt.ylabel('Probability density')
plt.xlabel('Value')

Displaying Colormap/legend with x,y,z plot and fourth variable

I'm using Pandas and am very new to programming. I'm plotting Energy Deposited (eDep) as a function of its x,y and z positions. So far, was successful in getting it to plot, but it won't let me plot the colormap beside my scatter plot! Any help is much appreciated
%matplotlib inline
import pandas as pd
import numpy as np
IncubatorBelow = "./Analysis.Test.csv"
df = pd.read_csv(IncubatorBelow, sep = ',', names['Name','TrackID','ParentID','xPos','yPos','zPos','eDep','DeltaE','Einit','EventID'],low_memory=False,error_bad_lines=False)
df["xPos"] = df["xPos"].str.replace("(","")
df["zPos"] = df["zPos"].str.replace(")","")
df.sort_values(by='Name', ascending=[False])
df.dropna(how='any',axis=0,subset=['Name','TrackID','ParentID','xPos','yPos','zPos','eDep','DeltaE','Einit','EventID'], inplace=True)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
df['xPos'] = df['xPos'].astype(float)
df['yPos'] = df['yPos'].astype(float)
df['zPos'] = df['zPos'].astype(float)
#df10[df10['Name'].str.contains("e-")]
threedee = plt.figure().gca(projection='3d')
threedee.scatter(df["xPos"], df["yPos"], df["zPos"], c=df["eDep"], cmap=plt.cm.coolwarm)
threedee.set_xlabel("x(mm)")
threedee.set_ylabel("y(mm)")
threedee.set_zlabel("z(mm)")
plt.show()
Heres what the plot looks like!
Its from a particle physics simulation using GEANT4. The actual files are extremely large (3.7GB's that I've chunked into 40ish MB's) and this plot only represents a small fraction of the data.

matplotlib overlay a normal distribution with stddev axis onto another plot

I have a series of data that I'm reading in from a tutorial site.
I've managed to plot the distribution of the TV column in that data, however I also want to overlay a normal distribution curve with StdDev ticks on a second x-axis (so I can compare the two curves). I'm struggling to work out how to do it..
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
# draw distribution curve
h = sorted(data.TV)
hmean = np.mean(h)
hstd = np.std(h)
pdf = stats.norm.pdf(h, hmean, hstd)
plt.plot(h, pdf)
Here is a diagram close to what I'm after, where x is the StdDeviations. All this example needs is a second x axis to show the values of data.TV

Not sure what you really want, but you could probably use second axis like this
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('Advertising.csv', index_col=0)
fig, ax1 = plt.subplots()
# draw distribution curve
h = sorted(data.TV)
ax1.plot(h,'b-')
ax1.set_xlabel('TV')
ax1.set_ylabel('Count', color='b')
for tl in ax1.get_yticklabels():
tl.set_color('b')
hmean = np.mean(h)
hstd = np.std(h)
pdf = stats.norm.pdf(h, hmean, hstd)
ax2 = ax1.twinx()
ax2.plot(h, pdf, 'r.')
ax2.set_ylabel('pdf', color='r')
for tl in ax2.get_yticklabels():
tl.set_color('r')
plt.show()

Ok, assuming that you want to plot the distribution of your data, the fitted normal distribution with two x-axes, one way to achieve this is as follows.
Plot the normalized data together with the standard normal distribution. Then use matplotlib's twiny() to add a second x-axis to the plot. Use the same tick positions as the original x-axis on the second axis, but scale the labels so that you get the corresponding original TV values. The result looks like this:
Code
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
h = sorted(data.TV)
hmean = np.mean(h)
hstd = np.std(h)
h_n = (h - hmean) / hstd
pdf = stats.norm.pdf( h_n )
# plot data
f,ax1 = plt.subplots()
ax1.hist( h_n, 20, normed=1 )
ax1.plot( h_n , pdf, lw=3, c='r')
ax1.set_xlim( [h_n.min(), h_n.max()] )
ax1.set_xlabel( r'TV $[\sigma]$' )
ax1.set_ylabel( r'Relative Frequency')
ax2 = ax1.twiny()
ax2.grid( False )
ax2.set_xlim( ax1.get_xlim() )
ax2.set_ylim( ax1.get_ylim() )
ax2.set_xlabel( r'TV' )
ticklocs = ax2.xaxis.get_ticklocs()
ticklocs = [ round( t*hstd + hmean, 2) for t in ticklocs ]
ax2.xaxis.set_ticklabels( map( str, ticklocs ) )

How to locate the median in a (seaborn) KDE plot?

I am trying to do a Kernel Density Estimation (KDE) plot with seaborn and locate the median. The code looks something like this:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
sns.kdeplot(data, shade=True)
# x_median, y_median = magic_function()
# plt.vlines(x_median, 0, y_median)
plt.show()
As you can see I need a magic_function() to fetch the median x and y values from the kdeplot. Then I would like to plot them with e.g. vlines. However, I can't figure out how to do that. The result should look something like this (obviously the black median bar is wrong here):
I guess my question is not strictly related to seaborn and also applies to other kinds of matplotlib plots. Any ideas are greatly appreciated.

You need to:
Extract the data of the kde line
Integrate it to calculate the cumulative distribution function (CDF)
Find the value that makes CDF equal 1/2, that is the median
import numpy as np
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
p=sns.kdeplot(data, shade=True)
x,y = p.get_lines()[0].get_data()
#care with the order, it is first y
#initial fills a 0 so the result has same length than x
cdf = scipy.integrate.cumtrapz(y, x, initial=0)
nearest_05 = np.abs(cdf-0.5).argmin()
x_median = x[nearest_05]
y_median = y[nearest_05]
plt.vlines(x_median, 0, y_median)
plt.show()

Python scatter-plot: Conditions for marker styles?

I have a data set I wish to plot as scatter plot with matplotlib, and a vector the same size that categorizes and labels the data points (discretely, e.g. from 0 to 3). I want to use different markers for different labels (e.g. 'x' for 0, 'o' for 1 and so on). How can I solve this elegantly? I am quite sure I am just missing out on something, but didn't really find it, and my naive approaches failed so far...

What about iterating over all markers like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.rand(100)
y = np.random.rand(100)
category = np.random.random_integers(0, 3, 100)
markers = ['s', 'o', 'h', '+']
for k, m in enumerate(markers):
i = (category == k)
plt.scatter(x[i], y[i], marker=m)
plt.show()

Matplotlib does not accepts different markers per plot.
However, a less verbose and more robust solution for large dataset is using the pandas and seaborn library:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
kmean = np.array([0, 1, 0, 2, 2])
df = pd.DataFrame({'x':x,'y':y,'z':z, 'km_z':kmean})
sns.scatterplot(data = df, x='x', y='y', hue='km_z', style='km_z')
which produces the following output
Additionally you can use the pandas.cut function to plot bins (Its something I regularly need to produce graphs where I can use a third continuous value as a parameter). The way to use it is :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
df = pd.DataFrame({'x':x,'y':y,'z':z})
df['bins'] = pd.cut(df.z, bins=3)
sns.scatterplot(data = df, x='x', y='y', hue='bins', style='bins')
and it produces the following example:
I've used the latter method to produce graphs like the following:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scatter Matrix and linear regression - python

Related

Can I take a table from excel and plot a histogram in python?

Displaying Colormap/legend with x,y,z plot and fourth variable

matplotlib overlay a normal distribution with stddev axis onto another plot

How to locate the median in a (seaborn) KDE plot?

Python scatter-plot: Conditions for marker styles?

Categories

Resources