I have some data and want to show them in a scatter matrix
thats fine and something I was able to do with:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
xlin = np.linspace(0, 10, 20)
a = xlin*0.2 + np.random.randn(20)*3
b = a*2 + 2 + np.random.randn(20)*3
c = b*0.1 - a*2 + np.random.randn(20)*3
d = a*a*0.1 + np.random.randn(20)*3
df = pd.DataFrame(np.array([a,b,c,d]).T, columns=list('ABCD'))
but now I want to draw a scatter matrix of my dataframe - ok so far, so good
I did with:
from pandas.plotting import scatter_matrix
from scipy.stats import linregress
scatter_matrix(df, figsize=(12,12))
Now I want to show the correlations of the datasets using 'linregress'.
I even succeeded doing this but only in an extra plot below the scatter matrix. But how do I manage to get as a part of the scatter matrix?
In the end it should look like this:
scatter matrix
I have 2 tables a 10 by 110 and a 35 by 110 and both contain random numbers from a exponential distribution function provided by my professor. The assignment is to prove the central limit theorem in statistics.
What I thought to try is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
"importing data"
df1 = pd.read_excel(r'C:\Users\Henry\Desktop\n10.xlsx')
df2 = pd.read_excel(r'C:\Users\Henry\Desktop\n30.xlsx')
df1avg = pd.read_excel(r'C:\Users\Henry\Desktop\n10avg.xlsx')
df2avg = pd.read_excel(r'C:\Users\Henry\Desktop\n30avg.xlsx')
"plotting n10 histogram"
plt.hist(df1, bins=34)
plt.hist(df1avg, bins=11)
"plotting n30 histogram"
plt.hist(df2, bins=63)
plt.hist(df2avg, bins=11)
Is that ok or do I need to format the tables into a singular column, and if so what is the most efficient way to do that?
I suspect that you will want to flatten your dataframe first, as illustrated below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = np.random.exponential(1, [40, 5])
df = pd.DataFrame(N) # convert to dataframe
bin_edges = np.linspace(0,6,30)
plt.hist(df, bins = bin_edges, density = True)
plt.ylabel('Probability density')
The multiple (5) colours of lines per bin shows the histograms for each column of the data frame.
Fortunately, this is not hard to adjust. You can convert the data frame to a numpy array and flatten it:
plt.hist(df.to_numpy().flatten(), bins = bin_edges, density = True)
plt.ylabel('Probability density')
I'm using Pandas and am very new to programming. I'm plotting Energy Deposited (eDep) as a function of its x,y and z positions. So far, was successful in getting it to plot, but it won't let me plot the colormap beside my scatter plot! Any help is much appreciated
%matplotlib inline
import pandas as pd
import numpy as np
IncubatorBelow = "./Analysis.Test.csv"
df = pd.read_csv(IncubatorBelow, sep = ',', names['Name','TrackID','ParentID','xPos','yPos','zPos','eDep','DeltaE','Einit','EventID'],low_memory=False,error_bad_lines=False)
df["xPos"] = df["xPos"].str.replace("(","")
df["zPos"] = df["zPos"].str.replace(")","")
df.sort_values(by='Name', ascending=[False])
df.dropna(how='any',axis=0,subset=['Name','TrackID','ParentID','xPos','yPos','zPos','eDep','DeltaE','Einit','EventID'], inplace=True)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
df['xPos'] = df['xPos'].astype(float)
df['yPos'] = df['yPos'].astype(float)
df['zPos'] = df['zPos'].astype(float)
threedee = plt.figure().gca(projection='3d')
threedee.scatter(df["xPos"], df["yPos"], df["zPos"], c=df["eDep"], cmap=plt.cm.coolwarm)
Heres what the plot looks like!
Its from a particle physics simulation using GEANT4. The actual files are extremely large (3.7GB's that I've chunked into 40ish MB's) and this plot only represents a small fraction of the data.
I have a series of data that I'm reading in from a tutorial site.
I've managed to plot the distribution of the TV column in that data, however I also want to overlay a normal distribution curve with StdDev ticks on a second x-axis (so I can compare the two curves). I'm struggling to work out how to do it..
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
# draw distribution curve
h = sorted(data.TV)
hmean = np.mean(h)
hstd = np.std(h)
pdf = stats.norm.pdf(h, hmean, hstd)
plt.plot(h, pdf)
Here is a diagram close to what I'm after, where x is the StdDeviations. All this example needs is a second x axis to show the values of data.TV
Not sure what you really want, but you could probably use second axis like this
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('Advertising.csv', index_col=0)
fig, ax1 = plt.subplots()
# draw distribution curve
h = sorted(data.TV)
ax1.set_ylabel('Count', color='b')
for tl in ax1.get_yticklabels():
hmean = np.mean(h)
hstd = np.std(h)
pdf = stats.norm.pdf(h, hmean, hstd)
ax2 = ax1.twinx()
ax2.plot(h, pdf, 'r.')
ax2.set_ylabel('pdf', color='r')
for tl in ax2.get_yticklabels():
Ok, assuming that you want to plot the distribution of your data, the fitted normal distribution with two x-axes, one way to achieve this is as follows.
Plot the normalized data together with the standard normal distribution. Then use matplotlib's twiny() to add a second x-axis to the plot. Use the same tick positions as the original x-axis on the second axis, but scale the labels so that you get the corresponding original TV values. The result looks like this:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
h = sorted(data.TV)
hmean = np.mean(h)
hstd = np.std(h)
h_n = (h - hmean) / hstd
pdf = stats.norm.pdf( h_n )
# plot data
f,ax1 = plt.subplots()
ax1.hist( h_n, 20, normed=1 )
ax1.plot( h_n , pdf, lw=3, c='r')
ax1.set_xlim( [h_n.min(), h_n.max()] )
ax1.set_xlabel( r'TV $[\sigma]$' )
ax1.set_ylabel( r'Relative Frequency')
ax2 = ax1.twiny()
ax2.grid( False )
ax2.set_xlim( ax1.get_xlim() )
ax2.set_ylim( ax1.get_ylim() )
ax2.set_xlabel( r'TV' )
ticklocs = ax2.xaxis.get_ticklocs()
ticklocs = [ round( t*hstd + hmean, 2) for t in ticklocs ]
ax2.xaxis.set_ticklabels( map( str, ticklocs ) )
I am trying to do a Kernel Density Estimation (KDE) plot with seaborn and locate the median. The code looks something like this:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
sns.kdeplot(data, shade=True)
# x_median, y_median = magic_function()
# plt.vlines(x_median, 0, y_median)
As you can see I need a magic_function() to fetch the median x and y values from the kdeplot. Then I would like to plot them with e.g. vlines. However, I can't figure out how to do that. The result should look something like this (obviously the black median bar is wrong here):
I guess my question is not strictly related to seaborn and also applies to other kinds of matplotlib plots. Any ideas are greatly appreciated.
You need to:
Extract the data of the kde line
Integrate it to calculate the cumulative distribution function (CDF)
Find the value that makes CDF equal 1/2, that is the median
import numpy as np
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
p=sns.kdeplot(data, shade=True)
x,y = p.get_lines()[0].get_data()
#care with the order, it is first y
#initial fills a 0 so the result has same length than x
cdf = scipy.integrate.cumtrapz(y, x, initial=0)
nearest_05 = np.abs(cdf-0.5).argmin()
x_median = x[nearest_05]
y_median = y[nearest_05]
plt.vlines(x_median, 0, y_median)
I have a data set I wish to plot as scatter plot with matplotlib, and a vector the same size that categorizes and labels the data points (discretely, e.g. from 0 to 3). I want to use different markers for different labels (e.g. 'x' for 0, 'o' for 1 and so on). How can I solve this elegantly? I am quite sure I am just missing out on something, but didn't really find it, and my naive approaches failed so far...
What about iterating over all markers like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.rand(100)
y = np.random.rand(100)
category = np.random.random_integers(0, 3, 100)
markers = ['s', 'o', 'h', '+']
for k, m in enumerate(markers):
i = (category == k)
plt.scatter(x[i], y[i], marker=m)
Matplotlib does not accepts different markers per plot.
However, a less verbose and more robust solution for large dataset is using the pandas and seaborn library:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
kmean = np.array([0, 1, 0, 2, 2])
df = pd.DataFrame({'x':x,'y':y,'z':z, 'km_z':kmean})
sns.scatterplot(data = df, x='x', y='y', hue='km_z', style='km_z')
which produces the following output
Additionally you can use the pandas.cut function to plot bins (Its something I regularly need to produce graphs where I can use a third continuous value as a parameter). The way to use it is :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
df = pd.DataFrame({'x':x,'y':y,'z':z})
df['bins'] = pd.cut(df.z, bins=3)
sns.scatterplot(data = df, x='x', y='y', hue='bins', style='bins')
and it produces the following example:
I've used the latter method to produce graphs like the following: