Python chart of natural distribution - python

I want to graph my data in natural distribution way
Not sure how I do that
I tried using plt.hist but it failed, I only got one column!!
here is my code
import pymssql
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
conn = pymssql.connect(server='MyServer', database='MyDB')
df = pd.read_sql('EXEC [Stat_EDFlow] [2018-03-01], [2019-02-28]', conn, index=False)
conn.close()
plt.hist(df['MyColumn'])
plt.show()

The reason for this is the way bins are calculated.
You have some outliers in your data, which is causing the plot to "zoom out" in an effort to show all of them.
One way you can resolve this issue is to remove the outliers (say, everything past the 95th percentile) and specify the number of bins:
df.loc[df['MyColumn'] < df['MyColumn'].quantile(0.95), 'MyColumn']).plot.hist(bins=25)
If this doesn't work, decrease the threshold from 0.95.
Another way is to specify the bins directly:
df['MyColumn'].plot.hist(bins=np.linspace(0, 100, 25))

I think you're looking for the ,bins= keyword.
You can either provide an integer of the number of bins you want, or something like np.arange(min,max,dist).
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html
EDIT:
To have a line plot, you can use something like:
import matplotlib.pyplot as plt
import numpy as np
synthetic=np.random.normal(size=100)
fig=plt.figure(figsize=(5,5))
y,binEdges=np.histogram(synthetic,bins=20) #we want 20 bins
bincenters = 0.5*(binEdges[1:]+binEdges[:-1])
plt.plot(bincenters,y,c='k')

Related

Plot heatmap from pandas Dataframe

I have the following pandas Dataframe. alfa_value and beta_value are random, ndcg shall be the parameter deciding the color.
The question is: how do I do a heatmap of the pandas Dataframe?
You can use the code below to generate a heatmap. You have to adjust the bins to group your data (analyze the mean, the std, ...)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
rng = np.random.default_rng(2022)
df = pd.DataFrame({'alfa_value': rng.integers(1000, 10000, 1000),
'beta_value': rng.random(1000),
'ndcg': rng.random(1000)})
out = df.pivot_table('ndcg', pd.cut(df['alfa_value'], bins=10),
pd.cut(df['beta_value'], bins=10), aggfunc='mean')
sns.heatmap(out)
plt.tight_layout()
plt.show()
In general, Seaborn's heatmap function is a nice way to color pandas' DataFrames based on their values. Good examples and descriptions can be found here.
Since you seem to want to color the row based on a different column, you are probably looking for something more like these answers.

Plot Correlation Table imported from excel with Python

So I am trying to plot correlation Matrix (already calculated) in python. the table is like below:
And I would like it to look like this:
I am using the Following code in python:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
df = pd.DataFrame(data)
print (df)
corrMatrix = data.corr()
print (corrMatrix)
sn.heatmap(corrMatrix, annot=True)
plt.show()
Note that, the matrix is ready and I don't want to calculate the correlation again! but I failed to do that. Any suggestions?
You are recalculating the correlation with the following line:
corrMatrix = data.corr()
You then go on to utilize this recalculated variable in the heatmap here:
sn.heatmap(corrMatrix, annot=True)
plt.show()
To resolve this, instead of passing in the corrMatrix value which is the recalculated value, pass the pure excel data data or df (as df is just a copy of data). Thus, all the code you should need is:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
sn.heatmap(data, annot=True)
plt.show()
Note that this assumes, however, that your data IS ready for the heatmap as you suggest. As we online do not have access to your data we cannot confirm that.
I have deleted to frist column (names) and add them later so the code is as below:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Users/yousefalbuhaisi/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
fig, ax = plt.subplots(dpi=150)
y_axis_labels = ['CLC','GIEMS','GLWD','LPX_BERN','LPJ_WSL','LPJ_WHyME','SDGVM','DLEM','ORCHIDEE','CLM4ME']
sn.heatmap(data,yticklabels=y_axis_labels, annot=True)
plt.show()
and the results are:

Unable to draw KDE on python

I've created a Brownian motion and then I have taken the last values of 1000 entries repeated 10000 times. I was able to plot the histogram using the following code as follows:
import seaborn as sns
import matplotlib.pyplot as plt
\\BM represents list of values generated by the Brownian motion
fig, (ax1,ax2) = plt.subplots(2)
ax1.hist(BM[:,-1],12)
I've been able to draw the KDE as follows, however i unable to merge the two diagrams together. Can someone please help me?
sns.kdeplot(data=BM[:,-1])
Try with sns.kdeplot(BM['col1']) where 'col1' is the name of the column you want to plot.
I'll give you a reproducible example that works for me.
import seaborn as sns
import pandas as pd
import numpy as np
BM = pd.DataFrame(np.array([-0.00871515, -0.0001227 , -0.01449098, 0.01808527, 0.00074193, 0.01145541]
, columns=['col1'])
BM.head(2)
col1
0 -0.008715
1 -0.000123
sns.kdeplot(BM['col1'])
Edit based on your additional question:
To have the histogram and a kde plot use this one:
sns.distplot(BM['col1'])

Seaborn violin plot over time given numpy ndarray

I have a distribution that changes over time for which I would like to plot a violin plot for each time step side-by-side using seaborn. My initial attempt failed as violinplot cannot handle a np.ndarray for the y argument:
import numpy as np
import seaborn as sns
time = np.arange(0, 10)
samples = np.random.randn(10, 200)
ax = sns.violinplot(x=time, y=samples) # Exception: Data must be 1-dimensional
The seaborn documentation has an example for a vertical violinplot grouped by a categorical variable. However, it uses a DataFrame in long format.
Do I need to convert my time series into a DataFrame as well? If so, how do I achieve this?
A closer look at the documentation made me realize that omitting the x and y argument altogether leads to the data argument being interpreted in wide-form:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
samples = np.random.randn(20, 10)
ax = sns.violinplot(data=samples)
plt.show()
In the violin plot documentation it says that the input x and y parameters do not have to be a data frame, but they have a restriction of having the same dimension. In addition, the variable y that you created has 10 rows and 200 columns. This is detrimental when plotting the graphics and causes a dimension problem.
I tested it and this code has no problems when reading the python file.
import numpy as np
import seaborn as sns
import pandas as pd
time = np.arange(0, 200)
samples = np.random.randn(10, 200)
for sample in samples:
ax = sns.violinplot(x=time, y=sample)
You can then group the resulting graphs using this link:
https://python-graph-gallery.com/199-matplotlib-style-sheets/
If you want to convert your data into data frames it is also possible. You just need to use pandas.
example
import pandas as pd
x = [1,2,3,4]
df = pd.DataFrame(x)

Adding shaded areas onto a normal distribution for standard deviation and mean with matplotlib [duplicate]

I would like to Fill_Between a sub section of a normal distribution, say the left 5%tile.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as stats
plt.style.use('ggplot')
mean=1000
std=250
x=np.linspace(mean-3*std, mean+3*std,1000)
iq=stats.norm(mean,std)
plt.plot(x,iq.pdf(x),'b')
Great so far.
Then I set px to fill the area between x=0 to 500
px=np.arange(0,500,10)
plt_fill_between(px,iq.pdf(px),color='r')
The problem is that the above will only show the pdf from 0 to 500 in red.
I want to show the full pdf from 0 to 2000 where the 0 to 500 is shaded?
Any idea how to create this?
As commented, you need to use plt.fill_between instead of plt_fill_between. When doing so the output looks like this which seems to be exactly what you're looking for.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as stats
plt.style.use('ggplot')
mean=1000
std=250
x=np.linspace(mean-3*std, mean+3*std,1000)
iq=stats.norm(mean,std)
plt.plot(x,iq.pdf(x),'b')
px=np.arange(0,500,10)
plt.fill_between(px,iq.pdf(px),color='r')
plt.show()
You only use the x values from 0 to 500 in your np.arange if you want to go to 2000 write:
px=np.arange(0,2000,10)
plt.fill_between(px,iq.pdf(px),color='r')

Categories

Resources