Histogram Plotting Python - python

I need histograms on the number of attributes and classes, description of attribute and classes and the number of instances and classes, while being new to program this is what I've tried so far.
import numpy as np
import pandas as pd
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
data=pd.read_csv('mushroom')
column=df.'Class'
num_bins = 5
n, bins, patches = plt.hist(column, num_bins, facecolor='blue', alpha=0.5)
plt.show()
This is how my data looks like
cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,Class
,f,g,f,n,f,c,n,p,e,s,s,w,w,p,w,o,p,k,v,u,p
,f,g,f,n,f,c,n,n,e,s,s,w,w,p,w,o,p,k,y,u,p
x,f,g,f,n,f,w,b,k,t,s,f,w,w,p,w,o,e,n,s,g,e
,f,g,f,n,f,c,n,g,e,s,s,w,w,p,w,o,p,n,y,u,e
x,f,w,f,n,f,w,b,p,t,f,s,w,w,p,w,o,e,n,a,g,e
s,f,n,f,n,f,c,n,n,e,s,s,w,w,p,w,o,p,k,v,u,e
f,f,n,f,n,f,c,n,n,e,s,s,w,w,p,w,o,p,n,v,u,e
x,f,g,f,n,f,c,n,p,e,s,s,w,w,p,w,o,p,n,y,u,e
f,s,g,f,n,f,w,b,n,t,s,f,w,w,p,w,o,e,n,s,g,e
x,f,w,f,n,f,w,b,n,t,f,f,w,w,p,w,o,e,n,a,g,e
x,s,n,f,n,f,w,b,p,t,f,f,w,w,p,w,o,e,k,s,g,e
x,s,w,f,n,f,w,b,h,t,f,s,w,w,p,w,o,e,n,s,g,p
f,f,w,f,n,f,w,b,p,t,f,s,w,w,p,w,o,e,k,s,g,p
x,f,g,f,n,f,w,b,p,t,f,f,w,w,p,w,o,e,n,s,g,e

Class is a categorical variable (or a factor http://www.statisticshowto.com/what-is-a-categorical-variable/). Binning and histogram is meaningful when you have a continuous variable (http://www.statisticshowto.com/continuous-variable/).
I assume what you actually need is a frequency plot, a bar chart that shows the frequency of each outcome in your categorical data.
If my assumption is correct, the following code will solve your problem.
import pandas
import matplotlib.pyplot as plt
data = pandas.read_csv('mushroom.csv')
fig, ax = plt.subplots()
data['Class'].value_counts().plot(kind='bar', ax=ax)
plt.show()

Related

Adding KDE and Normal distribution to a Histogram

How to add KDE and NOrmal distribution to a dataframe histogram?
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
data = pd.DataFrame(norm.rvs(0,1,size=1000))
data.plot.hist()
I am familiar with data.plot.kde() function how I want scalles to be the same and also add normal distribution plot
And I am also aware of seaborn and distplot function in it - I need it in matplotlib
To demonstrate what I meant in the comment:
fig, ax = plt.subplots()
data.plot.hist(ax=ax, alpha=0.5)
ax2 = ax.twinx()
data.plot.kde(ax=ax2)
Output:

How to scale the x and y axis equally by log in Seaborn?

I want to create a regplot with a linear regression in Seaborn and scale both axes equally by log, such that the regression stays a straight line.
An example:
import matplotlib.pyplot as plt
import seaborn as sns
some_x=[0,1,2,3,4,5,6,7]
some_y=[3,5,4,7,7,9,9,10]
ax = sns.regplot(x=some_x, y=some_y, order=1)
plt.ylim(0, 12)
plt.xlim(0, 12)
plt.show()
What I get:
If I scale the x and y axis by log, I would expect the regression to stay a straight line. What I tried:
import matplotlib.pyplot as plt
import seaborn as sns
some_x=[0,1,2,3,4,5,6,7]
some_y=[3,5,4,7,7,9,9,10]
ax = sns.regplot(x=some_x, y=some_y, order=1)
ax.set_yscale('log')
ax.set_xscale('log')
plt.ylim(0, 12)
plt.xlim(0, 12)
plt.show()
How it looks:
The problem is that you are fitting to your data on a regular scale but later you are transforming the axes to log scale. So linear fit will no longer be linear on a log scale.
What you need instead is to transform your data to log scale (base 10) and then perform a linear regression. Your data is currently a list. It would be easy to transform your data to log scale if you convert your list to NumPy array because then you can make use of vectorised operation.
Caution: One of your x-entry is 0 for which log is not defined. You will encounter a warning there.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
some_x=np.array([0,1,2,3,4,5,6,7])
some_y=np.array([3,5,4,7,7,9,9,10])
ax = sns.regplot(x=np.log10(some_x), y=np.log10(some_y), order=1)
Solution using NumPy polyfit where you exclude x=0 data point from the fit
import matplotlib.pyplot as plt
import numpy as np
some_x=np.log10(np.array([0,1,2,3,4,5,6,7]))
some_y=np.log10(np.array([3,5,4,7,7,9,9,10]))
fit = np.poly1d(np.polyfit(some_x[1:], some_y[1:], 1))
plt.plot(some_x, some_y, 'ko')
plt.plot(some_x, fit(some_x), '-k')

Python - Pandas histogram width

I am doing a histogram plot of a bunch of data that goes from 0 to 1. When I plot I get this
As you can see, the histogram 'blocks' do not align with the y-axis.
Is there a way to set my histogram in order to get the histograms in a constant width of 0.1? Or should I try a diferent package?
My code is quite simple:
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
np.set_printoptions(precision=10,
threshold=10000,
linewidth=150,suppress=True)
E=pd.read_csv("FQCoherentSeparableBons5.csv")
E = E.ix[0:,1:]
E=np.array(E,float)
P0=E[:,0]
P0=pd.DataFrame(P0,columns=['P0'])
scatter_matrix(P0, alpha=0.2, figsize=(6, 6), diagonal='hist',color="red")
plt.suptitle('Distribucio p0')
plt.ylabel('Frequencia p0')
plt.show()
PD: If you are wondering about the data, I is just a random distribution from 0 to 1.
You can pass additional arguments to the pandas histogram using the hist_kwds argument of the scatter_matrix function. If you want ten bins of width 0.1, then your scatter_matrix call should look like
scatter_matrix(P0, alpha=0.2, figsize=(6, 6), diagonal='hist', color="red",
hist_kwds={'bins':[i*0.1 for i in range(11)]})
Additional arguments for the pandas histogram can be found in documentation.
Here is a simple example. I've added a grid to the plot so that you can see the bins align correctly.
import numpy as np
import pandas as pd
from pandas import scatter_matrix
import matplotlib.pyplot as plt
x = np.random.uniform(0,1,100)
scatter_matrix(pd.DataFrame(x), diagonal='hist',
hist_kwds={'bins':[i*0.1 for i in range(11)]})
plt.xlabel('x')
plt.ylabel('frequency')
plt.grid()
plt.show()
By default, the number of bins in the histogram is 10, but just because your data is distributed between 0 and 1 doesn't mean the bins will be evenly spaced over the range. For example, if you do not actually have a data point equal to 1, you will get a result similar to the one in your question.

custom scaling of a wind rose plot

I am trying to compare wind roses in python, but it is difficult because I cannot figure out how to make the same scale across all of the plots. Someone else asked the same question here Custom percentage scale used by windrose.py but it was not answered .
Example code:
from windrose import WindroseAxes
import numpy as np
import matplotlib.pyplot as plt
wind_dir = np.array([30,45,90,43,180])
wind_sd = np.arange(1,wind_dir.shape[0]+1)
bins_range = np.arange(1,6,1) # this sets the legend scale
fig,ax = plt.subplots()
ax = WindroseAxes.from_ax()
bin_range below sets scale of bars, but I need to change the y-axis frequency scale so it can be compared to other wind roses with different data.
ax.bar(wind_dir,wind_sd,normed=True,bins=bins_range)
this set_ylim does seem to work, but the yaxis ticks do not change
ax.set_ylim(0,50)
this set_ticks line below does not do anything and I do not know why
ax.yaxis.set_ticks(np.arange(0,50,10))
ax.set_legend()
plt.show()
from windrose import WindroseAxes
import numpy as np
import matplotlib.pyplot as plt
wind_dir = np.array([30,45,90,43,180])
wind_sd = np.arange(1,wind_dir.shape[0]+1)
bins_range = np.arange(1,6,1) # this sets the legend scale
ax = WindroseAxes.from_ax()
ax.bar(wind_dir,wind_sd,normed=True,bins=bins_range)
ax.set_yticks(np.arange(10, 60, step=10))
ax.set_yticklabels(np.arange(10, 60, step=10))
plt.show()

How to color individual points on scatter plots based on their type using matplotlib

I'm working on the Iris data and trying to use scatter plot, while I was able to get the output, I'd like to know how I can color the points based on their species, using matplotlib.
I've using the following syntax:
iris.plot.scatter(x='petal_length', y='petal_width')
iris.plot(kind='scatter', x='sepal_length', y='sepal_width')
Also is there any way to use a single line of code to create two scatter plots for sepal_length/width and petal_length/width while coloring based on species?
Getting the colors correct in a single call to the plotting function is a bit tedious.
import seaborn as sns
iris = sns.load_dataset("iris")
import numpy as np
import matplotlib.pyplot as plt
u, inv = np.unique(iris.species.values, return_inverse=True)
ax = iris.plot.scatter(x='petal_length', y='petal_width',
c=inv, cmap="brg", colorbar=False)
plt.show()
I would hence recommend to loop over the species, with the additional advantage of being able to easily put a legend into the plot.
import seaborn as sns
iris = sns.load_dataset("iris")
import matplotlib.pyplot as plt
for n, grp in iris.groupby("species"):
plt.scatter(grp.petal_length, grp.petal_width, label=n)
plt.legend()
plt.show()
An easy solution is also to use seaborn.
import seaborn as sns
iris = sns.load_dataset("iris")
import matplotlib.pyplot as plt
g = sns.FacetGrid(iris, hue="species")
g.map(plt.scatter, 'petal_length','petal_width').add_legend()
plt.show()

Categories

Resources