Python - Pandas histogram width - python

I am doing a histogram plot of a bunch of data that goes from 0 to 1. When I plot I get this
As you can see, the histogram 'blocks' do not align with the y-axis.
Is there a way to set my histogram in order to get the histograms in a constant width of 0.1? Or should I try a diferent package?
My code is quite simple:
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
np.set_printoptions(precision=10,
threshold=10000,
linewidth=150,suppress=True)
E=pd.read_csv("FQCoherentSeparableBons5.csv")
E = E.ix[0:,1:]
E=np.array(E,float)
P0=E[:,0]
P0=pd.DataFrame(P0,columns=['P0'])
scatter_matrix(P0, alpha=0.2, figsize=(6, 6), diagonal='hist',color="red")
plt.suptitle('Distribucio p0')
plt.ylabel('Frequencia p0')
plt.show()
PD: If you are wondering about the data, I is just a random distribution from 0 to 1.

You can pass additional arguments to the pandas histogram using the hist_kwds argument of the scatter_matrix function. If you want ten bins of width 0.1, then your scatter_matrix call should look like
scatter_matrix(P0, alpha=0.2, figsize=(6, 6), diagonal='hist', color="red",
hist_kwds={'bins':[i*0.1 for i in range(11)]})
Additional arguments for the pandas histogram can be found in documentation.
Here is a simple example. I've added a grid to the plot so that you can see the bins align correctly.
import numpy as np
import pandas as pd
from pandas import scatter_matrix
import matplotlib.pyplot as plt
x = np.random.uniform(0,1,100)
scatter_matrix(pd.DataFrame(x), diagonal='hist',
hist_kwds={'bins':[i*0.1 for i in range(11)]})
plt.xlabel('x')
plt.ylabel('frequency')
plt.grid()
plt.show()
By default, the number of bins in the histogram is 10, but just because your data is distributed between 0 and 1 doesn't mean the bins will be evenly spaced over the range. For example, if you do not actually have a data point equal to 1, you will get a result similar to the one in your question.

Related

Frequency in seaborn histograms

I have a dataset of used cars. I have made a histogram plot for the count of cars by their age (in months).
sns.distplot(df['Age'],kde=False,bins=6)
And the plot looks something like this:
Is there any way I can depict the frequency values of each bin in the plot itself
PS: I know I can fetch the values using the numpy histogram function which is
np.histogram(df['Age'],bins=6)
Basically I want the plot to look somewhat like this I guess so:
You can iterate over the patches belonging to the ax, get their position and height and use these to create annotations.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
sns.set_style()
df = pd.DataFrame({'Age': np.random.triangular(1, 80, 80, 1000).astype(np.int)})
ax = sns.distplot(df['Age'], kde=False, bins=6)
for p in ax.patches:
ax.annotate(f'{p.get_height():.0f}\n',
(p.get_x() + p.get_width() / 2, p.get_height()), ha='center', va='center', color='crimson')
plt.show()

Why matplotlib is not displaying the chart with values generated using numpy random array?

I have written following code,
import numpy as np
import matplotlib.pyplot as plt
x=np.random.randint(0,10,[1,5])
y=np.random.randint(0,10,[1,5])
x.sort(),y.sort()
fig, ax=plt.subplots(figsize=(10,10))
ax.plot(x,y)
ax.set( title="random data plot", xlabel="x",ylabel="y")
I am getting a blank figure.
Same code prints chart if I manually assign below value to x and y and not use random function.
x=[1,2,3,4]
y=[11,22,33,44]
Am I missing something or doing something wrong.
x=np.random.randint(0,10,[1,5]) returns an array if you specify the shape as [1,5]. Either you would want x=np.random.randint(0,10,[1,5])[0] or x=np.random.randint(0,10,size = 5). See: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.randint.html
Matplotlib doesn't plot markers by default, only a line. As per #Can comment, matplotlib then interprets your (1, 5) array as 5 different datasets each with 1 point, so there is no line as there is no second point.
If you add a marker to your plot function then you can see the data is actually being plotted, just probably not as you wish:
import matplotlib.pyplot as plt
import numpy as np
x=np.random.randint(0,10,[1,5])
y=np.random.randint(0,10,[1,5])
x.sort(),y.sort()
fig, ax=plt.subplots(figsize=(10,10))
ax.plot(x,y, marker='.') # <<< marker for each point added here
ax.set( title="random data plot", xlabel="x",ylabel="y")

How to plot only one half of a scatter matrix using pandas

I am using pandas scatter_matrix (couldn't get PairgGrid in seaborn to work) to plot all combinations of a set of columns in a pandas frame. Each column as 1000 data points and there are nine columns.
I am using the following code:
pandas.plotting.scatter_matrix(df, alpha=0.2, figsize=(8,8))
I get the figure shown below:
This is nice., However, you'll notice that across the main diagonal I have a mirror image. Is it possible to plot only the lower portion as in the following fake plot I made using paint:
This is probably not the cleanest way to do it, but it works:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
axes = pd.plotting.scatter_matrix(iris, alpha=0.2, figsize=(8,8))
for i in range(np.shape(axes)[0]):
for j in range(np.shape(axes)[1]):
if i < j:
axes[i,j].set_visible(False)

Histogram Plotting Python

I need histograms on the number of attributes and classes, description of attribute and classes and the number of instances and classes, while being new to program this is what I've tried so far.
import numpy as np
import pandas as pd
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
data=pd.read_csv('mushroom')
column=df.'Class'
num_bins = 5
n, bins, patches = plt.hist(column, num_bins, facecolor='blue', alpha=0.5)
plt.show()
This is how my data looks like
cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,Class
,f,g,f,n,f,c,n,p,e,s,s,w,w,p,w,o,p,k,v,u,p
,f,g,f,n,f,c,n,n,e,s,s,w,w,p,w,o,p,k,y,u,p
x,f,g,f,n,f,w,b,k,t,s,f,w,w,p,w,o,e,n,s,g,e
,f,g,f,n,f,c,n,g,e,s,s,w,w,p,w,o,p,n,y,u,e
x,f,w,f,n,f,w,b,p,t,f,s,w,w,p,w,o,e,n,a,g,e
s,f,n,f,n,f,c,n,n,e,s,s,w,w,p,w,o,p,k,v,u,e
f,f,n,f,n,f,c,n,n,e,s,s,w,w,p,w,o,p,n,v,u,e
x,f,g,f,n,f,c,n,p,e,s,s,w,w,p,w,o,p,n,y,u,e
f,s,g,f,n,f,w,b,n,t,s,f,w,w,p,w,o,e,n,s,g,e
x,f,w,f,n,f,w,b,n,t,f,f,w,w,p,w,o,e,n,a,g,e
x,s,n,f,n,f,w,b,p,t,f,f,w,w,p,w,o,e,k,s,g,e
x,s,w,f,n,f,w,b,h,t,f,s,w,w,p,w,o,e,n,s,g,p
f,f,w,f,n,f,w,b,p,t,f,s,w,w,p,w,o,e,k,s,g,p
x,f,g,f,n,f,w,b,p,t,f,f,w,w,p,w,o,e,n,s,g,e
Class is a categorical variable (or a factor http://www.statisticshowto.com/what-is-a-categorical-variable/). Binning and histogram is meaningful when you have a continuous variable (http://www.statisticshowto.com/continuous-variable/).
I assume what you actually need is a frequency plot, a bar chart that shows the frequency of each outcome in your categorical data.
If my assumption is correct, the following code will solve your problem.
import pandas
import matplotlib.pyplot as plt
data = pandas.read_csv('mushroom.csv')
fig, ax = plt.subplots()
data['Class'].value_counts().plot(kind='bar', ax=ax)
plt.show()

Extending the range of bins in seaborn histogram

I'm trying to create a histogram with seaborn, where the bins start at 0 and go to 1. However, there is only date in the range from 0.22 to 0.34. I want the empty space more for a visual effect to better present the data.
I create my sheet with
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf')
df = pd.read_excel('test.xlsx', sheetname='IvT')
Here I create a variable for my list and one that I think should define the range of the bins of the histogram.
st = pd.Series(df['Short total'])
a = np.arange(0, 1, 15, dtype=None)
And the histogram itself looks like this
sns.set_style("white")
plt.figure(figsize=(12,10))
plt.xlabel('Ration short/total', fontsize=18)
plt.title ('CO3 In vitro transcription, Na+', fontsize=22)
ax = sns.distplot(st, bins=a, kde=False)
plt.savefig("hist.svg", format="svg")
plt.show()
Histogram
It creates a graph bit the range in x goes from 0 to 0.2050 and in y from -0.04 to 0.04. So completely different from what I expect. I google searched for quite some time but can't seem to find an answer to my specific problem.
Already, thanks for your help guys.
There are a few approaches to achieve the desired results here. For example, you can change the xaxis limits after you have plotted the histogram, or adjust the range over which the bins are created.
import seaborn as sns
# Load sample data and create a column with values in the suitable range
iris = sns.load_dataset('iris')
iris['norm_sep_len'] = iris['sepal_length'] / (iris['sepal_length'].max()*2)
sns.distplot(iris['norm_sep_len'], bins=10, kde=False)
Change the xaxis limits (the bins are still created over the range of your data):
ax = sns.distplot(iris['norm_sep_len'], bins=10, kde=False)
ax.set_xlim(0,1)
Create the bins over the range 0 to 1:
sns.distplot(iris['norm_sep_len'], bins=10, kde=False, hist_kws={'range':(0,1)})
Since the range for the bins is larger, you now need to use more bins if you want to have the same bin width as when adjusting the xlim:
sns.distplot(iris['norm_sep_len'], bins=45, kde=False, hist_kws={'range':(0,1)})

Categories

Resources