Plotting a column with millions of rows - python

I have a data-frame with millions of rows (almost 8 million). I need to see the distribution of the values in one of the columns. This column is called 'price_per_mile'. I also have a column called 'Borough'. The final goal is doing a t-test.
First I want to see the distribution of data in 'price_per_mile', to see if data is normal and if I need to do some data cleaning. Then group-by based on five categories in 'borough' column and then do the t-test for each possible pair of boroughs.
I have tried to plot the distribution with sns.distplot() but it doesn't give me a clear plot as it seems there's a scaling of the values on the y-axis. Also, the range of values contained in 'price_per_mile' is big.
Then I tried to plot a section of values, again the plot doesn't look clear and informative enough. Scaling happens again.
result.drop(result[(result.price_per_mile <1) | (result.price_per_mile>200)].index, inplace=True)
What do I need to do to have a better-looking plot which gives me the true value of each bin and not just a normalized value?
I read the documentation for sns.distplot() but didn't find something helpful.

As per the documentation for displot (emphasis mine)
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
Which means that if you want the non-normalized histogram, you have to make sure to instruct seaborn to not plot the KDE at the same time
sns.distplot(a, kde=True, norm_hist=False)
sns.distplot(a, kde=False, norm_hist=False)

Related

My x-axis is messed up for huge datasets?

I am trying to plot a countplot using Seaborn library. The data-set is a huge dataset with lots of data of more than 100,000 entries and 67 columns. I have tried plotting it and my x-axis gets messed up. I have tried to increase the figure-size of the plot but still it does not work for me. My code and figure of the plot is as follows:
#We will see what is the status of columns that have null values or comprise of values that are zero
na = pd.DataFrame(df.isnull().sum())
plt.figure(figsize=(25,25))
sns.barplot(y=na[0],x=na.index)
plt.xlabel(xlabel=na.index)
plt.title("Columns with Null Values Distribution",size=20)
Any suggestion to plot in such a way that the x-axis gets more clearer and is easily being able to be visualised would be helpful. Thank you for your help.
So I have basically found the solution to this question and that basically is by swapping x-axis and y-axis. I also have amended my code to use dpi as well. The code which gives me proper visualisation is as follows:
#We will see what is the status of columns that have null values or comprise of values that are zero
na = pd.DataFrame(df.isnull().sum())
plt.figure(num=None, figsize=(20,18), dpi=80, facecolor='w', edgecolor='r')
sns.barplot(y=na.index,x=na[0])
#plt.xlabel(xlabel=na.index)
plt.title("Columns with Null Values Distribution",size=10)
The visualisation is as follows:

Matplotlib - Plot histogram truncate bar

I am plotting a histogram of observed values from a population against a normal distribution (dervived from the mean and std of the sample). The sample has an unusual number of observations of value 0 (not to be confused with "NAN"). As a result, the graph of the two does not show clearly.
How can I best truncate the one bar in the histogram to allow the rest of the plot to fill the frame?
Why don't you set the y-limit to be 0.00004? Then you can analyze better the plot.
axes = plt.gca()
axes.set_xlim([xmin,xmax])
axes.set_ylim([ymin,ymax])

Seaborn : How to get the count in y axis for distplot using PairGrid

I'm using PairGrid but I don't understand what does y axis means for distplot. I thought it represents a count. But it's starting from negative values in the pairgrid. If I make only the distplot, I'm getting the count.
I don't know if it's clear so, there's some plots :
My PairGrid:
My distplot :
The distplot is the same as the plot in the top left corner of the PairGrid.
The code corresponding to this is :
sns.distplot(pd.DataFrame(mySerie), kde=False)
and for the PairGrid :
g = sns.PairGrid(myDataFrame)
g = g.map_diag(sns.distplot, kde=False)
g = g.map_offdiag(plt.scatter)
Thank you in advance
You can use both methods to see a different trend in the data in respect to the range of values and the total count.
See below to get a better idea on what I was working for when I came across your question, (sorry not sharing data itself that is too big).
with KDE false I can see that the amount of Yes is twice as much as No in the total count.
Instead with KDE True I can see that at lower ranges of values the No is predominant and even higher in % over the Yes category.
Hope it will help...
kde=False
kde=False
kde=True
kde=True
It was understanding (although I could be mistaken) that the y-axis in your histograms is the fraction of the total counts. For example, in my distplot, roughly 0.08 or 8% of rows are in the 0-5 bin
My Distplot

Plotting histogram with matplotlib

i try to plot data in a histogram or bar in python. The data size (array size) is between 0-10000. The data itself (each entry of the array) depends on the input and has a range between 0 and e+20 (mostly the data is in th same range). So i want to do a hist plot with matplotlib. I want to plot how often a data is in some intervall (to illustrate the mean and deviation). Sometimes it works like this:
hist1.
But sometimes there is a problem with the intevall size like this:
hist2.
In this plot i need more bars at point 0-100 etc.
Can anyone help me with this?
The plots are just made with:
from numpy.linalg import *
import matplotlib.pyplot as plt
plt.hist(numbers,bins=100)
plt.show()
By default, hist produces a plot with an x range that covers the full range of your data.
If you have one outsider at very high x in comparison with the other values, then you will see this image with a 'compressed' figure.
I you want to have always the same view you can fix the limits with xlim.
Alternatively, if you want to see your distribution always centered and as nicer as possible, you can calculate the mean and the standard deviation of your data and fix the x range accordingly (p.e. for mean +/- 5 stdev)

Xaxis Labels not matching Data Points - Pandas/Matplotlib

I have a TimeSeries in Pandas that I want to plot. I have 336 records in the TimeSeries. I only want to show the date/time (index of the TimeSeries) on the x-axis once per every 20 or so data points.
Here is how I am trying to do this:
stats.plot()
ax.set_xticklabels(stats.index, rotation=45 )
ax.xaxis.set_major_locator(MultipleLocator(20))
ax.xaxis.set_minor_locator(NullLocator())
ax.yaxis.set_major_locator(MultipleLocator(.075))
draw()
My x-axis show the correct number of labels (18), but these are the first 18 in the series, they are not correctly corresponding to the datapoints in the plot.
The problem is you are using set_xticklabels which sets the value of the tick labels independent of the data. The ticks are labeled sequentially from the list you pass in.
From this I can't really tell what you are trying to do, but the behavior you are seeing is the 'correct' behavior for the library (it's doing exactly what you told it to, but that isn't what you want it to do).

Categories

Resources