My x-axis is messed up for huge datasets? - python

I am trying to plot a countplot using Seaborn library. The data-set is a huge dataset with lots of data of more than 100,000 entries and 67 columns. I have tried plotting it and my x-axis gets messed up. I have tried to increase the figure-size of the plot but still it does not work for me. My code and figure of the plot is as follows:
#We will see what is the status of columns that have null values or comprise of values that are zero
na = pd.DataFrame(df.isnull().sum())
plt.figure(figsize=(25,25))
sns.barplot(y=na[0],x=na.index)
plt.xlabel(xlabel=na.index)
plt.title("Columns with Null Values Distribution",size=20)
Any suggestion to plot in such a way that the x-axis gets more clearer and is easily being able to be visualised would be helpful. Thank you for your help.

So I have basically found the solution to this question and that basically is by swapping x-axis and y-axis. I also have amended my code to use dpi as well. The code which gives me proper visualisation is as follows:
#We will see what is the status of columns that have null values or comprise of values that are zero
na = pd.DataFrame(df.isnull().sum())
plt.figure(num=None, figsize=(20,18), dpi=80, facecolor='w', edgecolor='r')
sns.barplot(y=na.index,x=na[0])
#plt.xlabel(xlabel=na.index)
plt.title("Columns with Null Values Distribution",size=10)
The visualisation is as follows:

Related

Trying to remove names from pyplot bar graph

So I am working with some data for a science fair project, and I am extremely new to pandas and matplotlib/pyplot. I am currently trying to make a graph of some data (a bar graph) and have been able to do so fine. I split my DataFrame into two parts: the name and the values themselves:
data = pd.read_csv('results.csv')
data = data.sort_values(by=['Accuracy'], ascending=False)
accuracy = data['Accuracy']
names = data['Name']
This works fine. And when I go to make my graph it also works fine:
plt.bar(names, accuracy)
plt.title('Accuracy Below 97%')
plt.ylabel('Accuracy in Percent')
plt.show()
But the only problem is that when I do this, my names are too long so it ends up as a sort of blur:
I also have around 40 data points which I understand is probably too many to be able see the names anyways, but the names are around 30 characters long so even if I reduced the amount of data points in a graph, it still would probably not work.
So I then I just assumed that I would remove names from plt.bar(names, accuracy) but this throws the error:
TypeError: bar() missing 1 required positional argument: 'height'
So I realized that I need a width value, and since the number of data point was 42 I then tried:
plt.bar(42, accuracy)
But this creates a weird graph that I am not looking for:
So my question is: how do I remove the names from the graph while keeping the actual graph the same?
Any help is greatly appreciated. Thanks!
if you want to remove the xticks labels from the graph
you can do
plt.xticks([])
Also, you can adjust the x-axis limits to remove the labels completely.
plt.xticks([])
plt.xlim(-0.5, len(accuracy)-0.5)
Here is what you want but you can handle those with this link instead of deleting the problem.
datetime x-axis matplotlib labels causing uncontrolled overlap
ax = data[['Accuracy','Name']].plot(title='Accuracy Below 97%')
ax.get_xaxis().set_visible(False)
pyplot.show()

How do I plot ordered categorical data?

I possess categorical data in a dataframe that correspond to the x-axis and the y-axis of a plot, and where these categories are ordered (e.g. a<b<c<d). However, after failing to find a similar example on Stack Overflow on how to plot said data as I have done below, I assigned ordered numbers to the categories (e.g. a,b,c,d,e were indexed by 4,5,6,7,8) and plotted the following graph
using the following lines of code:
fig=plt.figure()
plt.scatter(df3["LCST"],df3["LCST.1"])
lims = [
np.min([ax.get_xlim(), ax.get_ylim()]), # min of both axes
np.max([ax.get_xlim(), ax.get_ylim()]), # max of both axes
]
plt.plot(lims, lims, 'k-', alpha=0.75, zorder=0)
plt.xlabel('Actual', fontsize=18)
plt.ylabel('Estimated', fontsize=18)
ax.set_aspect('equal')
ax.set_xlim(lims)
ax.set_ylim(lims)
The issue is that this plot is not particularly informative and I would prefer to actually have the actual categories "a,b,c,d,..." on the axis, as opposed to the mere numbers (besides the half numbers in the axis are misleading, as they do not correspond to any categories). I tried doing the same exercise without converting the categories to numbers first, but of course the ordering was all off. Sure! I have come across many posts concerning ordering categorical data, but for some reason these are not particularly fruitful for my particular case. How can this be done? My Data is of the following format (note below is just for the sake of an example and it does not correspond to the graph plotted above):
where, say, I want to produce the same graph as above, only with the categories as the axis, such that CCC+ < B < A < A+. Any help would very much be appreciated. Thanks

Python stacked barchart where y-axis scale is linear but the bar fill is logarithmic in the order of 10s

As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!

Plotting a column with millions of rows

I have a data-frame with millions of rows (almost 8 million). I need to see the distribution of the values in one of the columns. This column is called 'price_per_mile'. I also have a column called 'Borough'. The final goal is doing a t-test.
First I want to see the distribution of data in 'price_per_mile', to see if data is normal and if I need to do some data cleaning. Then group-by based on five categories in 'borough' column and then do the t-test for each possible pair of boroughs.
I have tried to plot the distribution with sns.distplot() but it doesn't give me a clear plot as it seems there's a scaling of the values on the y-axis. Also, the range of values contained in 'price_per_mile' is big.
Then I tried to plot a section of values, again the plot doesn't look clear and informative enough. Scaling happens again.
result.drop(result[(result.price_per_mile <1) | (result.price_per_mile>200)].index, inplace=True)
What do I need to do to have a better-looking plot which gives me the true value of each bin and not just a normalized value?
I read the documentation for sns.distplot() but didn't find something helpful.
As per the documentation for displot (emphasis mine)
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
Which means that if you want the non-normalized histogram, you have to make sure to instruct seaborn to not plot the KDE at the same time
sns.distplot(a, kde=True, norm_hist=False)
sns.distplot(a, kde=False, norm_hist=False)

Plotting data with matplot and python to graph

I'm currently trying to plot 7 days with varying small to large numbers.
The first set of data may look like this
dates = ['2018-09-20', '2018-09-21', '2018-09-22', '2018-09-23', '2018-09-24', '2018-09-25', '2018-09-26', '2018-09-27']
values = [107.660514, 107.550403, 107.435041, 107.435003, 107.574965, 107.449961, 107.650052, 107.649974]
vs another set of data may have the same dates, but the values may be much small incremental changes
dates = ['2018-09-20', '2018-09-21', '2018-09-22', '2018-09-23', '2018-09-24', '2018-09-25', '2018-09-26', '2018-09-27']
values = [0.849215, 0.849655, 0.849655, 0.851095, 0.850885, 0.850135, 0.851203, 0.851865]
When I use this
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
plt.plot_date(x=dates, y=values, fmt="r--")
plt.ylabel(c)
plt.grid(True)
plt.savefig('static/%s.png' % c)
The resulting image for the 1st set of values comes out as a dashed lined connecting the days to the dots. But the 2nd set of data makes a image of 7 parallel lines stacked on top of each other.
Should I be plotting this differently?
I assume you would like a comparison between two set of data you provided.
However, with such gap between both sets of data, it could be fairly unclear if you want to show both sets in a same plot.
You could use plt.subplots() to do that, and you'll probably get a plot like this
Or a better way is just showing two plots separately.. And you'll get a much clearer plot.
If you want to just show two plots, you can do something like this.

Categories

Resources