Y-axis values cuts off using seaborn scatter plot

Y-axis values cuts off using seaborn scatter plot - python

I have an issue with plotting the big CSV file with Y-axis values ranging from 1 upto 20+ millions. There are two problems I am facing right now.
The Y-axis do not show all the values that it is suppose to. When using the original data, it shows upto 6 million, instead of showing all the data upto 20 millions. In the sample data (smaller data) I put below, it only shows the first Y-axis value and does not show any other values.
In the label section, since I am using hue and style = name, "name" appears as the label title and as an item inside.
Questions:
Could anyone give me a sample or help me to answer how may I show all the Y-axis values? How can I fix it so all the Y-values show up?
How can I get rid of "name" under label section without getting rid of shapes and colors for the scatter points?
(Please let me know of any sources exist or this question was answered on some other post without labeling it duplicated. Please also let me know if I have any grammar/spelling issues that I need to fix. Thank you!)
Below you can find the function I am using to plot the graph and the sample data.
def test_graph (file_name):
data_file = pd.read_csv(file_name, header=None, error_bad_lines=False, delimiter="|", index_col = False, dtype='unicode')
data_file.rename(columns={0: 'name',
1: 'date',
2: 'name3',
3: 'name4',
4: 'name5',
5: 'ID',
6: 'counter'}, inplace=True)
data_file.date = pd.to_datetime(data_file['date'], unit='s')
norm = plt.Normalize(1,4)
cmap = plt.cm.tab10
df = pd.DataFrame(data_file)
# Below creates and returns a dictionary of category-point combinations,
# by cycling over the marker points specified.
points = ['o', 'v', '^', '<', '>', '8', 's', 'p', 'H', 'D', 'd', 'P', 'X']
mult = len(df['name']) // len(points) + (len(df['name']) % len(points) > 0)
markers = {key:value for (key, value)
in zip(df['name'], points * mult)} ; markers
sc = sns.scatterplot(data = df, x=df['date'], y=df['counter'], hue = df['name'], style = df['name'], markers = markers, s=50)
ax.set_autoscaley_on(True)
ax.set_title("TEST", size = 12, zorder=0)
plt.legend(title="Names", loc='center left', shadow=True, edgecolor = 'grey', handletextpad = 0.1, bbox_to_anchor=(1, 0.5))
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(100))
plt.xlabel("Dates", fontsize = 12, labelpad = 7)
plt.ylabel("Counter", fontsize = 12)
plt.grid(axis='y', color='0.95')
fig.autofmt_xdate(rotation = 30)
fig = plt.figure(figsize=(20,15),dpi=100)
ax = fig.add_subplot(1,1,1)
test_graph(file_name)
plt.savefig(graph_results + "/Test.png", dpi=100)
# Prevents to cut-off the bottom labels (manually) => makes the bottom part bigger
plt.gcf().subplots_adjust(bottom=0.15)
plt.show()
Sample data
namet1|1582334815|ai1|ai1||150|101
namet1|1582392415|ai2|ai2||142|105
namet2|1582882105|pc1|pc1||1|106
namet2|1582594106|pc1|pc1||1|123
namet2|1580592505|pc1|pc1||1|141
namet2|1580909305|pc1|pc1||1|144
namet3|1581974872|ai3|ai3||140|169
namet1|1581211616|ai4|ai4||134|173
namet2|1582550907|pc1|pc1||1|179
namet2|1582608505|pc1|pc1||1|185
namet4|1581355640|ai5|ai5|bcu|180|298466
namet4|1582651641|pc2|pc2||233|298670
namet5|1582406860|ai6|ai6|bcu|179|298977
namet5|1580563661|pc2|pc2||233|299406
namet6|1581283626|qe1|q0/1|Link to btse1/3|51|299990
namet7|1581643672|ai5|ai5|bcu|180|300046
namet4|1581758842|ai6|ai6|bcu|179|300061
namet6|1581298027|qe2|q0/2|Link to btse|52|300064
namet1|1582680415|pc2|pc2||233|300461
namet6|1581744427|pc3|p90|Link to btsi3a4|55|6215663
namet6|1581730026|pc3|p90|Link to btsi3a4|55|6573348
namet6|1582190826|qe2|q0/2|Link to btse|52|6706378
namet6|1582190826|qe1|q0/1|Link to btse1/3|51|6788568
namet1|1581974815|pc2|pc2||233|6895836
namet4|1581974841|pc2|pc2||233|7874504
namet6|1582176427|qe1|q0/1|Link to btse1/3|51|9497687
namet6|1582176427|qe2|q0/2|Link to btse|52|9529133
namet7|1581974872|pc2|pc2||233|9573450
namet6|1582162027|pc3|p90|Link to btsi3a4|55|9819491
namet6|1582190826|pc3|p90|Link to btsi3a4|55|13494946
namet6|1582176427|pc3|p90|Link to btsi3a4|55|19026820
Results I am getting:
Big data:
Small data:
Updated Graph
Updated-graph

First of all, some improvements on your post: you are missing the import statements
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns
The line
df = pd.DataFrame(data_file)
is not necessary, since data_file already is a DataFrame. The lines
points = ['o', 'v', '^', '<', '>', '8', 's', 'p', 'H', 'D', 'd', 'P', 'X']
mult = len(df['name']) // len(points) + (len(df['name']) % len(points) > 0)
markers = {key:value for (key, value)
in zip(df['name'], points * mult)}
do not cycle through points as you might expect, maybe use itertools as suggested here. Also, setting yticks like
ax.yaxis.set_major_locator(ticker.MultipleLocator(100))
for every 100 might be too much if your data is spanning values from 0 to 20 million, consider replacing 100 with, say, 1000000.
I was able to reproduce your first problem. Using df.dtypes I found that the column counter was stored as type object. Adding the line
df['counter']=df['counter'].astype(int)
resolved your first problem for me. I couldn't reproduce your second issue, though. Here is what the resulting plot looks like for me:
Have you tried updating all your packages to the latest version?
EDIT: as follow up on your comment, you can also adjust the number of xticks in your plot by replacing 1 in
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
by a higher number, say 10. Incorporating all my suggestions and deleting the seemingly unnecessary function definition, my version of your code looks as follows:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns
import itertools
fig = plt.figure()
ax = fig.add_subplot()
df = pd.read_csv(
'data.csv',
header = None,
error_bad_lines = False,
delimiter = "|",
index_col = False,
dtype = 'unicode')
df.rename(columns={0: 'name',
1: 'date',
2: 'name3',
3: 'name4',
4: 'name5',
5: 'ID',
6: 'counter'}, inplace=True)
df.date = pd.to_datetime(df['date'], unit='s')
df['counter'] = df['counter'].astype(int)
points = ['o', 'v', '^', '<', '>', '8', 's', 'p', 'H', 'D', 'd', 'P', 'X']
markers = itertools.cycle(points)
markers = list(itertools.islice(markers, len(df['name'].unique())))
sc = sns.scatterplot(
data = df,
x = 'date',
y = 'counter',
hue = 'name',
style = 'name',
markers = markers,
s = 50)
ax.set_title("TEST", size = 12, zorder=0)
ax.legend(
title = "Names",
loc = 'center left',
shadow = True,
edgecolor = 'grey',
handletextpad = 0.1,
bbox_to_anchor = (1, 0.5))
ax.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1000000))
ax.minorticks_off()
ax.set_xlabel("Dates", fontsize = 12, labelpad = 7)
ax.set_ylabel("Counter", fontsize = 12)
ax.grid(axis='y', color='0.95')
fig.autofmt_xdate(rotation = 30)
plt.gcf().subplots_adjust(bottom=0.15)
plt.show()

Related

Change axis line range in mpl_toolkits new_fixed_axis

I am struggling to modify my code to define a specific range of the secondary x-axis. Below is a snippet of the relevant code for creating 2 x-axes, and the output it generates:
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
from mpl_toolkits.axes_grid.parasite_axes import SubplotHost
...
x = np.arange(1, len(metric1)+1) # the label locations
width = 0.3 # the width of the bars
fig1 = plt.figure()
ax1 = SubplotHost(fig1, 111)
fig1.add_subplot(ax1)
ax1.axis((0, 14, 0, 20))
ax1.bar(x, [t[2] for t in metric1], width, label='metric1')
ax1.bar(x + width, [t[2] for t in metric2], width, label='metric2')
ax1.bar(x + 2*width, [t[2] for t in metric3], width, label='metric3')
ax1.set_xticks(x+width)
ax1.set_xticklabels(['BN', 'B', 'DO', 'N', 'BN', 'B', 'DO', 'N', 'BN', 'B', 'DO', 'N', 'BN', 'B', 'DO', 'N'])
ax1.axis["bottom"].major_ticks.set_ticksize(0)
ax2 = ax1.twiny()
offset = 0, -25 # Position of the second axis
new_axisline = ax2.get_grid_helper().new_fixed_axis
ax2.axis["bottom"] = new_axisline(loc="bottom", axes=ax2, offset=offset)
ax2.axis["top"].set_visible(False)
ax2.axis["bottom"].minor_ticks.set_ticksize(0)
ax2.axis["bottom"].major_ticks.set_ticksize(15)
ax2.set_xticks([0.058, 0.3434, 0.63, 0.915])
ax2.xaxis.set_major_formatter(ticker.NullFormatter())
ax2.xaxis.set_minor_locator(ticker.FixedLocator([0.20125, 0.48825, 0.776]))
ax2.xaxis.set_minor_formatter(ticker.FixedFormatter(['foo', 'bar', 'foo2']))
...
This is the current output:
What I would like to have, is to not have the secondary x-axis (foo, bar, foo2) line extend beyond the first and last x-tick, as follows (I edited in MS paint 😅):
Any help appreciated.

As there have been no other answers, I can suggest a non-elegant way of doing what you need.
You can hide the axis line and "manually" create one line yourself:
import matplotlib.lines as lines
ax2.axis["bottom"].line.set_visible(False)
p1 = ax2.axis["bottom"].line.get_extents().get_points()
x1 = 0.058 * (p1[1][0]-p1[0][0]) / (1) + p1[0][0]
x2 = 0.915 * (p1[1][0]-p1[0][0]) / (1) + p1[0][0]
newL = lines.Line2D([x1,x2], [p1[0][1],p1[1][1]], transform=None, axes=ax2,color="k",linewidth=0.5)
ax2.lines.extend([newL,])
Which gives, in a simple example, something like this:
As opposed to:
Alternative
One alternative for the creation of multiple axis is using spines (no parasite axis):
https://matplotlib.org/stable/gallery/ticks_and_spines/multiple_yaxis_with_spines.html
In this case, it is possible to do what you need simply by changing the bounds of the spines. For instance, by adding the following line to the code in the link
par2.spines["right"].set_bounds(10,30)
we get this:
Obviously, this does not strictly reply to the title of your question, and unfortunately, I do not know a proper way of doing it for new_fixed_axis as it can be done for the spines. I hope the "manually" created line solves your issue, in case nobody else comes with a better solution.

Different Markers in Scatterplot based on Label

I want to scatterplot the validation results of different models and methods. Both the "Train" and "Validation" datapoints are to be plotted in different colors (already done). In addition to that, I'd like to use different markers for the different models, like in the following:
Model 1, train set: marker "triangle_down" in color "red"
Model 1, validation set: marker "triangle down" in color "blue"
Model 2, train set: marker "octagon" in color "red"
Model 2, validation set: marker "octagon" in color "blue"
The dataframe looks like this
I have the following body of a function
k=1
label=['Train', 'Validation']
drop_learners=[]
drop_cols=[]
train_summary = self_summary_train.drop(drop_learners).drop(drop_cols, axis=1)
validation_summary = self_summary_validation.drop(drop_learners).drop(drop_cols, axis=1)
plot_data = pd.concat([self_summary_train, self_summary_validation])
plot_data['label'] = [i.replace('Train', '') for i in plot_data.index]
plot_data['label'] = [i.replace('Validation', '') for i in plot_data.label]
fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
xs = plot_data['Abs % Error of ATE']
ys = plot_data['MSE']
group = np.array([label[0]] * self_summary_train.shape[0] + [label[1]] * self_summary_validation.shape[0])
cdict = {label[0]: 'red', label[1]: 'blue'}
for g in np.unique(group):
ix = np.where(group == g)[0].tolist()
ax.scatter(xs[ix], ys[ix], c=cdict[g], label=g, s=100)
for i, txt in enumerate(plot_data.label[:]):
ax.annotate(txt, (xs[i] + 0.005, ys[i]))
ax.set_xlabel('Abs % Error of ATE')
ax.set_ylabel('MSE')
ax.set_title('Learner Performance (averaged over k={} simulations)'.format(k))
ax.legend(loc='center left', bbox_to_anchor=(1.1, 0.5))
plt.show()
I already tried in the for-loop to set marker in the scatter-function to
markerdict = {learners[0]: ".", learners[1]: 'v', learners[2]: "^", learners[3]: "1", learners[4]: "2", learners[5]: "8", learners[6]: "p",learners[7]:"*", learners[8]:"d"}
markers=['^', 's', 'p', 'h', '8']
but it didn't work out.
Maybe someone can help me here, thanks in advance!

See if this will work:
markers=['^', 's', 'p', 'h', '8']
for idx, g in enumerate(np.unique(group)):
ix = np.where(group == g)[0].tolist()
ax.scatter(xs[ix], ys[ix], c=cdict[g], label=g, s=100, marker = markers[idx])

Independent axis for each subplot in pandas boxplot

The below code helps in obtaining subplots with unique colored boxes. But all subplots share a common set of x and y axis. I was looking forward to having independent axis for each sub-plot:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import PathPatch
df = pd.DataFrame(np.random.rand(140, 4), columns=['A', 'B', 'C', 'D'])
df['models'] = pd.Series(np.repeat(['model1','model2', 'model3', 'model4', 'model5', 'model6', 'model7'], 20))
bp_dict = df.boxplot(
by="models",layout=(2,2),figsize=(6,4),
return_type='both',
patch_artist = True,
)
colors = ['b', 'y', 'm', 'c', 'g', 'b', 'r', 'k', ]
for row_key, (ax,row) in bp_dict.iteritems():
ax.set_xlabel('')
for i,box in enumerate(row['boxes']):
box.set_facecolor(colors[i])
plt.show()
Here is an output of the above code:
I am trying to have separate x and y axis for each subplot...

You need to create the figure and subplots before hand and pass this in as an argument to df.boxplot(). This also means you can remove the argument layout=(2,2):
fig, axes = plt.subplots(2,2,sharex=False,sharey=False)
Then use:
bp_dict = df.boxplot(
by="models", ax=axes, figsize=(6,4),
return_type='both',
patch_artist = True,
)

You may set the ticklabels visible again, e.g. via
plt.setp(ax.get_xticklabels(), visible=True)
This does not make the axes independent though, they are still bound to each other, but it seems like you are asking about the visibilty, rather than the shared behaviour here.

If you really think it is necessary to un-share the axes after the creation of the boxplot array, you can do this, but you have to do everything 'by hand'. Searching a while through stackoverflow and looking at the matplotlib documentation pages I came up with the following solution to un-share the yaxes of the Axes instances, for the xaxes, you would have to go analogously:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import PathPatch
from matplotlib.ticker import AutoLocator, AutoMinorLocator
##using differently scaled data for the different random series:
df = pd.DataFrame(
np.asarray([
np.random.rand(140),
2*np.random.rand(140),
4*np.random.rand(140),
8*np.random.rand(140),
]).T,
columns=['A', 'B', 'C', 'D']
)
df['models'] = pd.Series(np.repeat([
'model1','model2', 'model3', 'model4', 'model5', 'model6', 'model7'
], 20))
##creating the boxplot array:
bp_dict = df.boxplot(
by="models",layout = (2,2),figsize=(6,8),
return_type='both',
patch_artist = True,
rot = 45,
)
colors = ['b', 'y', 'm', 'c', 'g', 'b', 'r', 'k', ]
##adjusting the Axes instances to your needs
for row_key, (ax,row) in bp_dict.items():
ax.set_xlabel('')
##removing shared axes:
grouper = ax.get_shared_y_axes()
shared_ys = [a for a in grouper]
for ax_list in shared_ys:
for ax2 in ax_list:
grouper.remove(ax2)
##setting limits:
ax.axis('auto')
ax.relim() #<-- maybe not necessary
##adjusting tick positions:
ax.yaxis.set_major_locator(AutoLocator())
ax.yaxis.set_minor_locator(AutoMinorLocator())
##making tick labels visible:
plt.setp(ax.get_yticklabels(), visible=True)
for i,box in enumerate(row['boxes']):
box.set_facecolor(colors[i])
plt.show()
The resulting plot looks like this:
Explanation:
You first need to tell each Axes instance that it shouldn't share its yaxis with any other Axis instance. This post got me into the direction of how to do this -- Axes.get_shared_y_axes() returns a Grouper object, that holds references to all other Axes instances with which the current Axes should share its xaxis. Looping through those instances and calling Grouper.remove does the actual un-sharing.
Once the yaxis is un-shared, the y limits and the y ticks need to be adjusted. The former can be achieved with ax.axis('auto') and ax.relim() (not sure if the second command is necessary). The ticks can be adjusted by using ax.yaxis.set_major_locator() and ax.yaxis.set_minor_locator() with the appropriate Locators. Finally, the tick labels can be made visible using plt.setp(ax.get_yticklabels(), visible=True) (see here).
Considering all this, #DavidG's answer is in my opinion the better approach.

Non overlapping error bars in line plot

I am using Pandas and Matplotlib to create some plots. I want line plots with error bars on them. The code I am using currently looks like this
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
df_yerr = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
fig, ax = plt.subplots()
df.plot(yerr=df_yerr, ax=ax, fmt="o-", capsize=5)
ax.set_xscale("log")
plt.show()
With this code, I get 6 lines on a single plot (which is what I want). However, the error bars completely overlap, making the plot difficult to read.
Is there a way I could slightly shift the position of each point on the x-axis so that the error bars no longer overlap?
Here is a screenshot:

One way to achieve what you want is to plot the error bars 'by hand', but it is neither straight forward nor much better looking than your original. Basically, what you do is make pandas produce the line plot and then iterate through the data frame columns and do a pyplot errorbar plot for each of them such, that the index is slightly shifted sideways (in your case, with the logarithmic scale on the x axis, this would be a shift by a factor). In the error bar plots, the marker size is set to zero:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
colors = ['red','blue','green','yellow','purple','black']
df = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
df_yerr = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
fig, ax = plt.subplots()
df.plot(ax=ax, marker="o",color=colors)
index = df.index
rows = len(index)
columns = len(df.columns)
factor = 0.95
for column,color in zip(range(columns),colors):
y = df.values[:,column]
yerr = df_yerr.values[:,column]
ax.errorbar(
df.index*factor, y, yerr=yerr, markersize=0, capsize=5,color=color,
zorder = 10,
)
factor *= 1.02
ax.set_xscale("log")
plt.show()
As I said, the result is not pretty:
UPDATE
In my opinion a bar plot would be much more informative:
fig2,ax2 = plt.subplots()
df.plot(kind='bar',yerr=df_yerr, ax=ax2)
plt.show()

you can solve with alpha for examples
df.plot(yerr=df_yerr, ax=ax, fmt="o-", capsize=5,alpha=0.5)
You can also check this link for reference

Remove empty bars from grouped barplot

I have a grouped barplot. It's working very well, but I try to remove the empty barplots. They take too much space.
I have already tried :
%matplotlib inline
import matplotlib as mpl
from matplotlib.gridspec import GridSpec
import matplotlib.pyplot as plt
import sys
import os
import glob
import seaborn as sns
import pandas as pd
import ggplot
from ggplot import aes
sns.set(style= "whitegrid", palette="pastel", color_codes=True )
tab_folder = 'myData'
out_folder ='myData/plots'
tab = glob.glob('%s/R*.tab'%(tab_folder))
#is reading all my data
for i, tab_file in enumerate(tab):
folder,file_name=os.path.split(tab_file)
s_id=file_name[:-4].replace('DD','')
df=pd.DataFrame.from_csv(tab_file, sep='\t')
df_2 = df.groupby(['name','ab']).size().reset_index(name='count')
df_2 = df_2[df_2['count'] != 0]
table = pd.pivot_table(df_2, index='name',columns='ab', values='count' )
table.plot(kind='barh', width = 0.9, color = ['b', 'g', 'r'], ax = ax)
for label in (ax.get_xticklabels() + ax.get_yticklabels()):
label.set_fontsize(4)
ax.set_title(s_id).update({'color':'black', 'size':5, 'family':'monospace'})
ax.set_xlabel('')
ax.set_ylabel('')
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], bbox_to_anchor=(1, 1.05),prop= {'size': 4} )
png_t = '%s/%s.b.png'%(out_folder,s_id)
plt.savefig(png_t, dpi = 500)
But it's not working. The bars are still the same.
Is there any other method to remove empty bars?

Your question is not complete. I don't know what you're trying to accomplish, but from what you've said I'd guess that you are trying not to display empty pivot pairs.
This is not possible by standard means of pandas. Plot of groups need to display all of them even NaNs which will be plot as "empty bars".
Furthermore after groupby every group is at least size of one, so df_2[df_2['count'] != 0] is allways true.
For example
df = pd.DataFrame([['nameA', 'abA'], ['nameB', 'abA'],['nameA','abB'],['nameD', 'abD']], columns=['names', 'ab'])
df_2 = df.groupby(['names', 'ab']).size().reset_index(name='count')
df_2 = df_2[df_2['count'] != 0] # this line has no effect
table = pd.pivot_table(df_2, index='names',columns='ab', values='count' )
table
gives
ab abA abB abD
names
nameA 1.00 1.00 NaN
nameB 1.00 NaN NaN
nameD NaN NaN 1.00
and
table.plot(kind='barh', width = 0.9, color = ['b', 'g', 'r'])
shows
And that's the way it is. Plot need to show all groups after pivot.
EDIT
You can also use stacked plot, to get rid of spaces
table.plot(kind='barh', width = 0.9, color = ['b', 'g', 'r'], stacked=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Y-axis values cuts off using seaborn scatter plot - python

Related

Change axis line range in mpl_toolkits new_fixed_axis

Different Markers in Scatterplot based on Label

Independent axis for each subplot in pandas boxplot

Non overlapping error bars in line plot

Remove empty bars from grouped barplot

Categories

Resources