Set the axis range in a boxplot - python

I'm working on this kaggle dataset on the EDA.
I´m working with some boxplot in pandas with this code:
coupon_list[["CATALOG_PRICE","VALIDEND_MONTH"]].boxplot(by='VALIDEND_MONTH')
The problem I'm havaing here is that the y axes has a large scale and it hard to read the plot. Is there any way to limit the sixze of this axis? something similar to ylim ?
EDIT:
The dataset have outliers, adding the argument:
showfliers=False
Seems to solve the issue.

It's weird since by default the Y axis is autoscaled, see the example below. Maybe you have some outliers in your data. Could you share more code?
import pandas as pd
import numpy as np
np.random.seed = 4
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ax = df.boxplot()
Here is the same plot with outliers
# Generating some outliers
df.loc[0] = df.loc[0] * 10
ax = df.boxplot()
Could you try the showfliers option to plot the box without outliers? In this example the Y scale is back to [0-100].
ax = df.boxplot(showfliers=False)
showfliers : bool, optional (True)
Show the outliers beyond the caps.
matplotlib.axes.Axes.boxplot

Related

Contour plot of multiple lineplots in matplolib

I have a set of 125 x and y data (Xray absorption spectroscopy data ie energy vs intensity) and I would like to reproduce a plot similar to this one : [contour plot of xanes spectras]
(https://i.stack.imgur.com/0Kymp.png)
The spectras were taken as a function of time and my goal is to plot them in a 2d contour plot with the energy as x, and the time (or maybe just the index of the spectra) as the y. I would like the z axis to represent the intensity of the spectra with different colors so that changes in time are easily seen.
My data currently look like this, when I plot them all in the same graph with a viridis color map.line plot of the spectras
I have tried to work with the contour function of matplotlib and got this result :
attempt of a contour plot
I used the following code :
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_excel('data.xlsx')
energy = df['energy']
df.index = energy
df = df.iloc[:,2:]
df = df.transpose()
X = energy
Y = range(len(df.index))
fig, ax = plt.subplots()
ax.contourf(X,Y,df)
plt.show()
If you have any idea, I would be grateful. I am in fact not sure that the contour function is the most apropriate for what I want, and I am open to any suggestion.
Thanks,
Yoloco

How to create a scatter plot for each dataframe column

I am trying to write some code in order to create an animation of scatter plot data through tine. In order to do this I have a dataset with multiple columns where each column represents a numbered timestep.
I would like the code to cycle through each timestep column for the y axis and use a constant x axis, so that a separate scatter plot is generated for each timestep. I tried to do this by coding a for loop that specifies an incrementing column number for the y axis.
My current code generates three out of seven scatter plots in my sample data but returns the following error:
IndexError: index 9 is out of bounds for axis 0 with size 9
I have tried other similar solutions on stack overflow but that didn't correct my problem.
The data is here if anyone wants to use what I am using: https://www.dropbox.com/s/7vwa0lud44td2ak/test_splot_anim_noTS.csv?dl=0data file
Any help or advice would be much appreciated.
import numpy as np
import pandas a pd
import matplotlib as mpl
import matplotlib.pyplot as plt
data=pd.read_csv("test_splot_anim_noTS.csv")
for n in range (6, 13):
data.plot(kind='scatter', x='metres', y=n)
plt.ylim(-4,4)
plt.savefig('n.jpeg')
data=pd.read_csv("test_splot_anim_noTS.csv")
for column in data.columns[1:]:
data.plot(kind='scatter', x='metres',y=column)
plt.ylim(-4,4)
plt.savefig('{}.jpeg'.format(column))
I may have done it!
panda.DataFrame.plot, single line plot
data=pd.read_csv("test_splot_anim_noTS.csv")
data.set_index('metres', drop=True, inplace=True)
data.plot()
With matplotlib, single plot with all columns:
import matplotlib.pyplot as plt
plt.plot(data)
plt.show()
Separate scatter plots, files saved:
for col in data.columns:
plt.scatter(data.index, data[col])
plt.ylim(-4, 4)
plt.savefig(f'{col}.jpeg')
plt.show()
With Seaborn:
for col in data.columns:
sns.scatterplot(data.index, data[col])
plt.ylim(-4,4)
plt.savefig(f'{col}.jpeg')
plt.show()

Matplotlib: Identify bars in bar plot based on criteria

The code below:
import pandas as pd
import matplotlib.pyplot as plt
data = [['Apple',10],['Banana',15],['Kiwi',11],['Orange',17]]
df = pd.DataFrame(data,columns=['Fruit','Quantity'])
df.set_index('Fruit', inplace=True)
df.plot.bar(color='gray',rot=0)
plt.show()
gives the following output:
I would like to plot bars in red color for the top two quantity fruits i.e., Orange and Banana. How can I do that? Instead of giving a fixed threshold value to change color, I would prefer if my plot is robust enough to identify top two bars.
There might be a straightforward and simpler way but I was able to come up with the following solution which would work in principle for any number of top n values. The idea is:
First get the top n elements (n=2 in the example below) from the DataFrame using nlargest
Then, loop over the x-tick labels and change the color of the patches (bars) for those values which are the largest using an if statement to get their index. Here we created an axis instance ax to be able to extract the patches for setting the colors.
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
data = [['Apple',10],['Banana',15],['Kiwi',11],['Orange',17]]
df = pd.DataFrame(data,columns=['Fruit','Quantity'])
df.set_index('Fruit', inplace=True)
df.plot.bar(color='gray',rot=0, ax=ax)
top = df['Quantity'].nlargest(2).keys() # Top 2 values here
for i, tick in enumerate(ax.get_xticklabels()):
if tick.get_text() in top:
ax.patches[i].set_color('r')
plt.show()
Plotting a colored bar plot
The problem is that pandas bar plots take the color argument to apply column-wise. Here you have a single column. Hence something like the canonical attempt to color a bar plot does not work
pd.DataFrame([12,14]).plot.bar(color=["red", "green"])
A workaround is to create a diagonal matrix instead of a single column and plot it with the stacked=True option.
df = pd.DataFrame([12,14])
df = pd.DataFrame(np.diag(df[0].values), index=df.index, columns=df.index)
df.plot.bar(color=["red", "green"], stacked=True)
Another option is to use matplotlib instead.
df = pd.DataFrame([12,14])
plt.bar(df.index, df[0].values, color=color)
Choosing the colors according to values
Now the question remains on how to create a list of the colors to use in either of the two solutions above. Given a dataframe df you can create an array of equal length to the frame and fill it with the default color, then you can set those entries of the two highest values to another color:
color = np.array(["gray"]*len(df))
color[np.argsort(df["Quantity"])[-2:]] = "red"
Solution:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = [['Apple',10],['Banana',15],['Kiwi',11],['Orange',17]]
df = pd.DataFrame(data,columns=['Fruit','Quantity'])
df.set_index('Fruit', inplace=True)
color = np.array(["gray"]*len(df))
color[np.argsort(df["Quantity"])[-2:]] = "red"
plt.bar(df.index, df.values, color=color)
plt.show()

Plotting shaded uncertainty region in line plot in matplotlib when data has NaNs

I would like a plot which looks like this:
I am trying to do this with matplotlib:
fig, ax = plt.subplots()
with sns.axes_style("darkgrid"):
for i in range(5):
ax.plot(means.ix[i][list(range(3,104))], label=means.ix[i]["label"])
ax.fill_between(means.ix[i][list(range(3,104))]-stds.ix[i][list(range(3,104))], means.ix[i][list(range(3,104))]+stds.ix[i][list(range(3,104))])
ax.legend()
I want the shaded region to be the same colour as the line in the centre. But right now, my problem is that means has some NaNs and fill_between does not accept that. I get the error
TypeError: ufunc 'isfinite' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''
Any ideas on how I could achieve what I want? The solution doesn't need to use matplotlib as long as it can plot my series of points with their uncertainties for multiple series.
Ok. So one of the problem was that the dtype of my data was object and not float and this caused fill_between to fail when it looked to see if the numbers were finite. I finally managed to do it by (a) converting to float and then (b) to solve the problem of the matching colours for uncertainty and line, to use a colour palette. So I have:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots()
clrs = sns.color_palette("husl", 5)
with sns.axes_style("darkgrid"):
epochs = list(range(101))
for i in range(5):
meanst = np.array(means.ix[i].values[3:-1], dtype=np.float64)
sdt = np.array(stds.ix[i].values[3:-1], dtype=np.float64)
ax.plot(epochs, meanst, label=means.ix[i]["label"], c=clrs[i])
ax.fill_between(epochs, meanst-sdt, meanst+sdt ,alpha=0.3, facecolor=clrs[i])
ax.legend()
ax.set_yscale('log')
which gave me the following result:
You could simply drop the NaNs from your means DataFrame and plot that resulting dataframe instead?
In the example below, I tried to get close to your structure, I have a means DataFrame with some NaN sprinkled around. I suppose the stds DataFrame probably has NaN at the same locations, but in this case it doesn't really matter, I drop the NaN from means to get temp_means and I use the indices left in temp_means to extract the std values from stds.
The plots show the results before (top) and after (bottom) dropping the NaNs
x = np.linspace(0, 30, 100)
y = np.sin(x/6*np.pi)
error = 0.2
means = pd.DataFrame(np.array([x,y]).T,columns=['time','mean'])
stds = pd.DataFrame(np.zeros(y.shape)+error)
#sprinkle some NaN in the mean
sprinkles = means.sample(10).index
means.loc[sprinkles] = np.NaN
fig, axs = plt.subplots(2,1)
axs[0].plot(means.ix[:,0], means.ix[:,1])
axs[0].fill_between(means.ix[:,0], means.ix[:,1]-stds.ix[:,0], means.ix[:,1]+stds.ix[:,0])
temp_means = means.dropna()
axs[1].plot(temp_means.ix[:,0], temp_means.ix[:,1])
axs[1].fill_between(temp_means.ix[:,0], temp_means.ix[:,1]-stds.loc[temp_means.index,0], temp_means.ix[:,1]+stds.loc[temp_means.index,0])
plt.show()

y-axis scaling in seaborn vs pandas

I'm plotting a scatter plot using a pandas dataframe. This works correctly, but I wanted to use seaborn themes and specials functions. When I plot the same data points calling seaborn, the y-axis remains almost invisible. X-axis values ranges from 5000-15000, while y-axis values are in [-6:6]*10^-7.
If I multiply the y-axis values by 10^6, they display correctly, but the actual values when plotted using seaborn remains invisible/indistinguishable in a seaborn generated plot.
How can I seaborn so that the y-axis values scale automatically in the resultant plot?
Also some rows even contain NaN, not in this case, how to disregard that while plotting, short of manually weeding out rows containing NaN.
Below is the code I've used to plot.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("datascale.csv")
subdf = df.loc[(df.types == "easy") & (df.weight > 1300), ]
subdf = subdf.iloc[1:61, ]
subdf.drop(subdf.index[[25]], inplace=True) #row containing NaN
subdf.plot(x='length', y='speed', style='s') #scales y-axis correctly
sns.lmplot("length", "speed", data=subdf, fit_reg=True, lowess=True) #doesn't scale y-axis properly
# multiplying by 10^6 displays the plot correctly, in matplotlib
plt.scatter(subdf['length'], 10**6*subdf['speed'])
Strange that seaborn does not scale the axis correctly. Nonetheless, you can correct this behaviour. First, get a reference to the axis object of the plot:
lm = sns.lmplot("length", "speed", data=subdf, fit_reg=True)
After that you can manually set the y-axis limits:
lm.axes[0,0].set_ylim(min(subdf.speed), max(subdf.speed))
The result should look something like this:
Example Jupyter notebook here.
Seaborn and matplotlib should just ignore NaN values when plotting. You should be able to leave them as is.
As for the y scaling: there might be a bug in seaborn.
The most basic workaround is still to scale the data before plotting.
Scale to microspeed in the dataframe before plotting and plot microspeed instead.
subdf['microspeed']=subdf['speed']*10**6
Or transform to log y before plotting, i.e.
import math
df = pd.DataFrame({'speed':[1, 100, 10**-6]})
df['logspeed'] = df['speed'].map(lambda x: math.log(x,10))
then plot logspeed instead of speed.
Another approach would be to use seaborn regplot instead.
Matplot lib correctly scales and plots for me as follows:
plt.plot(subdf['length'], subdf['speed'], 'o')

Categories

Resources