pandas boxplot, groupby different ylim in each subplot - python

I have a dataframe and I would like to plot it as:
>>> X = pd.DataFrame(np.random.normal(0, 1, (100, 3)))
>>> X['NCP'] = np.random.randint(0, 5, 100)
>>> X[X['NCP'] == 0] += 100
>>> X.groupby('NCP').boxplot()
The result is what I want but all the subplots have the same ylim. This makes impossible to visualize the result properly. How can I set different ylim for each subplot?

What you asked for was to set the y axis separately for each axes. I believe that should be ax.set_ylim([a, b]). But every time I ran it for each axes it updated for all.
Because I couldn't figure out how to answer your question directly, I'm providing a work around.
X = pd.DataFrame(np.random.normal(0, 1, (100, 3)))
X['NCP'] = np.random.randint(0, 5, 100)
X[X['NCP'] == 0] += 100
groups = X.groupby('NCP')
print groups.groups.keys()
# This gets a number of subplots equal to the number of groups in a single
# column. you can adjust this yourself if you need.
fig, axes = plt.subplots(len(groups.groups), 1, figsize=[10, 12])
# Loop through each group and plot boxplot to appropriate axis
for i, k in enumerate(groups.groups.keys()):
group = groups.get_group(k)
group.boxplot(ax=axes[i], return_type='axes')
subplots DOCUMENTATION

Related

plot a point within ridgeplots

having the following dataframe:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import joypy
sample1 = np.random.normal(5, 10, size = (200, 5))
sample2 = np.random.normal(40, 5, size = (200, 5))
sample3 = np.random.normal(10, 5, size = (200, 5))
b = []
for i in range(0, 3):
a = "Sample" + "{}".format(i)
lst = np.repeat(a, 200)
b.append(lst)
b = np.asarray(b).reshape(600,1)
data_arr = np.vstack((sample1,sample2, sample3))
df1 = pd.DataFrame(data = data_arr, columns = ["foo", "bar", "qux", "corge", "grault"])
df1.insert(0, column="sampleNo", value = b)
I am able to produce the following ridgeplot:
fig, axes = joypy.joyplot(df1, column = ['foo'], by = 'sampleNo',
alpha=0.6,
linewidth=.5,
linecolor='w',
fade=True)
Now, let's say I have the following vector:
vectors = np.asarray([10, 40, 50])
How do I plot each one of those points into the density plots? E.g., on the distribution plot of sample 1, I'd like to have a single point (or line) on 10; sample 2 on 40, etc..
I've tried to use axvline, and I sort of expected this to work, but no luck:
for ax in axes:
ax.axvline(vectors(ax))
I am not sure if what I want is possible at all...
You almost had the correct approach.
axes holds 4 axis objects, in order: the three stacked plots from top to bottom and the big one where all the other 3 live in. So,
for ax, v in zip(axes, vectors):
ax.axvline(v)
zip() will only zip up to the shorter iterable, which is vectors. So, it will match each point from vectors with each axis from the stacked plots.

Set different markersizes for plotting pandas dataframe with matplotlib

I want to decrease the markersize for every line I plot with my dataframe. I can set a unique markersize like that:
df = pd.read_csv(file_string, index_col=0)
df.plot(style=['^-','v-','^-','v-','^-','v-'], markersize=8)
I set a different style for every line (I new that there are 6), now I wanted to do the same with the sizes, this doesn't work:
df = pd.read_csv(file_string, index_col=0)
df.plot(style=['^-','v-','^-','v-','^-','v-'], markersize=[16,14,12,10,8,6])
How can I achieve something like this?
The above earlier answer works fine for a small number of columns. If you don't want to repeat the same code many times, you can also write a loop that alternates between the markers, and reduces the marker size at each iteration. Here I reduced it by 4 each time, but the starting size and amount you want to reduce each marker size is obviously up to you.
df = pd.DataFrame({'y1':np.random.normal(loc = 5, scale = 10, size = 20),
'y2':np.random.normal(loc = 5, scale = 10, size = 20),
'y3':np.random.normal(loc = 5, scale = 10, size = 20)})
size = 18
for y in df.columns:
col_index = df.columns.get_loc(y)
if col_index % 2 == 0:
plt.plot(df[y], marker = '^', markersize = size)
else:
plt.plot(df[y], marker = 'v', markersize = size)
size -= 4
plt.legend(ncol = col_index+1, loc = 'lower right')
markersize accepts only a float value not a l ist acording to the documentation.
You can use matplotlib instead, and plot each line independently
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv(file_string, index_col=0)
plt.plot(df[x], df[y1],markersize=16,'^-')
plt.plot(df[x], df[y2],markersize=14,'v-')
#and so on...
plt.show()

print multiple separate histograms in one loop

I want this to print out two histograms (of the first two columns), but this instead stacks the histograms within the same plot. How do I get it to output two separate histograms?
dataobj = pd.DataFrame([[1,2,3],[3,4,5],[6,7,8]])
for i in [0,1]:
a = np.array(dataobj.iloc[:,i])
plt.hist(a,bins = np.linspace(0,10,11))
Even better would be a solution where I can save the plots into an array which I could later call to display them.
Working in Jupyter
dataobj = pd.DataFrame([[1, 2, 3], [3, 4, 5], [6, 7, 8]])
fig, axes = plt.subplots(3, 1)
plt.rcParams['figure.figsize'] = (12, 12)
for i in range(3):
a = np.array(dataobj.iloc[:, i])
axes[i].hist(a, bins=np.linspace(0, 10, 11))
plt.show()
u need to use axes
Just add plt.show() in for loop, no need in subplots and axes. Like this
dataobj = pd.DataFrame([[1,2,3],[3,4,5],[6,7,8]])
for i in [0,1]:
a = np.array(dataobj.iloc[:,i])
plt.hist(a,bins = np.linspace(0,10,11))
plt.show()

bubble chart with the bubble size equal to group size in python

Would like a scatter plot (or heatmap) with the bubble size (or color) to show the size of each group in pandas.
For example, data in pandas DataFrame:
df = pd.DataFrame(np.random.randint(10, size=(100, 2)), columns=['first_col', 'second_col'])
df.groupby(['first_col', 'second_col']).size()
In the scatter plot (or heatmap), x axis is the first_col, and y axis is the second_col, and the bubble size equal to the result from .size().
It would be better if the answer can handle continuous number more than discrete number. In that case, the plotter may need set the bin size.
Alright, figured it out myself.
df = pd.DataFrame(np.random.randint(10, size=(1000, 2)), columns=['first_col', 'second_col'])
index = df.groupby(['first_col', 'second_col']).size().index
x = index.map(lambda t: t[0])
y = index.map(lambda t: t[1])
areas = df.groupby(['first_col', 'second_col']).size()
plt.scatter(x, y, s=areas * 3, alpha=0.5)
Don't know how to extract x coordinate and y coordinate in a more numpy-way.
You can further simplify the code as follows:
first_col, second_col = "first_col", "second_col"
df = pd.DataFrame(
np.random.randint(10, size=(1000, 2)), columns=[first_col, second_col]
)
df_plot = df.groupby([first_col, second_col]).size()
plt.scatter(
df_plot.index.get_level_values(first_col),
df_plot.index.get_level_values(second_col),
s=df_plot * 3,
alpha=0.5,
)
Please note that the above code will work only if you have integer values in both columns. If that is not the case, you first need to use the pd.cut() function to bin your data and then plot the data using the following script. Please remember to change the np.arange() argument to match your data.
first_col, second_col = "first_col", "second_col"
df = pd.DataFrame(np.random.rand(1000, 2), columns=[first_col, second_col])
df["x_binned"] = pd.cut(
df[first_col],
np.arange(0, 1.1, 0.1),
)
df["y_binned"] = pd.cut(df[second_col], np.arange(0, 1.1, 0.1))
df_plot = df.groupby(["x_binned", "y_binned"]).size()
plt.scatter(
pd.IntervalIndex(df_plot.index.get_level_values("x_binned")).mid,
pd.IntervalIndex(df_plot.index.get_level_values("y_binned")).mid,
s=df_plot * 3,
alpha=0.5,
)

matplotlib: drawing lines between points ignoring missing data

I have a set of data which I want plotted as a line-graph. For each series, some data is missing (but different for each series). Currently matplotlib does not draw lines which skip missing data: for example
import matplotlib.pyplot as plt
xs = range(8)
series1 = [1, 3, 3, None, None, 5, 8, 9]
series2 = [2, None, 5, None, 4, None, 3, 2]
plt.plot(xs, series1, linestyle='-', marker='o')
plt.plot(xs, series2, linestyle='-', marker='o')
plt.show()
results in a plot with gaps in the lines. How can I tell matplotlib to draw lines through the gaps? (I'd rather not have to interpolate the data).
You can mask the NaN values this way:
import numpy as np
import matplotlib.pyplot as plt
xs = np.arange(8)
series1 = np.array([1, 3, 3, None, None, 5, 8, 9]).astype(np.double)
s1mask = np.isfinite(series1)
series2 = np.array([2, None, 5, None, 4, None, 3, 2]).astype(np.double)
s2mask = np.isfinite(series2)
plt.plot(xs[s1mask], series1[s1mask], linestyle='-', marker='o')
plt.plot(xs[s2mask], series2[s2mask], linestyle='-', marker='o')
plt.show()
This leads to
Qouting #Rutger Kassies (link) :
Matplotlib only draws a line between consecutive (valid) data points,
and leaves a gap at NaN values.
A solution if you are using Pandas, :
#pd.Series
s.dropna().plot() #masking (as #Thorsten Kranz suggestion)
#pd.DataFrame
df['a_col_ffill'] = df['a_col'].ffill()
df['b_col_ffill'] = df['b_col'].ffill() # changed from a to b
df[['a_col_ffill','b_col_ffill']].plot()
A solution with pandas:
import matplotlib.pyplot as plt
import pandas as pd
def splitSerToArr(ser):
return [ser.index, ser.as_matrix()]
xs = range(8)
series1 = [1, 3, 3, None, None, 5, 8, 9]
series2 = [2, None, 5, None, 4, None, 3, 2]
s1 = pd.Series(series1, index=xs)
s2 = pd.Series(series2, index=xs)
plt.plot( *splitSerToArr(s1.dropna()), linestyle='-', marker='o')
plt.plot( *splitSerToArr(s2.dropna()), linestyle='-', marker='o')
plt.show()
The splitSerToArr function is very handy, when plotting in Pandas. This is the output:
Without interpolation you'll need to remove the None's from the data. This also means you'll need to remove the X-values corresponding to None's in the series. Here's an (ugly) one liner for doing that:
x1Clean,series1Clean = zip(* filter( lambda x: x[1] is not None , zip(xs,series1) ))
The lambda function returns False for None values, filtering the x,series pairs from the list, it then re-zips the data back into its original form.
For what it may be worth, after some trial and error I would like to add one clarification to Thorsten's solution. Hopefully saving time for users who looked elsewhere after having tried this approach.
I was unable to get success with an identical problem while using
from pyplot import *
and attempting to plot with
plot(abscissa[mask],ordinate[mask])
It seemed it was required to use import matplotlib.pyplot as plt to get the proper NaNs handling, though I cannot say why.
Another solution for pandas DataFrames:
plot = df.plot(style='o-') # draw the lines so they appears in the legend
colors = [line.get_color() for line in plot.lines] # get the colors of the markers
df = df.interpolate(limit_area='inside') # interpolate
lines = plot.plot(df.index, df.values) # add more lines (with a new set of colors)
for color, line in zip(colors, lines):
line.set_color(color) # overwrite the new lines colors with the same colors as the old lines
I had the same problem, but the mask eliminate the point between and the line was cut either way (the pink lines that we see in the picture were the only not NaN data that was consecutive, that´s why the line). Here is the result of masking the data (still with gaps):
xs = df['time'].to_numpy()
series1 = np.array(df['zz'].to_numpy()).astype(np.double)
s1mask = np.isfinite(series1)
fplt.plot(xs[s1mask], series1[s1mask], ax=ax_candle, color='#FF00FF', width = 1, legend='ZZ')
Maybe because I was using finplot (to plot candle chart), so I decided to make the Y-axe points that was missing with the linear formula y2-y1=m(x2-x1) and then formulate the function that generate the Y values between the missing points.
def fillYLine(y):
#Line Formula
fi=0
first = None
next = None
for i in range(0,len(y),1):
ne = not(isnan(y[i]))
next = y[i] if ne else next
if not(next is None):
if not(first is None):
m = (first-next)/(i-fi) #m = y1 - y2 / x1 - x2
cant_points = np.abs(i-fi)-1
if (cant_points)>0:
points = createLine(next,first,i,fi,cant_points)#Create the line with the values of the difference to generate the points x that we need
x = 1
for p in points:
y[fi+x] = p
x = x + 1
first = next
fi = i
next = None
return y
def createLine(y2,y1,x2,x1,cant_points):
m = (y2-y1)/(x2-x1) #Pendiente
points = []
x = x1 + 1#first point to assign
for i in range(0,cant_points,1):
y = ((m*(x2-x))-y2)*-1
points.append(y)
x = x + 1#The values of the line are numeric we don´t use the time to assign them, but we will do it at the same order
return points
Then I use simple call the function to fill the gaps between like y = fillYLine(y), and my finplot was like:
x = df['time'].to_numpy()
y = df['zz'].to_numpy()
y = fillYLine(y)
fplt.plot(x, y, ax=ax_candle, color='#FF00FF', width = 1, legend='ZZ')
You need to think that the data in Y variable is only for the plot, I need the NaN values between in the operations (or remove them from the list), that´s why I created a Y variable from the pandas dataset df['zz'].
Note: I noticed that the data is eliminated in my case because if I don´t mask X (xs) the values slide left in the graph, in this case they become consecutive not NaN values and it draws the consecutive line but shrinked to the left:
fplt.plot(xs, series1[s1mask], ax=ax_candle, color='#FF00FF', width = 1, legend='ZZ') #No xs masking (xs[masking])
This made me think that the reason for some people to work the mask is because they are only plotting that line or there´s no great difference between the non masked and masked data (few gaps, not like my data that have a lot).
Perhaps I missed the point, but I believe Pandas now does this automatically. The example below is a little involved, and requires internet access, but the line for China has lots of gaps in the early years, hence the straight line segments.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# read data from Maddison project
url = 'http://www.ggdc.net/maddison/maddison-project/data/mpd_2013-01.xlsx'
mpd = pd.read_excel(url, skiprows=2, index_col=0, na_values=[' '])
mpd.columns = map(str.rstrip, mpd.columns)
# select countries
countries = ['England/GB/UK', 'USA', 'Japan', 'China', 'India', 'Argentina']
mpd = mpd[countries].dropna()
mpd = mpd.rename(columns={'England/GB/UK': 'UK'})
mpd = np.log(mpd)/np.log(2) # convert to log2
# plots
ax = mpd.plot(lw=2)
ax.set_title('GDP per person', fontsize=14, loc='left')
ax.set_ylabel('GDP Per Capita (1990 USD, log2 scale)')
ax.legend(loc='upper left', fontsize=10, handlelength=2, labelspacing=0.15)
fig = ax.get_figure()
fig.show()

Categories

Resources