How to create n subplots (box plots) automatically? - python

I need to show n (e.g. 5) box plots. How can I do it?
df =
col1 col2 col3 col4 col5 result
1 3 1 1 4 0
1 2 2 4 9 1
1 2 1 3 7 1
This is my current code. But it does not display the data inside plots. Also, plots are very thin if n is for example 10 (is it possible to go to a new line automatically?).
n=5
columns = df.columns
i = 0
fig, axes = plt.subplots(1, n, figsize=(20,5))
for ax in axes:
df.boxplot(by="result", column = [columns[i]], vert=False, grid=True)
i = i + 1
display(fig)
This example is for Azure Databricks, but I appreciate just a matplotlib solution as well if it's applicable.

I am not sure I got what you are trying to do, but the following code will show you the plots. You can control the figure sizes by changing the values of (10,10)
Code:
df.boxplot(by="result",figsize=(10,10));
Result:
To change the Vert and show the grid :
df.boxplot(by="result",figsize=(10,10),vert=False, grid=True);

I solved it myself as follows:
df.boxplot(by="result", column = columns[0:4], vert=False, grid=True, figsize=(30,10), layout = (3, 5))

If you want additional row to be generated, while fixing the number of columns to be constant: adjust the layout as follows:
In [41]: ncol = 2
In [42]: df
Out[42]:
v0 v1 v2 v3 v4 v5 v6
0 0 3 6 9 12 15 18
1 1 4 7 10 13 16 19
2 2 5 8 11 14 17 20
In [43]: df.boxplot(by='v6', layout=(df.shape[1] // ncol + 1, ncol)) # use floor division to determine how many row are required

Related

How do the subplot indices work in Python?

I'm trying to make sense how the subplot indices work but they don't seem intuitive at all. I particularly have an issue with the third index. I know that there are other ways to create subplots in python but I am trying to understand how subplots written in such a manner work because they are used extensively.
I am trying to use a trivial example to see if I understand what I'm doing. So, here's what I want to do:
Row 1 has 3 columns
Row 2 has 2 columns
Row 3 has 3 columns
Rows 4 and 5 have 2 columns. However, I want to have the left subplot span rows 4 and 5.
This is the code for the first 3 rows. I don't understand why the third index of ax4 is 3 instead of 4.
ax1 = plt.subplot(5,3,1)
ax2 = plt.subplot(5,3,2)
ax3 = plt.subplot(5,3,3)
ax4 = plt.subplot(5,2,3)
ax5 = plt.subplot(5,2,4)
ax6 = plt.subplot(5,3,7)
ax7 = plt.subplot(5,3,8)
ax8 = plt.subplot(5,3,9)
For the three subplots that sit in rows 3 and 4, I can't seem to be able to do that. Here's my wrong attempt:
ax9 = plt.subplot(4,2,10)
ax10 = plt.subplot(5,2,12)
ax11 = plt.subplot(5,2,15)
The indices are from left to right, and then wrap at the end of the row. So subplot(2, 3, x):
1 2 3
4 5 6
For your example, ax4=subplot(5, 3, x) the subplots are indexed:
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
For the ax4=subplot(5, 2, x) they are indexed:
1 2
3 4
5 6
7 8
9 10
To span subplots, you can input the start and stop indices:
ax9 = plt.subplot(5, 2, 7:9)
ax10 = plt.subplots(5, 2, 8:10)

Making a bar chart to represent the number of occurrences in a Pandas Series

I was wondering if anyone could help me with how to make a bar chart to show the frequencies of values in a Pandas Series.
I start with a Pandas DataFrame of shape (2000, 7), and from there I extract the last column. The column is shape (2000,).
The entries in the Series that I mentioned vary from 0 to 17, each with different frequencies, and I tried to plot them using a bar chart but faced some difficulties. Here is my code:
# First, I counted the number of occurrences.
count = np.zeros(max(data_val))
for i in range(count.shape[0]):
for j in range(data_val.shape[0]):
if (i == data_val[j]):
count[i] = count[i] + 1
'''
This gives us
count = array([192., 105., ... 19.])
'''
temp = np.arange(0, 18, 1) # Array for the x-axis.
plt.bar(temp, count)
I am getting an error on the last line of code, saying that the objects cannot be broadcast to a single shape.
What I ultimately want is a bar chart where each bar corresponds to an integer value from 0 to 17, and the height of each bar (i.e. the y-axis) represents the frequencies.
Thank you.
UPDATE
I decided to post the fixed code using the suggestions that people were kind enough to give below, just in case anybody facing similar issues will be able to see my revised code in the future.
data = pd.read_csv("./data/train.csv") # Original data is a (2000, 7) DataFrame
# data contains 6 feature columns and 1 target column.
# Separate the design matrix from the target labels.
X = data.iloc[:, :-1]
y = data['target']
'''
The next line of code uses pandas.Series.value_counts() on y in order to count
the number of occurrences for each label, and then proceeds to sort these according to
index (i.e. label).
You can also use pandas.DataFrame.sort_values() instead if you're interested in sorting
according to the number of frequencies rather than labels.
'''
y.value_counts().sort_index().plot.bar(x='Target Value', y='Number of Occurrences')
There was no need to use for loops if we use the methods that are built into the Pandas library.
The specific methods that were mentioned in the answers are pandas.Series.values_count(), pandas.DataFrame.sort_index(), and pandas.DataFrame.plot.bar().
I believe you need value_counts with Series.plot.bar:
df = pd.DataFrame({
'a':[4,5,4,5,5,4],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[1,1,6,1,6,5],
})
print (df)
a b c d
0 4 7 1 1
1 5 8 3 1
2 4 9 5 6
3 5 4 7 1
4 5 2 1 6
5 4 3 0 5
df['d'].value_counts(sort=False).plot.bar()
If possible some value missing and need set it to 0 add reindex:
df['d'].value_counts(sort=False).reindex(np.arange(18), fill_value=0).plot.bar()
Detail:
print (df['d'].value_counts(sort=False))
1 3
5 1
6 2
Name: d, dtype: int64
print (df['d'].value_counts(sort=False).reindex(np.arange(18), fill_value=0))
0 0
1 3
2 0
3 0
4 0
5 1
6 2
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
Name: d, dtype: int64
Here's an approach using Seaborn
import numpy as np
import pandas as pd
import seaborn as sns
s = pd.Series(np.random.choice(17, 10))
s
# 0 10
# 1 13
# 2 12
# 3 0
# 4 0
# 5 5
# 6 13
# 7 9
# 8 11
# 9 0
# dtype: int64
val, cnt = np.unique(s, return_counts=True)
val, cnt
# (array([ 0, 5, 9, 10, 11, 12, 13]), array([3, 1, 1, 1, 1, 1, 2]))
sns.barplot(val, cnt)

matplotlib color line by "value" [duplicate]

This question already has answers here:
How to plot one line in different colors
(5 answers)
Closed 5 years ago.
Various versions of this question have been asked before, and I'm not sure if I'm supposed to ask my question on one of the threads or start a new thread. Here goes:
I have a pandas dataframe where there is a column (eg: speed) that I'm trying to plot, and then another column (eg: active) which is, for now, true/false. Depending on the value of active, I'd like to color the line plot.
This thread seems to be the "right" solution, but I'm having an issue:
seaborn or matplotlib line chart, line color depending on variable The OP and I are trying to achieve the same thing:
Here's a broken plot/reproducer:
Values=[3,4,6, 6,5,4, 3,2,3, 4,5,6]
Colors=['red','red', 'red', 'blue','blue','blue', 'red', 'red', 'red', 'blue', 'blue', 'blue']
myf = pd.DataFrame({'speed': Values, 'colors': Colors})
grouped = myf.groupby('colors')
fig, ax = plt.subplots(1)
for key, group in grouped:
group.plot(ax=ax, y="speed", label=key, color=key)
The resultant plot has two issues: not only are the changed color lines not "connected", but the colors themselves connect "across" the end points:
What I want to see is the change from red to blue and back look like it's all one contiguous line.
Color line by third variable - Python seems to do the right thing, but I am not dealing with "linear" color data. I basically am assigning a set of line colors in a column. I could easily set the values of the color column to numericals:
Colors=['1','1', '1', '2','2'...]
if that makes generating the desired plot easier.
There is a comment in the first thread:
You could do it if you'll duplicate points when color changed, I've
modified answer for that
But I basically copied and pasted the answer, so I'm not sure that comment is entirely accurate.
Setup
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Values=[3,4,6, 6,5,4, 3,2,3, 4,5,6]
Colors=['red','red', 'red', 'blue','blue','blue', 'red', 'red', 'red', 'blue', 'blue', 'blue']
myf = pd.DataFrame({'speed': Values, 'colors': Colors})
Solution
1. Detect color change-points and label subgroups of contiguous colors, based on Pandas "diff()" with string
myf['change'] = myf.colors.ne(myf.colors.shift().bfill()).astype(int)
myf['subgroup'] = myf['change'].cumsum()
myf
colors speed change subgroup
0 red 3 0 0
1 red 4 0 0
2 red 6 0 0
3 blue 6 1 1
4 blue 5 0 1
5 blue 4 0 1
6 red 3 1 2
7 red 2 0 2
8 red 3 0 2
9 blue 4 1 3
10 blue 5 0 3
11 blue 6 0 3
2. Create gaps in the index in which to fit duplicated rows between color subgroups
myf.index += myf['subgroup'].values
myf
colors speed change subgroup
0 red 3 0 0
1 red 4 0 0
2 red 6 0 0
4 blue 6 1 1 # index is now 4; 3 is missing
5 blue 5 0 1
6 blue 4 0 1
8 red 3 1 2 # index is now 8; 7 is missing
9 red 2 0 2
10 red 3 0 2
12 blue 4 1 3 # index is now 12; 11 is missing
13 blue 5 0 3
14 blue 6 0 3
3. Save the indexes of each subgroup's first row
first_i_of_each_group = myf[myf['change'] == 1].index
first_i_of_each_group
Int64Index([4, 8, 12], dtype='int64')
4. Copy each group's first row to the previous group's last row
for i in first_i_of_each_group:
# Copy next group's first row to current group's last row
myf.loc[i-1] = myf.loc[i]
# But make this new row part of the current group
myf.loc[i-1, 'subgroup'] = myf.loc[i-2, 'subgroup']
# Don't need the change col anymore
myf.drop('change', axis=1, inplace=True)
myf.sort_index(inplace=True)
# Create duplicate indexes at each subgroup border to ensure the plot is continuous.
myf.index -= myf['subgroup'].values
myf
colors speed subgroup
0 red 3 0
1 red 4 0
2 red 6 0
3 blue 6 0 # this and next row both have index = 3
3 blue 6 1 # subgroup 1 picks up where subgroup 0 left off
4 blue 5 1
5 blue 4 1
6 red 3 1
6 red 3 2
7 red 2 2
8 red 3 2
9 blue 4 2
9 blue 4 3
10 blue 5 3
11 blue 6 3
5. Plot
fig, ax = plt.subplots()
for k, g in myf.groupby('subgroup'):
g.plot(ax=ax, y='speed', color=g['colors'].values[0], marker='o')
ax.legend_.remove()
I took a crack at it. Following the comments in the other question that you linked lead me to this. I did have to get down to matplotlib and couldn't do it in pandas itself. Once I converted the dataframe into lists, its pretty much the same code as the one from the mpl page.
I create the dataframe similar to yours:
vals=[3,4,6, 6,5,4, 3,2,3, 4,5,6]
colors=['red' if x < 5 else 'blue' for x in vals]
df = pd.DataFrame({'speed': vals, 'danger': colors})
Converting the vals and index into lists
x = df.index.tolist()
y = df['speed'].tolist()
z = np.array(list(y))
Break down the vals and index into points and then create line segments
out of them.
points = np.array([x, y]).T.reshape(-1, 1, 2)
segments = np.concatenate([points[:-1], points[1:]], axis=1)
Create the colormap based on the criteria used while creating the dataframe. In my case, speed less than 5 is red and rest is blue.
cmap = ListedColormap(['r', 'b'])
norm = BoundaryNorm([0, 4, 10], cmap.N)
Create the line segments and assign the colors accordingly
lc = LineCollection(segments, cmap=cmap, norm=norm)
lc.set_array(z)
Plot !
fig = plt.figure()
plt.gca().add_collection(lc)
plt.xlim(min(x), max(x))
plt.ylim(0, 10)
Here is the output:
Note: In the current code, the color of the line segment is dependent on the starting point. But hopefully, this gives you an idea.
I'm still new to answering questions here. Let me know if I need to add/remove some details. Thanks!

Python Pandas error bars are not plotted and how to customize the index

I need your help on plotting error bars using Pandas in Python. I have read the Pandas documentation, and did some trials and errors, but got no satisfying result.
Here is my code:
'''
usage : (python) rc-joint-plot-error-bar.py
'''
from __future__ import print_function
import pandas as pd
import matplotlib.pyplot as plt
filename = 'rc-plot-error-bar.csv'
df = pd.read_csv(filename, low_memory = False)
headers = ['Specimen', 'CA_Native_Mean', 'CA_Implant_Mean', 'CP_Native_Mean',
'CP_Implant_Mean', 'CA_Native_Error', 'CA_Implant_Error', 'CP_Native_Error',
'CP_Implant_Error']
for header in headers :
df[header] = pd.to_numeric(df[header], errors = 'coerce')
CA_means = df[['CA_Native_Mean','CA_Implant_Mean']]
CA_errors = df[['CA_Native_Error','CA_Implant_Error']]
CP_means = df[['CP_Native_Mean', 'CP_Implant_Mean']]
CP_errors = df[['CP_Native_Error', 'CP_Implant_Error']]
CA_means.plot.bar(yerr=CA_errors)
CP_means.plot.bar(yerr=CP_errors)
plt.show()
Here is what my dataframe looks like:
Specimen CA_Native_Mean CA_Implant_Mean CP_Native_Mean CP_Implant_Mean \
0 1 1 0.738366 1 1.087530
1 2 1 0.750548 1 1.208398
2 3 1 0.700343 1 1.394535
3 4 1 0.912814 1 1.324024
4 5 1 1.782425 1 1.296495
5 6 1 0.415147 1 0.479259
6 7 1 0.934014 1 1.084714
7 8 1 0.526591 1 0.873022
8 9 1 1.409730 1 2.051518
9 10 1 1.745822 1 2.134407
CA_Native_Error CA_Implant_Error CP_Native_Error CP_Implant_Error
0 0 0.096543 0 0.283576
1 0 0.076927 0 0.281199
2 0 0.362881 0 0.481450
3 0 0.400091 0 0.512375
4 0 2.732206 0 1.240796
5 0 0.169731 0 0.130892
6 0 0.355951 0 0.272396
7 0 0.258266 0 0.396502
8 0 0.360461 0 0.451923
9 0 0.667345 0 0.404856
If I ran the code, I got the following figures:
My questions are:
Could you please let me know how to make the error bars appear in the figures?
How to change the index (the values of x-axis) from 0-9 into 1-10?
Big thanks!
Regards,
Arnold A.
You're almost there!
For your error bars to show up in the plot, the column names in yerr should match those of the data in the bar plot. Try renaming CA_errors.
For changing x-labels, try ax.set_xticklabels(df.Specimen);
_, ax= plt.subplots()
CA_means = df[['CA_Native_Mean','CA_Implant_Mean']]
CA_errors = df[['CA_Native_Error','CA_Implant_Error']].\
rename(columns={'CA_Native_Error':'CA_Native_Mean',
'CA_Implant_Error':'CA_Implant_Mean'})
CA_means.plot.bar(yerr=CA_errors, ax=ax)
ax.set_xticklabels(df.Specimen);

How to draw bar in python

I want to draw bar chart for below data:
4 1406575305 4
4 -220936570 2
4 2127249516 2
5 -1047108451 4
5 767099153 2
5 1980251728 2
5 -2015783241 2
6 -402215764 2
7 927697904 2
7 -631487113 2
7 329714360 2
7 1905727440 2
8 1417432814 2
8 1906874956 2
8 -1959144411 2
9 859830686 2
9 -1575740934 2
9 -1492701645 2
9 -539934491 2
9 -756482330 2
10 1273377106 2
10 -540812264 2
10 318171673 2
The 1st column is the x-axis and the 3rd column is for y-axis. Multiple data exist for same x-axis value. For example,
4 1406575305 4
4 -220936570 2
4 2127249516 2
This means three bars for 4 value of x-axis and each of bar is labelled with tag(the value in middle column). The sample bar chart is like:
http://matplotlib.org/examples/pylab_examples/barchart_demo.html
I am using matplotlib.pyplot and np. Thanks..
I followed the tutorial you linked to, but it's a bit tricky to shift them by a nonuniform amount:
import numpy as np
import matplotlib.pyplot as plt
x, label, y = np.genfromtxt('tmp.txt', dtype=int, unpack=True)
ux, uidx, uinv = np.unique(x, return_index=True, return_inverse=True)
max_width = np.bincount(x).max()
bar_width = 1/(max_width + 0.5)
locs = x.astype(float)
shifted = []
for i in range(max_width):
where = np.setdiff1d(uidx + i, shifted)
locs[where[where<len(locs)]] += i*bar_width
shifted = np.concatenate([shifted, where])
plt.bar(locs, y, bar_width)
If you want you can label them with the second column instead of x:
plt.xticks(locs + bar_width/2, label, rotation=-90)
I'll leave doing both of them as an exercise to the reader (mainly because I have no idea how you want them to show up).

Categories

Resources