Seaborn, violin plot with one data per column - python

I would like to combine this violin plot http://seaborn.pydata.org/generated/seaborn.violinplot.html (fourth example with split=True) with this one http://seaborn.pydata.org/examples/elaborate_violinplot.html.
Actually, I have a dataFrame with a column Success (Yes or No) and several data column. For example :
df = pd.DataFrame(
{"Success": 50 * ["Yes"] + 50 * ["No"],
"A": np.random.randint(1, 7, 100),
"B": np.random.randint(1, 7, 100)}
)
A B Success
0 6 4 Yes
1 6 2 Yes
2 1 1 Yes
3 1 2 Yes
.. .. .. ...
95 4 4 No
96 2 1 No
97 2 6 No
98 2 3 No
99 2 1 No
I would like to plot a violin plot for each data column. It works with :
import seaborn as sns
sns.violinplot(data=df[["A", "B"]], inner="quartile", bw=.15)
But now, I would like to split the violin according to the Success column. But, using hue="Success" I got an error with Cannot use 'hue' without 'x' or 'y'. Thus how can I do to plot the violin plot by splitting according to "Success" column ?

If understand your question correctly, you need to reshape your dataframe to have it in long format:
df = pd.melt(df, value_vars=['A', 'B'], id_vars='Success')
sns.violinplot(x='variable', y='value', hue='Success', data=df)
plt.show()

I was able to adapt an example of a violin plot over a DataFrame like so:
df = pd.DataFrame({"Success": 50 * ["Yes"] + 50 * ["No"],
"A": np.random.randint(1, 7, 100),
"B": np.random.randint(1, 7, 100)})
sns.violinplot(df.A, df.B, df.Success, inner="quartile", split=True)
sns.plt.show()
Clearly, it still needs some work: the A scale should be sized to fit a single half-violin, for example.

Related

Plotting dataframe where headers are 24h timestamps

I have a CSV which looks something like:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
<>
Red 2151.0 1966.0 1889.0 2148.0 2112.0 2351.0 1813.0 2008.0 1841.0 1901.0 2373.0 2643.0 2322.0 1901.0 1849.0 2132.0 1877.0 1963.0 1861.0 1973.0 2468.0 3434.0 3159.0 3413.0
Blue 2122.0 2059.0 2274.0 2647.0 2136.0 2271.0 2107.0 2192.0 2403.0 2148.0 2008.0 2111.0 2061.0 2196.0 2165.0 2354.0 1931.0 2195.0 1985.0 2025.0 2463.0 2943.0 3302.0 3424.0
I need to plot this data as a scatter plot, where Red/Blue is on the X-axis, and time on the Y-axis. Here 1->24 is timestamps per hour. I am confused as usually we have a timestamp per row, but in my situation, I have to plot each row, for each timestamp.
I then have multiple such CSVs, and will need to plot all of them to one graph.
So my question is, what's the best way to plot all values of Red/Blue for each given timestamp?
When I try to do:
x = sorted_df.index
y = list(sorted_df.columns)
plt.scatter(x, y)
plt.show()
I get a ValueError: x and y must be the same size - and this is my main source of confusion because x and y will never be the same. In the above example, x will always be 2 and y will always be 24!
Any help is much appreciated!
Maybe you could transpose (.T, short for .transpose) your dataframe, iterate its (now) columns and add the scatter plot of each column ("Red", "Blue", etc.) in the same plot:
# Assuming that your index is ["Red", "Blue", ...]
# Assuming that your columns are [1, 2, ...]
sorted_df = sorted_df.T
# Now your index is [1, 2, ...] and columns are ["Red", "Blue", ...]
for column in sorted_df.columns:
# x: [1, 2, ...] (always the same)
# y: Values for each column (first "Red", then "Blue", and so on)
plt.scatter(sorted_df.index, sorted_df[column])
# Display plot
plt.show()
This is the result I get:

Multiple datasets and fit line

I have different datasets:
Df1
X Y
1 1
2 5
3 14
4 36
5 90
Df2
X Y
1 1
2 5
3 21
4 38
5 67
Df3
X Y
1 1
2 5
3 10
4 50
5 78
I would like to determine a line which fits this data and plot all data in one chart (like a regression).
On the x axis I have the time; on the y axis I have the frequency of an event that occurs.
Any help on the approach on how to determine the line and plot the results keeping the different legends (would be ok with seaborn or matplotlib) would be helpful.
What I have done so far is plotting the three lines as follows:
plot_df = pd.DataFrame(list(zip(dataset_list, x_lists, y_lists)),
columns =['Dataset', 'X', 'Y']).set_index('Dataset', inplace=False)
plot_df= plot_df.apply(pd.Series.explode).reset_index() # this step should transpose the resulting df and explode the values
# plot
fig, ax = plt.subplots(figsize=(10,8))
for name, group in plot_df.groupby('Dataset'):
group.plot(x = "X", y= "Y", ax=ax, label=name)
Please note that the three lists at the beginning contain information on the three different df.
I recommend using linregress from scipy.stats as this gives very readable code. Just need to add in the logic to your loop:
from scipy.stats import linregress
for name, group in plot_df.groupby('Dataset'):
group.plot(x = "X", y= "Y", ax=ax, label=name)
#fit a line to the data
fit = linregress(group.X, group.Y)
ax.plot(group.X, group.X * fit.slope + fit.intercept, label=f'{name} fit')

Forcing x-axis of pyplot histogram (python, pandas)

Hi so I have a data called vc that looks like this.
It is a count of scores. The score range is from 0 to 40.
However, as shown below, there is only a few with actual counts. I can't make the histogram to have the x axis that I want..
d = {'count': [9, 30, 6, 2,3,1,1,4,1,1,2,2,6,3]}
vc = pd.DataFrame(data=d, index=[12,13,14,15,17,18,19,20,21,22,23,24,25,26])
vc
index count
12 9
13 30
14 6
15 2
17 3
18 1
19 1
20 4
...and so on
I want to make a histogram with x axis from 0 to 40, like this: histogram I want
However, my histogram doesn't show the scores with zero counts..
vc=vc.sort_index()
ax = vc.plot(kind='bar', legend=False)
ax.set_xlabel("score")
ax.set_ylabel("count")
ax.set_xticks(range(0,40,5))
The resulting histogram: enter image description here
How can I produce the wanted histogram in the first image?
I've tried for hours but have sadly failed.. Thank you
Maybe not a very clever solution, but you can plot a blank with the range you need, then plot over with your count table:
import pandas as pd
import matplotlib.pyplot as plt
d = {'count': [9, 30, 6, 2,3,1,1,4,1,1,2,2,6,3]}
vc = pd.DataFrame(data=d, index=[12,13,14,15,17,18,19,20,21,22,23,24,25,26])
plt.bar(x=np.arange(10,40),height=0)
plt.bar(vc.index.to_list(),vc['count'])

scatter plot with multiple X features and single Y in Python

Data in form:
x1 x2
data= 2104, 3
1600, 3
2400, 3
1416, 2
3000, 4
1985, 4
y= 399900
329900
369000
232000
539900
299900
I want to plot scatter plot which have got 2 X feature {x1 and x2} and single Y,
but when I try
y=data.loc[:'y']
px=data.loc[:,['x1','x2']]
plt.scatter(px,y)
I get:
'ValueError: x and y must be the same size'.
So I tried this:
data=pd.read_csv('ex1data2.txt',names=['x1','x2','y'])
px=data.loc[:,['x1','x2']]
x1=px['x1']
x2=px['x2']
y=data.loc[:'y']
plt.scatter(x1,x2,y)
This time I got blank graph with full blue color painted inside.
I will be great full if i get some guide
You can only plot with one x and several y's. You could plot the different x's in a twiny axis:
fig, ax = plt.subplots()
ay = ax.twiny()
ax.scatter(df['x1'], df['y'])
ay.scatter(df['x2'], df['y'], color='r')
plt.show()
Output:
You can check the pandas functions for plotting dataframe content, it's very powerful.
But if you want to use matplotlib you can check the documentation (https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html), and it's said that X and Y must be array-like. You are instead passing a list.
So the working code it's like this:
data = pd.read_csv("test.txt", header=None)
data
0 1 2
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900
5 1985 4 299900
data.columns = ["x1", "x2", "y"]
data
x1 x2 y
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900
5 1985 4 299900
# If you call scatter many times and then plt.show() a single image is created
plt.scatter(data["x1"], data["y"])
plt.scatter(data["x2"], data["y"])
plt.show()
Note that if you want to have data in an array format you can do data["x1"].values and it will return an ndarray.
You could use seaborn with a melted dataframe. seaborn.scatterplot has a hue argument, which allows to include multiple data series.
import seaborn as sns
ax = sns.scatterplot(x='value', hue='series', y='y',
data=data.melt(value_vars=['x1', 'x2'],
id_vars='y',
var_name='series'))
However, if your x values are that different, you might want to use twin axes, as in #Quang Hoang's answer.

Grouping boxplots in seaborn when input is a DataFrame

I intend to plot multiple columns in a pandas dataframe, all grouped by another column using groupby inside seaborn.boxplot. There is a nice answer here, for a similar problem in matplotlib matplotlib: Group boxplots but given the fact that seaborn.boxplot comes with groupby option I thought it could be much easier to do this in seaborn.
Here we go with a reproducible example that fails:
import seaborn as sns
import pandas as pd
df = pd.DataFrame([[2, 4, 5, 6, 1], [4, 5, 6, 7, 2], [5, 4, 5, 5, 1],
[10, 4, 7, 8, 2], [9, 3, 4, 6, 2], [3, 3, 4, 4, 1]],
columns=['a1', 'a2', 'a3', 'a4', 'b'])
# display(df)
a1 a2 a3 a4 b
0 2 4 5 6 1
1 4 5 6 7 2
2 5 4 5 5 1
3 10 4 7 8 2
4 9 3 4 6 2
5 3 3 4 4 1
#Plotting by seaborn
sns.boxplot(df[['a1','a2', 'a3', 'a4']], groupby=df.b)
What I get is something that completely ignores groupby option:
Whereas if I do this with one column it works thanks to another SO question Seaborn groupby pandas Series :
sns.boxplot(df.a1, groupby=df.b)
So I would like to get all my columns in one plot (all columns come in a similar scale).
EDIT:
The above SO question was edited and now includes a 'not clean' answer to this problem, but it would be nice if someone has a better idea for this problem.
As the other answers note, the boxplot function is limited to plotting a single "layer" of boxplots, and the groupby parameter only has an effect when the input is a Series and you have a second variable you want to use to bin the observations into each box..
However, you can accomplish what I think you're hoping for with the factorplot function, using kind="box". But, you'll first have to "melt" the sample dataframe into what is called long-form or "tidy" format where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
Then it's very simple to plot:
sns.factorplot("a", hue="b", y="c", data=df_long, kind="box")
You can use directly boxplot (I imagine when the question was asked, that was not possible, but with seaborn version > 0.6 it is).
As explained by #mwaskom, you have to "melt" the sample dataframe into its "long-form" where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
# display(df_long.head())
b a c
0 1 a1 2
1 2 a1 4
2 1 a1 5
3 2 a1 10
4 2 a1 9
Then you just plot it:
sns.boxplot(x="a", hue="b", y="c", data=df_long)
Seaborn's groupby function takes Series not DataFrames, that's why it's not working.
As a work around, you can do this :
fig, ax = plt.subplots(1,2, sharey=True)
for i, grp in enumerate(df.filter(regex="a").groupby(by=df.b)):
sns.boxplot(grp[1], ax=ax[i])
it gives :
Note that df.filter(regex="a") is equivalent to df[['a1','a2', 'a3', 'a4']]
a1 a2 a3 a4
0 2 4 5 6
1 4 5 6 7
2 5 4 5 5
3 10 4 7 8
4 9 3 4 6
5 3 3 4 4
Hope this helps
It isn't really any better than the answer you linked, but I think the way to achieve this in seaborn is using the FacetGrid feature, as the groupby parameter is only defined for Series passed to the boxplot function.
Here's some code - the pd.melt is necessary because (as best I can tell) the facet mapping can only take individual columns as parameters, so the data need to be turned into a 'long' format.
g = sns.FacetGrid(pd.melt(df, id_vars='b'), col='b')
g.map(sns.boxplot, 'value', 'variable')
It's not adding a lot to this conversation, but after struggling with this for longer than warranted (the actual clusters are unusable), I thought I would add my implementation as another example. It's got a superimposed scatterplot (because of how annoying my dataset is), shows melt using indices, and some aesthetic tweaks. I hope this is useful for someone.
output_graph
Here it is without using column headers (I saw a different thread that wanted to know how to do this using indices):
combined_array: ndarray = np.concatenate([dbscan_output.data, dbscan_output.labels.reshape(-1, 1)], axis=1)
cluster_data_df: DataFrame = DataFrame(combined_array)
if you want to use labelled columns:
column_names: List[str] = list(outcome_variable_names)
column_names.append('cluster')
cluster_data_df.set_axis(column_names, axis='columns', inplace=True)
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=['cluster'],
# value_vars is an optional param - by default it uses columns except the id vars, but I've included it as an example
# value_vars=['outcome_var_1', 'outcome_var_2', 'outcome_var_3', 'outcome_var_4', 'outcome_var_5', 'outcome_var_6']
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
The resulting dataframe (rows = sample_n x variable_n (in my case 1626 x 6 = 9756)):
index
cluster
psychometric_tst
standard deviations from the mean
0
0.0
outcome_var_1
-1.276182
1
0.0
outcome_var_1
-1.118813
2
0.0
outcome_var_1
-1.276182
9754
0.0
outcome_var_6
0.892548
9755
0.0
outcome_var_6
1.420480
If you want to use indices with melt:
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=cluster_data_df.columns[-1],
# value_vars=cluster_data_df.columns[:-1],
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
And here's the graphing code:
(Done with column headings - just note that y-axis=value_name, x-axis = var_name, hue = id_vars):
# plot graph grouped by cluster
sns.set_theme(style="ticks")
fig = plt.figure(figsize=(10, 10))
fig.set(font_scale=1.2)
fig.set_style("white")
# create boxplot
fig.ax = sns.boxplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', showfliers=False,
data=graph_data)
# set box alpha:
for patch in fig.ax.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .2))
# create scatterplot
fig.ax = sns.stripplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', data=graph_data,
dodge=True, alpha=.25, zorder=1)
# customise legend:
cluster_n: int = dbscan_output.n_clusters
## create list with legend text
i = 0
cluster_info: Dict[int, int] = dbscan_output.cluster_sizes # custom method
legend_labels: List[str] = []
while i < cluster_n:
label: str = f"cluster {i+1}, n = {cluster_info[i]}"
legend_labels.append(label)
i += 1
if -1 in cluster_info.keys():
cluster_n += 1
label: str = f"Unclustered, n = {cluster_info[-1]}"
legend_labels.insert(0, label)
## fetch existing handles and legends (each tuple will have 2*cluster number -> 1 for each boxplot cluster, 1 for each scatterplot cluster, so I will remove the first half)
handles, labels = fig.ax.get_legend_handles_labels()
index: int = int(cluster_n*(-1))
labels = legend_labels
plt.legend(handles[index:], labels[0:])
plt.xticks(rotation=45)
plt.show()
asds
Just a note: Most of my time was spent debugging the melt function. I predominantly got the error "*only integer scalar arrays can be converted to a scalar index with 1D numpy indices array*". My output required me to concatenate my outcome variable value table and the clusters (DBSCAN), and I'd put extra square brackets around the cluster array in the concat method. So I had a column where each value was an invisible List[int], rather than a plain int. It's pretty niche, but maybe it'll help someone.
List item

Categories

Resources