I would like to plot a violin plot using Python for a multivariate regression problem. I attempt to obtain a prediction scalar value for time series input. The libraries of choice are probably matplotlib and / or seaborn but I'm open to alternative suggestions well.
This is what I have:
A list [g_1,g_2,...g_n] of n ground truth values for each of my n subjects.
k time series inputs (i.e. lists) consisting of j elements for each of my n subjects. Please note that k and j don't have to be equal for each subject.
k predictions for each of my n subjects.
Example input:
Ground truth: [14,67,342,5]
Time series input: [[19,2434,23432,-123,-54],[99,23,4,-6],[1,2,3,4,5,6,7,8],[-1,-2,-3]]
Example output after performing a regression:
Predictions: [17,54,312,-2]
What I would like to obtain is a nice violin plot as shown in this tutorial. This is how my pandas data frame looks like:
dataframe = pd.DataFrame(
{'Predictions': predictions, # This is a list of k elements
'Subject IDs': subjectIDs, # This is a list of n strings
'Ground truths': groundtruths # This is a list of n float values
})
Attempting to draw a plot with
sns.violinplot( ax = ax, y = dataframe["Predictions"] )
only results in:
TypeError: No loop matching the specified signature and casting was found for ufunc add
Additionally, I also already tried to follow the official seaborn documentation, using the command
ax = sns.violinplot(x="Subject IDs", y="Predictions", data=dataframe)
instead. However, this only results in
TypeError: unhashable type: 'list'
Update: If I treat the "Predictions" list as a tuple, I manage to create a plot without errors but unfortunately it's completely messed up as it puts all prediction values on the y-axis (see below for a snippet).
Thus, my question is: How can I draw a plot with all subject IDs on the x-axis, the ground truths on the y-axis and the probability distribution of my predictions, the corresponding mean values and a confidence interval as violin plot?
OK, I solved my problem. The problem was with my input pandas dataframe. I had to make sure that each of my observation was assigned exactly one single prediction and not a complete list.
This is what my data frame should have looked like:
data = pd.DataFrame(
{'groundtruths': groundtruthsList,
'predictions': predictionsList,
'subjectIDs': subjectIDsList
})
print(data.head())
groundtruths predictions subjectIDs
0 70 75.864983 01
1 70 50.814903 01
2 70 80.715569 01
3 70 70.627260 01
4 70 49.516285 01
. . . .
. . . .
. . . .
Now, as the data frame has the right format, I can easily draw nice violin plots with
sns.violinplot(x="subjectIDs", y="predictions", data=data)
A simple seaborn scatterplot can be used to nicely put the ground truth for each subject in this plot as well.
Related
I have been looking at this fitting a digits dataset to a k-means cluster on Python tutorial here, and some of the codes are just confusing me.
I do understand this part where we need to train our model using 10 clusters.
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape
The following show us an output of the 10 cluster centroids on the console.
it first creates figure and axes which has two row, each row has 5 axes subplots return the figure and
(8,3) is the size of the figure displaying on the console.
But after that I just do not understand how the command shows the output of cluster centroids in the for loop.
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
axi.set(xticks=[], yticks=[])
axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
Also, this part is to check how accurate the clustering was in finding the similar digits within the data. I know that we need to create a labels that has the same size as the clusters filling with zero so we can place our predicted label in there.
But again, I just do not understand how do they implement it inside the for-loop.
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]
Can someone please explain what each line of the commands do? Thank you.
Question 1: How does the code plot the centroids?
It's important to see that each centroid is a point in the feature space. In other words, a centroid looks like one of the training samples. In this case, each training sample is an 8 × 8 image (although they've been flattened into rows with 64 elements (because sklearn always wants input X to be a two-dimensional array). So each centroid also represents an 8 × 8 image.
The loop steps over the axes (a 2 × 5 matrix) and the centroids (kmeans.cluster_centers_ together. The purpose of zip is to ensure that for each Axes object there is a corresponding center (this is a common way to plot a bunch of n things into a bunch of n subplots). The centroids have been reshaped into a 10 × 8 × 8 array, so that each of the 10 centroids is the 8 × 8 image we're expecting.
Since each centroid is now a 2D array, you can use imshow to plot it.
Question 2: How does the code assign labels?
The easiest thing might be to take the code apart and run bits of it on their own. For example, take a look at clusters == 0. This is a Boolean array. You can use Boolean arrays to index other arrays of the same shape. The first line of code in the loop assigns this array to mask so we can use it.
Then we index into labels using the Boolean array (try it!) to say, "Change these values to the mode average of the corresponding elements of the label vector, i.e. digits.target." The index [0] is just needed because of what the scipy.stats.mode() function returns (again, try it out).
I have a dataframe containing confidence intervals of means on parameters 'likes, 'retweets', 'followers', 'pics' for 4 samples: ikke-aktant, laser, umbrella, mask. All values are a list containing the confidence intervals, e.g. [8.339078253365264, 9.023388831788864], which is the confidence interval for likes in the laser-sample. A picture of the dataframe can be seen here:https://imgur.com/a/NkDckII
I want to plot it in a seaborn pointplot, where y represents the four samples, and x is likes.
So far I have:
ax = sns.pointplot(x="likes", data=df_boot, hue='sample', join=False)
Which returns error:
TypeError: Horizontal orientation requires numeric `x` variable.
I guess this is because x is a list. Is there a way to plot my confidence intervals using pointplot?
I think the problem is that you are using data that are already confidence intervals. Pointplot expects 'raw' data like the example dataset found here: github.com/mwaskom/seaborn-data/blob/master/tips.csv. So why not use the data that you used to calculate those confidence intervals?
I'm having some trouble with error bars in python. I'm plotting the columns on a pandas dataframe grouped, so on this example dataframe:
unfiltered = [0.975,0.964,0.689,0.974]
filtered = [0.954,0.932,0.570,0.960]
index_df = ["Accuracy", "Recall", "Precision", "Specificity"]
column_names = ["Unfiltered", "With overhang filter"]
df = pd.DataFrame(list(zip(unfiltered,filtered)),index=index_df,columns=column_names)
So my dataframe looks like this:
Unfiltered With overhang filter
Accuracy 0.975 0.954
Recall 0.964 0.932
Precision 0.689 0.570
Specificity 0.974 0.960
And I plot it with this following lines:
plt.style.use('ggplot')
ax = data_df.plot.bar(rot=0)
plt.show()
I get a figure like this:
Now I want to add error bars, but my problem is that I don't seem to be able to figure out how to get a different error value for each bar. I want to use the standard deviation and the values I have are different for each one of them (example: the std for both recalls shown is different). My problem is that if I add:
ax = data_df.plot.bar(rot=0, yerr=data_errors)
where data_errors is a list with the 8 standard deviations I get:
ValueError: The lengths of the data (4) and the error 8 do not match
It does work when data_errors has only 4 elements, but then it plots the same error bars for both accuracies, recalls, etc.
Can anyone help me to keep the data grouped by index like it is, but with different error bars for each value of the dataframe?
SOLUTION
Thanks to the user Quang Hoang I researched into sns.barplot. The solution to my problem was to create a dataframe (which I named data_df) like this:
Indicator Data Class
0 Accuracy 0.966279 Unfiltered
1 Accuracy 0.981395 Unfiltered
2 Accuracy 0.989535 Unfiltered
3 Accuracy 0.975553 Unfiltered
4 Accuracy 0.961583 Unfiltered
5 Recall 0.954545 Unfiltered
...
35 Specificity 0.941176 Filtered
36 Specificity 0.953431 Filtered
37 Specificity 0.993865 Filtered
38 Specificity 0.946012 Filtered
39 Specificity 0.953374 Filtered
Followed by:
ax = sns.barplot(x="Indicator", y= "Data",hue="Class", data=data_df, ci="sd")
This allowed me to create this figure:
where as you can see the error bars are different for each value, and also calculated automatically.
This might not be exactly what you're looking
data_df.stack().plot.bar(yerr=data_errors)
I was wondering if there was an option in matplotlib to have different colors along one graph.
So far I manged to have a graph in a specific color as well as having multiple graphs in different colors.
However, all graphs I created so far have a singular color. I was wondering if I could use column c (see below) to color different parts of a graph.
In the example, I want to use the value "0.1" in column c with index 1 to color the graph from the first to the second data point, the value "0.2" in column c with index 2 to color the graph from the second to the third data point and so on.
data for one graph:
index x y z c
1 1 2 1 0.1
2 1 2 2 0.2
3 1 3 1 0.1
I found that I could color data points dependent on a fourth column in a 3D scatter plot and was wondering if that somehow works with line plots as well.
The only "workaround" I can think of is splitting my graph data into x sub-graphs (each subgraph data has only two data points - the start and end point) and color them according to the column c of the first data point. This would result in n-1 separate graphs for n data points however.
The solution for anyone still looking was splitting my graph data into x sub-graphs (each subgraph data has only two data points - the start and end point, or the nth and n+1th point).
Then I colored them according to the column c of the first data point. This results in n-1 separate graphs for n data points.
Further explanation here Line colour of 3D parametric curve in python's matplotlib.pyplot
My data is 2250 x 100. I would like to plot the output, like http://glowingpython.blogspot.com/2012/04/k-means-clustering-with-scipy.html. However, the problem is that all the examples use only a small number of clusters, usually 2 or 3. How would you plot the output of kmeans in scipy if you wanted more clusters, like a 100.
Here's what I got:
#get the centroids
centroids,_ = kmeans(data,100)
idx,_ = vq(data,centroids)
#do some plotting here...
Maybe with 10 colors and 10 point types?
Or you could plot each in a 10 x 10 grid of plots. The first would show the relationships better. The second would allow easier inspection of an arbitrary cluster.