Plot pre-calculated confidence intervals in seaborn pointplot - python

I have a dataframe containing confidence intervals of means on parameters 'likes, 'retweets', 'followers', 'pics' for 4 samples: ikke-aktant, laser, umbrella, mask. All values are a list containing the confidence intervals, e.g. [8.339078253365264, 9.023388831788864], which is the confidence interval for likes in the laser-sample. A picture of the dataframe can be seen here:https://imgur.com/a/NkDckII
I want to plot it in a seaborn pointplot, where y represents the four samples, and x is likes.
So far I have:
ax = sns.pointplot(x="likes", data=df_boot, hue='sample', join=False)
Which returns error:
TypeError: Horizontal orientation requires numeric `x` variable.
I guess this is because x is a list. Is there a way to plot my confidence intervals using pointplot?

I think the problem is that you are using data that are already confidence intervals. Pointplot expects 'raw' data like the example dataset found here: github.com/mwaskom/seaborn-data/blob/master/tips.csv. So why not use the data that you used to calculate those confidence intervals?

Related

plot mean and confidence interval - matplotlib

I want to make a plot that splits a dataset and shows the amount of observations per category on the left axis and a confidence interval (e.g. 90%) including the mean for a certain observed value on the right axis.
It should look like this:
I know how to use ax.hist() or ax.bar() for the first job. A second axis is easily made using ax.twinx(). However, after trying both ax.boxplot() and ax.violinplot(), I believe neither could do the job (plotting the confidence interval + mean) correctly. Any suggestions?

Plotting a violin plot with lists

I would like to plot a violin plot using Python for a multivariate regression problem. I attempt to obtain a prediction scalar value for time series input. The libraries of choice are probably matplotlib and / or seaborn but I'm open to alternative suggestions well.
This is what I have:
A list [g_1,g_2,...g_n] of n ground truth values for each of my n subjects.
k time series inputs (i.e. lists) consisting of j elements for each of my n subjects. Please note that k and j don't have to be equal for each subject.
k predictions for each of my n subjects.
Example input:
Ground truth: [14,67,342,5]
Time series input: [[19,2434,23432,-123,-54],[99,23,4,-6],[1,2,3,4,5,6,7,8],[-1,-2,-3]]
Example output after performing a regression:
Predictions: [17,54,312,-2]
What I would like to obtain is a nice violin plot as shown in this tutorial. This is how my pandas data frame looks like:
dataframe = pd.DataFrame(
{'Predictions': predictions, # This is a list of k elements
'Subject IDs': subjectIDs, # This is a list of n strings
'Ground truths': groundtruths # This is a list of n float values
})
Attempting to draw a plot with
sns.violinplot( ax = ax, y = dataframe["Predictions"] )
only results in:
TypeError: No loop matching the specified signature and casting was found for ufunc add
Additionally, I also already tried to follow the official seaborn documentation, using the command
ax = sns.violinplot(x="Subject IDs", y="Predictions", data=dataframe)
instead. However, this only results in
TypeError: unhashable type: 'list'
Update: If I treat the "Predictions" list as a tuple, I manage to create a plot without errors but unfortunately it's completely messed up as it puts all prediction values on the y-axis (see below for a snippet).
Thus, my question is: How can I draw a plot with all subject IDs on the x-axis, the ground truths on the y-axis and the probability distribution of my predictions, the corresponding mean values and a confidence interval as violin plot?
OK, I solved my problem. The problem was with my input pandas dataframe. I had to make sure that each of my observation was assigned exactly one single prediction and not a complete list.
This is what my data frame should have looked like:
data = pd.DataFrame(
{'groundtruths': groundtruthsList,
'predictions': predictionsList,
'subjectIDs': subjectIDsList
})
print(data.head())
groundtruths predictions subjectIDs
0 70 75.864983 01
1 70 50.814903 01
2 70 80.715569 01
3 70 70.627260 01
4 70 49.516285 01
. . . .
. . . .
. . . .
Now, as the data frame has the right format, I can easily draw nice violin plots with
sns.violinplot(x="subjectIDs", y="predictions", data=data)
A simple seaborn scatterplot can be used to nicely put the ground truth for each subject in this plot as well.

What do the numbers on the axis mean when visualizing clusters in 2-dimensions?

I followed the codes in this link
What do the numbers on the x-axis and y-axis mean in this plot? Why they are discrete numbers?
When I used my own data, it gives me this kind of plot, I can't understand what the plot is trying to say.
As they are working with more than two dimensions (features), they are using PCA to project the data into two dimensions (that do not need to correspond to any of the dimensions of the original data) so it can be plotted.
So each of the data points are projected into the dimensions PCA1 and PCA2, which are real-valued (not discrete)

How to coarsen ordered 1D data into irregular bins with Python

I have a high frequency set of ordered 1D data set that relates to observations of a property with respect to depth, consisting of a continuous float value observation versus monotonically increasing depth
I'd like to find a way to coarsen this data set up into user defined number of contiguous bins (or zones), each of which is described by a single mean value and lower depth limit (the top depth limit being defined by the end of the zone above it). The criteria for splitting the zones should be k-means like - in that (within the bounds of the number of zones specified) there will be minimum property variance within each zone and maximum variation between adjacent zones.
As an example, if I had a small high frequency dataset as follows;
depth = [2920.530612, 2920.653061, 2920.734694, 2920.857143, 2920.938776, 2921.102041, 2921.22449, 2921.346939, 2921.469388, 2921.510204, 2921.55, 2921.632653, 2921.795918, 2922, 2922.081633, 2922.122449, 2922.244898, 2922.326531, 2922.489796, 2922.612245, 2922.857143, 2922.979592, 2923.020408, 2923.142857, 2923.265306]
value = [0.0098299, 0.009827939, 0.009826632, 1004.042327, 3696.000306, 3943.831644, 3038.254723, 3693.543377, 3692.806616, 50.04989348, 15.0127, 2665.2111, 3690.842641, 3238.749497, 429.4979635, 18.81228993, 1800.889643, 2662.199897, 3454.082382, 3934.140146, 3030.184014, 0.556587319, 8.593768956, 11.90163067, 26.01012696]
And I was to request a split into 7 zones, it would return something like the following;
depth_7zone =[2920.530612, 2920.857143, 2920.857143, 2921.510204, 2921.510204, 2921.632653, 2921.632653, 2922.081633, 2922.081633, 2922.244898, 2922.244898, 2922.979592, 2922.979592, 2923.265306]
value_7zone = [0.009828157, 0.009828157, 3178.079832, 3178.079832, 32.53129674, 32.53129674, 3198.267746, 3198.267746, 224.1551267, 224.1551267, 2976.299216, 2976.299216, 11.76552848, 11.76552848]
which can be visualized as (blue = original data, red = data split into 7 zones);
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
plt.plot(value, depth, '-o')
plt.plot(value_7zone, depth_7zone, '-', color='red')
plt.gca().invert_yaxis()
plt.xlabel('Values')
plt.ylabel('Depth')
plt.show()
I've tried standard k-means clustering, and it doesn't appear suited to this ordered 1D problem. I was thinking of methods perhaps used for digital signal processing but all I could find discretize into constant bin sizes, or even for image compression but that may be overkill and likely expect 2D data
Can anyone suggest an avenue to explore further? (I'm fairly new to Python so apologies in advance)

Marker color showing the frequency of occurrences

I have 100 data points and a time series at each data points.I calculated the distance (dist in code) between every pair of points and the correlation coefficient (corr in code) between the corresponding time series. Now I need to have a scatter plot of distance (in the x axis) v/s correlation coefficient (in the y axis) and the marker color should give the no. of occurrences of the correlation coefficient at each distance value. I tried the following code using matplotlib
colors=np.random.randint(len(dist))
cmap=plt.cm.viridis
plt.scatter(dist,corr,c=colors,cmap=cmap)
plt.colorbar()
plt.show()
The result was incorrect.
Is it possible to get the desired result using scatter plot? Or, is there any other way of getting it?
You are trying to generate a color map with a single number as color differentiator, i.e. len(colors) = 1, but you need len(colors) = len(dist).
Try:
colors=np.random.randint(len(dist), size=len(dist))
Not sure what you want to achieve. Perhaps this would work instead:
plt.scatter(dist,corr,c=dist,cmap=cmap)

Categories

Resources