I have a dataset with three columns and thousands of rows as shown below.
The number of classes (clusters) are 4 as shown in column three (R, I, C, F).
row id VALUE CLASS
1 284 R
2 254 I
3 184 C
4 177 F
..........
I am trying to get the cluster plot from the above data based on the 4 classes. The expected output is shown in the picture below.
What I tried:
Scatter plot in seaborn
from pandas import read_csv
import seaborn as sns
df2 = read_csv(r'C:\Users\jo\Downloads\Clusters.csv')
sns.scatterplot(data=df2, x="VALUE", y= "rowid",hue="CLASS")
Well, I have to say that the clustering algo is almost certainly doing absolutely what it is supposed to do. Clustering is non-supervised, of course, so you don't have any training/testing and you don't know what the outcome will be. You can feed in different features, and see what the outcome is. Also, you don't really share any code, so it's impossible to say for sure what is going on here. I would suggest taking a look at following links, below, and doing some more Googling on this subject.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20-%20Historical%20Stock%20Prices.ipynb
https://www.askpython.com/python/examples/plot-k-means-clusters-python
https://towardsdatascience.com/visualizing-clusters-with-pythons-matplolib-35ae03d87489
Related
I'm having some trouble with error bars in python. I'm plotting the columns on a pandas dataframe grouped, so on this example dataframe:
unfiltered = [0.975,0.964,0.689,0.974]
filtered = [0.954,0.932,0.570,0.960]
index_df = ["Accuracy", "Recall", "Precision", "Specificity"]
column_names = ["Unfiltered", "With overhang filter"]
df = pd.DataFrame(list(zip(unfiltered,filtered)),index=index_df,columns=column_names)
So my dataframe looks like this:
Unfiltered With overhang filter
Accuracy 0.975 0.954
Recall 0.964 0.932
Precision 0.689 0.570
Specificity 0.974 0.960
And I plot it with this following lines:
plt.style.use('ggplot')
ax = data_df.plot.bar(rot=0)
plt.show()
I get a figure like this:
Now I want to add error bars, but my problem is that I don't seem to be able to figure out how to get a different error value for each bar. I want to use the standard deviation and the values I have are different for each one of them (example: the std for both recalls shown is different). My problem is that if I add:
ax = data_df.plot.bar(rot=0, yerr=data_errors)
where data_errors is a list with the 8 standard deviations I get:
ValueError: The lengths of the data (4) and the error 8 do not match
It does work when data_errors has only 4 elements, but then it plots the same error bars for both accuracies, recalls, etc.
Can anyone help me to keep the data grouped by index like it is, but with different error bars for each value of the dataframe?
SOLUTION
Thanks to the user Quang Hoang I researched into sns.barplot. The solution to my problem was to create a dataframe (which I named data_df) like this:
Indicator Data Class
0 Accuracy 0.966279 Unfiltered
1 Accuracy 0.981395 Unfiltered
2 Accuracy 0.989535 Unfiltered
3 Accuracy 0.975553 Unfiltered
4 Accuracy 0.961583 Unfiltered
5 Recall 0.954545 Unfiltered
...
35 Specificity 0.941176 Filtered
36 Specificity 0.953431 Filtered
37 Specificity 0.993865 Filtered
38 Specificity 0.946012 Filtered
39 Specificity 0.953374 Filtered
Followed by:
ax = sns.barplot(x="Indicator", y= "Data",hue="Class", data=data_df, ci="sd")
This allowed me to create this figure:
where as you can see the error bars are different for each value, and also calculated automatically.
This might not be exactly what you're looking
data_df.stack().plot.bar(yerr=data_errors)
I would like to plot a violin plot using Python for a multivariate regression problem. I attempt to obtain a prediction scalar value for time series input. The libraries of choice are probably matplotlib and / or seaborn but I'm open to alternative suggestions well.
This is what I have:
A list [g_1,g_2,...g_n] of n ground truth values for each of my n subjects.
k time series inputs (i.e. lists) consisting of j elements for each of my n subjects. Please note that k and j don't have to be equal for each subject.
k predictions for each of my n subjects.
Example input:
Ground truth: [14,67,342,5]
Time series input: [[19,2434,23432,-123,-54],[99,23,4,-6],[1,2,3,4,5,6,7,8],[-1,-2,-3]]
Example output after performing a regression:
Predictions: [17,54,312,-2]
What I would like to obtain is a nice violin plot as shown in this tutorial. This is how my pandas data frame looks like:
dataframe = pd.DataFrame(
{'Predictions': predictions, # This is a list of k elements
'Subject IDs': subjectIDs, # This is a list of n strings
'Ground truths': groundtruths # This is a list of n float values
})
Attempting to draw a plot with
sns.violinplot( ax = ax, y = dataframe["Predictions"] )
only results in:
TypeError: No loop matching the specified signature and casting was found for ufunc add
Additionally, I also already tried to follow the official seaborn documentation, using the command
ax = sns.violinplot(x="Subject IDs", y="Predictions", data=dataframe)
instead. However, this only results in
TypeError: unhashable type: 'list'
Update: If I treat the "Predictions" list as a tuple, I manage to create a plot without errors but unfortunately it's completely messed up as it puts all prediction values on the y-axis (see below for a snippet).
Thus, my question is: How can I draw a plot with all subject IDs on the x-axis, the ground truths on the y-axis and the probability distribution of my predictions, the corresponding mean values and a confidence interval as violin plot?
OK, I solved my problem. The problem was with my input pandas dataframe. I had to make sure that each of my observation was assigned exactly one single prediction and not a complete list.
This is what my data frame should have looked like:
data = pd.DataFrame(
{'groundtruths': groundtruthsList,
'predictions': predictionsList,
'subjectIDs': subjectIDsList
})
print(data.head())
groundtruths predictions subjectIDs
0 70 75.864983 01
1 70 50.814903 01
2 70 80.715569 01
3 70 70.627260 01
4 70 49.516285 01
. . . .
. . . .
. . . .
Now, as the data frame has the right format, I can easily draw nice violin plots with
sns.violinplot(x="subjectIDs", y="predictions", data=data)
A simple seaborn scatterplot can be used to nicely put the ground truth for each subject in this plot as well.
Cause of the kind and helping community i solved the first problem i had in my work which you can see here: Basic Problem - necessary for understanding the upcoming
After i used this, i wanted to visualize the distribution of the classes and the nan-Values in the features. Therefore i plottet it in a bar-diagramm. with a few classes it is pretty handy.
the problem is, i have about 120 different classes and all in all 50000 data objects - the plots are not readable with this amount of data.
therefor i wanted to split the visualization.
for each class there should be a subplot that is showing the sum of the nan-values of each feature.
Data:
CLASS FEATURE1 FEATURE2 FEATURE3
X 1 1 2
B 0 0 0
C 2 3 1
Actual Plot:
Expected Plots:
Non of my approaches worked so far.
i tried to solve it through df.groupBy('Class').plot(kind="barh", subplots=True) - completely destroyed the layout and plotted per feature not per class.
i tried this approach but if i write my groupBy-df into the Variable 'grouped' i can print it in a perfect format with all informations, but i can not access it in the way it is done in the solution. i always get the error: 'string indices must be integers'
my approach:
grouped = df.groupby('Class')
for name, group in grouped:
group.plot.bar()
EDIT - Further Information
The Data I use is completely categorical - no numerical values - i want to display the amount of nan-values in the different features of the classes(labels) of my dataset.
A solution using seaborn
import seaborn as sns
ndf = pd.melt(df, id_vars="CLASS", var_name="feature", value_name="val")
sns.catplot("feature", "val", col="CLASS", data=ndf, kind="bar", col_wrap=1)
plt.show()
Grouping is the way to go, just set the labels
for name, grp in df3.groupby('CLASS'):
ax = grp.plot.bar()
ax.set_xticks([])
ax.set_xlabel(name)
With the solution provided by #meW i was able to achieve a result that is near my goal.
I had to do two steps to actually use his solution.
Cast the GroupBy-Object to an DataFrame-Object via df = pd.DataFrame(df.groupBy('Class').count().rsub(df.groupby('Class').size(), axis =0)
Through the groupby-query the Class-Column was transformed to the index so i had to transform it back via grouped['class'] = grouped.index
Two Questions that rise from this solution: is it possible to fit the ticks to the different amounts of nan. because there are classes with only 5-10 nan values in the features and classes with over 1000 nan-values. (see pictures)
Second Question - The Feature-Names are only shown in the last plot - is there a way to add them to all x-axis of all plots
Does anyone know what is generally the best practice to visualize data that shows growth for different categories over time?
In my example think of "Category" as a product, the "Type" as the model and the values as a performance metric. I want to visualize the data in a way that would let me prioritize which "Category" and corresponding "Type" have had the most increase in the mean value.
The challenge I'm having is that after I've summarized tabular data to show change over time, the best thing that I can come up with to compare and visualize the summarized data is to show the changing mean for each separate category in it's own excel tab.
There has to be a better way to do this!
I've done a 3d column chart in matplotlib - one row for each category, but it's not effective enough.
It's possible that someone knows the best solution from experience.
Right now the mean values are being shown over time, grouped by "Category" and "Type" in my example.
Maybe I shouldn't be looking at this as a pandas table or matplotlib bar chart at all.
If my goal was to identify and prioritize the 'Category' and it's respective 'Type' where the mean growth has been the most promising, how should I do that?
I really appreciate any help or advice.
import pandas as pd
import numpy as np
import scipy
from scipy import stats
import warnings;warnings.filterwarnings("ignore")
def calc_slope(row):
mask = row.notnull()
a = scipy.stats.linregress(row[mask.values], y=axisvalues[mask])
return pd.Series(a._asdict())
table=pd.DataFrame({'Category':['A','A','A','B','C','C','C','B','B','A','A','A','B','B','D','A','B','B','A','C','B','B','C','A','A','C','B','B','A','A','A','B','B','B','B'],
'Type':['I','I','I','III','II','II','II','III','III','I','I','I','III','III','II','I','III','III','I','II','III','I','II','III','I','II','III','I','II','II','II','II','II','II','II'],
'Quarter':['2016-Q1','2017-Q2','2017-Q3','2017-Q4','2017-Q2','2016-Q2','2017-Q2','2016-Q3','2016-Q4','2016-Q2','2016-Q3','2017-Q4','2016-Q1',\
'2016-Q2','2016-Q4','2016-Q4','2017-Q2','2017-Q3','2016-Q3','2016-Q4','2016-Q2','2017-Q2','2016-Q1','2017-Q4','2016-Q4','2017-Q2',\
'2016-Q1','2017-Q2','2016-Q1','2017-Q2','2016-Q4','2016-Q1','2017-Q2','2017-Q3','2017-Q4'],
'Value':np.random.randint(100,1000,size=35)})
db=(table.groupby(['Category','Type','Quarter']).filter(lambda group: len(group) >= 1)).groupby(['Category','Type','Quarter'])["Value"].mean()
db=db.unstack()
axisvalues= np.arange(1,len(db.columns)+1) #used in calc_slope function
db = db.join(db.apply(calc_slope,axis=1))
print(db)
For this type of problem you should really consider seaborn.
import seaborn as sns
# reshape the data into 'tidy form' for seaborn
melted = pd.melt(db.reset_index(),
value_vars=[c for c in db.columns if '-Q' in c],
value_name='Mean',
var_name='Quarter',
id_vars=['Type', 'Category'])
g = sns.factorplot(data=melted, x='Quarter', y='Mean',
col='Type', hue='Category', kind='point')
You can change what type of plot you have and explore around really fast and easily. For example:
And change the 'kind' keyword:
[edited because its 2:30 am] Maybe fit a trend to the means?
I'm making a simple pairplot with Seaborn in Python that shows different levels of a categorical variable by the color of plot elements across variables in a Pandas DataFrame. Although the plot comes out exactly as I want it, the categorical variable is binary, which makes the legend quite meaningless to an audience not familiar with the data (categories are naturally labeled as 0 & 1).
An example of my code:
g = sns.pairplot(df, hue='categorical_var', palette='Set3')
Is there a way to change legend label text with pairplot? Or should I use PairGrid, and if so how would I approach this?
Found it! It was answered here: Edit seaborn legend
g = sns.pairplot(df, hue='categorical_var', palette='Set3')
g._legend.set_title(new_title)
Since you don't provide a full example of code, nor mock data, I will use my own codes to answer.
First solution
The easiest must be to keep your binary labels for analysis and to create a column with proper names for plotting. Here is a sample code of mine, you should grab the idea:
def transconum(morph):
if (morph == 'S'):
return 1.0
else:
return 0.0
CompactGroups['MorphNum'] = CompactGroups['MorphGal'].apply(transconum)
Second solution
Another way would be to overwrite labels on the flight. Here is a sample code of mine which works perfectly:
grid = sns.jointplot(x="MorphNum", y="PropS", data=CompactGroups, kind="reg")
grid.set_axis_labels("Central type", "Spiral proportion among satellites")
grid.ax_joint.set_xticks([0, 1, 1])
plt.xticks(range(2), ('$Red$', '$S$'))