How do I plot observations each with multiple values in python? - python

I have data for each individual participant from a survey. Each individual has a vector of data for example :
#[a,b,c]
[1,2,5] # 1 participant
...
...
...
[1,3,4]
Instead of having that kind of data, I have the data column wise. Example :
a = [1...1] # has n values equal to participants
b = [2...3] # has n values equal to participants
c = [5...4] # has n values equal to participants
I need to plot this data somehow to represent it clearly as a figure, does anybody have ideas how to plot this all together? I have plotted them individually as bar plots with frequencies, but I would like them to be plotted together as a 3D plot so that all 3 dimension's values can be inferred from the data.
I have around 200 participants.
Any suggestions are welcome.

Use each list as xaxis, yaxis, and zaxis data. This is especially useful when you know the lists are the same length, and each column represents one object. For example, (a[0], b[0], c[0]) represent a trait of the same object. The a-, b- and c-list objects represent the x-, -y, and z-axis fields, respectively.
If you're trying to do a Scatter plot, for example:
import plotly as plotly
from plotly.graph_objs import *
# stuff here, i.e. your code
myScatter = Scatter3D(
# some graph stuff, like your title:
# title = 'Random_Plot_Title'
x = a,
y = b,
z = c,
# some more stuff: Here's what I tend to add
# mode = 'markers',
# marker = dict(
# color = '#DC6D37'
# ),
# name = 'Your_Legend_Name_Here',
# legendgroup = 'Group_Multiple_Traces_Here',
)

Related

Pandas stacked bar plotting with different shapes

I'm currently experimenting with pandas and matplotlib.
I have created a Pandas dataframe which stores data like this:
cmc|coloridentity
1 | G
1 | R
2 | G
3 | G
3 | B
4 | B
What I now want to do is to make a stacked bar plot where I can see how many entries per cmc exist. And I want to do that for all coloridentity and stack them above.
My thoughts so far:
#get all unique values of coloridentity
unique_values = df['coloridentity'].unique()
#Create two dictionaries. One for the number of entries per cost and one
# to store the different costs for each color
color_dict_values = {}
color_dict_index = {}
for u in unique_values:
temp_df = df['cmc'].loc[df['coloridentity'] == u].value_counts()
color_dict_values[u] = np.array(temp_df)
color_dict_index[u] = temp_df.index.to_numpy()
width = 0.4
p1 = plt.bar(color_dict_index['G'], color_dict_values['G'], width, color='g')
p2 = plt.bar(color_dict_index['R'], color_dict_values['R'], width,
bottom=color_dict_values['G'], color='r')
plt.show()
So but this gives me an error because the line where I say that the bottom of the second plot shall be the values of different plot have different numpy shapes.
Does anyone know a solution? I thought of adding 0 values so that the shapes are the same , but I don't know if this is the best solution, and if yes how the best way would be to solve it.
Working with a fixed index (the range of cmc values), makes things easier. That way the color_dict_values of a color_id give a count for each of the possible cmc values (stays zero when there are none).
The color_dict_index isn't needed any more. To fill in the color_dict_values, we iterate through the temporary dataframe with the value_counts.
To plot the bars, the x-axis is now the range of possible cmc values. I added [1:] to each array to skip the zero at the beginning which would look ugly in the plot.
The bottom starts at zero, and gets incremented by the color_dict_values of the color that has just been plotted. (Thanks to numpy, the constant 0 added to an array will be that array.)
In the code I generated some random numbers similar to the format in the question.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
N = 50
df = pd.DataFrame({'cmc': np.random.randint(1, 10, N), 'coloridentity': np.random.choice(['R', 'G'], N)})
# get all unique values of coloridentity
unique_values = df['coloridentity'].unique()
# find the range of all cmc indices
max_cmc = df['cmc'].max()
cmc_range = range(max_cmc + 1)
# dictionary for each coloridentity: array of values of each possible cmc
color_dict_values = {}
for u in unique_values:
value_counts_df = df['cmc'].loc[df['coloridentity'] == u].value_counts()
color_dict_values[u] = np.zeros(max_cmc + 1, dtype=int)
for ind, cnt in value_counts_df.iteritems():
color_dict_values[u][ind] = cnt
width = 0.4
bottom = 0
for col_id, col in zip(['G', 'R'], ['limegreen', 'crimson']):
plt.bar(cmc_range[1:], color_dict_values[col_id][1:], bottom=bottom, width=width, color=col)
bottom += color_dict_values[col_id][1:]
plt.xticks(cmc_range[1:]) # make sure every cmc gets a tick label
plt.tick_params(axis='x', length=0) # hide the tick marks
plt.xlabel('cmc')
plt.ylabel('count')
plt.show()

Placing Labels in nested categorical stacked bar with Bokeh and Pandas

I am trying to replicate a chart like the following using a pandas dataframe and bokeh vbar.:
Objective
So far, I´ve managed to place the labels in their corresponding height but now I can't find a way to access the numeric value where the category (2016,2017,2018) is located in the x axis. This is my result:
My nested categorical stacked bars chart
This is my code. It's messy but it's what i've managed so far. So is there a way to access the numeric value in x_axis of the bars?
def make_nested_stacked_bars(source,measurement,dimension_attr):
#dimension_attr is a list that contains the names of columns in source that will be used as categories
#measurement containes the name of the column with numeric data.
data = source.copy()
#Creates list of values of highest index
list_attr = source[dimension_attr[0]].unique()
list_stackers = list(source[dimension_attr[-1]].unique())
list_stackers.sort()
#trims labals that are too wide to fit in graph
for column in data.columns:
if data[column].dtype.name == 'object':
data[column] = np.where(data[column].apply(len) > 30, data[column].str[:30]+'...', data[column])
#Creates a list of dataframes, each grouping a specific value
list_groups = []
for item in list_attr:
list_groups.append(data[data[dimension_attr[0]] == item])
#Groups data by dimension attrs, aggregates measurement to count
#Drops highest index from dimension attr
dropped_attr = dimension_attr[0]
dimension_attr.remove(dropped_attr)
#Creates groupby by the last 2 parameters, and aggregates to count
#Calculates percentage
for index,value in enumerate(list_groups):
list_groups[index] = list_groups[index].groupby(by=dimension_attr).agg({measurement: ['count']})
list_groups[index] = list_groups[index].groupby(level=0).apply(lambda x: round(100 * x / float(x.sum()),1))
# Resets indexes
list_groups[index] = list_groups[index].reset_index()
list_groups[index] = list_groups[index].pivot(index=dimension_attr[0], columns=dimension_attr[1])
list_groups[index].index = [(x,list_attr[index]) for x in list_groups[index].index]
# Drops dimension attr as top level column
list_groups[index].columns = list_groups[index].columns.droplevel(0)
list_groups[index].columns = list_groups[index].columns.droplevel(0)
df = pd.concat(list_groups)
# Get the number of colors needed for the plot.
colors = brewer["Spectral"][len(list_stackers)]
colors.reverse()
p = figure(plot_width=800, plot_height=500, x_range=FactorRange(*df.index))
renderers = p.vbar_stack(list_stackers, x='index', width=0.3, fill_color=colors, legend=[get_item_value(x)for x in list_stackers], line_color=None, source=df, name=list_stackers,)
# Adds a different hovertool to a stacked bar
#empy dictionary with initial values set to zero
list_previous_y = {}
for item in df.index:
list_previous_y[item] = 0
#loops through bar graphs
for r in renderers:
stack = r.name
hover = HoverTool(tooltips=[
("%s" % stack, "#%s" % stack),
], renderers=[r])
#Initial value for placing label in x_axis
previous_x = 0.5
#Loops through dataset rows
for index, row in df.iterrows():
#adds value of df column to list
list_previous_y[index] = list_previous_y[index] + df[stack][index]
## adds label if value is not nan and at least 10
if not math.isnan(df[stack][index]) and df[stack][index]>=10:
p.add_layout(Label(x=previous_x, y=list_previous_y[index] -df[stack][index]/2,
text='% '+str(df[stack][index]), render_mode='css',
border_line_color='black', border_line_alpha=1.0,
background_fill_color='white', background_fill_alpha=1.0))
# increases position in x_axis
#this should be done by adding the value of next bar in x_axis
previous_x = previous_x + 0.8
p.add_tools(hover)
p.add_tools(hover)
p.legend.location = "top_left"
p.x_range.range_padding = 0.2
p.xgrid.grid_line_color = None
return p
Or is there an easier way to get all this done?
Thank you for your time!
UPDATE:
Added an additional image of a three level nested chart where the label placement in x_axis should be accomplished too
Three level nested chart
I can't find a way to access the numeric value where the category (2016,2017,2018) is located in the x axis.
There is not any way to access this information on the Python side in standalone Bokeh output. The coordinates are only computed inside the browser on the JavaScript side. i.e. only after your Python code has finished running and is out of the picture entirely. Even in a Bokeh server app context there is not any direct way, as there are not any synchronized properties that record the values.
As of Bokeh 1.3.4, support for placing labels with categorical coordinates is a known open issue.
In the mean time, the only workarounds I can suggest are:
Use the text glyph method with coordinates in a ColumnDataSource, instead of Label. That should work to position with actual categorical coordinates. (LabelSet might also work, though I have not tried). You can see an example of text with categorical coordiantes here:
https://github.com/bokeh/bokeh/blob/master/examples/plotting/file/periodic.py
Use numerical coordinates to position the Label. But you will have to experiment/best guess to find numercal coordinates that work for you. A rule of thumb is that categories have a width of 1.0 in synthetic (numeric) coordinate space.
My solution was..
Creating a copy of the dataframe used for making the chart. This dataframe (labeling_data) contains the y_axis coordinates calculated so that the label is positioned at the middle of the corresponding stacked bar.
Then, added aditional columnns to be used as the actual label where the values to be displayed were concatenated with the percentage symbol.
labeling_data = df.copy()
#Cumulative sum of columns
labeling_data = labeling_data.cumsum(axis=1)
#New names for columns
y_position = []
for item in labeling_data.columns:
y_position.append(item+'_offset')
labeling_data.columns = y_position
#Copies original columns
for item in df:
#Adding original columns
labeling_data[item] = df[item]
#Modifying offset columns to place label in the middle of the bar
labeling_data[item+'_offset'] = labeling_data[item+'_offset']-labeling_data[item]/2
#Concatenating values with percentage symbol if at least 10
labeling_data[item+'_label'] = np.where(df[item] >=10 , '% '+df[item].astype(str), "")
Finally, by looping through the renderers of the plot, a labelset was added to each stack group using the labeling_data as Datasource . By doing this, the index of the dataframe can be used to set the x_coordinate of the label. And the corresponding columns were added for the y_coordinate and text parameters.
info = ColumnDataSource(labeling_data)
#loops through bar graphs
for r in renderers:
stack = r.name
#Loops through dataset rows
for index, row in df.iterrows():
#Creates Labelset and uses index, y_offset and label columns
#as x, y and text parameters
labels = LabelSet(x='index', y=stack+'_offset', text=stack+'_label', level='overlay',
x_offset=-25, y_offset=-5, source=info)
p.add_layout(labels)
Final result:
Nested categorical stacked bar chart with labels

How do I label a specific point in a scatter plot with a unique ID?

I am creating an interactive graph for a layout that looks a lot like this:
Each point has a unique ID and is usually part of a group. Each group has their own color so I use multiple scatter plots to create the entire layout. I need the following to occur when I click on a single point:
On mouse click, retrieve the ID of the selected point.
Plug the ID into a black box function that returns a list of nearby* IDs.
Highlight the points of the IDs in the returned list.
*It is possible for some of the IDs to be from different groups/plots.
How do I:
Associate each point with an ID and return the ID when the point is clicked?
Highlight other points in the layout when all I know is their IDs?
Re-position individual points while maintaining their respective groups i.e. swapping positions with points that belong to different groups/plots.
I used pyqtgraph before switching over to matplotlib so I first thought of creating a dictionary of IDs and their point objects. After experimenting with pick_event, it seems to me that the concept of point objects does not exist in matplotlib. From what I've learned so far, each point is represented by an index and only its PathCollection can return information about itself e.g. coordinates. I also learned that color modification of a specific point is done through its PathCollection whereas in pyqtgraph I can do it through a point object e.g. point.setBrush('#000000').
I am still convinced that using a single scatter plot would be the much better option. There is nothing in the question that would contradict that.
You can merge all your data in a single DataFrame, with columns group, id, x, y, color. The part in the code below which says "create some dataset" does create such a DataFrame
group id x y color
0 1 AEBB 0 0 palegreen
1 3 DCEB 1 0 plum
2 0 EBCC 2 0 sandybrown
3 0 BEBE 3 0 sandybrown
4 3 BEBB 4 0 plum
Note that each group has its own color. One can then create a scatter from it, using the colors from the color column.
A pick event is registered as in this previous question and once a point is clicked, which is not already black, the id from the DataFrame corresponding to the selected point is obtained. From the id, other ids are generated via the "blackbox function" and for each id obtained that way the respective index of the point in the dataframe is determined. Because we have single scatter this index is directly the index of the point in the scatter (PathCollection) and we can paint it black.
import numpy as np; np.random.seed(1)
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors
### create some dataset
x,y = np.meshgrid(np.arange(20), np.arange(20))
group = np.random.randint(0,4,size=20*20)
l = np.array(np.meshgrid(list("ABCDE"),list("ABCDE"),
list("ABCDE"),list("ABCDE"))).T.reshape(-1,4)
ide = np.random.choice(list(map("".join, l)), size=20*20, replace=False)
df = pd.DataFrame({"id" : ide, "group" : group ,
"x" : x.flatten(), "y" : y.flatten() })
colors = ["sandybrown", "palegreen", "paleturquoise", "plum"]
df["color"] = df["group"]
df["color"].update(df["color"].map(dict(zip(range(4), colors ))))
print df.head()
### plot a single scatter plot from the table above
fig, ax = plt.subplots()
scatter = ax.scatter(df.x,df.y, facecolors=df.color, s=64, picker=4)
def getOtherIDsfromID(ID):
""" blackbox function: create a list of other IDs from one ID """
l = [np.random.permutation(list(ID)) for i in range(5)]
return list(set(map("".join, l)))
def select_point(event):
if event.mouseevent.button == 1:
facecolor = scatter._facecolors[event.ind,:]
if (facecolor == np.array([[0, 0, 0, 1]])).all():
c = df.color.values[event.ind][0]
c = matplotlib.colors.to_rgba(c)
scatter._facecolors[event.ind,:] = c
else:
ID = df.id.values[event.ind][0]
oIDs = getOtherIDsfromID(ID)
# for each ID obtained, make the respective point black.
rows = df.loc[df.id.isin([ID] + oIDs)]
for i, row in rows.iterrows():
scatter._facecolors[i,:] = (0, 0, 0, 1)
tx = "You selected id {}.\n".format(ID)
tx += "Points with other ids {} will be affected as well"
tx = tx.format(oIDs)
print tx
fig.canvas.draw_idle()
fig.canvas.mpl_connect('pick_event', select_point)
plt.show()
In the image below, the point with id DAEE has been clicked on, and other points with ids ['EDEA', 'DEEA', 'EDAE', 'DEAE'] have been chosen by the blackbox function. Not all of those IDs exist, such that two other points with an existing id are colorized as well.

Pyplot: changing color depending on class

I have an array of values with their associated class labels (0 or 1). I'd like to change the colour of the plotted values based on their class labels.
I'm using the matplotlib.pyplot plot function to plot the values:
plt.plot(data[0])
For each value the associated class labels are stored in an separate array of the same length as the data array.
The current plot looks like this:
The areas in between the red lines should be coloured differently.
You could split it in two different data sets:
xx0 = class_labels == 0
xx1 = class_labels == 1
data_class_0 = data[0].copy()
data_class_0[xx1] = np.nan
data_class_1 = data[0].copy()
data_class_1[xx0] = np.nan
plt.plot(data_class_0, 'b')
plt.plot(data_class_1, 'r')

plotting high precision data

I have an array which contains error values as a function of two different quantities (alpha and eigRange).
I fill my array like this :
for j in range(n):
for i in range(alphaLen):
alpha = alpha_list[i]
c = train.eig(xt_, yt_,m-j, m,alpha, "cpu")
costListTrain[j, i] = cost.err(xt_, xt_, yt_, c)
normedValues=costListTrain/np.max(costListTrain.ravel())
where
n = 20
alpha_list = [0.0001,0.0003,0.0008,0.001,0.003,0.006,0.01,0.03,0.05]
My costListTrain array contains some values that have very small differences, e.g.:
2.809458902485728 2.809458905776425 2.809458913576337 2.809459011062461
2.030326752376704 2.030329906064879 2.030337351188699 2.030428976282031
1.919840839066182 1.919846470077076 1.919859731440199 1.920021453630778
1.858436351617677 1.858444223016128 1.858462730482461 1.858687054377165
1.475871326997542 1.475901926855846 1.475973476249240 1.476822830933632
1.475775410801635 1.475806023102173 1.475877601316863 1.476727286424228
1.475774284270633 1.475804896751524 1.475876475382906 1.476726165223209
1.463578292548192 1.463611627166494 1.463689466240788 1.464609083309240
1.462859608038034 1.462893157900139 1.462971489632478 1.463896516033939
1.461912706143012 1.461954067956570 1.462047793798572 1.463079574605320
1.450581041157659 1.452770209885761 1.454835202839513 1.459676311335618
1.450581041157643 1.452770209885764 1.454835202839484 1.459676311335624
1.450581041157651 1.452770209885735 1.454835202839484 1.459676311335610
1.450581041157597 1.452770209885784 1.454835202839503 1.459676311335620
1.450581041157575 1.452770209885757 1.454835202839496 1.459676311335619
1.450581041157716 1.452770209885711 1.454835202839499 1.459676311335613
1.450581041157667 1.452770209885744 1.454835202839509 1.459676311335625
1.450581041157649 1.452770209885750 1.454835202839476 1.459676311335617
1.450581041157655 1.452770209885708 1.454835202839442 1.459676311335622
1.450581041157571 1.452770209885700 1.454835202839498 1.459676311335622
as you can here the value are very very close together!
I am trying to plotting this data in a way where I have the two quantities in the x, y axes and the error value is represented by the dot color.
This is how I'm plotting my data:
alpha_list = np.log(alpha_list)
eigenvalues, alphaa = np.meshgrid(eigRange, alpha_list)
vMin = np.min(costListTrain)
vMax = np.max(costListTrain)
plt.scatter(x, y, s=70, c=normedValues, vmin=vMin, vmax=vMax, alpha=0.50)
but the result is not correct.
I tried to normalize my error value by dividing all values by the max, but it didn't work !
The only way that I could make it work (which is incorrect) is to normalize my data in two different ways. One is base on each column (which means factor1 is constant, factor 2 changing), and the other one based on row (means factor 2 is constant and factor one changing). But it doesn't really make sense because I need a single plot to show the tradeoff between the two quantities on the error values.
UPDATE
this is what I mean by last paragraph.
normalizing values base on max on each rows which correspond to eigenvalues:
maxsEigBasedTrain= np.amax(costListTrain.T,1)[:,np.newaxis]
maxsEigBasedTest= np.amax(costListTest.T,1)[:,np.newaxis]
normEigCostTrain=costListTrain.T/maxsEigBasedTrain
normEigCostTest=costListTest.T/maxsEigBasedTest
normalizing values base on max on each column which correspond to alphas:
maxsAlphaBasedTrain= np.amax(costListTrain,1)[:,np.newaxis]
maxsAlphaBasedTest= np.amax(costListTest,1)[:,np.newaxis]
normAlphaCostTrain=costListTrain/maxsAlphaBasedTrain
normAlphaCostTest=costListTest/maxsAlphaBasedTest
plot 1:
where no. eigenvalue = 10 and alpha changes (should correspond to column 10 of plot 1) :
where alpha = 0.0001 and eigenvalues change (should correspond to first row of plot1)
but as you can see the results are different from plot 1!
UPDATE:
just to clarify more stuff this is how I read my data:
from sklearn.datasets.samples_generator import make_regression
rng = np.random.RandomState(0)
diabetes = datasets.load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target
X_diabetes=np.c_[np.ones(len(X_diabetes)),X_diabetes]
ind = np.arange(X_diabetes.shape[0])
rng.shuffle(ind)
#===============================================================================
# Split Data
#===============================================================================
import math
cross= math.ceil(0.7*len(X_diabetes))
ind_train = ind[:cross]
X_train, y_train = X_diabetes[ind_train], y_diabetes[ind_train]
ind_val=ind[cross:]
X_val,y_val= X_diabetes[ind_val], y_diabetes[ind_val]
I also uploaded .csv files HERE
log.csv contain the original value before normalization for plot 1
normalizedLog.csv for plot 1
eigenConst.csv for plot 2
alphaConst.csv for plot 3
I think I found the answer. First of all there was one problem in my code. I was expecting the "No. of eigenvalue" correspond to rows but in my for loop they fill the columns. The currect answer is this :
for i in range(alphaLen):
for j in range(n):
alpha=alpha_list[i]
c=train.eig(xt_, yt_,m-j,m,alpha,"cpu")
costListTrain[i,j]=cost.err(xt_,xt_,yt_,c)
costListTest[i,j]=cost.err(xt_,xv_,yv_,c)
After asking questions from friends and colleagues I got this answer :
I would assume on default imshow and other plotting commands you
might want to use, do equally sized intervals on the values you are
plotting. if you can set that to logarithmic you should be fine.
Ideally, equally "populated bins" would proof most effective, i guess.
for plotting I just subtract the min value from the error and the add a small number and at the end take the log.
temp=costListTrain- costListTrain.min()
temp+=0.00000001
extent = [0, 20,alpha_list[0], alpha_list[-1]]
plt.imshow(np.log(temp),interpolation="nearest",cmap=plt.get_cmap('spectral'), extent = extent, origin="lower")
plt.colorbar()
and result is :

Categories

Resources