apply custom plotting function to list of features

apply custom plotting function to list of features - python

I have created a function that creates a specified plot for a given feature:
def barplotter (dataset, feature):
ax1 = sns.displot(dataset, x =feature, stat = 'density', discrete = True, color = 'black')
ax1.set(title=feature, xlabel = "")
ax2 = sns.displot(dataset, x =feature, col = 'status_group', stat = 'density', discrete = True)
ax2.set(xlabel = "")
plt.show()
Result:
barplotter ( raw, "quality_group")
I would like to apply this function to a list of features instead of having to apply it manually for each features.
I was thinking of using a for loop.
First I created a list of features: categorical_columns = raw[categorical].columns.tolist()
for item in categorical_columns:
barplotter(raw, item)
Unfortunately this results in an error
What am I doing wrong here?

Ok, in case anyone else runs into this error:
Some of my features consisted of mixed datatypes. I converted all features to strings and now the for loop works.

Related

Altair shortened form of x-axis label issues

I'm trying to make the x axis be a slice of the string of longspecies, but I want longspecies to show up as the legend. I've tried a couple different ways and without the domain the below code works, but once you add the domain it at least doubles the length of the x axis, probably adding the correlated values from species and longspecies.
I'm not sure how to just use longspecies and slice the tickmarks either (value and label seem like they aren't what I'm looking for). Any help would be appreciated.
the data setup:
testspecies = ['oak', 'elm', 'willow']
testmean = [np.random.rand()*25+75 for _ in range(len(testspecies))]
testlongspecies = [testspecies[i] + ' long form name' for i in range(len(testspecies))]
testset = zip(testmean, testspecies, testlongspecies, strict=True)
testdf = pd.DataFrame(testset, columns = ['average', 'species', 'long species'])
The chart:
alt.Chart(testdf).mark_bar().encode(
alt.X('species', title = None),
alt.Y('average', title = 'mean', scale = alt.Scale(domain = (50,100))),
color='long species'
)

If I add clip=True inside the mark, I get the following graph which sounds like it is what you are looking for (the x-axis labels are shorter than those in the legend):
alt.Chart(testdf).mark_bar(clip=True).encode(
alt.X('species', title = None),
alt.Y('average', title = 'mean', scale = alt.Scale(domain = (50,100))),
color='long species'
)

How do I reduce the number of ticks on an Altair graph?

I am using Altair to create a graph, but for some weird reason it's seems to be generating a tick for each of the points. Creating a graph like this Altair Graph
If I filter the dataframe, it produces weird axis values. Altair graph
Is there a way to reduce the amount of ticks? I tried tickCount in the y axis paramater and it didn't work since it seems to require integers.I also tried setting the axis value parameter to a list [0,0.2,0.4,0.6,0.8,1] and that didn't work either. Here is my code (sorry it's so lengthy!). Thank you in advance!
a = alt.Chart(df_filtered).mark_point().encode(x =alt.X('Process_Time_(mins)', axis = alt.Axis(title='Process Time (mins)')),
y = alt.Y('Heavy_Phase_%SS',axis=alt.Axis(title='Heavy Phase %SS', tickCount = 10),sort = 'descending'),
color = alt.Color('DSP_Lot', legend = alt.Legend(title = 'DSP_Lot')),
shape = alt.Shape('Strain', scale = alt.Scale(range = ["circle", "square", "cross", "diamond", "triangle-up", "triangle-down", "triangle-right", "triangle-left"])),
tooltip = [alt.Tooltip('DSP_Lot',title = 'Lot'), alt.Tooltip('Heavy_Phase_%SS', title = 'Heavy Phase %SS'),
alt.Tooltip('Process_Time_(mins)', title = 'Process Time (mins)'), alt.Tooltip('Purpose', title = 'Purpose'), alt.Tooltip('Strain', title = 'Strain'),
alt.Tooltip('Trial', title = 'Trial')]).properties(width = 1000, height = 500)

It's hard to tell without a reproducible example but I suspect the issue is that your y axis is defaulting to a nominal encoding type, in which case you get one tick mark per unique value. If you specify a quantitative type in the Y encoding, it may improve things:
y = alt.Y('Heavy_Phase_%SS:Q', ...)
The reason it defaults to nominal is probably because the associated column in the pandas dataframe has a string type rather than a numerical type.

How to annotate certain data points on a python scatterplot based on column value

I am almost done with my first real deal python data science project. However, there is one last thing I can't seem to figure out. I have the following code to create a plot for my PCA and K Means clustering algorithm:
y_axis = passers_pca_kmeans['Component 1']
x_axis = passers_pca_kmeans['Component 2']
plt.figure(figsize=(10,8))
sns.scatterplot(x_axis, y_axis, hue=passers_pca_kmeans['Segment'], palette=['g','r','c','m'])
plt.title('Clusters by PCA Components')
plt.grid(zorder=0,alpha=.4)
texts = [plt.text(x0,y0,name,ha='right',va='bottom') for x0,y0,name in zip(
passers_pca_kmeans['Component 2'], passers_pca_kmeans['Component 1'], passers_pca_kmeans.name)]
adjust_text(texts)
plt.show
I finally got the correct code to annotate the points using adjustText, but my plot has too many points to label them all; it looks like a mess with text everywhere.
I would like to annotate the scatterplot based on the value in the column 'Segment'.
The values in this column are the names of my four clusters 'first', 'second', 'third', 'fourth'.
How do I alter my adjustText code to only annotate points where 'Segment'='first'?
Would this be an np.where situation?

You could boolean slice your input into the text call, something like:
mask = (passers_kca_means["Subject"] == "first")
x = passers_kca_means["Component 2"][mask]
y = passers_kca_means["Component 1"][mask]
names = passers_kca_means.name[mask]
texts = [plt.text(x0,y0,name,ha='right',va='bottom') for x0,y0,name in zip(x,y,names)]
You could also make an unruly list comprehension by adding an if condition:
x = passers_kca_means["Component 2"]
y = passers_kca_means["Component 1"]
names = passers_kca_means.name
subjects = passers_kca_means["Subject"]
texts = [plt.text(x0,y0,name,ha='right',va='bottom') for x0,y0,name,subject in zip(x,y,names,subjects) if subject == "first"]
I bet there is an answer with np.where as well.

Placing Labels in nested categorical stacked bar with Bokeh and Pandas

I am trying to replicate a chart like the following using a pandas dataframe and bokeh vbar.:
Objective
So far, I´ve managed to place the labels in their corresponding height but now I can't find a way to access the numeric value where the category (2016,2017,2018) is located in the x axis. This is my result:
My nested categorical stacked bars chart
This is my code. It's messy but it's what i've managed so far. So is there a way to access the numeric value in x_axis of the bars?
def make_nested_stacked_bars(source,measurement,dimension_attr):
#dimension_attr is a list that contains the names of columns in source that will be used as categories
#measurement containes the name of the column with numeric data.
data = source.copy()
#Creates list of values of highest index
list_attr = source[dimension_attr[0]].unique()
list_stackers = list(source[dimension_attr[-1]].unique())
list_stackers.sort()
#trims labals that are too wide to fit in graph
for column in data.columns:
if data[column].dtype.name == 'object':
data[column] = np.where(data[column].apply(len) > 30, data[column].str[:30]+'...', data[column])
#Creates a list of dataframes, each grouping a specific value
list_groups = []
for item in list_attr:
list_groups.append(data[data[dimension_attr[0]] == item])
#Groups data by dimension attrs, aggregates measurement to count
#Drops highest index from dimension attr
dropped_attr = dimension_attr[0]
dimension_attr.remove(dropped_attr)
#Creates groupby by the last 2 parameters, and aggregates to count
#Calculates percentage
for index,value in enumerate(list_groups):
list_groups[index] = list_groups[index].groupby(by=dimension_attr).agg({measurement: ['count']})
list_groups[index] = list_groups[index].groupby(level=0).apply(lambda x: round(100 * x / float(x.sum()),1))
# Resets indexes
list_groups[index] = list_groups[index].reset_index()
list_groups[index] = list_groups[index].pivot(index=dimension_attr[0], columns=dimension_attr[1])
list_groups[index].index = [(x,list_attr[index]) for x in list_groups[index].index]
# Drops dimension attr as top level column
list_groups[index].columns = list_groups[index].columns.droplevel(0)
list_groups[index].columns = list_groups[index].columns.droplevel(0)
df = pd.concat(list_groups)
# Get the number of colors needed for the plot.
colors = brewer["Spectral"][len(list_stackers)]
colors.reverse()
p = figure(plot_width=800, plot_height=500, x_range=FactorRange(*df.index))
renderers = p.vbar_stack(list_stackers, x='index', width=0.3, fill_color=colors, legend=[get_item_value(x)for x in list_stackers], line_color=None, source=df, name=list_stackers,)
# Adds a different hovertool to a stacked bar
#empy dictionary with initial values set to zero
list_previous_y = {}
for item in df.index:
list_previous_y[item] = 0
#loops through bar graphs
for r in renderers:
stack = r.name
hover = HoverTool(tooltips=[
("%s" % stack, "#%s" % stack),
], renderers=[r])
#Initial value for placing label in x_axis
previous_x = 0.5
#Loops through dataset rows
for index, row in df.iterrows():
#adds value of df column to list
list_previous_y[index] = list_previous_y[index] + df[stack][index]
## adds label if value is not nan and at least 10
if not math.isnan(df[stack][index]) and df[stack][index]>=10:
p.add_layout(Label(x=previous_x, y=list_previous_y[index] -df[stack][index]/2,
text='% '+str(df[stack][index]), render_mode='css',
border_line_color='black', border_line_alpha=1.0,
background_fill_color='white', background_fill_alpha=1.0))
# increases position in x_axis
#this should be done by adding the value of next bar in x_axis
previous_x = previous_x + 0.8
p.add_tools(hover)
p.add_tools(hover)
p.legend.location = "top_left"
p.x_range.range_padding = 0.2
p.xgrid.grid_line_color = None
return p
Or is there an easier way to get all this done?
Thank you for your time!
UPDATE:
Added an additional image of a three level nested chart where the label placement in x_axis should be accomplished too
Three level nested chart

I can't find a way to access the numeric value where the category (2016,2017,2018) is located in the x axis.
There is not any way to access this information on the Python side in standalone Bokeh output. The coordinates are only computed inside the browser on the JavaScript side. i.e. only after your Python code has finished running and is out of the picture entirely. Even in a Bokeh server app context there is not any direct way, as there are not any synchronized properties that record the values.
As of Bokeh 1.3.4, support for placing labels with categorical coordinates is a known open issue.
In the mean time, the only workarounds I can suggest are:
Use the text glyph method with coordinates in a ColumnDataSource, instead of Label. That should work to position with actual categorical coordinates. (LabelSet might also work, though I have not tried). You can see an example of text with categorical coordiantes here:
https://github.com/bokeh/bokeh/blob/master/examples/plotting/file/periodic.py
Use numerical coordinates to position the Label. But you will have to experiment/best guess to find numercal coordinates that work for you. A rule of thumb is that categories have a width of 1.0 in synthetic (numeric) coordinate space.

My solution was..
Creating a copy of the dataframe used for making the chart. This dataframe (labeling_data) contains the y_axis coordinates calculated so that the label is positioned at the middle of the corresponding stacked bar.
Then, added aditional columnns to be used as the actual label where the values to be displayed were concatenated with the percentage symbol.
labeling_data = df.copy()
#Cumulative sum of columns
labeling_data = labeling_data.cumsum(axis=1)
#New names for columns
y_position = []
for item in labeling_data.columns:
y_position.append(item+'_offset')
labeling_data.columns = y_position
#Copies original columns
for item in df:
#Adding original columns
labeling_data[item] = df[item]
#Modifying offset columns to place label in the middle of the bar
labeling_data[item+'_offset'] = labeling_data[item+'_offset']-labeling_data[item]/2
#Concatenating values with percentage symbol if at least 10
labeling_data[item+'_label'] = np.where(df[item] >=10 , '% '+df[item].astype(str), "")
Finally, by looping through the renderers of the plot, a labelset was added to each stack group using the labeling_data as Datasource . By doing this, the index of the dataframe can be used to set the x_coordinate of the label. And the corresponding columns were added for the y_coordinate and text parameters.
info = ColumnDataSource(labeling_data)
#loops through bar graphs
for r in renderers:
stack = r.name
#Loops through dataset rows
for index, row in df.iterrows():
#Creates Labelset and uses index, y_offset and label columns
#as x, y and text parameters
labels = LabelSet(x='index', y=stack+'_offset', text=stack+'_label', level='overlay',
x_offset=-25, y_offset=-5, source=info)
p.add_layout(labels)
Final result:
Nested categorical stacked bar chart with labels

Plotting a new image without using the old one in matplotlib?

I'm new to python and matplotlib.
I have implemented the k means algorithm in order to compress and image to
clusters and then plotting the changed image.
my question is: I was not able to plot the new image without using
the old one as a base, I tried a few things but could not quite get the result I want. and it's bad programming if I pass the old image as argument when I can definitely not use it.
Can someone please help?
I tried to create a new ndarray but it did not work.
Here is my function:
def changePic(newPixelList, oldPixel, image_size):
index = 0
new_pixels = []
for pixel in newPixelList:
oldPixel[index] = pixel.classification
index+=1
l = oldPixel.reshape(image_size)
plt.imshow(l)
plt.grid(False)
plt.show()
As you can see I don't really use the oldPixel values, just its structure.
now I'll show you the type of oldPixel:
Here is my loadPic method where X.copy is the argument oldPixel:
def loadPic():
"""
Load pic to array
:return: copy of original X, new lisf of pixels, image size
"""
# data preperation (loading, normalizing, reshaping)
path = 'dog.jpeg'
A = imread(path)
A = A.astype(float) / 255.
img_size = A.shape
X = A.reshape(img_size[0] * img_size[1], img_size[2])
listOfPixel= []
for pixel in X:
listOfPixel.append(Pixel(pixel))
return X.copy(), listOfPixel,img_size

Try this:
def changePic(newPixelList, oldPixel, image_size, picture_num):
index = 0
new_pixels = []
for pixel in newPixelList:
oldPixel[index] = pixel.classification
index+=1
l = oldPixel.reshape(image_size)
plt.figure(picture_num)
plt.imshow(l)
plt.grid(False)
plt.show()
Every plot that you generate should have a different picture_num in order to have separate plots.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

apply custom plotting function to list of features - python

Ok, in case anyone else runs into this error: Some of my features consisted of mixed datatypes. I converted all features to strings and now the for loop works.

Related

Altair shortened form of x-axis label issues

How do I reduce the number of ticks on an Altair graph?

How to annotate certain data points on a python scatterplot based on column value

Placing Labels in nested categorical stacked bar with Bokeh and Pandas

Plotting a new image without using the old one in matplotlib?

Categories

Resources