seaborn.boxplot and wide-format dataframe

seaborn.boxplot and wide-format dataframe - python

I have a DataFrame like this:
I tried these two instructions one after another:
sns.boxplot([dataFrame.mean_qscore_template,dataFrame.mean_qscore_complement,dataFrame.mean_qscore_2d])
sns.boxplot(x = "mean_qscore_template", y= "mean_qscore_complement", hue = "mean_qscore_2d" data = tips)
I want to get mean_qscore_template, mean_qscore_complement and mean_qscore_2d on the x-axis with the measure on y-axis but it doesn't work.
In the documentation they give an example with tips but my dataFrame is not organized f the same way.

sns.boxplot(data = dataFrame) will make boxplots for each numeric column of your dataframe.

Related

Stacked Area Chart in Python

I'm working on an assignment from school, and have run into a snag when it comes to my stacked area chart.
The data is fairly simple: 4 columns that look similar to this:
Series id
Year
Period
Value
LNS140000
1948
M01
3.4
I'm trying to create a stacked area chart using Year as my x and Value as my y and breaking it up over Period.
#Stacked area chart still using unemployment data
x = d.Year
y = d.Value
plt.stackplot(x, y, labels = d['Period'])
plt.legend(d['Period'], loc = 'upper left')
plt.show()enter code here`
However, when I do it like this it only picks up M01 and there are M01-M12. Any thoughts on how I can make this work?

You need to preprocess your data a little before passing them to the stackplot function. I took a look at this link to work on an example that could be suitable for your case.
Since I've seen one row of your data, I add some random values to the dataset.
import pandas as pd
import matplotlib.pyplot as plt
dd=[[1948,'M01',3.4],[1948,'M02',2.5],[1948,'M03',1.6],
[1949,'M01',4.3],[1949,'M02',6.7],[1949,'M03',7.8]]
d=pd.DataFrame(dd,columns=['Year','Period','Value'])
years=d.Year.unique()
periods=d.Period.unique()
#Now group them per period, but in year sequence
d.sort_values(by='Year',inplace=True) # to ensure entire dataset is ordered
pds=[]
for p in periods:
pds.append(d[d.Period==p]['Value'].values)
plt.stackplot(years,pds,labels=periods)
plt.legend(loc='upper left')
plt.show()
Is that what you want?

So I was able to use Seaborn to help out. First I did a pivot table
df = d.pivot(index = 'Year',
columns = 'Period',
values = 'Value')
df
Then I set up seaborn
plt.style.use('seaborn')
sns.set_style("white")
sns.set_theme(style = "ticks")
df.plot.area(figsize = (20,9))
plt.title("Unemployment by Year and Month\n", fontsize = 22, loc = 'left')
plt.ylabel("Values", fontsize = 22)
plt.xlabel("Year", fontsize = 22)

It seems to me that the problem you are having relates to the formatting of the data. Look how the values are formatted in this matplotlib example. I would try to groupby the data by period, or pivot it in the correct format, and then graphing again.

Unusual bar plot in matplotlib

I need to create a somewhat unusual bar plot in matplotlib and the standard functionality does not seem to offer what I need.
I have clustered some documents and want to show the 5 most important keywords per cluster. The first problem is that I have one group per cluster which consists of 5 individual bars. The second problem is that the labels of these individual bars are important, not the same across groups and not unique either.
I have a makeshift prototype that looks like this:
I just plotted all the individual bars in the right order and separated them by empty entries. The biggest problem (aside from being ugly) is that the only way to identify the cluster is by counting the groups. It would help a lot if the clusters could be identified either by color or something else, but I cannot figure out how to do this.
Edit: Here is some requested toy data as well as the code used to produce the plot I already have.
Toy data:
The following two pandas dataframes are included in an array. The two code blocks include the results from df_list[i].to_csv(). I hope this helps, but for the context of this problem the actual data does not really matter, so you can also just create your own dataframes.
,features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127
and
,features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198
Code:
The approach for the current solution is to combine all the individual dataframes into one dataframe, add empty entries where necessary, and plot the result.
def plot_all_clusters_words(dfs):
# target structure: word as non unique column, value as other non unique column
df_dict_list = []
for df in dfs:
for index, row in df.iterrows():
df_dict_list.append({"word": row.features, "value": row.score})
df_dict_list.append({"word": "", "value": 0})
df_dict_list = df_dict_list[:-1]
new_df = pd.DataFrame(df_dict_list)
new_df.plot.bar(x="word")
plt.show()
return new_df
Note:
I just need a way to easily identify the groups, if you know a different approach than the ones I suggested above, feel free to do so.

Calling plt.bar for each of the dataframes, each with an own label and color, would create the following plot:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from io import StringIO
df1_str = '''features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127'''
df2_str = '''features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198'''
df1 = pd.read_csv(StringIO(df1_str))
df2 = pd.read_csv(StringIO(df2_str))
dfs = [df1, df2]
cluster_names = [f'cluster {i}' for i in range(1, len(dfs) + 1)]
colors = plt.cm.rainbow(np.linspace(0, 1, len(dfs)))
bar_width = 0.8 # width of individual bars
cluster_gap = 0.2 # extra distance between clusters
starts = np.append(0, np.array([len(df) + cluster_gap for df in dfs]).cumsum())
all_tickpos = [s + np.arange(len(df)) for df, s in zip(dfs, starts)]
for df, name, color, tickpos in zip(dfs, cluster_names, colors, all_tickpos):
plt.bar(tickpos, df['score'], width=bar_width, color=color, label=name)
plt.xticks(np.concatenate(all_tickpos), [f for df in dfs for f in df['features']], rotation=90)
plt.legend()
plt.tight_layout()
plt.show()

How to make a scatterplot in seaborn from 2 numerical columns and 1 categorical column?

I want to make a scatterplot in seaborn (but I'm open to other ways to execute this) from two numerical columns of data and one categorical column of data, with the two titles of the numerical columns on the x axis, the values of the numerical columns on the y axis, and the cat column represented by hue.
this is kind of like what I want, with the names, firstgame and lastgame on the x axis instead of 1 minute and 15 minute
There are 50 basketball teams in my dataset, each with their own row (so there are 50 rows). Each team has a label, "good" or "bad". The label is the categorical column that I want in my plot. The first numerical column I want has the number of attendees for the first game of the season and the second numerical column has the number of attendees for the last game of the season. I figured I could plot this using seaborn, but I'm not sure how to designate x and y. I tried add the two num columns together in a list and then going from there but that didn't really work out. Any suggestions...? Thank you so much in advance.

try the following
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = [[8.98, 1.56, 'fail'],
[8.91, 5.22, 'success'],
[5.39, 2.13, 'fail'],
[5.06, 1.61, 'fail'],
[5.84, 2.86, 'fail']]
df=pd.DataFrame(data=data, columns=['firstgame','lastgame','label'])
ax=sns.scatterplot(x='firstgame',y='lastgame',hue='label',data=df)
plt.show()
This will produce:

You can try the following:
## sample data, ignore this
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,100, (50,2)),
columns=['firstgame','lastgame'])
df['label'] = np.random.choice(['good','bad'], 50)
## replace 'index' with your index name if any
sns.lineplot(data=df.reset_index().melt(id_vars=['index','label']),
hue='label',
style='variable',
x='index',
y='value')
Output:

Reorder Catagorical X axis ticks Matplotlib [duplicate]

EDIT: this question arose back in 2013 with pandas ~0.13 and was obsoleted by direct support for boxplot somewhere between version 0.15-0.18 (as per #Cireo's late answer; also pandas greatly improved support for categorical since this was asked.)
I can get a boxplot of a salary column in a pandas DataFrame...
train.boxplot(column='Salary', by='Category', sym='')
...however I can't figure out how to define the index-order used on column 'Category' - I want to supply my own custom order, according to another criterion:
category_order_by_mean_salary = train.groupby('Category')['Salary'].mean().order().keys()
How can I apply my custom column order to the boxplot columns? (other than ugly kludging the column names with a prefix to force ordering)
'Category' is a string (really, should be a categorical, but this was back in 0.13, where categorical was a third-class citizen) column taking 27 distinct values: ['Accounting & Finance Jobs','Admin Jobs',...,'Travel Jobs']. So it can be easily factorized with pd.Categorical.from_array()
On inspection, the limitation is inside pandas.tools.plotting.py:boxplot(), which converts the column object without allowing ordering:
pandas.core.frame.py.boxplot() is a passthrough to
pandas.tools.plotting.py:boxplot()
which instantiates ...
matplotlib.pyplot.py:boxplot() which instantiates ...
matplotlib.axes.py:boxplot()
I suppose I could either hack up a custom version of pandas boxplot(), or reach into the internals of the object. And also file an enhance request.

Hard to say how to do this without a working example. My first guess would be to just add an integer column with the orders that you want.
A simple, brute-force way would be to add each boxplot one at a time.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(37,4), columns=list('ABCD'))
columns_my_order = ['C', 'A', 'D', 'B']
fig, ax = plt.subplots()
for position, column in enumerate(columns_my_order):
ax.boxplot(df[column], positions=[position])
ax.set_xticks(range(position+1))
ax.set_xticklabels(columns_my_order)
ax.set_xlim(xmin=-0.5)
plt.show()

EDIT: this is the right answer after direct support was added somewhere between version 0.15-0.18
tl;dr: for recent pandas - use positions argument to boxplot.
Adding a separate answer, which perhaps could be another question - feedback appreciated.
I wanted to add a custom column order within a groupby, which posed many problems for me. In the end, I had to avoid trying to use boxplot from a groupby object, and instead go through each subplot myself to provide explicit positions.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame()
df['GroupBy'] = ['g1', 'g2', 'g3', 'g4'] * 6
df['PlotBy'] = [chr(ord('A') + i) for i in xrange(24)]
df['SortBy'] = list(reversed(range(24)))
df['Data'] = [i * 10 for i in xrange(24)]
# Note that this has no effect on the boxplot
df = df.sort_values(['GroupBy', 'SortBy'])
for group, info in df.groupby('GroupBy'):
print 'Group: %r\n%s\n' % (group, info)
# With the below, cannot use
# - sort data beforehand (not preserved, can't access in groupby)
# - categorical (not all present in every chart)
# - positional (different lengths and sort orders per group)
# df.groupby('GroupBy').boxplot(layout=(1, 5), column=['Data'], by=['PlotBy'])
fig, axes = plt.subplots(1, df.GroupBy.nunique(), sharey=True)
for ax, (g, d) in zip(axes, df.groupby('GroupBy')):
d.boxplot(column=['Data'], by=['PlotBy'], ax=ax, positions=d.index.values)
plt.show()
Within my final code, it was even slightly more involved to determine positions because I had multiple data points for each sortby value, and I ended up having to do the below:
to_plot = data.sort_values([sort_col]).groupby(group_col)
for ax, (group, group_data) in zip(axes, to_plot):
# Use existing sorting
ordering = enumerate(group_data[sort_col].unique())
positions = [ind for val, ind in sorted((v, i) for (i, v) in ordering)]
ax = group_data.boxplot(column=[col], by=[plot_by], ax=ax, positions=positions)

Actually I got stuck with the same question. And I solved it by making a map and reset the xticklabels, with code as follows:
df = pd.DataFrame({"A":["d","c","d","c",'d','c','a','c','a','c','a','c']})
df['val']=(np.random.rand(12))
df['B']=df['A'].replace({'d':'0','c':'1','a':'2'})
ax=df.boxplot(column='val',by='B')
ax.set_xticklabels(list('dca'))

Note that pandas can now create categorical columns. If you don't mind having all the columns present in your graph, or trimming them appropriately, you can do something like the below:
http://pandas.pydata.org/pandas-docs/stable/categorical.html
df['Category'] = df['Category'].astype('category', ordered=True)
Recent pandas also appears to allow positions to pass all the way through from frame to axes.
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py
https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/pyplot.py
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes/_axes.py

It might sound kind of silly, but many of the plot allow you to determine the order. For example:
Library & dataset
import seaborn as sns
df = sns.load_dataset('iris')
Specific order
p1=sns.boxplot(x='species', y='sepal_length', data=df, order=["virginica", "versicolor", "setosa"])
sns.plt.show()

If you're not happy with the default column order in your boxplot, you can change it to a specific order by setting the column parameter in the boxplot function.
check the two examples below:
np.random.seed(0)
df = pd.DataFrame(np.random.rand(37,4), columns=list('ABCD'))
##
plt.figure()
df.boxplot()
plt.title("default column order")
##
plt.figure()
df.boxplot(column=['C','A', 'D', 'B'])
plt.title("Specified column order")

Use the new positions= attribute:
df.boxplot(column=['Data'], by=['PlotBy'], positions=df.index.values)

This can be resolved by applying a categorical order. You can decide on the ranking yourself. I'll give an example with days of week.
Provide categorical order to weekday
#List categorical variables in correct order
weekday = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
#Assign the above list to category ranking
wDays = pd.api.types.CategoricalDtype(ordered= True, categories=Weekday)
#Apply this to the specific column in DataFrame
df['Weekday'] = df['Weekday'].astype(wDays)
# Then generate your plot
plt.figure(figsize = [15, 10])
sns.boxplot(data = flights_samp, x = 'Weekday', y = 'Y Axis Variable', color = colour)

Plotting a Pandas DataSeries.GroupBy

I am new to python and pandas, and have the following DataFrame.
How can I plot the DataFrame where each ModelID is a separate plot, saledate is the x-axis and MeanToDate is the y-axis?
Attempt
data[40:76].groupby('ModelID').plot()
DataFrame

You can make the plots by looping over the groups from groupby:
import matplotlib.pyplot as plt
for title, group in df.groupby('ModelID'):
group.plot(x='saleDate', y='MeanToDate', title=title)
See for more information on plotting with pandas dataframes:
http://pandas.pydata.org/pandas-docs/stable/visualization.html
and for looping over a groupby-object:
http://pandas.pydata.org/pandas-docs/stable/groupby.html#iterating-through-groups

Example with aggregation:
I wanted to do something like the following, if pandas had a colour aesthetic like ggplot:
aggregated = df.groupby(['model', 'training_examples']).aggregate(np.mean)
aggregated.plot(x='training_examples', y='accuracy', label='model')
(columns: model is a string, training_examples is an integer, accuracy is a decimal)
But that just produces a mess.
Thanks to joris's answer, I ended up with:
for index, group in df.groupby(['model']):
group_agg = group.groupby(['training_examples']).aggregate(np.mean)
group_agg.plot(y='accuracy', label=index)
I found that title= was just replacing the single title of the plot on each loop iteration, but label= does what you'd expect -- after running plt.legend(), of course.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.