Joining two Pandas dataframes and producing side-by-side barplot? - python

Suppose I have two Pandas dataframes, df1 and df2, each with two columns, hour and value. Some of the hours are missing in the two dataframes.
import pandas as pd
import matplotlib.pyplot as plt
data1 = [
('hour', [0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]),
('value', [12.044324085714285, 8.284134466666668, 9.663580800000002,
18.64010145714286, 15.817029916666664, 13.242198508695651,
10.157177889201877, 9.107153674476985, 10.01193336545455,
16.03340384878049, 16.037368506666674, 16.036160044827593,
15.061596637500001, 15.62831551764706, 16.146087032608694,
16.696574719512192, 16.02603831463415, 17.07469460470588,
14.69635686969697, 16.528905725581396, 12.910250661111112,
13.875522341935481, 12.402971938461539])
]
df1 = pd.DataFrame.from_items(data1)
df1.head()
# hour value
# 0 0 12.044324
# 1 1 8.284134
# 2 2 9.663581
# 3 4 18.640101
# 4 5 15.817030
data2 = [
('hour', [0, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23]),
('value', [27.2011904, 31.145661266666668, 27.735570511111113,
18.824297487999996, 17.861847334275623, 25.3033003254902,
22.855934450000003, 31.160574200000003, 29.080220000000004,
30.987719745454548, 26.431310216666663, 30.292641480000004,
27.852885586666666, 30.682682472727276, 29.43023531764706,
24.621718962500005, 33.92878745, 26.873105866666666,
34.06412232, 32.696606333333335])
]
df2 = pd.DataFrame.from_items(data2)
df2.head()
# hour value
# 0 0 27.201190
# 1 5 31.145661
# 2 6 27.735571
# 3 7 18.824297
# 4 8 17.861847
I would like to join them together using the key of hour and then produce a side-by-side barplot of the data. The x-axis would be hour, and the y-axis would be value.
I can create a bar plot of one dataframe at a time.
_ = plt.bar(df1.hour.tolist(), df1.value.tolist())
_ = plt.xticks(df1.hour, rotation=0)
_ = plt.grid()
_ = plt.show()
_ = plt.bar(df2.hour.tolist(), df2.value.tolist())
_ = plt.xticks(df2.hour, rotation=0)
_ = plt.grid()
_ = plt.show()
However, what I want is to create a barchart of them side by side, like this:
Thank you for any help.

You can do it all in one line, if you wish. Making use of the pandas plotting wrapper and the fact that plotting a dataframe with several columns will group the plot. Given the definitions of df1 and df2 from the question, you can call
pd.merge(df1,df2, how='outer', on=['hour']).set_index("hour").plot.bar()
plt.show()
resulting in
Note that this leaves out the number 3 in this case as it is not part of any hour column in any of the two dataframes. To include it, use reset_index
pd.merge(df1,df2, how='outer', on=['hour']).set_index("hour").reindex(range(24)).plot.bar()

First reindex the dataframes and then create two barplots using the data. The positioning of the rectangles is given by (x - width/2, x + width/2, bottom, bottom + height).
import numpy as np
index = np.arange(0, 24)
bar_width = 0.3
df1 = df1.set_index('hour').reindex(index)
df2 = df2.set_index('hour').reindex(index)
plt.figure(figsize=(10, 5))
plt.bar(index - bar_width / 2, df1.value, bar_width, label='df1')
plt.bar(index + bar_width / 2, df2.value, bar_width, label='df2')
plt.xticks(index)
plt.legend()
plt.tight_layout()
plt.show()

Related

pandas boxplot contains content of plot saved before

I'm plotting some columns of a datafame into a boxplot. Sofar, no problem. As seen below I wrote some stuff and it works. BUT: the second plot contains the plot of the first plot, too. So as you can see I tried it with "= None" or "del value", but it does not work. Putting the plot function outside also don't solves the problem.
Whats wrong with my code?
Here is an executable example
import pandas as pd
d1 = {'ff_opt_time': [10, 20, 11, 5, 15 , 13, 19, 25 ], 'ff_count_opt': [30, 40, 45, 29, 35,38,32,41]}
df1 = pd.DataFrame(data=d1)
d2 = {'ff_opt_time': [1, 2, 1, 5, 1 , 1, 4, 5 ], 'ff_count_opt': [3, 4, 4, 9, 5,3, 2,4]}
df2 = pd.DataFrame(data=d2)
def evaluate2(df1, df2):
def plot(df, output ):
boxplot = df.boxplot(rot=45,fontsize=5)
fig = boxplot.get_figure()
fig.savefig(output + ".pdf")
df_ot = pd.DataFrame(columns=['opt_time1' , 'opt_time2'])
df_ot['opt_time1'] = df1['ff_opt_time']
df_ot['opt_time2'] = df2['ff_opt_time']
plot(df_ot, "bp_opt_time")
df_op = pd.DataFrame(columns=['count_opt1' , 'count_opt2'])
df_op['count_opt1'] = df1['ff_count_opt']
df_op['count_opt2'] = df2['ff_count_opt']
plot(df_op, "bp_count_opt_perm")
evaluate2(df1, df2)
Here is another executable example. I even used other variable names.
import pandas as pd
d1 = {'ff_opt_time': [10, 20, 11, 5, 15 , 13, 19, 25 ], 'ff_count_opt': [30, 40, 45, 29, 35,38,32,41]}
df1 = pd.DataFrame(data=d1)
d2 = {'ff_opt_time': [1, 2, 1, 5, 1 , 1, 4, 5 ], 'ff_count_opt': [3, 4, 4, 9, 5,3, 2,4]}
df2 = pd.DataFrame(data=d2)
def evaluate2(df1, df2):
df_ot = pd.DataFrame(columns=['opt_time1' , 'opt_time2'])
df_ot['opt_time1'] = df1['ff_opt_time']
df_ot['opt_time2'] = df2['ff_opt_time']
boxplot1 = df_ot.boxplot(rot=45,fontsize=5)
fig1 = boxplot1.get_figure()
fig1.savefig( "bp_opt_time.pdf")
df_op = pd.DataFrame(columns=['count_opt1' , 'count_opt2'])
df_op['count_opt1'] = df1['ff_count_opt']
df_op['count_opt2'] = df2['ff_count_opt']
boxplot2 = df_op.boxplot(rot=45,fontsize=5)
fig2 = boxplot2.get_figure()
fig2.savefig( "bp_count_opt_perm.pdf")
evaluate2(df1, df2)
I can see from your code that boxplots: boxplot1 & boxplot2 are in the same graph. What you need to do is instruct that there is going to be two plots.
This can be achieved either by
Create two sub plots using pyplot in matplotlib, this code does the trick fig1, ax1 = plt.subplots() with ax1 specifying boxplot to put in that axes and fig2 specifying boxplot figure
Dissolve evaluate2 function and execute the boxplot separately in different cell in the jupyter notebook
Solution 1 : Two subplots using pyplot
import pandas as pd
import matplotlib.pyplot as plt
d1 = {'ff_opt_time': [10, 20, 11, 5, 15 , 13, 19, 25 ], 'ff_count_opt': [30, 40, 45, 29, 35,38,32,41]}
df1 = pd.DataFrame(data=d1)
d2 = {'ff_opt_time': [1, 2, 1, 5, 1 , 1, 4, 5 ], 'ff_count_opt': [3, 4, 4, 9, 5,3, 2,4]}
df2 = pd.DataFrame(data=d2)
def evaluate2(df1, df2):
df_ot = pd.DataFrame(columns=['opt_time1' , 'opt_time2'])
df_ot['opt_time1'] = df1['ff_opt_time']
df_ot['opt_time2'] = df2['ff_opt_time']
fig1, ax1 = plt.subplots()
boxplot1 = df_ot.boxplot(rot=45,fontsize=5)
ax1=boxplot1
fig1 = boxplot1.get_figure()
fig1.savefig( "bp_opt_time.pdf")
df_op = pd.DataFrame(columns=['count_opt1' , 'count_opt2'])
df_op['count_opt1'] = df1['ff_count_opt']
df_op['count_opt2'] = df2['ff_count_opt']
fig2, ax2 = plt.subplots()
boxplot2 = df_op.boxplot(rot=45,fontsize=5)
fig2 = boxplot2.get_figure()
ax2=boxplot2
fig2.savefig( "bp_count_opt_perm.pdf")
plt.show()
evaluate2(df1, df2)
Solution 2: Executing boxplot in different cell
Update based on comments : clearing plots
Two ways you can clear the plot,
plot itself using clf()
matplotlib.pyplot.clf() function to clear the current Figure’s state without closing it
clear axes using cla()
matplotlib.pyplot.cla() function clears the current Axes state without closing the Axes.
Simply call plt.clf() function after calling fig.save
Read this documentation on how to clear a plot in Python using matplotlib
Just grab the code from Archana David and put it in your plot function: the goal is to call "fig, ax = plt.subplots()" to create a new graph.
import pandas as pd
import matplotlib.pyplot as plt
d1 = {'ff_opt_time': [10, 20, 11, 5, 15, 13, 19, 25],
'ff_count_opt': [30, 40, 45, 29, 35, 38, 32, 41]}
df1 = pd.DataFrame(data=d1)
d2 = {'ff_opt_time': [1, 2, 1, 5, 1, 1, 4, 5],
'ff_count_opt': [3, 4, 4, 9, 5, 3, 2, 4]}
df2 = pd.DataFrame(data=d2)
def evaluate2(df1, df2):
def plot(df, output):
fig, ax = plt.subplots()
boxplot = df.boxplot(rot=45, fontsize=5)
ax = boxplot
fig = boxplot.get_figure()
fig.savefig(output + ".pdf")
df_ot = pd.DataFrame(columns=['opt_time1', 'opt_time2'])
df_ot['opt_time1'] = df1['ff_opt_time']
df_ot['opt_time2'] = df2['ff_opt_time']
plot(df_ot, "bp_opt_time")
df_op = pd.DataFrame(columns=['count_opt1' , 'count_opt2'])
df_op['count_opt1'] = df1['ff_count_opt']
df_op['count_opt2'] = df2['ff_count_opt']
plot(df_op, "bp_count_opt_perm")
evaluate2(df1, df2)

Plotting averages of box plots as a box plot

I have a set of lists (about 100) of the form [6, 17, 5, 1, 4, 7, 14, 19, 0, 10] and I want to get one box plot which plots the averages of box-plot information (i.e. median, max, min, Q1, Q3, outliers) of all of the lists.
For example, if I have 2 lists
l1 = [6, 17, 5, 1, 4, 7, 14, 19, 0, 10]
l2 = [4, 12, 3, 5, 16, 0, 14, 7, 8, 15]
I can get averages of max, median, and min of the lists as follows:
maxs = np.array([])
mins = np.array([])
medians = np.array([])
for l in [l1, l2]:
medians = np.append(medians, np.median(l))
maxs = np.append(maxs, np.max(l))
mins = np.append(mins, np.min(l))
averMax = np.mean(maxs)
averMin = np.mean(mins)
averMedian = np.mean(medians)
I should do the same for other info in the box plot such as average Q1, average Q3. I then need to use this information (averMax, averMin, etc.) to plot just one single box plot (not multiple box plots in one graph).
I know from Draw Box-Plot with matplotlib that you don't have to calculate the values for a normal box plot. You just need to specify the data as a variable.
Is it possible to do the same for my case instead of manually calculating the averages of the values of all the lists?
pd.describe() will get the quartiles, so you can make a graph based on them. I customized the calculated numbers with the help of this answer and the example graph from the official reference.
import pandas as pd
import numpy as np
import io
l1 = [6, 17, 5, 1, 4, 7, 14, 19, 0, 10]
l2 = [4, 12, 3, 5, 16, 0, 14, 7, 8, 15]
df = pd.DataFrame({'l1':l1, 'l2':l2}, index=np.arange(len(l1)))
df.describe()
l1 l2
count 10.000000 10.000000
mean 8.300000 8.400000
std 6.532823 5.561774
min 0.000000 0.000000
25% 4.250000 4.250000
50% 6.500000 7.500000
75% 13.000000 13.500000
max 19.000000 16.000000
import matplotlib.pyplot as plt
# spread,center, filer_high, flier_low
x1 = [l1[4]-1.5*(l1[6]-l1[4]), l1[4], l1[5], l1[5]+1.5*(l1[6]-l1[4])]
x2 = [l2[4]-1.5*(l2[6]-l2[4]), l2[4], l2[5], l2[5]+1.5*(l2[6]-l2[4])]
fig = plt.figure(figsize=(8,6))
plt.boxplot([x for x in [x1, x2]], 0, 'rs', 1)
plt.xticks([y+1 for y in range(len([x1, x2]))], ['x1', 'x2'])
plt.xlabel('measurement x')
t = plt.title('Box plot')
plt.show()

Bar graph df.plot() vs ax.bar() structure matplotlib

I am trying to graph a table as a bar graph.
I get my desired outcome using df.plot(kind='bar') structure. But for certain reasons, I now need to graph it using the ax.bar() structure.
Please refer to the example screenshot. I would like to graph the x axis as categorical labels like the df.plot(kind='bar') structure rather than continuous scale, but need to learn to use ax.bar() structure to do the same.
Make the index categorical by setting the type to 'str'
import pandas as pd
import matplotlib.pyplot as plt
data = {'SA': [11, 12, 13, 16, 17, 159, 209, 216],
'ET': [36, 45, 11, 15, 16, 4, 11, 10],
'UT': [11, 26, 10, 11, 16, 7, 2, 2],
'CT': [5, 0.3, 9, 5, 0.2, 0.2, 3, 4]}
df = pd.DataFrame(data)
df['SA'] = df['SA'].astype('str')
df.set_index('SA', inplace=True)
width = 3
fig, ax = plt.subplots(figsize=(12, 8))
p1 = ax.bar(df.index, df.ET, color='b', label='ET')
p2 = ax.bar(df.index, df.UT, bottom=df.ET, color='g', label='UT')
p3 = ax.bar(df.index, df.CT, bottom=df.ET+df.UT, color='r', label='CT')
plt.legend()
plt.show()

How to keep the index when using pd.melt and merge to create a DataFrame for Seaborn and matplotlib

I am trying to draw subplots using two identical DataFrames ( predicted and observed) with exact same structure ... the first column is index
The code below makes new index when they are concatenated using pd.melt and merge
as you can see in the figure the index of orange line is changed from 1-5 to 6-10
I was wondering if some could fix the code below to keep the same index for the orange line:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
actual = pd.DataFrame({'a': [5, 8, 9, 6, 7, 2],
'b': [89, 22, 44, 6, 44, 1]})
predicted = pd.DataFrame({'a': [7, 2, 13, 18, 20, 2],
'b': [9, 20, 4, 16, 40, 11]})
# Creating a tidy-dataframe to input under seaborn
merged = pd.concat([pd.melt(actual), pd.melt(predicted)]).reset_index()
merged['category'] = ''
merged.loc[:len(actual)*2,'category'] = 'actual'
merged.loc[len(actual)*2:,'category'] = 'predicted'
g = sns.FacetGrid(merged, col="category", hue="variable")
g.map(plt.plot, "index", "value", alpha=.7)
g.add_legend();
The orange line ('variable' == 'b') doesn't have an index of 0-5 because of how you used melt. If you look at pd.melt(actual), the index doesn't match what you are expecting, IIUC.
Here is how I would rearrange the dataframe:
merged = pd.concat([actual, predicted], keys=['actual', 'predicted'])
merged.index.names = ['category', 'index']
merged = merged.reset_index()
merged = pd.melt(merged, id_vars=['category', 'index'], value_vars=['a', 'b'])
Set the ignore_index variable to false to preserve the index., e.g.
df = df.melt(var_name=‘species’, value_name=‘height’, ignore_index = False)

python - Plotting bar graph side by side on the same graph with seaborn

I need to try to plot 3 bars on the same graph. I have 2 dataframes set up right now. My first dataframe was created off a JSON file seen here.
My second dataframe was created in the code below:
def make_bar_graph():
with open('filelocation.json') as json_file:
data = json.load(json_file)
df = pd.DataFrame([])
for item in data["Results"]["Result"]:
df = df.append(pd.DataFrame.from_dict(kpi for kpi in item["KPI"]))
df.reset_index(level=0, inplace= True)
df.rename(columns={0: 'id', 1: 'average', 2:'std. dev', 3: 'min', 4:
'median', 5:'max'}, inplace=True)
wanted_x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
wanted_y = [5, 5, .500, .500, .500, 1, 1, 5, 5, .500, .500, .500, 1, 1]
kpi = ['kpi1', 'kpi2', 'kpi3', 'kpi4', 'kpi5', 'kpi6', 'kpi7', 'kpi8', 'kpi9', 'kpi10', 'kpi11', 'kpi12',
'kpi13', 'kpi14']
df2 = pd.DataFrame(dict(x=wanted_x, y=wanted_y, kpi=kpi))
sns.set()
sns.set_context("talk")
sns.axes_style("darkgrid")
h = sns.barplot(x='id', y ='average', data=df.ix[0:13], label='Test
on 4/30/2018', color='b')
g = sns.barplot(x='id', y='average', data=df.ix[14:27], label='Test
on 6/4/2018', color='r')
k = sns.barplot("x", "y", data=df2, label='Desired Results', color='y')
plt.legend()
plt.xlabel('KPI number')
plt.ylabel('Time(s)')
plt.show()
This is the graph I get from that:
Graph1
I need the bars to be next to each other, separated by id (or KPI, id number and KPI number are the same things). I'm not sure how to rework my dataframe to do this

Categories

Resources