pandas boxplot: swap box placement for comparison - python

tmpdf.boxplot(['original','new'], by = 'by column', ax = ax, sym = '')
gets me a plot like this
I want to compare "original" with "new", how can I arrange to put the two "0" boxes in one panel and the two "1" boxes in another panel? And of course swap the labelling with that.
Thanks

Here is a sample dataset to demonstrate.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# simulate some artificial data
# ==========================================
np.random.seed(0)
df = pd.DataFrame(np.random.rand(10,2), columns=['original', 'new'] )
df['by column'] = pd.Series([0,0,0,0,1,1,1,1,1,1])
# your original plot
ax = df.boxplot(['original', 'new'], by='by column', figsize=(12,6))
To get desired output, use groupby explicitly out of boxplot, so that we iterate over all subgroups, and plot a boxplot for each.
ax = df[['original', 'new']].groupby(df['by column']).boxplot(figsize=(12,6))

Related

Create a box plot from two series

I have two pandas series of numbers (not necessarily in the same size).
Can I create one side by side box plot for both of the series?
I didn't found a way to create a boxplot from a series, and not from 2 series.
For the test I generated 2 Series, of different size:
np.random.seed(0)
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(14))
The first processing step is to concatenate them into a single DataFrame
and set some meaningful column names (will be included in the picture):
df = pd.concat([s1, s2], axis=1)
df.columns = ['A', 'B']
And to create the picture, along with a title, you can run:
ax = df.boxplot()
ax.get_figure().suptitle(t='My Boxplot', fontsize=16);
For my source data, the result is:
We can try with an example dataset, two series, unequal length, and defined colors.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(100)
S1 = pd.Series(np.random.normal(0,1,10))
S2 = pd.Series(np.random.normal(0,1,14))
colors = ['#aacfcf', '#d291bc']
One option is to make a data.frame containing the two series in a column, and provide a label for the series:
fig, ax = plt.subplots(1, 1,figsize=(6,4))
import seaborn as sns
sns.boxplot(x='series',y='values',
data=pd.DataFrame({'values':pd.concat([S1,S2],axis=0),
'series':np.repeat(["S1","S2"],[len(S1),len(S2)])}),
ax = ax,palette=colors,width=0.5
)
The other, is to use matplotlib directly, as the other solutions have suggested. However, there is no need to concat them column wise and create some amounts of NAs. You can directly use plt.boxplot from matplotlib to plot an array of values. The downside is, that it takes a bit of effort to adjust the colors etc, as I show below:
fig, ax = plt.subplots(1, 1,figsize=(6,4))
bplot = ax.boxplot([S1,S2],patch_artist=True,widths=0.5,
medianprops=dict(color="black"),labels =['S1','S2'])
plt.setp(bplot['boxes'], color='black')
for patch, color in zip(bplot['boxes'], colors):
patch.set_facecolor(color)
Try this:
import numpy as np
ser1 = pd.Series(np.random.randn(10))
ser2 = pd.Series(np.random.randn(10))
## solution
pd.concat([ser1, ser2], axis=1).plot.box()

Bar plot and coloured categorical variable

I have a dataframe with 3 variables:
data= [["2019/oct",10,"Approved"],["2019/oct",20,"Approved"],["2019/oct",30,"Approved"],["2019/oct",40,"Approved"],["2019/nov",20,"Under evaluation"],["2019/dec",30,"Aproved"]]
df = pd.DataFrame(data, columns=['Period', 'Observations', 'Result'])
I want a barplot grouped by the Period column, showing all the values ​​contained in the Observations column and colored with the Result column.
How can I do this?
I tried the sns.barplot, but it joined the values in Observations column in just one bar(mean of the values).
sns.barplot(x='Period',y='Observations',hue='Result',data=df,ci=None)
Plot output
Assuming that you want one bar for each row, you can do as follows:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
result_cat = df["Result"].astype("category")
result_codes = result_cat.cat.codes.values
cmap = plt.cm.Dark2(range(df["Result"].unique().shape[0]))
patches = []
for code in result_cat.cat.codes.unique():
cat = result_cat.cat.categories[code]
patches.append(mpatches.Patch(color=cmap[code], label=cat))
df.plot.bar(x='Period',
y='Observations',
color=cmap[result_codes],
legend=False)
plt.ylabel("Observations")
plt.legend(handles=patches)
If you would like it grouped by the months, and then stacked, please use the following (note I updated your code to make sure one month had more than one status), but not sure I completely understood your question correctly:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
data= [["2019/oct",10,"Approved"],["2019/oct",20,"Approved"],["2019/oct",30,"Approved"],["2019/oct",40,"Under evaluation"],["2019/nov",20,"Under evaluation"],["2019/dec",30,"Aproved"]]
df = pd.DataFrame(data, columns=['Period', 'Observations', 'Result'])
df.groupby(['Period', 'Result'])['Observations'].sum().unstack('Result').plot(kind='bar', stacked=True)

Time Frequency Color Map

I'd like to show the occurrence in a color map for the frequency of a point , i.e. (1,2) has a frequency of 3 points while still keeping my 'xaxis' (i.e. df['A'])
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'A': [1,1,1,1,2,2,3,4,6,7,7],
'B': [2,2,2,3,3,4,5,6,7,8,8]})
plt.figure()
plt.scatter(df['A'], df['B'])
plt.show()
Here is my current plot
I'd like to keep the same axis I have, while adding the colormap. Hope I was being clear.
You can calculate the frequency of a certain value using the collections package.
freq_dic = collections.Counter(df["B"])
You then need to add this new list to your dataframe and add two new options to the scatter plot. The colormap legend is displayed with plt.colorbar. This code is far from perfect, so any further improvements are very welcome.
import pandas as pd
import matplotlib.pyplot as plt
import collections
df = pd.DataFrame({'A': [1,1,1,1,2,2,3,4,6,7,7],
'B': [2,2,2,3,3,4,5,6,7,8,8]})
freq_dic = collections.Counter(df["B"])
for index, entry in enumerate(df["B"]):
df.at[index, 'freq'] = (freq_dic[entry])
plt.figure()
plt.scatter(df['A'], df['B'],
c=df['freq'],
cmap='viridis')
plt.colorbar()
plt.show()

Plotting data with categorical x and y axes in python

I have a list of case and control samples along with the information about what characteristics are present or absent in each of them. A dataframe including the information can be generated by Pandas:
import pandas as pd
df={'Patient':[True,True,False],'Control':[False,True,False]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
I need to visualize this data as a dotplot/scatterplot in the way that both of the x and y axis to be categorical and presence/absence to be coded by different shapes. Something like following:
Patient| x x -
Control| - x -
__________________
GeneA GeneB GeneC
I am new to Matplotlib/seaborn and I can plot simple line plots and scatter plots. But searching online I could not find any instructions or plot similar to what I need here.
A quick way would be:
import pandas as pd
import matplotlib.pyplot as plt
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
heatmap = plt.imshow(df)
plt.xticks(range(len(df.columns.values)), df.columns.values)
plt.yticks(range(len(df.index)), df.index)
cbar = plt.colorbar(mappable=heatmap, ticks=[0, 1], orientation='vertical')
# vertically oriented colorbar
cbar.ax.set_yticklabels(['Absent', 'Present'])
Thanks to #DEEPAK SURANA for adding labels to the colorbar.
I searched the pyplot documentation and could not find a scatter or dot plot exactly like you described. Here is my take on creating a plot that illustrates what you want. The True records are blue and the False records are red.
# creating dataframe and extra column because index is not numeric
import pandas as pd
df={'Patient':[True,True,False],
'Control':[False,True,False]}
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
df['level'] = [i for i in range(0, len(df))]
print(df)
# plotting the data
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
for idx, gene in enumerate(df.columns[:-1]):
df_gene = df[[gene, 'level']]
cList = ['blue' if x == True else 'red' for x in df[gene]]
for inr_idx, lv in enumerate(df['level']):
ax.scatter(x=idx, y=lv, c=cList[inr_idx], s=20)
fig.tight_layout()
plt.yticks([i for i in range(len(df.index))], list(df.index))
plt.xticks([i for i in range(len(df.columns)-1)], list(df.columns[:-1]))
plt.show()
Something like this might work
import pandas as pd
import numpy as np
from matplotlib.ticker import FixedLocator
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
plot = df.T.plot()
loc = FixedLocator([0,1,2])
plot.xaxis.set_major_locator(loc)
plot.xaxis.set_ticklabels(df.columns)
look at https://matplotlib.org/examples/pylab_examples/major_minor_demo1.html
and https://matplotlib.org/api/ticker_api.html
I think you have to convert the boolean values to zeros and ones to make it work. Someting like df.astype(int)

Plot Multiple DataFrames into one single plot

I have two dataFrames that I would like to plot into a single graph. Here's a basic code:
#!/usr/bin/python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
scenarios = ['scen-1', 'scen-2']
for index, item in enumerate(scenarios):
df = pd.DataFrame({'A' : np.random.randn(4)})
print df
df.plot()
plt.ylabel('y-label')
plt.xlabel('x-label')
plt.title('Title')
plt.show()
However, this only plots the last dataFrame. If I use pd.concat() it plots one line with the combined values.
How can I plot two lines, one for the first dataFrame and one for the second one?
You need to put your plot in the for loop.
If you want them on a single plot then you need to use plot's ax kwarg to put them to plot on the same axis. Here I have created a fresh axis using subplots but this could be an already populated axis,
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
scenarios = ['scen-1', 'scen-2']
fig, ax = plt.subplots()
for index, item in enumerate(scenarios):
df = pd.DataFrame({'A' : np.random.randn(4)})
print df
df.plot(ax=ax)
plt.ylabel('y-label')
plt.xlabel('x-label')
plt.title('Title')
plt.show()
The plot function is only called once, and as you say this is with the last value of df. Put df.plot() inside the loop.

Categories

Resources