Bar plots of data frame columns split by criterion of another column

Bar plots of data frame columns split by criterion of another column - python

I would like to create bar plots of specified data frame columns split by a criterion of another column (here < 5). It works like below but certainly there is a more dataframe-like way? Something like df.makeCoolBarPlots()?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
df = pd.DataFrame({'score':[1,6,2,3,1,9,5], 'age':[10,16,33,12,8,24,18], 'IQ':[89,120,88,94,103,110,102]})
df_pass = df[df['score'] >= 5]
df_fail = df[df['score'] < 5]
fieldsOfInterest = ['age', 'IQ']
ind = np.arange(2)
for fieldOfInterest in fieldsOfInterest:
plt.figure()
plt.bar(ind, [df_pass[fieldOfInterest].mean(), df_fail[fieldOfInterest].mean()], yerr=[df_pass[fieldOfInterest].std(), df_fail[fieldOfInterest].std()])
stat, p = stats.ttest_ind(df_pass[fieldOfInterest], df_fail[fieldOfInterest])
plt.title("p={:0.3f}".format(p))
plt.xticks(ind, ('pass', 'fail'))
plt.ylabel(fieldOfInterest)
plt.show()

You can use pandas builtin plot function alongside groupby:
# First make your conditions using np.select
df["group"] = np.select([df["score"].ge(5), df["score"].lt(5)], ["pass", "fail"])
# Create a groupby
gb = df.groupby('group')
for col in ["age", "IQ"]:
# Get p value, mean, and std for each column
_, p = stats.ttest_ind(*[g[col] for n, g in gb])
means = gb[col].mean()
errors = gb[col].std()
# Plot using pandas.plot
fig, ax = plt.subplots()
means.plot.bar(yerr=errors, ax=ax)
ax.set(ylabel=col, title="p={:.3f}".format(p))
Results in:

Related

How to normalize coloring of data with seaborn in pandas?

I got data like you can see in picture 1, because I have value 0 and rest is much bigger (values are between 0 and 100). I would like to get data like is show in picture 2. How to solve this problem?
This is minimal reproducible code.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import colors
index = pd.MultiIndex.from_product([[2019, 2020], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Group1', 'Group2', 'Group3'], ['value1', 'value2']],
names=['subject', 'type'])
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 20
data += 50
rdata = pd.DataFrame(data, index=index, columns=columns)
cc = sns.light_palette("red", as_cmap=True)
cc.set_bad('white')
def my_gradient(s, cmap):
return [f'background-color: {colors.rgb2hex(x)}'
for x in cmap(s.replace(np.inf, np.nan))]
styler = rdata.style
red = styler.apply(
my_gradient,
cmap=cc,
subset=rdata.columns.get_loc_level('value1', level=1)[0],
axis=0)
styler
Picture 1
Picture 2

You need to normalize. Usually, in matplotlib, a norm is used, of which plt.Normalize() is the most standard one.
The updated code could look like:
my_norm = plt.Normalize(0, 100)
def my_gradient(s, cmap):
return [f'background-color: {colors.rgb2hex(x)}'
for x in cmap(my_norm(s.replace(np.inf, np.nan)))]

You can normalize you data with the following equation (x-min)/(max-min). So to apply this to your dataframe you could use something like the following:
result = pd.DataFrame()
for i,row in df.iterrows():
hold = {}
for h in df:
hold[h] = (row[h]-df[h].min())/(df[h].max()-df[h].min())
result = result.append(hold,ignore_index=True)

Scatter and curve plot using matplotlib

I am trying to plot the accuracy evolution of NN models overtimes. So, I have an excel file with data like the following:
and I wrote the following code to extract data and plot the scatter:
import pandas as pd
data = pd.read_excel (r'SOTA DNN.xlsx')
acc1 = pd.DataFrame(data, columns= ['Top-1-Acc'])
para = pd.DataFrame(data, columns= ['Parameters'])
dates = pd.to_datetime(data['Date'], format='%Y-%m-%d')
import matplotlib.pyplot as plt
plt.grid(True)
plt.ylim(40, 100)
plt.scatter(dates, acc1)
plt.show()
Is there a way to draw a line in the same figure to show only the ones achieving the maximum and print their names at the same time as in this example:
is it also possible to stretch the x-axis (for the dates)?

It is still not clear what you mean by "stretch the x-axis" and you did not provide your dataset, but here is a possible general approach:
import matplotlib.pyplot as plt
import pandas as pd
#fake data generation, this has to be substituted by your .xls import routine
from pandas._testing import rands_array
import numpy as np
np.random.seed(1234)
n = 30
acc = np.concatenate([np.random.randint(0, 10, 10), np.random.randint(0, 30, 10), np.random.randint(0, 100, n-20)])
date_range = pd.date_range("20190101", periods=n)
model = rands_array(5, n)
df = pd.DataFrame({"Model": model, "Date": date_range, "TopAcc": acc})
df = df.sample(frac=1).reset_index(drop=True)
#now to the actual data modification
#first, we extract the dataframe with monotonically increasing values after sorting the date column
df = df.sort_values("Date").reset_index()
df["Max"] = df.TopAcc.cummax().diff()
df.loc[0, "Max"] = 1
dfmax = df[df.Max > 0]
#then, we plot all data, followed by the best performers
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(df.Date, df.TopAcc, c="grey")
ax.plot(dfmax.Date, dfmax.TopAcc, marker="x", c="blue")
#finally, we annotate the best performers
for _, xylabel in dfmax.iterrows():
ax.text(xylabel.Date, xylabel.TopAcc, xylabel.Model, c="blue", horizontalalignment="right", verticalalignment="bottom")
plt.show()
Sample output:

Create a box plot from two series

I have two pandas series of numbers (not necessarily in the same size).
Can I create one side by side box plot for both of the series?
I didn't found a way to create a boxplot from a series, and not from 2 series.

For the test I generated 2 Series, of different size:
np.random.seed(0)
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(14))
The first processing step is to concatenate them into a single DataFrame
and set some meaningful column names (will be included in the picture):
df = pd.concat([s1, s2], axis=1)
df.columns = ['A', 'B']
And to create the picture, along with a title, you can run:
ax = df.boxplot()
ax.get_figure().suptitle(t='My Boxplot', fontsize=16);
For my source data, the result is:

We can try with an example dataset, two series, unequal length, and defined colors.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(100)
S1 = pd.Series(np.random.normal(0,1,10))
S2 = pd.Series(np.random.normal(0,1,14))
colors = ['#aacfcf', '#d291bc']
One option is to make a data.frame containing the two series in a column, and provide a label for the series:
fig, ax = plt.subplots(1, 1,figsize=(6,4))
import seaborn as sns
sns.boxplot(x='series',y='values',
data=pd.DataFrame({'values':pd.concat([S1,S2],axis=0),
'series':np.repeat(["S1","S2"],[len(S1),len(S2)])}),
ax = ax,palette=colors,width=0.5
)
The other, is to use matplotlib directly, as the other solutions have suggested. However, there is no need to concat them column wise and create some amounts of NAs. You can directly use plt.boxplot from matplotlib to plot an array of values. The downside is, that it takes a bit of effort to adjust the colors etc, as I show below:
fig, ax = plt.subplots(1, 1,figsize=(6,4))
bplot = ax.boxplot([S1,S2],patch_artist=True,widths=0.5,
medianprops=dict(color="black"),labels =['S1','S2'])
plt.setp(bplot['boxes'], color='black')
for patch, color in zip(bplot['boxes'], colors):
patch.set_facecolor(color)

Try this:
import numpy as np
ser1 = pd.Series(np.random.randn(10))
ser2 = pd.Series(np.random.randn(10))
## solution
pd.concat([ser1, ser2], axis=1).plot.box()

Seaborn Plot including different distributions of the same data

I wish to create a seaborn pointplot to display the full data distribution in a column, alongside the distribution of the lowest 25% of values, and the distribution of the highest 25% of values, and all side by side (on the x axis).
My attempt so far provides me with the values, but they are displayed on the same part of the x-axis only and not spread out from left to right on the graph, and with no obvious way to label the points from x-ticks (which I would prefer , rather than via a legend).
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
df = sns.load_dataset('tips')
df1 = df[(df.total_bill < df.total_bill.quantile(.25))]
df2 = df[(df.total_bill > df.total_bill.quantile(.75))]
sns.pointplot(y=df['total_bill'], data=df, color='red')
sns.pointplot(y=df1['total_bill'], data=df1, color='green')
sns.pointplot(y=df2['total_bill'], data=df2, color='blue')

You could .join() the new distributions to your existing df and then .plot() using wide format:
lower, upper = df.total_bill.quantile([.25, .75]).values.tolist()
df = df.join(df.loc[df.total_bill < lower, 'total_bill'], rsuffix='_lower')
df = df.join(df.loc[df.total_bill > upper, 'total_bill'], rsuffix='_upper')
sns.pointplot(data=df.loc[:, [c for c in df.columns if c.startswith('total')]])
to get:
If you wanted to add groups, you could simply use .unstack() to get to long format:
df = df.loc[:, ['total_bill', 'total_bill_upper', 'total_bill_lower']].unstack().reset_index().drop('level_1', axis=1).dropna()
df.columns = ['grp', 'val']
to get:
sns.pointplot(x='grp', y='val', hue='grp', data=df)

I would think along the lines of adding a "group" and then plot as a single DataFrame.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
df = sns.load_dataset('tips')
df = df.append(df)
df.loc[(df.total_bill < df.total_bill.quantile(.25)),'group'] = 'L'
df.loc[(df.total_bill > df.total_bill.quantile(.75)),'group'] = 'H'
df = df.reset_index(drop=True)
df.loc[len(df)/2:,'group'] = 'all'
sns.pointplot(data = df,
y='total_bill',
x='group',
hue='group',
linestyles='')

100% area plot of a pandas DataFrame

In pandas' documentation you can find a discussion on area plots, and in particular stacking them. Is there an easy and straightforward way to get a 100% area stack plot like this one
from this post?

The method is basically the same as in the other SO answer; divide each row by the sum of the row:
df = df.divide(df.sum(axis=1), axis=0)
Then you can call df.plot(kind='area', stacked=True, ...) as usual.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(2015)
y = np.random.randint(5, 50, (10,3))
x = np.arange(10)
df = pd.DataFrame(y, index=x)
df = df.divide(df.sum(axis=1), axis=0)
ax = df.plot(kind='area', stacked=True, title='100 % stacked area chart')
ax.set_ylabel('Percent (%)')
ax.margins(0, 0) # Set margins to avoid "whitespace"
plt.show()
yields

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bar plots of data frame columns split by criterion of another column - python

Related

How to normalize coloring of data with seaborn in pandas?

Scatter and curve plot using matplotlib

Create a box plot from two series

Seaborn Plot including different distributions of the same data

100% area plot of a pandas DataFrame

Categories

Resources