Bar chart with ticks based on multiple dataframe columns - python

How can I make a bar chart in matplotlib (or pandas) from the bins in my dataframe?
I want something like this, below, where the x-axis labels come from the low, high in my dataframe (so first tick would read [-1.089, 0) and the y value is the percent column in my dataframe.
Here is an example dataset. The dataset is already in this format (I don't have an uncut version).
df = pd.DataFrame(
{
"low": [-1.089, 0, 0.3, 0.5, 0.6, 0.8],
"high": [0, 0.3, 0.5, 0.6, 0.8, 10.089],
"percent": [0.509, 0.11, 0.074, 0.038, 0.069, 0.202],
}
)
display(df)

Create a new column using the the low, high cols.
Covert the int values in the low and high columns to str type and set the new str in the [<low>, <high>) notation that you want.
From there, you can create a bar plot dirrectly from df using df.plot.bar(), assigning the newly created column as x and percent as y.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html

Recreate the bins using IntervalArray.from_arrays:
df['label'] = pd.arrays.IntervalArray.from_arrays(df.low, df.high)
# low high percent label
# 0 -1.089 0.000 0.509 (-1.089, 0.0]
# 1 0.000 0.300 0.110 (0.0, 0.3]
# 2 0.300 0.500 0.074 (0.3, 0.5]
# 3 0.500 0.600 0.038 (0.5, 0.6]
# 4 0.600 0.800 0.069 (0.6, 0.8]
# 5 0.800 10.089 0.202 (0.8, 10.089]
Then plot with x as these bins:
df.plot.bar(x='label', y='percent')

Related

Multiindex data.frame from two data.frames join by column headers

There are a dozens similar sounding questions here, I think I've searched them all and could not find a solution to my problem:
I have 2 df: df_c:
CAN-01 CAN-02 CAN-03
CE
ce1 0.84 0.73 0.50
ce2 0.06 0.13 0.05
And df_z:
CAN-01 CAN-02 CAN-03
marker
cell1 0.29 1.5 7
cell2 1.00 3.0 1
I want to join for each 'marker' + 'CE' combination over their column names
Example: cell1 + ce1:
[[0.29, 0.84],[1.5,0.73],[7,0.5], ...]
(Continuing for cell1 + ce2, cell2 + ce1, cell2 + ce2)
I have a working example using two loops and .loc twice, but it takes forever on the full data set.
I think the best to build is a multiindex DF with some merge/join/concat magic:
CAN-01 CAN-02 CAN-03
Source
0 CE 0.84 0.73 0.50
Marker 0.29 1.5 7
1 CE ...
Marker ...
Sample Code
dc = [['ce1', 0.84, 0.73, 0.5],['c2', 0.06,0.13,0.05]]
dat_c = pd.DataFrame(dc, columns=['CE', 'CAN-01', 'CAN-02', 'CAN-03'])
dat_c.set_index('CE',inplace=True)
dz = [['cell1', 0.29, 1.5, 7],['cell2', 1, 3, 1]]
dat_z = pd.DataFrame(dz, columns=['marker', "CAN-01", "CAN-02", "CAN-03"])
dat_z.set_index('marker',inplace=True)
Bad/Slow Solution
for ci, c_row in dat_c.iterrows(): # for each CE in CE table
tmp = []
for j,colz in enumerate(dat_z.columns[1:]):
if not colz in dat_c:
continue
entry_c = c_row.loc[colz]
if len(entry_c.shape) > 0:
continue
tmp.append([dat_z.loc[marker,colz],entry_c])
IIUC:
use append()+groupby():
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
df=dat_c.append(dat_z).groupby(level=0).agg(list)
output of df:
CAN-01 CAN-02 CAN-03
cell1 [0.84, 0.29] [0.73, 1.5] [0.5, 7.0]
cell2 [0.06, 1.0] [0.13, 3.0] [0.05, 1.0]
If needed list:
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
lst=dat_c.append(dat_z).groupby(level=0).agg(list).to_numpy().tolist()
output of lst:
[[[0.84, 0.29], [0.73, 1.5], [0.5, 7.0]],
[[0.06, 1.0], [0.13, 3.0], [0.05, 1.0]]]

data visualisation - which kind of chart that comfortable with time and multiple data columns?

I have an excel file which consists of the 4 spreadsheets for representing the period of time. each spreadsheet has 3 columns data which are 'subject', 'measure', and 'frequency' (the data is considering the student's interested rate in every 10 years)
E.G, sheet 1970-1980
frequency score
math 3.4 1
english 2.5 0.95
art 0.4 0.8
sheet 1981-1990
frequency score
math 4.7 0.5
english 2.3 0.48
art -0.4 0.13
sheet 1991-2000
frequency score
math 4.2 0.6
english 2.1 0.77
art -0.2 0.24
sheet 2000-2010
frequency score
math 4.5 0.55
english 1.9 0.66
art -0.23 0.19
I have created the scatter plot for each period of time, but I would like to see the movement of the data over the period of time. for example, an x-axis represents the time period and a y-axis represents the frequency and score.
are any suggestions?
First of all, I will reproduce the tables that you have here as pandas Dataframes and for three decades:
data_80s = {'math':[ 3.4, 1], 'english':[2.5, 0.95],'art':[0.4, 0.8]}
df_80s = pd.DataFrame.from_dict(data_80s, orient = 'index', columns=['frequency',
'score'])
df_80s['decade'] = pd.to_datetime(1990, format='%Y')
df_80s['index'] = df_80s.index
data_90s = {'math':[ 4.7, 0.5], 'english':[2.3, 0.48],'art':[-0.4, 0.13]}
df_90s = pd.DataFrame.from_dict(data_90s, orient = 'index', columns=['frequency',
'score'])
df_90s['decade'] = pd.to_datetime(1990, format='%Y')
df_90s['index'] = df_90s.index
data_20s = {'math':[ 4.2, 0.6], 'english':[2.1, 0.77],'art':[-0.2, 0.24]}
df_20s = pd.DataFrame.from_dict(data_20s, orient = 'index', columns=['frequency',
'score'])
df_20s['decade'] = pd.to_datetime(2000, format='%Y')
df_20s['index'] = df_20s.index
Probably you will just need to convert your excell sheet to pandas Dataframes. Just don't forget to add the extra column index and decade.
Then you can merge the dataframes:
frames = [df_90s, df_20s]
result = df_80s.append(frames)
And finally plot whatever you want:
f, (ax1, ax2) = plt.subplots(2, figsize=(15,10))
sns.lineplot(x='decade', y='score', hue = 'index', data=result, ax=ax1)
sns.lineplot(x='decade', y='frequency', hue = 'index', data=result, ax=ax2)

Show correlation values in pairplot using seaborn in python

I have the below data:
prop_tenure prop_12m prop_6m
0.00 0.00 0.00
0.00 0.00 0.00
0.06 0.06 0.10
0.38 0.38 0.25
0.61 0.61 0.66
0.01 0.01 0.02
0.10 0.10 0.12
0.04 0.04 0.04
0.22 0.22 0.22
and I am doing a pairplot as below:
sns.pairplot(data)
plt.show()
However I would like to display the correlation coefficient among the variables and if possible the skewness and kurtosis of each variable.
How do you do that in seaborn?
As far as I'm aware, there is no out of the box function to do this, you'll have to create your own:
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
def corrfunc(x, y, ax=None, **kws):
"""Plot the correlation coefficient in the top left hand corner of a plot."""
r, _ = pearsonr(x, y)
ax = ax or plt.gca()
ax.annotate(f'ρ = {r:.2f}', xy=(.1, .9), xycoords=ax.transAxes)
Example using your input:
import seaborn as sns; sns.set(style='white')
import pandas as pd
data = {'prop_tenure': [0.0, 0.0, 0.06, 0.38, 0.61, 0.01, 0.10, 0.04, 0.22],
'prop_12m': [0.0, 0.0, 0.06, 0.38, 0.61, 0.01, 0.10, 0.04, 0.22],
'prop_6m': [0.0, 0.0, 0.10, 0.25, 0.66, 0.02, 0.12, 0.04, 0.22]}
df = pd.DataFrame(data)
g = sns.pairplot(df)
g.map_lower(corrfunc)
plt.show()
Just to mention, for seaborn in more recent version (>0.11.0) the answer above doesn't work anymore. But you need to add a hue=None to make it work again.
def corrfunc(x, y, hue=None, ax=None, **kws):
"""Plot the correlation coefficient in the top left hand corner of a plot."""
r, _ = pearsonr(x, y)
ax = ax or plt.gca()
ax.annotate(f'ρ = {r:.2f}', xy=(.1, .9), xycoords=ax.transAxes)
Reference this issue https://github.com/mwaskom/seaborn/issues/2307#issuecomment-702980853

get means and SEM in one df with pandas groupby

I'd like to find an efficient way to use the df.groupby() function in pandas to return both the means and standard deviations of a data frame - preferably in one shot!
import pandas as PD
df = pd.DataFrame({'case':[1, 1, 2, 2, 3, 3],
'condition':[1,2,1,2,1,2],
'var_a':[0.92, 0.88, 0.90, 0.79, 0.94, 0.85],
'var_b':[0.21, 0.15, 0.1, 0.16, 0.17, 0.23]})
with that data, I'd like an easier way (if there is one!) to perform the following:
grp_means = df.groupby('case', as_index=False).mean()
grp_sems = df.groupby('case', as_index=False).sem()
grp_means.rename(columns={'var_a':'var_a_mean', 'var_b':'var_b_mean'},
inplace=True)
grp_sems.rename(columns={'var_a':'var_a_SEM', 'var_b':'var_b_SEM'},
inplace=True)
grouped = pd.concat([grp_means, grp_sems[['var_a_SEM', 'var_b_SEM']]], axis=1)
grouped
Out[1]:
case condition var_a_mean var_b_mean var_a_SEM var_b_SEM
0 1 1.5 0.900 0.18 0.900 0.18
1 2 1.5 0.845 0.13 0.845 0.13
2 3 1.5 0.895 0.20 0.895 0.20
I also recently learned of the .agg() function, and tried df.groupby('grouper column') agg('var':'mean', 'var':sem') but this just returns a SyntaxError.
I think need DataFrameGroupBy.agg, but then remove column ('condition','sem') and map for convert MultiIndex to columns:
df = df.groupby('case').agg(['mean','sem']).drop(('condition','sem'), axis=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
case condition_mean var_a_mean var_a_sem var_b_mean var_b_sem
0 1 1.5 0.900 0.020 0.18 0.03
1 2 1.5 0.845 0.055 0.13 0.03
2 3 1.5 0.895 0.045 0.20 0.03

Plot with pandas: group and mean

My data from my 'combos' data frame looks like this:
pr = [1.0,2.0,3.0,4.0,1.0,2.0,3.0,4.0,1.0,2.0,3.0,4.0,.....1.0,2.0,3.0,4.0]
lmi = [200, 200, 200, 250, 250,.....780, 780, 780, 800, 800, 800]
pred = [0.16, 0.18, 0.25, 0.43, 0.54......., 0.20, 0.34, 0.45, 0.66]
I plot the results like this:
fig,ax = plt.subplots()
for pr in [1.0,2.0,3.0,4.0]:
ax.plot(combos[combos.pr==pr].lmi, combos[combos.pr==pr].pred, label=pr)
ax.set_xlabel('lmi')
ax.set_ylabel('pred')
ax.legend(loc='best')
And I get this plot:
How can I plot means of "pred" for each "lmi" data point when keeping the pairs (lmi, pr) intact?
You can first group your DataFrame by lmi then compute the mean for each group just as your title suggests:
combos.groupby('lmi').pred.mean().plot()
In one line we:
Group the combos DataFrame by the lmi column
Get the pred column for each lmi
Compute the mean across the pred column for each lmi group
Plot the mean for each lmi group
As of your updates to the question it is now clear that you want to calculate the means for each pair (pr, lmi). This can be done by grouping over these columns and then simply calling mean(). With reset_index(), we then restore the DataFrame format to the previous form.
$ combos.groupby(['lmi', 'pr']).mean().reset_index()
lmi pr pred
0 200 1.0 0.16
1 200 2.0 0.18
2 200 3.0 0.25
3 250 1.0 0.54
4 250 4.0 0.43
5 780 2.0 0.20
6 780 3.0 0.34
7 780 4.0 0.45
8 800 1.0 0.66
In this new DataFrame pred does contain the means and you can use the same plotting procedure you have been using before.

Categories

Resources