grouping boxplots matplotlib - python

Is there a way to group boxplots in matplotlib WITHOUT the use of seaborn or some other library?
e.g. in the following, I want to have blocks along the x axis, and plot values grouped by condition (so there will be 16 boxes). Like what seaborn's hue argument accomplishes.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
blocks = 4
conditions = 4
ndatapoints = blocks * conditions
blockcol = np.repeat(list(range(1, conditions+1)), blocks)
concol = np.repeat(np.arange(1, conditions+1, 1), blocks)
trialcol = np.arange(1, ndatapoints+1, 1)
valcol = np.random.normal(0, 1, ndatapoints)
raw_data = {'blocks': np.repeat(list(range(1, conditions+1)), blocks),
'condition': list(range(1, conditions+1))*blocks,
'trial': np.arange(1, ndatapoints+1, 1),
'value': np.random.normal(0, 1, ndatapoints)}
df = pd.DataFrame(raw_data)
df
blocks condition trial value
0 1 1 1 1.306146
1 1 2 2 -0.024201
2 1 3 3 -0.374561
3 1 4 4 -0.093366
4 2 1 5 -0.548427
5 2 2 6 -1.205077
6 2 3 7 0.617165
7 2 4 8 -0.239830
8 3 1 9 -0.876789
9 3 2 10 0.656436
10 3 3 11 -0.471325
11 3 4 12 -1.465787
12 4 1 13 -0.495308
13 4 2 14 -0.266914
14 4 3 15 -0.305884
15 4 4 16 0.546730
I can't seem to find any examples.

I think you just want a factor plot:
import numpy
import pandas
import seaborn
blocks = 3
conditions = 4
trials = 12
ndatapoints = blocks * conditions * trials
blockcol = list(range(1, blocks + 1)) * (conditions * trials)
concol = list(range(1, conditions + 1)) * (blocks * trials)
trialcol = list(range(1, trials + 1)) * (blocks * conditions)
valcol = numpy.random.normal(0, 1, ndatapoints)
fg = pandas.DataFrame({
'blocks': blockcol,
'condition': concol,
'trial': trialcol,
'value': valcol
}).pipe(
(seaborn.factorplot, 'data'),
x='blocks', y='value', hue='condition',
kind='box'
)

Related

Is there faster way to get values based on the linear regression model and append it to a new column in a DataFrame?

I created this code below to make a new column in my dataframe to compare the actual values and regressed value:
b = dfSemoga.loc[:, ['DoB','AA','logtime']]
y = dfSemoga.loc[:,'logCO2'].values.reshape(len(dfSemoga)+1,1)
lr = LinearRegression().fit(b,y)
z = lr.coef_[0,0]
j = lr.coef_[0,1]
k = lr.coef_[0,2]
c = lr.intercept_[0]
for i in range (0,len(dfSemoga)):
dfSemoga.loc[i,'EF CO2 Predict'] = (c + dfSemoga.loc[i,'DoB']*z +
dfSemoga.loc[i,'logtime']*k + dfSemoga.loc[i, 'AA']*j)
So, I basically regress a column with three variables: 1) AA, 2) logtime, and 3) DoB. But in this code, to get the regressed value in a new column called dfSemoga['EF CO2 Predict'] I assign the coefficient manually, as shown in the for loop.
Is there any fancy one-liner code that I can write to make my work more efficient?
Without sample data I can't confirm but you should just be able to do
dfSemoga["EF CO2 Predict"] = c + (z * dfSemoga["DoB"]) + (k * dfSemoga["logtime"]) + (j * dfSemoga["AA"])
Demo:
In [4]: df
Out[4]:
a b
0 0 0
1 0 8
2 7 6
3 3 1
4 3 8
5 6 6
6 4 8
7 2 7
8 3 8
9 8 1
In [5]: df["c"] = 3 + 0.5 * df["a"] - 6 * df["b"]
In [6]: df
Out[6]:
a b c
0 0 0 3.0
1 0 8 -45.0
2 7 6 -29.5
3 3 1 -1.5
4 3 8 -43.5
5 6 6 -30.0
6 4 8 -43.0
7 2 7 -38.0
8 3 8 -43.5
9 8 1 1.0

Multiple multi-line plots group wise in Python

I have a pandas dataframe like this -
(Creating a random dataframe)
from random import randint
from random import random
import random
import pandas as pd
x = [randint(1,20) for i in range(20)]
y1 = [random() for i in range(20)]
y2 = [random() for i in range(20)]
y3 = [random() for i in range(20)]
y4 = [random() for i in range(20)]
g = ['a', 'b', 'c']
group = [random.choice(g) for i in range(20)]
data = {'Group': group, 'x': x, 'y1':y1, 'y2':y2, 'y3':y3, 'y4':y4}
df = pd.DataFrame(data)
df.sort_values('Group')
The dataframe is like this -
>>> df.sort_values('Group')
Group x y1 y2 y3 y4
17 a 9 0.400730 0.242629 0.858307 0.799613
16 a 14 0.644299 0.952255 0.257262 0.376845
5 a 3 0.784374 0.800639 0.753612 0.441645
18 a 3 0.988016 0.739003 0.741000 0.299011
11 a 18 0.672816 0.232951 0.763451 0.762478
0 b 7 0.670889 0.785928 0.604563 0.620951
15 b 3 0.838479 0.286988 0.374546 0.013822
4 b 4 0.495855 0.159839 0.984262 0.882428
13 b 3 0.756058 0.979226 0.423426 0.297381
8 b 13 0.835705 0.374927 0.492676 0.939113
12 b 17 0.643511 0.156267 0.248037 0.316526
14 c 13 0.303215 0.177303 0.980071 0.705428
9 c 16 0.829414 0.173755 0.992532 0.398509
7 c 9 0.774353 0.082118 0.089582 0.587679
6 c 14 0.551595 0.737882 0.127206 0.985017
3 c 4 0.072765 0.497016 0.634819 0.149798
2 c 1 0.971598 0.254215 0.325086 0.588159
1 c 14 0.467277 0.631844 0.927199 0.051251
10 c 13 0.346592 0.384929 0.185384 0.330408
19 c 16 0.790785 0.449498 0.176042 0.036896
Using this dataframe I intend to plot multiple graphs group wise (in this case 3 graphs as there are only 3 groups). Each graph is a multi line graph with x on x-axis and [y1, y2, y3, y4] on y-axis
How can I achieve this, I can plot a single multiline graph, but unable to plot multiple plots group -wise.
You can use groupby:
fig, axes = plt.subplots(1, 3, figsize=(10,3))
for (grp, data), ax in zip(df.groupby('Group'), axes.flat):
data.plot(x='x', ax=ax)
Output:
Note: You don't really need to sort by group.

ValueError: Points must be Nx2 array, got 2x5

I'm trying to make an animation and am looking at the code of another stack overflow question. The code is the following
import matplotlib.pyplot as plt
from matplotlib import animation as animation
import numpy as np
import pandas as pd
import io
u = u"""Time M1 M2 M3 M4 M5
1 1 2 3 1 2
2 1 3 3 1 2
3 1 3 2 1 3
4 2 2 3 1 2
5 3 3 3 1 3
6 2 3 4 1 4
7 2 3 4 3 3
8 3 4 4 3 4
9 4 4 5 3 3
10 4 4 5 5 4"""
df_Bubble = pd.read_csv(io.StringIO(u), delim_whitespace=True)
time_count = len(df_Bubble)
colors = np.arange(1, 6)
x = np.arange(1, 6)
max_radius = 25
fig, ax = plt.subplots()
pic = ax.scatter(x, df_Bubble.iloc[0, 1:], s=100, c=colors)
pic.set_offsets([[np.nan]*len(colors)]*2)
ax.axis([0,7,0,7])
def init():
pic.set_offsets([[np.nan]*len(colors)]*2)
return pic,
def updateData(i):
y = df_Bubble.iloc[i, 1:]
area = np.pi * (max_radius * y / 10.0) ** 2
pic.set_offsets([x, y.values])
pic._sizes = area
i+=1
return pic,
ani = animation.FuncAnimation(fig, updateData,
frames=10, interval = 50, blit=True, init_func=init)
plt.show()
When I run this code unchanged I get the error
ValueError: Points must be Nx2 array, got 2x5
I have looked at similar threads on this question and have come to the conclusion that the problem has to do with the line with [[np.nan]*len(colors)]*2. Based on the examples I found, I thought that changing a part of this line to an array might help, but none of my attempts have worked, and now I'm stuck. I would be grateful for any help.
set_offsets expects a Nx2 ndarray and you provide 2 arrays with 5 elements each in updateData(i) and 2 lists with 5 elements each in init()
def init():
pic.set_offsets(np.empty((len(colors),2)))
return pic,
def updateData(i):
y = df_Bubble.iloc[i, 1:]
area = np.pi * (max_radius * y / 10.0) ** 2
#pic.set_offsets(np.hstack([x[:i,np.newaxis], y.values[:i, np.newaxis]]))
pic.set_offsets(np.transpose((x, y.values)))
pic._sizes = area
i+=1
return pic,

Box Plotting top ten values of a column in pandas

I have the a large dataframe where I calculate the p value using a t-test for each row. I now want to have a boxplot of the row with the top ten of lowest p-values
LeadSNPs = pd.unique(candidate_genes.LeadSNP) #rs3184504 rs531612
gene_counts_per_snp_df = pd.DataFrame.empty
save_path = "../figures/SM5_gene_counts/"
for LeadSNP_cnt, LeadSNP in enumerate(LeadSNPs):
print(LeadSNP)
candidate_genes_per_SNP = candidate_genes.Target[np.where(candidate_genes.LeadSNP==LeadSNP)[0]]
region = pd.unique(candidate_genes.Region[np.where(candidate_genes.LeadSNP==LeadSNP)[0]])
first_gene_flag = 1
for gene_cnt, target_gene in enumerate(candidate_genes_per_SNP):
gene_indexes = candidate_genes_per_SNP.index
PRE = candidate_genes['sumOfWeightedWeights (PRE)'][gene_indexes[gene_cnt]]
print(target_gene)
ensembl_id = get_ensembl_id(target_gene)
print(ensembl_id)
if pd.isnull(ensembl_id):
pass
else:
gene_counts_df = get_gene_counts_df(ensembl_id)
if gene_counts_df.shape[0]==0:
print('no ensemble id found in gene counts!')
else:
gene_counts_df = gene_counts_df.melt(id_vars=["Gene"], var_name='compartment', value_name='count')
gene_counts_df = reshape_gene_counts_df(gene_counts_df)
gene_counts_df['target_gene'] = target_gene
gene_counts_df['PRE'] = PRE
gene_counts_df['pval_ftest']= np.nan
pop3= gene_counts_df.loc[(gene_counts_df['target_gene']==target_gene) & (gene_counts_df['compartment']=='CSF_N')]['count']
pop4 = gene_counts_df.loc[(gene_counts_df['target_gene']==target_gene) & (gene_counts_df['compartment']=='PB_N')]['count']
pval1 = stats.ttest_ind(pop3, pop4)[1]
gene_counts_df.loc[(gene_counts_df['target_gene']==target_gene) & (gene_counts_df['compartment'].isin(['CSF_N','PB_N'])),"pval_ftest"]= pval_ftest
if first_gene_flag == 1:
gene_counts_per_snp_df = gene_counts_df
first_gene_flag = 0
else:
gene_counts_per_snp_df = pd.concat([gene_counts_per_snp_df, gene_counts_df])
gene_counts_per_snp_df['LeadSNP'] = LeadSNP
if LeadSNP_cnt == 0:
all_gene_counts = gene_counts_per_snp_df
else:
all_gene_counts = pd.concat([all_gene_counts, gene_counts_per_snp_df])
all_gene_counts = pd.DataFrame.reset_index(all_gene_counts)
plot_top_genes_snps(all_gene_counts_per_comp, 'target_gene')
and the plotting code is given here:
def plot_top_genes_snps(all_gene_counts_per_comp, x_label):
sns.set(style="white")
sns.set_context("poster")
palette = sns.color_palette("colorblind", 10)
fig, ax = plt.subplots(figsize=(25,4))
g = sns.boxplot(ax=ax, y='count', x=x_label, data=all_gene_counts_per_comp, hue = 'compartment', showfliers=False, palette=palette, hue_order=comp_order)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
handles, _ = ax.get_legend_handles_labels()
current_legends = []
for str_ind in range(len(handles)):
current_legends.append(comp_dict[handles[str_ind].get_label()])
ax.legend(handles, current_legends, bbox_to_anchor=(1, 1), loc=2)
ax.yaxis.grid()
sns.set(font_scale = 2)
plt.xlabel('')
plt.ylabel('Gene count')
# plt.savefig(save_path+str(LeadSNP)+'.pdf', bbox_inches='tight')
plt.show()
For context, I want the top ten target_gene with the lowest p values. However, this is the plot I am getting:
allgenesandpvalues
How do I extract only the ten lowest p values and boxplot them.
Update: The dataframe looks like this, the table is repeated from different SNPs:
dataframe
The dataframe in text format:
Gene compartment count patient_id target_gene PRE \
1 ENSG00000157870 CSF_N 0 1 FAM213B 7.5
11 ENSG00000157870 CSF_N 0 2 FAM213B 7.5
21 ENSG00000157870 CSF_N 0 3 FAM213B 7.5
31 ENSG00000157870 CSF_N 0 4 FAM213B 7.5
41 ENSG00000157870 CSF_N 0 5 FAM213B 7.5
.. ... ... ... ... ... ...
21 ENSG00000182866 CSF_N 18 3 LCK 2.0
31 ENSG00000182866 CSF_N 45 4 LCK 2.0
41 ENSG00000182866 CSF_N 0 5 LCK 2.0
51 ENSG00000182866 CSF_N 9 6 LCK 2.0
61 ENSG00000182866 CSF_N 0 7 LCK 2.0
pval_ftest LeadSNP
1 0.222523 rs6670198
11 0.222523 rs6670198
21 0.222523 rs6670198
31 0.222523 rs6670198
41 0.222523 rs6670198
all_gene_counts_per_comp.sort_values(by="pval_ftest").loc[:10, :]
will give you the top 10 rows with the smallest "pval_ftest" value.
Maybe this toy example will make it clearer how to sort and select subsets of a DataFrame.
>>> df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [4, 3, 2, 1]})
>>> print(df)
a b
0 1 4
1 2 3
2 3 2
3 4 1
>>> df_sorted = df.sort_values(by="b")
>>> print(df_sorted)
a b
3 4 1
2 3 2
1 2 3
0 1 4
>>> print(df_sorted.loc[:2, :])
a b
3 4 1
2 3 2

How to draw bar in python

I want to draw bar chart for below data:
4 1406575305 4
4 -220936570 2
4 2127249516 2
5 -1047108451 4
5 767099153 2
5 1980251728 2
5 -2015783241 2
6 -402215764 2
7 927697904 2
7 -631487113 2
7 329714360 2
7 1905727440 2
8 1417432814 2
8 1906874956 2
8 -1959144411 2
9 859830686 2
9 -1575740934 2
9 -1492701645 2
9 -539934491 2
9 -756482330 2
10 1273377106 2
10 -540812264 2
10 318171673 2
The 1st column is the x-axis and the 3rd column is for y-axis. Multiple data exist for same x-axis value. For example,
4 1406575305 4
4 -220936570 2
4 2127249516 2
This means three bars for 4 value of x-axis and each of bar is labelled with tag(the value in middle column). The sample bar chart is like:
http://matplotlib.org/examples/pylab_examples/barchart_demo.html
I am using matplotlib.pyplot and np. Thanks..
I followed the tutorial you linked to, but it's a bit tricky to shift them by a nonuniform amount:
import numpy as np
import matplotlib.pyplot as plt
x, label, y = np.genfromtxt('tmp.txt', dtype=int, unpack=True)
ux, uidx, uinv = np.unique(x, return_index=True, return_inverse=True)
max_width = np.bincount(x).max()
bar_width = 1/(max_width + 0.5)
locs = x.astype(float)
shifted = []
for i in range(max_width):
where = np.setdiff1d(uidx + i, shifted)
locs[where[where<len(locs)]] += i*bar_width
shifted = np.concatenate([shifted, where])
plt.bar(locs, y, bar_width)
If you want you can label them with the second column instead of x:
plt.xticks(locs + bar_width/2, label, rotation=-90)
I'll leave doing both of them as an exercise to the reader (mainly because I have no idea how you want them to show up).

Categories

Resources