consider a csv file:
z, a, error, b, error
cm, kg, dl , kg, dl
1.0 , 2.0, 3.0, 4.0, 5.0
1.1 , 2.1, 3.1, 4.1, 5.1
1.2 , 2.2, 3.2, 4.2, 5.2
The first line tells us what the variable is. The second line here describes something about the data which is the units of each of the variables. One way would be to ignore the second line which is currently I am doing.
Is there a more consistent way of doing this than ignoring the second line?
There is! You can tell pandas that your csv contains more than one header row.
header : int, list of int, None, default ‘infer’
Row number(s) to use as the column names, and the start of the data. [. . . ] The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3] [. . . ] (pandas documentation on read_csv)
Input csv
z,a,error,b,error
cm,kg,dl,kg,dl
1.0,2.0,3.0,4.0,5.0
1.1,2.1,3.1,4.1,5.1
1.2,2.2,3.2,4.2,5.2
Open it
df = pd.read_csv(path_to_csv, header=[0,1])
Your Dataframe
z a error b error
cm kg dl kg dl.1
0 1.0 2.0 3.0 4.0 5.0
1 1.1 2.1 3.1 4.1 5.1
2 1.2 2.2 3.2 4.2 5.2
You can now easily access the columns and rows.
Result of df["z"]
cm
0 1.0
1 1.1
2 1.2
Result of df.loc[1, "z"]
cm 1.1
Name: 1, dtype: float64
I have a pandas dataframe Bg that was created by taking sample in rows and r for in columns. r is a list of genes that I want to split in a row-wise manner for the entire dataframe.
My code below is taking a long time to run and repeatedly crash. I would like to know if there is a more efficient way to achieve the aim.
import pandas as pd
Bg = pd.DataFrame()
for idx, r in pathway_genes.itertuples():
for i, p in enumerate(M.index):
if idx == p:
for genes, samples in common_mrna.iterrows():
b = pd.DataFrame({r:samples})
Bg = Bg.append(b).fillna(0)
M.index
M.index = ['KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION',
'KEGG_DRUG_METABOLISM_OTHER_ENZYMES', 'KEGG_PEROXISOME',
'KEGG_LONG_TERM_POTENTIATION', 'KEGG_ADHERENS_JUNCTION', 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM']
pathway_genes
geneSymbols
KEGG_ABC_TRANSPORTERS
['ABCA1', 'ABCA10', 'ABCA12']
KEGG_ACUTE_MYELOID_LEUKEMIA
['AKT1', 'AKT2', 'AKT3', 'ARAF']
KEGG_ADHERENS_JUNCTION
['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2']
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY
['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5']
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM
['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']
common_mrna
common_mrna = pd.DataFrame([[1.2, 1.3, 1.4, 1.5], [1.6,1.7,1.8,1.9], [2.0,2.1,2.2,2.3], [2.4,2.5,2.6,2.7], [2.8,2.9,3.0,3.1],[3.2,3.3,3.4,3.5],[3.6,3.7,3.8,3.9],[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ABCA1','ABCA10','ABCA12','AKT1','AKT2','AKT3','ARAF','ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])
Desired output:
Bg = pd.DataFrame([[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])
Firs of all, you can use list comprehension to match the M_index with the pathway_genes
pathway_genes = {'KEGG_ABC_TRANSPORTERS': ['ABCA1', 'ABCA10', 'ABCA12'], 'KEGG_ACUTE_MYELOID_LEUKEMIA': ['AKT1', 'AKT2', 'AKT3', 'ARAF'], 'KEGG_ADHERENS_JUNCTION': ['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2'], 'KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY': ['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5'], 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM': ['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']}
matched_index_symbols = [pathway_genes[i] for i in pathway_genes.keys() if i in M_index]
After that, you can use loc to match all the symbols.
flatten_list = [j for sub in matched_index_symbols for j in sub]
Bg = common_mrna.loc[flatten_list]
Out[26]:
TCGA-02-0033-01 TCGA-02-2470-01 TCGA-02-2483-01 TCGA-06-0124-01
ABCA1 1.2 1.3 1.4 1.5
ABCA10 1.6 1.7 1.8 1.9
ABCA12 2.0 2.1 2.2 2.3
ACP1 4.0 4.1 4.2 4.3
ACTB 4.4 4.5 4.6 4.7
ACTG1 4.8 4.9 5.0 5.1
ACTN1 5.2 5.3 5.4 5.5
ACTN2 5.6 5.7 5.8 5.9
ABAT 6.0 6.1 6.2 6.3
ACY3 6.4 6.5 6.6 6.7
ADSL 6.8 6.9 7.0 7.1
ADSS1 7.2 7.3 7.4 7.5
ADSS2 7.6 7.7 7.8 7.9
Update
It seems that your pathway_genes is not originally a dictionary but a dataframe. If that's the case, you can extract the column index of the dataframe.
pathway_genes
Out[46]:
geneSymbols
KEGG_ABC_TRANSPORTERS [ABCA1, ABCA10, ABCA12]
KEGG_ACUTE_MYELOID_LEUKEMIA [AKT1, AKT2, AKT3, ARAF]
KEGG_ADHERENS_JUNCTION [ACP1, ACTB, ACTG1, ACTN1, ACTN2]
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY [ACACB, ACSL1, ACSL3, ACSL4, ACSL5]
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM [ABAT, ACY3, ADSL, ADSS1, ADSS2]
matched_index_symbols = np.array([pathway_genes['geneSymbols'].loc[i] for i in pathway_genes.index if i in M_index])
flatten_list = matched_index_symbols.ravel()
I have to perform an interpolation in 3 or 4 dimensions moving from a tabular data stored as Pandas DataFrame.
I have the following data stored in the variable df : DataFrame:
xm xA xl z
2.3 4.6 10.0 1.905
2.3 4.6 11.0 1.907
2.3 4.8 10.0 1.908
2.3 4.8 11.0 1.909
2.4 4.6 10.0 1.811
2.4 4.6 11.0 1.812
2.4 4.8 10.0 1.813
2.4 4.8 11.0 1.814
xm, xa, xl are the axis from which the grid should be drawn. The column z contains the values from which the interpolation is to be performed. Indeed, the regular grid I came up with is calculated as:
grid = np.meshgrid(*(df.xm,df.xA,df.xl))
Now my problem is how to turn the Z-series data from the DataFrame into a np.array to be passed to the Scipy function:
from scipy import interpolate
p0 = (xm0,xA0,xl0)
z0 = interpolate.interpn(grid, myarray, p0)
Thanks to SCKU for the hint on the z-column reshape. I was using
grid = np.meshgrid(*(df.xm,df.xA,df.xl))
following the example from scipy doc.
It was actually enough to pass the tuple of base axis array:
grid = np.meshgrid(xm,xA,xLn)
z = df.z.values.reshape(grid[0].shape)
xt = (df.xM,df.xA,df.xLn)
p0 = (xM0,xA0,xLn0)
val = interpolate.interpn(xt, z, p0)
There is a large csv file imported. Below is the output, where Flavor_Score and Overall_Score are results of applying df.groupby('beer_name').mean() across a multitude of testers. I would like to add a column Std Deviation for each: Flavor_Score and Overall_Score to the right of the mean column. The function is clear but how to add a column for display? Of course, I can generate an array and append it (right?) but it would seem to be a cumbersome way.
Beer_name Beer_Style Flavor_Score Overall_Score
Coors Light 2.0 3.0
Sam Adams Dark 4.0 4.5
Becks Light 3.5 3.5
Guinness Dark 2.0 2.2
Heineken Light 3.5 3.7
You could use
df.groupby('Beer_name').agg(['mean','std'])
This computes the mean and the std for each group.
For example,
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 100
beers = ['Coors', 'Sam Adams', 'Becks', 'Guinness', 'Heineken']
style = ['Light', 'Dark', 'Light', 'Dark', 'Light']
df = pd.DataFrame({'Beer_name': np.random.choice(beers, N),
'Flavor_Score': np.random.uniform(0, 10, N),
'Overall_Score': np.random.uniform(0, 10, N)})
df['Beer_Style'] = df['Beer_name'].map(dict(zip(beers, style)))
print(df.groupby('Beer_name').agg(['mean','std']))
yields
Flavor_Score Overall_Score
mean std mean std
Beer_name
Becks 5.779266 3.033939 6.995177 2.697787
Coors 6.521966 2.008911 4.066374 3.070217
Guinness 4.836690 2.644291 5.577085 2.466997
Heineken 4.622213 3.108812 6.372361 2.904932
Sam Adams 5.443279 3.311825 4.697961 3.164757
groupby.agg([fun1, fun2]) computes any number of functions in one step:
from random import choice, random
import pandas as pd
import numpy as np
beers = ['Coors', 'Sam Adams', 'Becks', 'Guinness', 'Heineken']
styles = ['Light', 'Dark']
def generate():
for i in xrange(0, 100):
yield dict(beer=choice(beers), style=choice(styles),
flavor_score=random()*10.0,
overall_score=random()*10.0)
pd.options.display.float_format = ' {:,.1f} '.format
df = pd.DataFrame(generate())
print df.groupby(['beer', 'style']).agg([np.mean, np.std])
=>
flavor_score overall_score
mean std mean std
beer style
Becks Dark 7.1 3.6 1.9 1.6
Light 4.7 2.4 2.0 1.0
Coors Dark 5.5 3.2 2.6 1.1
Light 5.3 2.5 1.9 1.1
Guinness Dark 3.3 1.4 2.1 1.1
Light 4.7 3.6 2.2 1.1
Heineken Dark 4.4 3.0 2.7 1.0
Light 6.0 2.3 2.1 1.3
Sam Adams Dark 3.4 3.0 1.7 1.2
Light 5.2 3.6 1.6 1.3
What if I need to use a user-defined function to just a flavor_score column? let's say I want subtract 0.5 from a flavor_score column (from all rows, except for Heineken, for which I want to add 0.25)
grouped[grouped.beer != 'Heineken']['flavor_score']['mean'] - 0.5
grouped[grouped.beer == 'Heineken']['flavor_score']['mean'] + 0.25