How can I manage units in pandas data? - python

I'm trying to figure out if there is a good way to manage units in my pandas data. For example, I have a DataFrame that looks like this:
length (m) width (m) thickness (cm)
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
Currently, the measurement units are encoded in column names. Downsides include:
column selection is awkward -- df['width (m)'] vs. df['width']
things will likely break if the units of my source data change
If I wanted to strip the units out of the column names, is there somewhere else that the information could be stored?

There isn't any great way to do this right now, see github issue here for some discussion.
As a quick hack, could do something like this, maintaining a separate dict with the units.
In [3]: units = {}
In [5]: newcols = []
...: for col in df:
...: name, unit = col.split(' ')
...: units[name] = unit
...: newcols.append(name)
In [6]: df.columns = newcols
In [7]: df
Out[7]:
length width thickness
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
In [8]: units['length']
Out[8]: '(m)'

As I was searching for this, too. Here is what pint and the (experimental) pint_pandas is capable of today:
import pandas as pd
import pint
import pint_pandas
ureg = pint.UnitRegistry()
ureg.Unit.default_format = "~P"
pint_pandas.PintType.ureg.default_format = "~P"
df = pd.DataFrame({
"length": pd.Series([1.2, 7.8, 3.4], dtype="pint[m]"),
"width": pd.Series([3.4, 9.0, 5.6], dtype="pint[m]"),
"thickness": pd.Series([5.6, 1.2, 7.8], dtype="pint[cm]"),
})
print(df.pint.dequantify())
length width thickness
unit m m cm
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
df['width'] = df['width'].pint.to("inch")
print(df.pint.dequantify())
length width thickness
unit m in cm
0 1.2 133.858268 5.6
1 7.8 354.330709 1.2
2 3.4 220.472441 7.8

Offer you some methods:
pands-units-extension: janpipek/pandas-units-extension: Units extension array for pandas based on astropy
pint-pandas: hgrecco/pint-pandas: Pandas support for pint
you can also extend the pandas by yourself following this Extending pandas — pandas 1.3.0 documentation

Related

What is the proper way to handle units in a csv file using python csv or pandas?

consider a csv file:
z, a, error, b, error
cm, kg, dl , kg, dl
1.0 , 2.0, 3.0, 4.0, 5.0
1.1 , 2.1, 3.1, 4.1, 5.1
1.2 , 2.2, 3.2, 4.2, 5.2
The first line tells us what the variable is. The second line here describes something about the data which is the units of each of the variables. One way would be to ignore the second line which is currently I am doing.
Is there a more consistent way of doing this than ignoring the second line?
There is! You can tell pandas that your csv contains more than one header row.
header : int, list of int, None, default ‘infer’
Row number(s) to use as the column names, and the start of the data. [. . . ] The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3] [. . . ] (pandas documentation on read_csv)
Input csv
z,a,error,b,error
cm,kg,dl,kg,dl
1.0,2.0,3.0,4.0,5.0
1.1,2.1,3.1,4.1,5.1
1.2,2.2,3.2,4.2,5.2
Open it
df = pd.read_csv(path_to_csv, header=[0,1])
Your Dataframe
z a error b error
cm kg dl kg dl.1
0 1.0 2.0 3.0 4.0 5.0
1 1.1 2.1 3.1 4.1 5.1
2 1.2 2.2 3.2 4.2 5.2
You can now easily access the columns and rows.
Result of df["z"]
cm
0 1.0
1 1.1
2 1.2
Result of df.loc[1, "z"]
cm 1.1
Name: 1, dtype: float64

Efficient way to iterate over rows and columns in pandas

I have a pandas dataframe Bg that was created by taking sample in rows and r for in columns. r is a list of genes that I want to split in a row-wise manner for the entire dataframe.
My code below is taking a long time to run and repeatedly crash. I would like to know if there is a more efficient way to achieve the aim.
import pandas as pd
Bg = pd.DataFrame()
for idx, r in pathway_genes.itertuples():
for i, p in enumerate(M.index):
if idx == p:
for genes, samples in common_mrna.iterrows():
b = pd.DataFrame({r:samples})
Bg = Bg.append(b).fillna(0)
M.index
M.index = ['KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION',
'KEGG_DRUG_METABOLISM_OTHER_ENZYMES', 'KEGG_PEROXISOME',
'KEGG_LONG_TERM_POTENTIATION', 'KEGG_ADHERENS_JUNCTION', 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM']
pathway_genes
geneSymbols
KEGG_ABC_TRANSPORTERS
['ABCA1', 'ABCA10', 'ABCA12']
KEGG_ACUTE_MYELOID_LEUKEMIA
['AKT1', 'AKT2', 'AKT3', 'ARAF']
KEGG_ADHERENS_JUNCTION
['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2']
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY
['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5']
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM
['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']
common_mrna
common_mrna = pd.DataFrame([[1.2, 1.3, 1.4, 1.5], [1.6,1.7,1.8,1.9], [2.0,2.1,2.2,2.3], [2.4,2.5,2.6,2.7], [2.8,2.9,3.0,3.1],[3.2,3.3,3.4,3.5],[3.6,3.7,3.8,3.9],[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ABCA1','ABCA10','ABCA12','AKT1','AKT2','AKT3','ARAF','ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])
Desired output:
Bg = pd.DataFrame([[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])
Firs of all, you can use list comprehension to match the M_index with the pathway_genes
pathway_genes = {'KEGG_ABC_TRANSPORTERS': ['ABCA1', 'ABCA10', 'ABCA12'], 'KEGG_ACUTE_MYELOID_LEUKEMIA': ['AKT1', 'AKT2', 'AKT3', 'ARAF'], 'KEGG_ADHERENS_JUNCTION': ['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2'], 'KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY': ['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5'], 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM': ['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']}
matched_index_symbols = [pathway_genes[i] for i in pathway_genes.keys() if i in M_index]
After that, you can use loc to match all the symbols.
flatten_list = [j for sub in matched_index_symbols for j in sub]
Bg = common_mrna.loc[flatten_list]
Out[26]:
TCGA-02-0033-01 TCGA-02-2470-01 TCGA-02-2483-01 TCGA-06-0124-01
ABCA1 1.2 1.3 1.4 1.5
ABCA10 1.6 1.7 1.8 1.9
ABCA12 2.0 2.1 2.2 2.3
ACP1 4.0 4.1 4.2 4.3
ACTB 4.4 4.5 4.6 4.7
ACTG1 4.8 4.9 5.0 5.1
ACTN1 5.2 5.3 5.4 5.5
ACTN2 5.6 5.7 5.8 5.9
ABAT 6.0 6.1 6.2 6.3
ACY3 6.4 6.5 6.6 6.7
ADSL 6.8 6.9 7.0 7.1
ADSS1 7.2 7.3 7.4 7.5
ADSS2 7.6 7.7 7.8 7.9
Update
It seems that your pathway_genes is not originally a dictionary but a dataframe. If that's the case, you can extract the column index of the dataframe.
pathway_genes
Out[46]:
geneSymbols
KEGG_ABC_TRANSPORTERS [ABCA1, ABCA10, ABCA12]
KEGG_ACUTE_MYELOID_LEUKEMIA [AKT1, AKT2, AKT3, ARAF]
KEGG_ADHERENS_JUNCTION [ACP1, ACTB, ACTG1, ACTN1, ACTN2]
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY [ACACB, ACSL1, ACSL3, ACSL4, ACSL5]
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM [ABAT, ACY3, ADSL, ADSS1, ADSS2]
matched_index_symbols = np.array([pathway_genes['geneSymbols'].loc[i] for i in pathway_genes.index if i in M_index])
flatten_list = matched_index_symbols.ravel()

Arrange Pandas DataFrame to be used for scipy interpn

I have to perform an interpolation in 3 or 4 dimensions moving from a tabular data stored as Pandas DataFrame.
I have the following data stored in the variable df : DataFrame:
xm xA xl z
2.3 4.6 10.0 1.905
2.3 4.6 11.0 1.907
2.3 4.8 10.0 1.908
2.3 4.8 11.0 1.909
2.4 4.6 10.0 1.811
2.4 4.6 11.0 1.812
2.4 4.8 10.0 1.813
2.4 4.8 11.0 1.814
xm, xa, xl are the axis from which the grid should be drawn. The column z contains the values from which the interpolation is to be performed. Indeed, the regular grid I came up with is calculated as:
grid = np.meshgrid(*(df.xm,df.xA,df.xl))
Now my problem is how to turn the Z-series data from the DataFrame into a np.array to be passed to the Scipy function:
from scipy import interpolate
p0 = (xm0,xA0,xl0)
z0 = interpolate.interpn(grid, myarray, p0)
Thanks to SCKU for the hint on the z-column reshape. I was using
grid = np.meshgrid(*(df.xm,df.xA,df.xl))
following the example from scipy doc.
It was actually enough to pass the tuple of base axis array:
grid = np.meshgrid(xm,xA,xLn)
z = df.z.values.reshape(grid[0].shape)
xt = (df.xM,df.xA,df.xLn)
p0 = (xM0,xA0,xLn0)
val = interpolate.interpn(xt, z, p0)

How can an array be added as a separate column to a CSV file using numpy (preferable) or python 3 without overwriting the CSV file or data?

The new data to be appended has a shorter length!
Here is an example:
Add numpy array:
ema = [3.3 3.4 3.5 3.6]
(csv now has 3-columns of equal length)
1.1 2.1 append ema to end up with: 1.1 2.1 0
1.2 2.2 1.2 2.2 0
1.3 2.3 1.3 2.3 3.3
1.4 2.4 1.4 2.4 3.4
1.5 2.5 1.5 2.5 3.5
1.6 2.6 1.6 2.6 3.6
#kinshukdua's comment suggestion as code:
Best way would be to read the old data as a pandas dataframe and then
append the column to it and fill empty columns with 0s and finally
writing it back to csv.
using this
import pandas as pd
my_csv_path = r"path/to/csv.csv"
df = pd.read_csv(my_csv_path)
ema_padded = np.concatenate([np.zeros(len(df) - ema.shape[0]), ema])
df['ema'] = pd.Series(ema_padded, index=df.index)
df.to_csv(my_csv_path)
df.to_csv(file_path, index=False)

Adding calculated columns to the Dataframe in pandas

There is a large csv file imported. Below is the output, where Flavor_Score and Overall_Score are results of applying df.groupby('beer_name').mean() across a multitude of testers. I would like to add a column Std Deviation for each: Flavor_Score and Overall_Score to the right of the mean column. The function is clear but how to add a column for display? Of course, I can generate an array and append it (right?) but it would seem to be a cumbersome way.
Beer_name Beer_Style Flavor_Score Overall_Score
Coors Light 2.0 3.0
Sam Adams Dark 4.0 4.5
Becks Light 3.5 3.5
Guinness Dark 2.0 2.2
Heineken Light 3.5 3.7
You could use
df.groupby('Beer_name').agg(['mean','std'])
This computes the mean and the std for each group.
For example,
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 100
beers = ['Coors', 'Sam Adams', 'Becks', 'Guinness', 'Heineken']
style = ['Light', 'Dark', 'Light', 'Dark', 'Light']
df = pd.DataFrame({'Beer_name': np.random.choice(beers, N),
'Flavor_Score': np.random.uniform(0, 10, N),
'Overall_Score': np.random.uniform(0, 10, N)})
df['Beer_Style'] = df['Beer_name'].map(dict(zip(beers, style)))
print(df.groupby('Beer_name').agg(['mean','std']))
yields
Flavor_Score Overall_Score
mean std mean std
Beer_name
Becks 5.779266 3.033939 6.995177 2.697787
Coors 6.521966 2.008911 4.066374 3.070217
Guinness 4.836690 2.644291 5.577085 2.466997
Heineken 4.622213 3.108812 6.372361 2.904932
Sam Adams 5.443279 3.311825 4.697961 3.164757
groupby.agg([fun1, fun2]) computes any number of functions in one step:
from random import choice, random
import pandas as pd
import numpy as np
beers = ['Coors', 'Sam Adams', 'Becks', 'Guinness', 'Heineken']
styles = ['Light', 'Dark']
def generate():
for i in xrange(0, 100):
yield dict(beer=choice(beers), style=choice(styles),
flavor_score=random()*10.0,
overall_score=random()*10.0)
pd.options.display.float_format = ' {:,.1f} '.format
df = pd.DataFrame(generate())
print df.groupby(['beer', 'style']).agg([np.mean, np.std])
=>
flavor_score overall_score
mean std mean std
beer style
Becks Dark 7.1 3.6 1.9 1.6
Light 4.7 2.4 2.0 1.0
Coors Dark 5.5 3.2 2.6 1.1
Light 5.3 2.5 1.9 1.1
Guinness Dark 3.3 1.4 2.1 1.1
Light 4.7 3.6 2.2 1.1
Heineken Dark 4.4 3.0 2.7 1.0
Light 6.0 2.3 2.1 1.3
Sam Adams Dark 3.4 3.0 1.7 1.2
Light 5.2 3.6 1.6 1.3
What if I need to use a user-defined function to just a flavor_score column? let's say I want subtract 0.5 from a flavor_score column (from all rows, except for Heineken, for which I want to add 0.25)
grouped[grouped.beer != 'Heineken']['flavor_score']['mean'] - 0.5
grouped[grouped.beer == 'Heineken']['flavor_score']['mean'] + 0.25

Categories

Resources