Evaluating an Expression using data frames - python

I have a df
Patient ID
A 72
A SD75
A 74
A 74
B 71
C 72
And
I have an expression
exp = '((71+72)*((73+75)+SD75))*((74+76)+SD76))'
Now I need to evaluate this expression with ones and zeros if there's a match in the df for each of the three patients A, B , C . A has a match with ID 72, SD75, 74 so the expressions should be
A- '((0+1)*((0+0)+1))*((1+0)+0))'
B- '((1+0)*((0+0)+0))*((0+0)+0))'
C- '((0+1)*((0+0)+0))*((0+0)+0))'
And My final df_output should look like this
Patient FinalVal
A 1
B 0
C 0
The FinalVal can be obtained by eval(exp) after replacing the ID's with 1's and O's
so Far here is where I reached. When I am replacing the ID 75 with 0 the SD75 is becoming SD0 and that's where I am stuck
import pandas as pd
import re
exp = '((71+72)*((73+75)+SD75))*((74+76)+SD76))'
mylist = re.sub(r'[^\w]', ' ', exp).split()
distinct_pt = df.Patient.drop_duplicates().dropna()
df_output = pd.DataFrame(distinct_pt)
df_output['Exp'] = exp
for index, row in df_output.iterrows():
new_df = df[df.Patient == row['Patient']]
new_dfl = new_df['ID'].tolist()
#print(new_dfl)
for j in mylist:
if j in new_dfl:
#print(j)
row['Exp'] = row['Exp'].replace(j,'1')
else:
row['Exp'] = row['Exp'].replace(j,'1')

We can try creating an indicator DataFrame using a Series.get_dummies to create indicator columns for each value in the ID column, then reduce to a single row per Patient via groupby max:
# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()
indicator_df:
71 72 74 SD75
Patient
A 0 1 1 1
B 1 0 0 0
C 0 1 0 0
Now we can reindex from the expression terms to create missing columns. np.unique is used to ensure that duplicate terms in the expression do not result in duplicate columns in indicator_df (this can be omitted if it is guaranteed there are no duplicate terms):
exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()
# Ensure All expression terms are present
indicator_df = indicator_df.reindex(
columns=np.unique(cols), # prevent duplicate cols
fill_value=0 # Added cols are filled with 0
)
indicator_df:
71 72 73 74 75 76 SD75 SD76
Patient
A 0 1 0 1 0 0 1 0
B 1 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0
Now if we alter the exp slightly by surrounding these new columns names with backticks (`) we can use DataFrame.eval to compute the expression:
exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# create indicator_df (chained)
indicator_df = (
df.set_index('Patient')['ID']
.str.get_dummies()
.groupby(level=0).max()
.reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
# Add Backticks around columns names
re.sub(r'(\w+)', r'`\1`', exp)
).reset_index(name='FinalVal')
result:
Patient FinalVal
0 A 1
1 B 0
2 C 0
The backticks are necessary to indicate these values represent column names, and not numeric values:
re.sub(r'(\w+)', r'`\1`', exp)
# (((`71`+`72`)*((`73`+`75`)+`SD75`))*((`74`+`76`)+`SD76`))
Notice the difference between 71 with backticks vs without:
# Column '71' + the number 71
pd.DataFrame({'71': [1, 2, 3]}).eval('B = `71` + 71')
71 B
0 1 72
1 2 73
2 3 74
Alternatively, the indicator_df can be created with a crosstab and clip:
exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
indicator_df = (
pd.crosstab(df['Patient'], df['ID'])
.clip(upper=1) # Restrict upperbound to 1
.reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
# Add Backticks around columns names
re.sub(r'(\w+)', r'`\1`', exp)
).reset_index(name='FinalVal')
Setup and imports used:
import re
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Patient': ['A', 'A', 'A', 'A', 'B', 'C'],
'ID': ['72', 'SD75', '74', '74', '71', '72']
})

I would not try to parse that expression and evaluate it. Instead, I would create dummy or indicator variables for the ID column. (Indicator variables are also called one-hot encoded variables.) With these indicators, you can then calculate your expression using a standard function.
Here's how to do it with Pandas and scikit-learn. I am using scikit-learn's OneHotEncoder. An alternative might be Panda's get_dummies(), but the OneHotEncoder allows you to specify the categories.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
variables = [71, 72, 73, 74, 75, 76, "SD75", "SD76"]
enc = OneHotEncoder(categories=[variables], sparse=False)
df = pd.DataFrame({
"Patient": ["A"] * 4 + ["B", "C"],
"ID": [72, "SD75", 74, 74, 71, 72]
})
# Create one-hot encoded variables, also called dummy or indicator variables
df_one_hot = pd.DataFrame(
enc.fit_transform(df[["ID"]]),
columns=variables,
index=df.Patient
)
# Aggregate dummy or one-hot variables, so there's one for each patient
# You may need to alter the aggretaion function
# I chose max because it matched your example
# but perhaps sum might be better (e.g. patient A has two entires for 74, should that be a value of 2 for variable 74?
one_hot_patient = df_one_hot.groupby(level="Patient").agg(max)
# Finally, evaluate your expression
# Create a function to calcualte the output given a data frame
def my_expr(DF):
out = (DF[71] + DF[72]) \
* (DF[73] + DF[75] + DF["SD75"]) \
* (DF[74]+DF[76]+DF["SD76"])
return out
output = one_hot_patient.assign(FinalVal=my_expr)
Result
71 72 73 74 75 76 SD75 SD76 FinalVal
Patient
A 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
B 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Using sub instead of replace should work:
for j in mylist:
if j in new_dfl:
exp = re.sub(r'\b{}'.format(j), '1', exp)
else:
exp = re.sub(r'\b{}'.format(j), '0', exp)
Another way that would work for this exact scenario is to sort mylist in descending order so the items preceded by SD are iterated before the others.
mylist = re.sub(r'[^\w]', ' ', exp).split()
mylist.sort(reverse=True)

Related

Comparing pandas DataFrames where column values are lists

I have some chemical data that I'm trying to process using Pandas. I have two dataframes:
C_atoms_all.head()
id_all index_all label_all species_all position
0 217 1 C C [6.609, 6.6024, 19.3301]
1 218 2 C C [4.8792, 11.9845, 14.6312]
2 219 3 C C [4.8373, 10.7563, 13.9466]
3 220 4 C C [4.7366, 10.9327, 12.5408]
4 6573 5 C C [1.9482, -3.8747, 19.6319]
C_atoms_a.head()
id_a index_a label_a species_a position
0 55 1 C C [6.609, 6.6024, 19.3302]
1 56 2 C C [4.8792, 11.9844, 14.6313]
2 57 3 C C [4.8372, 10.7565, 13.9467]
3 58 4 C C [4.7367, 10.9326, 12.5409]
4 59 5 C C [5.1528, 15.5976, 14.1249]
What I want to do is get a mapping of all of the id_all values to the id_a values where their position matches. You can see that for C_atoms_all.iloc[0]['id_all'] (which returns 55) and the same query for C_atoms_a, the position values match (within a small fudge factor), which I should also include in the query.
The problem I'm having is that I can't merge or filter on the position columns because lists aren't hashable in Python.
I'd ideally like to return a dataframe that looks like so:
id_all id_a position
217 55 [6.609, 6.6024, 19.3301]
... ... ...
for every row where the position values match.
You can do it like below:
I named your C_atoms_all as df_all and C_atoms_a as df_a:
# First we try to extract different values in "position" columns for both dataframes.
df_all["val0"] = df_all["position"].str[0]
df_all["val1"] = df_all["position"].str[1]
df_all["val2"] = df_all["position"].str[2]
df_a["val0"] = df_a["position"].str[0]
df_a["val1"] = df_a["position"].str[1]
df_a["val2"] = df_a["position"].str[2]
# Then because the position values match (within a small fudge factor)
# we round them with three decimal
df_all.loc[:, ["val0", "val1", "val2"]] = df_all[["val0", "val1", "val2"]].round(3)
df_a.loc[:, ["val0", "val1", "val2"]]= df_a[["val0", "val1", "val2"]].round(3)
# We use loc to modify the original dataframe, instead of a copy of it.
# Then we use merge on three extracted values from position column
df = df_all.merge(df_a, on=["val0", "val1", "val2"], left_index=False, right_index=False,
suffixes=(None, "_y"))
# Finally we just keep the the desired columns
df = df[["id_all", "id_a", "position"]]
print(df)
id_all id_a position
0 217 55 [6.609, 6.6024, 19.3301]
1 218 56 [4.8792, 11.9845, 14.6312]
2 219 57 [4.8373, 10.7563, 13.9466]
3 220 58 [4.7366, 10.9327, 12.5408]
This isn't pretty, but it might work for you
def do(x, df_a):
try:
return next((df_a.iloc[i]['id_a'] for i in df_a.index if df_a.iloc[i]['position'] == x))
except StopIteration:
return np.NAN
match = pd.DataFrame(C_atoms_all[['id_all', 'position']])
match['id_a'] = C_atoms_all['position'].apply(do, args=(C_atoms_a,))
You can create a new column in both datasets that contains the hash of the position column and then merge both datasets by that new column.
# Custom hash function
def hash_position(position):
return hash(tuple(position))
# Create the hash column "hashed_position"
C_atoms_all['hashed_position'] = C_atoms_all['position'].apply(hash_position)
C_atoms_a['hashed_position'] = C_atoms_a['position'].apply(hash_position)
# merge datasets
C_atoms_a.merge(C_atoms_all, how='inner', on='hashed_position')
# ... keep the columns you need
Your question is not clear. It seems to me an interesting question though. For that reason I have reproduced your data in a more useful format just in case there is some one who can help more than I can.
Data
C_atoms_all = pd.DataFrame({
'id_all': [217,218,219,220,6573],
'index_all': [1,2,3,4,5],
'label_all': ['C','C','C','C','C'],
'species_all': ['C','C','C','C','C'],
'position':[[6.609, 6.6024, 19.3301],[4.8792, 11.9845, 14.6312],[4.8373, 10.7563, 13.9466],[4.7366, 10.9327, 12.5408],[1.9482,-3.8747, 19.6319]]})
C_atoms_a = pd.DataFrame({
'id_a': [55,56,57,58,59],
'index_a': [1,2,3,4,5],
'label_a': ['C','C','C','C','C'],
'species_a': ['C','C','C','C','C'],
'position':[[6.609, 6.6024, 19.3302],[4.8792, 11.9844, 14.6313],[4.8372, 10.7565, 13.9467],[4.7367, 10.9326, 12.5409],[5.1528, 15.5976, 14.1249]]})
C_atoms_ab
Solution
#new dataframe bringing together columns position
df3=C_atoms_all.set_index('index_all').join(C_atoms_a.set_index('index_a').loc[:,'position'].to_frame(),rsuffix='_r').reset_index()
#Create temp column that gives you the comparison tolerances
df3['temp']=df3.filter(regex='^position').apply(lambda x: np.round(np.array(x[0])-np.array(x[1]), 4), axis=1)
#Assume tolerance is where only one of the values is over 0.0
C_atoms_all[df3.explode('temp').groupby(level=0)['temp'].apply(lambda x:x.eq(0).sum()).gt(1)]
id_all index_all label_all species_all position
0 217 1 C C [6.609, 6.6024, 19.3301]

Remove opening and closing parenthesis with word in pandas

Given a data frame:
df =
multi
0 MULTIPOLYGON(((3 11, 2 33)))
1 MULTIPOLYGON(((4 22, 5 66)))
I was trying to remove the word 'MULTIPOLYGON', and parenthesis '(((', ')))'
My try:
df['multi'] = df['multi'].str.replace(r"\(.*\)","")
df['multi'] = df['multi'].map(lambda x: x.lstrip('MULTIPOLYGON()').rstrip('aAbBcC'))
df.values =
array([[''],
[''],
...
[''],
[''],
[''],
['7.5857754821 44.9628409423']
Desired output:
df =
multi
3 11, 2 33
4 22, 5 6
Try this:
import pandas as pd
import re
def f(x):
x = ' '.join(re.findall(r'[0-9, ]+',x))
return x
def f2(x):
x = re.findall(r'[0-9, ]+',x)
return pd.Series(x[0].split(','))
df =pd.DataFrame({'a':['MULTIPOLYGON(((3 11, 2 33)))' ,'MULTIPOLYGON(((4 22, 5 6)))']})
df['a'] = df['a'].apply(f)
print(df)
#or for different columns you can do
df =pd.DataFrame({'a':['MULTIPOLYGON(((3 11, 2 33)))' ,'MULTIPOLYGON(((4 22, 5 6)))']})
#df['multi'] = df.a.str.replace('[^0-9. ]', '', regex=True)
#print(df)
list_of_cols = ['c1','c2']
df[list_of_cols] = df['a'].apply(f2)
del df['a']
print(df)
output:
a
0 3 11, 2 33
1 4 22, 5 6
c1 c2
0 3 11 2 33
1 4 22 5 6
[Finished in 2.5s]
You can also use str.replace with a regex:
# removes anything that's not a digit or a space or a dot
df['multi'] = df.multi.str.replace('[^0-9\. ]', '', regex=True)#changing regex
You can use df.column.str in the following way.
df['a'] = df['a'].str.findall(r'[0-9.]+')
df = pd.DataFrame(df['a'].tolist())
print(df)
output:
0 1
0 3.49 11.10
1 4.49 22.12
This will work for any number of columns. But in the end you have to name those columns.
df.columns = ['a'+str(i) for i in range(df.shape[1])]
This method will work even when some rows have different number of numerical values. like
df =pd.DataFrame({'a':['MULTIPOLYGON(((3.49)))' ,'MULTIPOLYGON(((4.49 22.12)))']})
a
0 MULTIPOLYGON(((3.49)))
1 MULTIPOLYGON(((4.49 22.12)))
So the expected output is
0 1
0 3.49 None
1 4.49 22.12
After naming the columns using,
df.columns = ['a'+str(i) for i in range(df.shape[1])]
You get,
a0 a1
0 3.49 None
1 4.49 22.12
Apply is a rather slow method in pandas since it's basically a loop that iterates over each row and apply's your function. Pandas has vectorized methods, we can use str.extract here to extract your pattern:
df['multi'] = df['multi'].str.extract('(\d\.\d+\s\d+\.\d+)')
multi
0 3.49 11.10
1 4.49 22.12

While loop for iterating all combinations between two values

I want to create a loop that loads all the iterations of two variables into a dataframe in seperate columns. I want variable "a" to hold values between 0 and 1 in 0.1 increments, and the same for variable "b". In otherwords there should be 100 iterations when complete, starting with 0 & 0, and ending with 1 & 1.
I've tried the following code
data = [['Decile 1', 10], ['Decile_2', 15], ['Decile_3', 14]]
staging_table = pd.DataFrame(data, columns = ['Decile', 'Volume'])
profile_table = pd.DataFrame(columns = ['Decile', 'Volume'])
a = 0
b = 0
finished = False
while not finished:
if b != 1:
if a != 1:
a = a + 0.1
staging_table['CAM1_Modifier'] = a
staging_table['CAM2_Modifier'] = b
profile_table = profile_table.append(staging_table)
else:
b = b + 0.1
else:
finished = True
profile_table
You can use itertools.product to get all the combinations:
import itertools
import pandas as pd
x = [i / 10 for i in range(11)]
df = pd.DataFrame(
list(itertools.product(x, x)),
columns=["a", "b"]
)
# a b
# 0 0.0 0.0
# 1 0.0 0.1
# 2 0.0 0.2
# ... ... ...
# 118 1.0 0.8
# 119 1.0 0.9
# 120 1.0 1.0
#
# [121 rows x 2 columns]
itertools is your friend.
from itertools import product
for a, b in product(map(lambda x: x / 10, range(10)),
map(lambda x: x / 10, range(10))):
...
range(10) gives us the integers from 0 to 10 (regrettably, range fails on floats). Then we divide those values by 10 to get your range from 0 to 1. Then we take the Cartesian product of that iterable with itself to get every combination.

How to check correlation between matching columns of two data sets?

If we have the data set:
import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})
How does one create a correlation matrix, in which the y-axis represents "a" and the x-axis represents "b"?
The aim is to see correlations between the matching columns of the two datasets like this:
If you won't mind a NumPy based vectorized solution, based on this solution post to Computing the correlation coefficient between two multi-dimensional arrays -
corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.
Sample run -
In [621]: a
Out[621]:
A B C D E
0 34 54 56 0 78
1 12 87 78 23 12
2 78 35 0 72 31
3 84 25 14 56 0
4 26 82 13 14 34
In [622]: b
Out[622]:
A B C D E
0 45 45 98 0 24
1 24 87 52 23 12
2 65 65 32 1 65
3 65 52 32 365 3
4 65 12 12 53 65
In [623]: corr2_coeff(a.values.T,b.values.T).T
Out[623]:
array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011],
[ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737],
[-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413],
[ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866],
[ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])
This achieves exactly what you want:
from scipy.stats import pearsonr
# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)
# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient
Edit: Well, it achieved exactly what you wanted, until the question was modified. Although this can easily be changed:
c = pd.DataFrame(columns = a.columns, index = a.columns)
for col in c.columns:
for idx in c.index:
correl_signif = pearsonr(a[col], b[idx])
correl = correl_signif[0]
c.loc[idx, col] = correl
c is now this:
Out[16]:
A B C D E
A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101
B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427
C -0.266264 -0.0198347 0.661107 -0.50872 0.683504
D 0.580956 -0.552312 -0.320539 0.384165 -0.624039
E 0.0165272 0.140005 -0.582389 0.12936 0.286023
I use this function that breaks it down with numpy
def corr_ab(a, b):
a_ = a.values
b_ = b.values
ab = a_.T.dot(b_)
n = len(a)
sums_squared = np.outer(a_.sum(0), b_.sum(0))
stds_squared = np.outer(a_.std(0), b_.std(0))
return pd.DataFrame((ab - sums_squared / n) / stds_squared / n,
a.columns, b.columns)
demo
corr_ab(a, b)
Do you have to use Pandas? This seem can be done via numpy rather easily. Did i understand the task incorrectly?
import numpy
X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}
Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]}
for key,value in X.items():
print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))

pandas dataframe groupby like mysql, yet into new column

df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
df
A mask values
0 11 0 10
1 11 0 15
2 22 0 20
3 22 1 25
Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:
def calculate_df_stats(dfs):
mask_ = list(dfs['B'])
mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
return mean
df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work
and generate:
A mask values Z
0 11 0 10 12.5
1 22 0 20 25
Whatever I do it only replaces values column with the masked mean.
and can your solution be applied for a function on two columns and return in a new column?
Thanks!
Edit:
To clarify more: let's say I have such a table in Mysql:
SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560'
LIMIT 200;
which gives me such result:
http://pastebin.com/qXiaWcJq
If I run now this:
SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560'
group by `reader_plate_ID`;
I get:
datapoint_ID plate_ID coordinate_x coordinate_y res_value wavelength ignore avg(action_value)
193 1 0 0 2.1783 560 NULL 2.090027083333334
481 2 0 0 1.7544 560 NULL 1.4695583333333333
769 3 0 0 2.0161 560 NULL 1.6637885416666673
How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.
If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):
>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>>
>>> grouped = df.groupby("A")
>>>
>>> result = grouped.agg('first')
>>> result
mask values
A
11 0 10
22 0 20
and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':
>>> def calculate_df_stats(dfs):
... mask_ = list(dfs['mask'])
... mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
... return mean
...
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>>
>>> result
mask values Z
A
11 0 10 12.5
22 0 20 20.0
In your function definition you can always use more columns (just by their name) to return the result.

Categories

Resources