Need to do eval on pandas dataframe at row level - python

I have a scenario where my pandas data frame have a condition stored as string which I need to execute and store result as different column. Below example will help you understand better;
Existing DataFrame:
ID Val Cond
1 5 >10
1 15 >10
Expected DataFrame:
ID Val Cond Result
1 5 >10 False
1 15 >10 True
As you see and I need to concatenate Val and Cond and do eval at row level.

If your conditions are formed from the basic operations (<, <=, ==, !=, >, >=), then we can do this more efficiently using getattr. We use .str.extract to parse the condition and separate the comparison and the value. Using our dictionary we map the comparison to the Series attributes that we can then call for each unique comparison separately in a simple groupby.
import pandas as pd
print(df)
ID Val Cond
0 1 5 >10
1 1 15 >10
2 1 20 ==20
3 1 25 <=25
4 1 26 <=25
# All operations we might have.
d = {'>': 'gt', '<': 'lt', '>=': 'ge', '<=': 'le', '==': 'eq', '!=': 'ne'}
# Create a DataFrame with the LHS value, comparator, RHS value
tmp = pd.concat([df['Val'],
df['Cond'].str.extract('(.*?)(\d+)').rename(columns={0: 'cond', 1: 'comp'})],
axis=1)
tmp[['Val', 'comp']] = tmp[['Val', 'comp']].apply(pd.to_numeric)
# Val cond comp
#0 5 > 10
#1 15 > 10
#2 20 == 20
#3 25 <= 25
#4 26 <= 25
#5 10 != 10
# Aligns on row Index
df['Result'] = pd.concat([getattr(gp['Val'], d[idx])(gp['comp'])
for idx, gp in tmp.groupby('cond')])
# ID Val Cond Result
#0 1 5 >10 False
#1 1 15 >10 True
#2 1 20 ==20 True
#3 1 25 <=25 True
#4 1 26 <=25 False
#5 1 10 !=10 False
Simple, but inefficient and dangerous, is to eval on each row, creating a string of your condition. eval is dangerous as it can evaluate any code, so only use if you truly trust and know the data.
df['Result'] = df.apply(lambda x: eval(str(x.Val) + x.Cond), axis=1)
# ID Val Cond Result
#0 1 5 >10 False
#1 1 15 >10 True
#2 1 20 ==20 True
#3 1 25 <=25 True
#4 1 26 <=25 False
#5 1 10 !=10 False

You can also do something like this:
df["Result"] = [eval(x + y) for x, y in zip(df["Val"].astype(str), df["Cond"]]
Make the "Result" column by concatenating the strings df["Val"] and df["Cond"], then applying eval to that.

Related

Create a column Counting the consecutive True values on multi-index

Let df be a dataframe of boolean values with a two column index. I want to calculate the value for every id. For example, this is how it would look on this specific case.
value consecutive
id Week
1 1 True 1
1 2 True 2
1 3 False 0
1 4 True 1
1 5 True 2
2 1 False 0
2 2 False 0
2 3 True 1
This is my solution:
def func(id,week):
M = df.loc[id]
M= df.loc[id][:week+1]
consecutive_list = list()
S=0
for index,row in M.iterrows():
if row['value']:
S+=1
else:
S=0
consecutive_list.append(S)
return consecutive_list[-1]
Then we generate the column "consecutive" as a list on the following way:
Consecutive_list = list()
for k in df.index:
id = k[0]
week=k[1]
Consecutive_list.append(func(id,week))
df['consecutive'] = Consecutive_list
I would like to know if there is a more Pythonic way to do this.
EDIT: I wrote the "consecutive" column in order to show what I expect this to be.
If you are trying to add the consecutive column to the df, this should work:
df.assign(consecutive = df['value'].groupby(df['value'].diff().ne(0).cumsum()).cumsum())
Output:
value consecutive
1 a True 1
b True 2
2 a False 0
b True 1
3 a True 2
b False 0
4 a False 0
b True 1

pandas - remove specific sequence from column

I want to remove specific sequences from my column, because they appear a lot and don't give me a lot of extra information. The database consists of edges between nodes. In this case, there will be an edge between node 1 and node 1, node 1 and node 2, node 2 and node 3.....
However, the edge 1-5 happens around 80.000 times in the real database. I want to filter those out, only keeping the 'not so common' interactions.
Lets say my dataframe looks like this
>>> datatry
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 1 23
5 5 22
6 3 144
7 5 33
What I have so far was removing a sequence that was only repeating itself:
c1 = datatry['num'].eq('1')
c2 = datatry['num'].eq(datatry['num'].shift(1))
datatry2 = datatry[(c1 & ~c2) | ~(c1)]
How could I alter the code above (that removes all the rows that repeat the integer 1 and keeps only the first row with the value 1) to code that removes all rows that are a specific sequence? For example: a 1 and then a 5? In this case, I want to remove both the row with value 1 and the row with value 5 that appear in that sequence. My end result would ideally be:
>>> datatry
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 3 144
5 5 33
Here is one way:
import numpy as np
import pandas as pd
def find_drops(seq, df):
if seq:
m = np.logical_and.reduce([df.num.shift(-i).eq(seq[i]) for i in range(len(seq))])
if len(seq) == 1:
return pd.Series(m, index=df.index)
else:
return pd.Series(m, index=df.index).replace({False: np.NaN}).ffill(limit=len(seq)-1).fillna(False)
else:
return pd.Series(False, index=df.index)
find_drops([1], df)
#0 True
#1 True
#2 False
#3 False
#4 True
#5 False
#6 False
#7 False
#dtype: bool
find_drops([1,1,2,3], df)
#0 True
#1 True
#2 True
#3 True
#4 False
#5 False
#6 False
#7 False
#dtype: bool
Then just use those Series to slice df[~find_drops([1,5], df)]
Did you look at duplicated? That has a default value of keep=first. So you can simply do:
datatry.loc[datatry['num'].duplicated(), :]

How to get boolean index array for pandas DataFrame where obs != last_obs?

I have a DataFrame where there is a column we can call 'X.'
So something like 'print(df.X)' would yield some integers somewhere between -10,000 and 10,000. For example:
ID X
1 0
2 0
3 1
4 1
...
7334 -19
7335 -19
7336 -20
7337 -20
>>>
For the example above, I'd like a boolean index I can use to subset the DataFrame where row 3 is equal to True and row 7336 is equal to True, because the value in X changes from the last observation. All others should be False.
You can use check equality of your series with a shifted version of itself via pd.Series.shift.
Note the first item in the series must be manually set to False, if this is a requirement.
df['change'] = df['X'] != df['X'].shift()
df['change'].iat[0] = False
print(df)
ID X change
0 1 0 False
1 2 0 False
2 3 1 True
3 4 1 False
4 7334 -19 True
5 7335 -19 False
6 7336 -20 True
7 7337 -20 False

Pandas DataFrame use previous row value for complicated 'if' conditions to determine current value

I want to know if there is any faster way to do the following loop? Maybe use apply or rolling apply function to realize this
Basically, I need to access previous row's value to determine current cell value.
df.ix[0] = (np.abs(df.ix[0]) >= So) * np.sign(df.ix[0])
for i in range(1, len(df)):
for col in list(df.columns.values):
if ((df[col].ix[i] > 1.25) & (df[col].ix[i-1] == 0)) | :
df[col].ix[i] = 1
elif ((df[col].ix[i] < -1.25) & (df[col].ix[i-1] == 0)):
df[col].ix[i] = -1
elif ((df[col].ix[i] <= -0.75) & (df[col].ix[i-1] < 0)) | ((df[col].ix[i] >= 0.5) & (df[col].ix[i-1] > 0)):
df[col].ix[i] = df[col].ix[i-1]
else:
df[col].ix[i] = 0
As you can see, in the function, I am updating the dataframe, I need to access the most updated previous row, so using shift will not work.
For example:
Input:
A B C
1.3 -1.5 0.7
1.1 -1.4 0.6
1.0 -1.3 0.5
0.4 1.4 0.4
Output:
A B C
1 -1 0
1 -1 0
1 -1 0
0 1 0
you can use .shift() function for accessing previous or next values:
previous value for col column:
df['col'].shift()
next value for col column:
df['col'].shift(-1)
Example:
In [38]: df
Out[38]:
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
In [39]: df['prev_a'] = df['a'].shift()
In [40]: df
Out[40]:
a b c prev_a
0 1 0 5 NaN
1 9 9 2 1.0
2 2 2 8 9.0
3 6 3 0 2.0
4 6 1 7 6.0
In [43]: df['next_a'] = df['a'].shift(-1)
In [44]: df
Out[44]:
a b c prev_a next_a
0 1 0 5 NaN 9.0
1 9 9 2 1.0 2.0
2 2 2 8 9.0 6.0
3 6 3 0 2.0 6.0
4 6 1 7 6.0 NaN
I am surprised there isn't a native pandas solution to this as well, because shift and rolling do not get it done. I have devised a way to do this using the standard pandas syntax but I am not sure if it performs any better than your loop... My purposes just required this for consistency (not speed).
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
Disclaimer: I used pandas 0.16 but with only slight modification this will work for the latest versions too.
Others had similar questions and I posted this solution on those as well:
Reference previous row when iterating through dataframe
Reference values in the previous row with map or apply
#maxU has it right with shift, I think you can even compare dataframes directly, something like this:
df_prev = df.shift(-1)
df_out = pd.DataFrame(index=df.index,columns=df.columns)
df_out[(df>1.25) & (df_prev == 0)] = 1
df_out[(df<-1.25) & (df_prev == 0)] = 1
df_out[(df<-.75) & (df_prev <0)] = df_prev
df_out[(df>.5) & (df_prev >0)] = df_prev
The syntax may be off, but if you provide some test data I think this could work.
Saves you having to loop at all.
EDIT - Update based on comment below
I would try my absolute best not to loop through the DF itself. You're better off going column by column, sending to a list and doing the updating, then just importing back again. Something like this:
df.ix[0] = (np.abs(df.ix[0]) >= 1.25) * np.sign(df.ix[0])
for col in df.columns.tolist():
currData = df[col].tolist()
for currRow in range(1,len(currData)):
if currData[currRow]> 1.25 and currData[currRow-1]== 0:
currData[currRow] = 1
elif currData[currRow] < -1.25 and currData[currRow-1]== 0:
currData[currRow] = -1
elif currData[currRow] <=-.75 and currData[currRow-1]< 0:
currData[currRow] = currData[currRow-1]
elif currData[currRow]>= .5 and currData[currRow-1]> 0:
currData[currRow] = currData[currRow-1]
else:
currData[currRow] = 0
df[col] = currData

Making new column in pandas DataFrame based on filter

Given this DataFrame:
df = pandas.DataFrame({"a": [1,10,20,3,10], "b": [50,60,55,0,0], "c": [1,30,1,0,0]})
What is the best way to make a new column, "filter" that has value "pass" if the values at columns a and b are both greater than x and value "fail" otherwise?
It can be done by iterating through rows but it's inefficient and inelegant:
c = []
for x, v in df.iterrows():
if v["a"] >= 20 and v["b"] >= 20:
c.append("pass")
else:
c.append("fail")
df["filter"] = c
One way would be to create a column of boolean values like this:
>>> df['filter'] = (df['a'] >= 20) & (df['b'] >= 20)
a b c filter
0 1 50 1 False
1 10 60 30 False
2 20 55 1 True
3 3 0 0 False
4 10 0 0 False
You can then change the boolean values to 'pass' or 'fail' using replace:
>>> df['filter'].astype(object).replace({False: 'fail', True: 'pass'})
0 fail
1 fail
2 pass
3 fail
4 fail
You can extend this to more columns using all. For example, to find rows across the columns with entries greater than 0:
>>> cols = ['a', 'b', 'c'] # a list of columns to test
>>> df[cols] > 0
a b c
0 True True True
1 True True True
2 True True True
3 True False False
4 True False False
Using all across axis 1 of this DataFrame creates the new column:
>>> (df[cols] > 0).all(axis=1)
0 True
1 True
2 True
3 False
4 False
dtype: bool

Categories

Resources