I want to apply a lambda function to a DataFrame column using if...elif...else within the lambda function.
The df and the code are something like:
df=pd.DataFrame({"one":[1,2,3,4,5],"two":[6,7,8,9,10]})
df["one"].apply(lambda x: x*10 if x<2 elif x<4 x**2 else x+10)
Obviously, this doesn't work. Is there a way to apply if....elif....else to a lambda?
How can I get the same result with List Comprehension?
Nest if .. elses:
lambda x: x*10 if x<2 else (x**2 if x<4 else x+10)
I do not recommend the use of apply here: it should be avoided if there are better alternatives.
For example, if you are performing the following operation on a Series:
if cond1:
exp1
elif cond2:
exp2
else:
exp3
This is usually a good use case for np.where or np.select.
numpy.where
The if else chain above can be written using
np.where(cond1, exp1, np.where(cond2, exp2, ...))
np.where allows nesting. With one level of nesting, your problem can be solved with,
df['three'] = (
np.where(
df['one'] < 2,
df['one'] * 10,
np.where(df['one'] < 4, df['one'] ** 2, df['one'] + 10))
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
numpy.select
Allows for flexible syntax and is easily extensible. It follows the form,
np.select([cond1, cond2, ...], [exp1, exp2, ...])
Or, in this case,
np.select([cond1, cond2], [exp1, exp2], default=exp3)
df['three'] = (
np.select(
condlist=[df['one'] < 2, df['one'] < 4],
choicelist=[df['one'] * 10, df['one'] ** 2],
default=df['one'] + 10))
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
and/or (similar to the if/else)
Similar to if-else, requires the lambda:
df['three'] = df["one"].apply(
lambda x: (x < 2 and x * 10) or (x < 4 and x ** 2) or x + 10)
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
List Comprehension
Loopy solution that is still faster than apply.
df['three'] = [x*10 if x<2 else (x**2 if x<4 else x+10) for x in df['one']]
# df['three'] = [
# (x < 2 and x * 10) or (x < 4 and x ** 2) or x + 10) for x in df['one']
# ]
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
For readability I prefer to write a function, especially if you are dealing with many conditions. For the original question:
def parse_values(x):
if x < 2:
return x * 10
elif x < 4:
return x ** 2
else:
return x + 10
df['one'].apply(parse_values)
You can do it using multiple loc operators. Here is a newly created column labelled 'new' with the conditional calculation applied:
df.loc[(df['one'] < 2), 'new'] = df['one'] * 10
df.loc[(df['one'] < 4), 'new'] = df['one'] ** 2
df.loc[(df['one'] >= 4), 'new'] = df['one'] + 10
Related
I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.
You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')
i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.
i have a huge data and when and there a lot of duplicate so i wanna remove all value have less then 5 in value_counts() function
like this and less i wanna remove it
If want remove values from counts Series use boolean indexing:
y = pd.Series(['a'] * 5 + ['b'] * 2 + ['c'] * 3 + ['d'] * 7)
s = y.value_counts()
out = s[s > 4]
print (out)
d 7
a 5
dtype: int64
If want remove values from original Series use Series.isin:
y1 = y[y.isin(out.index)]
print (y1)
0 a
1 a
2 a
3 a
4 a
10 d
11 d
12 d
13 d
14 d
15 d
16 d
dtype: object
Thank you mr.jezrael your answer so helpful and i will add a small tip, after you gathering the values this how you can replace the values :
s = y.value_counts()
x = s[s>5]
for z in y:
if z not in x:
y = y.replace([z],'Other')
else:
continue
Example.
Let's say I have dataframe with several columns and I want to select rows that match all 4 conditions, I would write:
condition = (df['A'] < 10) & (df['B'] < 10) & (df['C'] < 10) & (df['D'] < 10)
df.loc[condition]
Contrary to that if I want to select rows that match any of 4 conditions I would write:
condition = (df['A'] < 10) | (df['B'] < 10) | (df['C'] < 10) | (df['D'] < 10)
df.loc[condition]
Now, what if I want to select rows that match any two of those 4 conditions? That would be rows that match any combination of either columns (A and B), (A and C), (A and D), (B and C) or (C and D). It is obvious that I can write complex condition with all those combinations:
condition = ((df['A'] < 10) & (df['B'] < 10)) |\
((df['A'] < 10) & (df['C'] < 10)) |\
((df['A'] < 10) & (df['D'] < 10)) |\
((df['B'] < 10) & (df['C'] < 10)) |\
((df['C'] < 10) & (df['D'] < 10))
df.loc[condition]
But if there is 50 columns and I want to match any 20 columns of those 50, that would become impossible to list all possible combinations into condition. Is there a way to do that somehow better?
Since True == 1 and False == 0 you can find rows that satisfy atleast N conditions by checking the sum. Series have most of the basic comparisons as attributes so you could make a single condition list with a variety of checks and then use getattr to make it tidy.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0, 20, (5,4)), columns=list('ABCD'))
# can check `eq`, `lt`, `le`, `gt`, `ge`, `isin`
cond_list = [('A', 'lt', 10), ('B', 'ge', 10), ('D', 'eq', 4),
('C', 'isin', [2, 4, 6])]
df_c = pd.concat([getattr(df[col], attr)(val).astype(int)
for col,attr,val in cond_list], axis=1)
# A B D C
#0 0 0 0 1
#1 0 1 0 0
#2 1 1 0 0
#3 1 1 0 0
#4 0 1 0 1
# Each row satisfies this many conditions
df_c.sum(1)
#0 1
#1 1
#2 2
#3 2
#4 2
#dtype: int64
#Select those that satisfy at least 2.
df[df_c.sum(1).ge(2)]
# A B C D
#2 0 17 15 9
#3 0 14 0 15
#4 19 14 4 0
If you need some more complicated comparisons that aren't possible with .getattr then you can write them out yourself and concat that list of Series.
df_c = pd.concat([df['A'].lt(10), df['B'].ge(10), df['D'].eq(4),
df['C'].isin([2,4,6])],
axis=1).astype(int)
Here's a method using itertools.combinations, so we can get all necessary combinations of our conditions. Then we check the "amount" of time condition1 & condition2 are True:
# test dataframe
np.random.seed(10)
df = pd.DataFrame(np.random.randint(20, size=(10,5)), columns=list('ABCDE'))
print(df)
A B C D E
0 9 4 15 0 17
1 16 17 8 9 0
2 10 8 4 19 16
3 4 15 11 11 1
4 8 4 14 17 19
5 13 5 13 19 13
6 12 1 4 18 13
7 11 10 9 15 18
8 16 7 11 17 14
9 7 11 1 0 12
from itertools import combinations
conditions = [(df['A'] < 10), (df['B'] < 15), (df['C'] >= 5), (df['D'] <= 9)]
mask = pd.concat([x&y for x, y in combinations(conditions, 2)], axis=1).sum(axis=1).ge(2)
df[mask]
A B C D E
0 9 4 15 0 17
4 8 4 14 17 19
9 7 11 1 0 12
I have a dataframe - pastebin for minimium code to run
df_dict = {
'A': [1, 2, 3, 4, 5],
'B': [5, 2, 3, 1, 5],
'out': np.nan
}
df = pd.DataFrame(df_dict)
I am currently performing some row by row calculations by doing the following:
def transform(row):
length = 2
weight = 5
row_num = int(row.name)
out = row['A'] / length
if (row_num >= length):
previous_out = df.at[ row_num-1, 'out' ]
out = (row['B'] - previous_out) * weight + previous_out
df.at[row_num, 'out'] = out
df.apply( lambda x: transform(x), axis=1)
This yield the correct result:
A B out
0 1 5 0.5
1 2 2 1.0
2 3 3 11.0
3 4 1 -39.0
4 5 5 181.0
The breakdown for the correct calculation is as follows:
A B out
0 1 5 0.5
out = a / b
1 2 2 1.0
out = a / b
row_num >= length:
2 3 3 11.0
out = (b - previous_out) * weight + previous_out
out = (3 - 1) * 5 + 1 = 11
3 4 1 -39.0
out = (1 - 11) * 5 + 11 = 39
4 5 5 181.0
out = (5 - (-39)) * 5 + (-39) = 181
Executing this across many columns and looping is slow so I would like to optimize taking advantage of some kind of vectorization if possible.
My current attempt looks like this:
df['out'] = df['A'] / length
df[length:]['out'] = (df[length:]['B'] - df[length:]['out'].shift() ) * weight + df[length:]['out'].shift()
This is not working and I'm not quite sure where to go from here.
Pastebin of the above code to just copy/paste into a file and run
You won't be able to do better than this:
df['out'] = df.A / length
for i in range(len(df)):
if i >= length:
df.loc[i, 'out'] = (df.loc[i, 'B'] -
df.loc[i - 1, 'out']) * weight + df.loc[i - 1, 'out']
The reason is that "the iterative nature of the calculation where the inputs depend on results of previous steps complicates vectorization" (as a commenter here puts it). You can't do a calculation where every result depends on the previous ones in a matrix - there will always be some kind of loop going on behind the scenes.
I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0
Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)
You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)