if else function in pandas dataframe [duplicate] - python

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 4 years ago.
I'm trying to apply an if condition over a dataframe, but I'm missing something (error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().)
raw_data = {'age1': [23,45,21],'age2': [10,20,50]}
df = pd.DataFrame(raw_data, columns = ['age1','age2'])
def my_fun (var1,var2,var3):
if (df[var1]-df[var2])>0 :
df[var3]=df[var1]-df[var2]
else:
df[var3]=0
print(df[var3])
my_fun('age1','age2','diff')

You can use numpy.where:
def my_fun (var1,var2,var3):
df[var3]= np.where((df[var1]-df[var2])>0, df[var1]-df[var2], 0)
return df
df1 = my_fun('age1','age2','diff')
print (df1)
age1 age2 diff
0 23 10 13
1 45 20 25
2 21 50 0
Error is better explain here.
Slowier solution with apply, where need axis=1 for data processing by rows:
def my_fun(x, var1, var2, var3):
print (x)
if (x[var1]-x[var2])>0 :
x[var3]=x[var1]-x[var2]
else:
x[var3]=0
return x
print (df.apply(lambda x: my_fun(x, 'age1', 'age2','diff'), axis=1))
age1 age2 diff
0 23 10 13
1 45 20 25
2 21 50 0
Also is possible use loc, but sometimes data can be overwritten:
def my_fun(x, var1, var2, var3):
print (x)
mask = (x[var1]-x[var2])>0
x.loc[mask, var3] = x[var1]-x[var2]
x.loc[~mask, var3] = 0
return x
print (my_fun(df, 'age1', 'age2','diff'))
age1 age2 diff
0 23 10 13.0
1 45 20 25.0
2 21 50 0.0

There is another way without np.where or pd.Series.where. Am not saying it is better, but after trying to adapt this solution to a challenging problem today, was finding the syntax for where no so intuitive. In the end, not sure whether it would have possible with where, but found the following method lets you have a look at the subset before you modify it and it for me led more quickly to a solution. Works for the OP here of course as well.
You deliberately set a value on a slice of a dataframe as Pandas so often warns you not to.
This answer shows you the correct method to do that.
The following gives you a slice:
df.loc[df['age1'] - df['age2'] > 0]
..which looks like:
age1 age2
0 23 10
1 45 20
Add an extra column to the original dataframe for the values you want to remain after modifying the slice:
df['diff'] = 0
Now modify the slice:
df.loc[df['age1'] - df['age2'] > 0, 'diff'] = df['age1'] - df['age2']
..and the result:
age1 age2 diff
0 23 10 13
1 45 20 25
2 21 50 0

You can use pandas.Series.where
df.assign(age3=(df.age1 - df.age2).where(df.age1 > df.age2, 0))
age1 age2 age3
0 23 10 13
1 45 20 25
2 21 50 0
You can wrap this in a function
def my_fun(v1, v2):
return v1.sub(v2).where(v1 > v2, 0)
df.assign(age3=my_fun(df.age1, df.age2))
age1 age2 age3
0 23 10 13
1 45 20 25
2 21 50 0

Related

Setting values in pandas df using location

I have two dataframes as such:
df_pos = pd.DataFrame(
data = [[5,4,3,6,0,7,1,2], [2,5,3,6,4,7,1,0]]
)
df_value = pd.DataFrame(
data=[np.arange(10 + i, 50 + i, 5) for i in range(0,2)]
)
and I want to have a new dataframe df_final where df_pos notates the position and df_value the corresponding value.
I can do it like this:
df_value_copy = df_value.copy()
for i in range(len(df_pos)):
df_value_copy.iloc[i, df_pos.iloc[i, :]] = df_value.iloc[i].values
df_final = df_value_copy
However, I have very large dataframes that would be way too slow. Therefore I want to see whether there is any smarter way to do it.
We can also try np.put_along_axis to place df_value into df_final based on the df_pos:
df_final = df_value.copy()
np.put_along_axis(
arr=df_final.values, # Destination Arr
indices=df_pos.values, # Indices
values=df_value.values, # Source Values
axis=1 # Along Axis
)
The arguments do not need to be kwargs can be positional like:
df_final = df_value.copy()
np.put_along_axis(df_final.values, df_pos.values, df_value.values, 1)
df_final:
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36
You can try setting values with numpy advanced indexing:
df_final = df_value.copy()
df_final.values[np.arange(len(df_pos))[:,None], df_pos.values] = df_value.values
df_final
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36

replicating data in same dataFrame

I want to replicate the data from the same dataframe when a certain condition is fulfilled.
Dataframe:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
I want to replicate the dataframe when going through a loop and there is a difference greater than 4 in row.hour.
Expected Output:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
2,17
4,20
i want to replicate the rows when the iterating through all the row and there is a difference greater than 4 in row.hour
row.hour[0] = 1
row.hour[1] = 2.here the difference between is 1 but in (row.hour[2]=4 and row,hour[3]=10).here the difference is 6 which is greater than 4.I want to replicate the data above of the index where this condition(greater than 4) is fulfilled
I can replicate the data with **df = pd.concat([df]*2, ignore_index=False)**.but it does not replicate when i run it with if statement
I tried the code below but nothing is happening.
**for i in range(0,len(df)-1):
if (df.iloc[i,0] - df.iloc[i+1,0]) > 4 :
df = pd.concat([df]*2, ignore_index=False)**
My understanding is: you want to compare 'Hour' values for two successive rows.
If the difference is > 4 you want to add the previous row to the DF.
If that is what you want try this:
Create a DF:
j = pd.DataFrame({'Hour':[1, 2, 4,10,15,16,17,19],
'Wage':[15,17,20,25,26,30,40,15]})
Define a function:
def f1(d):
dn = d.copy()
for x in range(len(d)-2):
if (abs(d.iloc[x+1].Hour - d.iloc[x+2].Hour) > 4):
idx = x + 0.5
dn.loc[idx] = d.iloc[x]['Hour'], d.iloc[x]['Wage']
dn = dn.sort_index().reset_index(drop=True)
return dn
Call the function passing your DF:
nd = f1(j)
Hour Wage
0 1 15
1 2 17
2 2 17
3 4 20
4 4 20
5 10 25
6 15 26
7 16 30
8 17 40
9 19 15
In line
if df.iloc[i,0] - df.iloc[i+1,0] > 4
you calculate 4-10 instead of 10-4 so you check -6 > 4 instead of 6 > 4
You have to replace items
if df.iloc[i+1,0] - df.iloc[i,0] > 4
or use abs() if you want to replicate in both situations - > 4 and < -4
if abs(df.iloc[i+1,0] - df.iloc[i,0]) > 4
If you would use print( df.iloc[i,0] - df.iloc[i+1,0]) (or debuger) the you would see it.

How to multiply a specific row in pandas dataframe by a condition

I have a column which of 10th marks but some specific rows are not scaled properly i.e they are out of 10. I want to create a function that will help me to detect which are <=10 and then multiply to 100. I tried by creating a function but it failed.
Following is the Column:
data['10th']
0 0
1 0
2 0
3 10.00
4 0
...
2163 0
2164 0
2165 0
2166 76.50
2167 64.60
Name: 10th, Length: 2168, dtype: object
I am not what do you mean by "multiply to 100" but you should be able to use apply with lambda similar to this:
df = pd.DataFrame({"a": [1, 3, 5, 23, 76, 43 ,12, 3 ,5]})
df['a'] = df['a'].apply(lambda x: x*100 if x < 10 else x)
print(df)
0 100
1 300
2 500
3 23
4 76
5 43
6 12
7 300
8 500
If I do not understand you correctly you could replace the action and condition in the lambda function to your purpose.
Looks like you need to change the data type first data["10th"] = pd.to_numeric(data["10th"])
I assume you want to multiply by 10 not 100 to scale it with the other out of 100 scores. you can try this np.where(data["10th"]<10, data["10th"]*10, data["10th"])
assigning it back to the dataframe using. data["10th"] = np.where(data["10th"]<10, data["10th"]*10, data["10th"])

Alternatives to Dataframe.iterrows() or Dataframe.itertuples()?

My understanding of a Pandas dataframe vectorization (through Pandas vectorization itself or through Numpy) is applying a function to an array, similar to .apply() (Please correct me if I'm wrong). Suppose I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'color' : ['red','blue','yellow','orange','green',
'white','black','brown','orange-red','teal',
'beige','mauve','cyan','goldenrod','auburn',
'azure','celadon','lavender','oak','chocolate'],
'group' : [1,1,1,1,1,
1,1,1,1,1,
1,2,2,2,2,
4,4,5,6,7]})
df = df.set_index('color')
df
For this data, I want to apply a special counter for each unique value in A. Here's my current implementation of it:
df['C'] = 0
for value in set(df['group'].values):
filtered_df = df[df['group'] == value]
adj_counter = 0
initialize_counter = -1
spacing_counter = 20
special_counters = [0,1,-1,2,-2,3,-3,4,-4,5,-5,6,-6,7,-7]
for color,rows in filtered_df.iterrows():
if len(filtered_df.index) < 7:
initialize_counter +=1
df.loc[color,'C'] = (46+special_counters[initialize_counter])
else:
spacing_counter +=1
if spacing_counter > 5:
spacing_counter = 0
df.loc[color,'C'] = spacing_counter
df
Is there a faster way to implement this that doesn't involve iterrows or itertuples? Since the counting in the C columns is very irregular, I'm not sure as how I could implement this through apply or even through vectorization
What you can do is first create the column 'C' with groupby on the column 'group' and cumcount that would almost represent spacing_counter or initialize_counter depending on if len(filtered_df.index) < 7 or not.
df['C'] = df.groupby('group').cumcount()
Now you need to select the appropriate rows to do the if or the else part of your code. One way is to create a series using groupby again and transform to know the size of the group related to each row. Then, use loc on you df using this series and do: if the value is smaller than 7, you can map your values with the special_counters else just use modulo % 6
ser_size = df.groupby('group')['C'].transform('size')
df.loc[ser_size < 7,'C'] = df.loc[ser_size < 7,'C'].map(lambda x: 46 + special_counters[x])
df.loc[ser_size >= 7,'C'] %= 6
at the end, you get as expected:
print (df)
group C
color
red 1 0
blue 1 1
yellow 1 2
orange 1 3
green 1 4
white 1 5
black 1 0
brown 1 1
orange-red 1 2
teal 1 3
beige 1 4
mauve 2 46
cyan 2 47
goldenrod 2 45
auburn 2 48
azure 4 46
celadon 4 47
lavender 5 46
oak 6 46
chocolate 7 46

broadcast groupby with boolean filter scalar in pandas

I have a data frame according to below.
df = pd.DataFrame({'var1' : list('a' * 3) + list('b' * 2) + list('c' * 4)
,'var2' : [i for i in range(9)]
,'var3' : [20, 40, 100, 10, 80, 12,24, 53, 90]
})
End result that I want is the following:
var1 var2 var3 var3_lt_50
0 a 0 20 60
1 a 1 40 60
2 a 2 100 60
3 b 3 10 10
4 b 4 80 10
5 c 5 12 36
6 c 6 24 36
7 c 7 53 36
8 c 8 90 36
I get this result in two steps, through a group-by and a merge, according to code below:
df = df.merge(df[df.var3 < 50][['var1', 'var3']].groupby('var1', as_index = False).sum().rename(columns = {'var3' : 'var3_lt_50'})
,how = 'left'
,left_on = 'var1'
,right_on = 'var1')
Can someone show me a way of doing this type of boolean logic expression + broadcasting of inter groupby scalar without the "groupby" + "merge" step im doing today. I want a smoother line of code.
Thanks in advance for input,
/Swepab
You can use groupby.transform which keeps the shape of the transformed variable as well as the index so you can just assign the result back to the data frame:
df['var3_lt_50'] = df.groupby('var1').var3.transform(lambda g: g[g < 50].sum())
df

Categories

Resources