A For loop with embedded if statement to update a dataframe

A For loop with embedded if statement to update a dataframe - python

I am a newcomer to python. I want to implement a "For" loop on the elements of a dataframe, with an embedded "if" statement.
Code:
import numpy as np
import pandas as pd
#Dataframes
x = pd.DataFrame([1,-2,3])
y = pd.DataFrame()
for i in x.iterrows():
for j in x.iteritems():
if x>0:
y = x*2
else:
y = 0
With the previous loop, I want to go through each item in the x dataframe and generate a new dataframe y based on the condition in the "if" statement. When I run the code, I get the following error message.
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any help would be much appreciated.

In pandas is best avoid loops if exist vectorized solution:
x = pd.DataFrame([1,-2,3], columns=['a'])
y = pd.DataFrame(np.where(x['a'] > 0, x['a'] * 2, 0), columns=['b'])
print (y)
b
0 2
1 0
2 6
Explanation:
First compare column by value for boolean mask:
print (x['a'] > 0)
0 True
1 False
2 True
Name: a, dtype: bool
Then use numpy.where for set values by conditions:
print (np.where(x['a'] > 0, x['a'] * 2, 0))
[2 0 6]
And last use DataFrame constructor or create new column:
x['new'] = np.where(x['a'] > 0, x['a'] * 2, 0)
print (x)
a new
0 1 2
1 -2 0
2 3 6

You can try this:
y = (x[(x > 0)]*2).fillna(0)

Related

pandas fill the column values with min function

I have a dataframe with 2 columns and I need to add 3rd column 'start'. However my code for some reason doesn't work and I am not sure why. Here is my code
df.loc[df.type=='C', 'start']= min(-1+df['dq']-3,4)
df.loc[df.type=='A', 'start']= min(-3+df['dq']-3,4)
df.loc[df.type=='B', 'start']= min(-3+df['dq']-5,4)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(),
a.any() or a.all().
and the dataset looks like this:
type dq
A 3
A 4
B 8
C 3

The error is being raised because first argument you pass to min() is a Series and the second (4) is an int.
Since you're using min to replace values greater than 4 with 4, you can just replace it once at the end using where:
df.loc[df.type=='C', 'start'] = -1+df['dq']-3
df.loc[df.type=='A', 'start'] = -3+df['dq']-3
df.loc[df.type=='B', 'start'] = -3+df['dq']-5
df["start"] = df["start"].where(df["start"]<4,other=4)
>>> df
type dq start
0 A 3 -3
1 A 4 -2
2 B 8 0
3 C 3 -1
Another (perhaps cleaner) way of getting your column would be to use numpy.select, like so:
import numpy as np
df["start"] = np.select([df["type"]=="A", df["type"]=="B", df["type"]=="C"],
[df['dq']-6, df["dq"]-8, df["dq"]-4])
df["start"] = df["start"].where(df["start"]<4, 4)

You cannot use a Series with min. Instead, you can do:
s = (df['dq'] - df['type'].map({'A': 6, 'B': 8, 'C': 4}))
df['start'] = s.where(s<4, 4)
output:
type dq start
0 A 3 -3
1 A 4 -2
2 B 8 0
3 C 3 -1

Pandas apply custom function to DF

I would like to create a brand new data frame by replacing values of a DF using a custom function.
I keep getting the following error "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
I tried some suggestions (Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()) but it didn't work.
I would appreciate if somebody could shed light on this issue and help me out.
Thank you in advance for your time.
def convert2integer(x):
if x <= -0.5:
return -1
elif (x > -0.5 & x <= 0.5):
return 0
elif x > 0.5:
return 1
df = pd.DataFrame({'A':[1,0,-0.6,-1,0.7],
'B':[-1,1,-0.3,0.5,1]})
df.apply(convert2integer)

A few options:
The slower option but the most similar via applymap:
def convert2integer(x):
if x <= -0.5:
return -1
elif x <= 0.5:
return 0
else:
return 1
df = pd.DataFrame({'A': [1, 0, -0.6, -1, 0.7],
'B': [-1, 1, -0.3, 0.5, 1]})
new_df = df.applymap(convert2integer)
new_df:
A B
0 1 -1
1 0 1
2 -1 0
3 -1 0
4 1 1
applymap applies the function to each cell in the DataFrame. For this reason, x is a float, and should be treated as such.
The faster option via np.select:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 0, -0.6, -1, 0.7],
'B': [-1, 1, -0.3, 0.5, 1]})
new_df = pd.DataFrame(
np.select([df.le(-0.5), df.le(0.5)],
[-1, 0],
default=1),
index=df.index,
columns=df.columns
)
new_df:
A B
0 1 -1
1 0 1
2 -1 0
3 -1 0
4 1 1
np.select takes a list of conditions, and a list of choices. When the condition is True it uses the values from the corresponding index in the choice list.
The last condition does not need checked as if it did not match the first two conditions it must be greater than 0.5. Likewise, the second condition does not need to also check that it is greater than -0.5 because if it were the first condition would have been met.

compare value of the current index to the value of next index in Pandas df

I'm trying to compare the value of the current index to the value of the next index in my pandas data frame. I'm able to access the value with iloc but when I write an if condition to validate the value. It gives me an error.
Code I tried:
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
trend = list()
for k in range(len(df)):
if df.iloc[k+1] > df.iloc[k]:
trend.append('up')
if df.iloc[k+1] < df.iloc[k]:
trend.append('down')
if df.iloc[k+1] == df.iloc[k]:
trend.append('nochange')
dftrend = pd.DataFrame(trend)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I tried assigning the iloc[k] value to a variable "current" with astype=int. Still I am unable to use the variable "current" in my if condition validation. Appreciate if somebody can help with info on how to resolve it.

You are getting the error becauce
df.iloc[k] gives you a pd.Series.
You can use say df.iloc[k,0] to get the Col1 value

So, what I have done is I have converted that particular column into a list. Instead of working directly from the Series object returned by the dataframe, I prefer converting it to a list or numpy array first and then performing basic functions on it.
I have also given the correct working code below.
import pandas as pd
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
trend = list()
temp=df['Col1'].tolist()
print(temp)
for k in range(len(df)-1):
if temp[k+1] > temp[k]:
trend.append('up')
if temp[k+1] < temp[k]:
trend.append('down')
if temp[k+1] == temp[k]:
trend.append('nochange')
dftrend = pd.DataFrame(trend)
print(trend)

Here is a more pandas-like approach. We can get the difference of two consecutive elements of a series easily via pandas.DataFrame.diff:
import pandas as pd
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
df_diff = df.diff()
Col1
0 NaN
1 -1.0
2 1.5
3 0.0
4 1.8
5 -0.8
Now you can apply a function elementwise that only distiguishes >0, <0 or ==0, using pandas.DataFrame.applymap
def direction(x):
if x > 0:
return 'up'
elif x < 0:
return 'down'
elif x == 0:
return 'nochange'
df_diff.applymap(direction))
Col1
0 None
1 down
2 up
3 nochange
4 up
5 down
Finally, it's a design decision what should happen to the first value of the series. Here the NaN value doesn't fit any case. You can also treat it separately in direction, or omit in in your result by slicing.
Edit: The same as a oneliner:
df.diff().applymap(lambda x: 'up' if x > 0 else ('down' if x < 0 else ('nochange' if x == 0 else None)))

You can use this:
df['trend'] = np.where(df.Col1.shift().isnull(), "N/A", np.where(df.Col1 == df.Col1.shift(), "nochange", np.where(df.Col1 < df.Col1.shift(), "down", "up")))
Col1 trend
0 2.5 N/A
1 1.5 down
2 3.0 up
3 3.0 nochange
4 4.8 up
5 4.0 down

Pandas apply ValueError: The truth value of a Series is ambigous

I'm trying to create a new feature using
df_transactions['emome'] = df_transactions['emome'].apply(lambda x: 1 if df_transactions['plan_list_price'] ==0 & df_transactions['actual_amount_paid'] > 0 else 0).astype(int)
But it raises error
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
How can I create a new column that returns 1 when plan_list_price is 0 and actual_amount_paid is >0 else 0?
I would like to still use pandas apply.

You are really close, but much better is vectorized solution without apply - get boolean mask and convert to int:
mask = (df_transactions['plan_list_price'] == 0) &
(df_transactions['actual_amount_paid'] > 0)
df_transactions['emome'] = mask.astype(int)
If really want slowier apply:
f = lambda x: 1 if x['plan_list_price'] ==0 and x['actual_amount_paid'] > 0 else 0
df_transactions['emome'] = df_transactions.apply(f, axis=1)
Sample:
df_transactions = pd.DataFrame({'A':list('abcdef'),
'plan_list_price':[0,0,0,5,5,0],
'actual_amount_paid':[-1,0,9,4,2,3]})
mask = (df_transactions['plan_list_price'] == 0) & \
(df_transactions['actual_amount_paid'] > 0)
df_transactions['emome1'] = mask.astype(int)
f = lambda x: 1 if x['plan_list_price'] ==0 and x['actual_amount_paid'] > 0 else 0
df_transactions['emome2'] = df_transactions.apply(f, axis=1)
print (df_transactions)
A actual_amount_paid plan_list_price emome1 emome2
0 a -1 0 0 0
1 b 0 0 0 0
2 c 9 0 1 1
3 d 4 5 0 0
4 e 2 5 0 0
5 f 3 0 1 1
Timings:
#[60000 rows]
df_transactions = pd.concat([df_transactions] * 10000, ignore_index=True)
In [201]: %timeit df_transactions['emome1'] = ((df_transactions['plan_list_price'] == 0) & (df_transactions['actual_amount_paid'] > 0)).astype(int)
1000 loops, best of 3: 971 µs per loop
In [202]: %timeit df_transactions['emome2'] = df_transactions.apply(lambda x: 1 if x['plan_list_price'] ==0 and x['actual_amount_paid'] > 0 else 0, axis=1)
1 loop, best of 3: 1.15 s per loop

A few issues:
On the right side of the equation, the new field (emome)is not
created yet.
The lambda function is on x, not on df_transactions, which does not exist in this scope.
You need to specify axis since you are applying to each row (default is to each column).
From Doc:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0 Axis along which the
function is applied:
0 or ‘index’: apply function to each column. 1 or ‘columns’: apply
function to each row.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

Pandas. Check that rows are partly equals

I have table:
tel size 1 2 3 4
0 123 1 Baby Baby None None
1 234 1 Shave Shave None None
2 222 1 Baby Baby None None
3 333 1 Shave Shave None None
I want to check if values in tables 1,2,3,4 ... are partly equal with 2 loops:
x = df_map.iloc[i,2:]
y = df_map.iloc[j,2:]
so df_map.iloc[0,2:] should be equal to df_map.iloc[2,2:],
and df_map.iloc[1,2:], is equal to df_map.iloc[3,2:],
I tried:
x == y
and
y.eq(x)
but it returns error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
If i use (x==y).all() or (x==y).any() it returns wrong result.
I need something like:
if x== y:
counter += 1
Update:
problem was in None values. I used fillna('') and (x == y).all()

fillna('') because None == None is False
use numpy broadcasting evaluate ==
all(-1) to make sure the whole row matches
np.fill_diagonal because we don't need self matches
np.where to find where the matches are
v = df.fillna('').values[:, 2:]
match = ((v[None, :] == v[:, None]).all(-1))
np.fill_diagonal(match, False)
i, j = np.where(match)
pd.Series(i, j)
2 0
3 1
0 2
1 3
dtype: int64

THe type of pandas series(rows,columns) are numpy array,you can only get the results by column to column unless you loop again from the results which is also another array
import numpy as np
x = df_map[i,2:]
y = df_map[j,2:]
np.equal(y.values,x.values)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

A For loop with embedded if statement to update a dataframe - python

You can try this: y = (x[(x > 0)]*2).fillna(0)

Related

pandas fill the column values with min function

Pandas apply custom function to DF

compare value of the current index to the value of next index in Pandas df

Pandas apply ValueError: The truth value of a Series is ambigous

Pandas. Check that rows are partly equals

Categories

Resources