change value in a column with apply lambda - python

Do you have a trick to avoid that the following code changes the cells (and keep the existing data) when the condition in the lambda function is not met
df['test'] = df['Q8_3'].apply(lambda x: 'serial' if x >=3 else 'toto')
This code is running when I add 'toto' after the else statement but I would like to bypass the else statement.
Thanks for your help

is this helps you?
df = pd.DataFrame({
'Q8_3': [1, 2, 3, 4, 5],
'test': ['old','old','old','old','old']
})
df.loc[df.Q8_3 >3 , 'test'] = "serial"
print(df)
Output:
Q8_3 test
0 1 old
1 2 old
2 3 old
3 4 serial
4 5 serial

Or even better with np.where:
df['test'] = np.where(df['Q8_3'] >= 3, 'serial', df['Q8_3'])

Related

How to resolve Pandas performance warning "highly fragmented" after using many custom np.where statements?

I have a project where I am converting code from SQL to Pandas. I have 80 custom elements in my dataset / dataframe - each requires custom logic. In SQL, I use multiple case statements within a single Select like this:
Select x, y, z,
(case when statement1 then 0
when statement2 then 0
else 1 end) as custom_element1,
next case statement...as custom_element2,
next case statement...as custom_element3,
etc...
Now in Pandas, I am hoping for some advice on the most efficient way to accomplish the same goal. To make it easier to reproduce, here is an example that does the same thing that I want to do. I need to create 80 custom output variables. In this example, I am just adding one custom element at a time using different np.where statements.
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0]},
index=['falcon', 'dog', 'spider', 'fish'])
df['custom1'] = np.where(df['num_legs'].values > 2, 1, 0)
df['custom2'] = np.where(df['num_wings'] == df['num_legs'], 1, 0)
df['custom3'] = np.where((df['num_wings'].values == 0) | (df['num_legs'].values == 0), 1, 0)
I can get the output from consecutive np.where statements to match my output from original SQL exactly, so no problems there.
BUT I saw this warning:
DataFrame is highly fragmented...poor performance...Consider using pd.concat
instead...or use copy().
So my question is, for my example, how do I improve performance? How would I use pd.concat here? What is a better way to structure the code than what I am showing above? I have tried searching for an answer in this forum but did not find anything. I appreciate your time in responding.
So, np.where is totally unecessary here. For example, you can just use:
In [6]: df.num_legs > 2
Out[6]:
falcon False
dog True
spider True
fish False
Name: num_legs, dtype: bool
Instead of:
In [9]: np.where(df.num_legs > 2, 1, 0)
Out[9]: array([0, 1, 1, 0])
These probably should be bool dtype columns, but if you insist on using int, just add an .astype(int).
In any case, here is how you might use pd.concat:
df = pd.concat(
[
df,
(df["num_legs"] > 2).rename("custom1"),
(df["num_wings"] == df["num_legs"]).rename("custom1"),
((df["num_wings"] == 0) | (df["num_legs"] == 0)).rename("custom3"),
],
axis=1,
)
Example:
In [10]: df
Out[10]:
num_legs num_wings
falcon 2 2
dog 4 0
spider 8 0
fish 0 0
In [11]: pd.concat(
...: [
...: df,
...: (df["num_legs"] > 2).rename("custom1"),
...: (df["num_wings"] == df["num_legs"]).rename("custom1"),
...: ((df["num_wings"] == 0) | (df["num_legs"] == 0)).rename("custom3"),
...: ],
...: axis=1,
...: )
Out[11]:
num_legs num_wings custom1 custom1 custom3
falcon 2 2 False True False
dog 4 0 True False True
spider 8 0 True False True
fish 0 0 False True True

pandas returns DataError when seaborn plots a dataframe made from lists

When I tried plotting a pandas dataframe in seaborn I got an DataError. I fixed the problem by recreating the dataframe from a Dictionary instead of using lists and a for loop. However, I still don't understand why I got the error in the first case. The two data frames look identical to me. Can somebody explain what happens here?
# When I create two seemingly identical data frames.
x = [0, 1, 2]
y = [3, 5, 7]
line_df1 = pd.DataFrame(columns=['x','y'])
for i in range(3):
line_df1.loc[i] = [x[i], y[i]]
line_dict = {'x': [0, 1, 2], 'y': [3, 5, 7]}
line_df2 = pd.DataFrame(line_dict)
# they look identical when printed
print(line_df1)
print(line_df2)
>> x y
>> 0 0 3
>> 1 1 5
>> 2 2 7
>> x y
>> 0 0 3
>> 1 1 5
>> 2 2 7
# This first one throws a DataError...
sns.lineplot('x', 'y', data=line_df1)
# ..but this one does not.
sns.lineplot('x', 'y', data=line_df2)
Problem is first values are objects, verified by DataFrame.dtypes:
print(line_df1.dtypes)
x object
y object
dtype: object
print(line_df2.dtypes)
x int64
y int64
dtype: object
Solution for correct working first solution is set dtype of empty DataFrame:
line_df1 = pd.DataFrame(columns=['x','y'], dtype=int)
But if performance is important, better is second solution, because update empty DataFrame is last instance:
6) updating an empty frame (e.g. using loc one-row-at-a-time)

Pandas replacing one value with another for specified columns

I need to apply a function to a subset of columns in a dataframe. consider the following toy example:
pdf = pd.DataFrame({'a' : [1, 2, 3], 'b' : [2, 3, 4], 'c' : [5, 6, 7]})
arb_cols = ['a', 'b']
what I want to do is this:
[df[c] = df[c].apply(lambda x : 99 if x == 2 else x) for c in arb_cols]
But this is bad syntax. Is it possible to accomplish such a task without a for loop?
With mask
pdf.mask(pdf.loc[:,arb_cols]==2,99).assign(c=pdf.c)
Out[1190]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Or with assign
pdf.assign(**pdf.loc[:,arb_cols].mask(pdf.loc[:,arb_cols]==2,99))
Out[1193]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Do not use pd.Series.apply when you can use vectorised functions.
For example, the below should be efficient for larger dataframes even though there is an outer loop:
for col in arb_cols:
pdf.loc[pdf[col] == 2, col] = 99
Another option it to use pd.DataFrame.replace:
pdf[arb_cols] = pdf[arb_cols].replace(2, 99)
Yet another option is to use numpy.where:
import numpy as np
pdf[arb_cols] = np.where(pdf[arb_cols] == 2, 99, pdf[arb_cols])
For this case it would probably be better to use applymap if you need to apply a custom function
pdf[arb_cols] = pdf[arb_cols].applymap(lambda x : 99 if x == 2 else x)

From a dataframe using the apply() method, how to return a new column with lists of elements from the dataframe?

There's an operation that is a little counter intuitive when using pandas apply() method. It took me a couple of hours of reading to solve, so here it is.
So here is what I was trying to accomplish.
I have a pandas dataframe like so:
test = pd.DataFrame({'one': [[2],['test']], 'two': [[5],[10]]})
one two
0 [2] [5]
1 [test] [10]
and I want to add the columns per row to create a resulting list of length = to the DataFrame's original length like so:
def combine(row):
result = row['one'] + row['two']
return(result)
When running it through the dataframe using the apply() method:
test.apply(lambda x: combine(x), axis=1)
one two
0 2 5
1 test 10
Which isn't quite what we wanted. What we want is:
result
0 [2, 5]
1 [test, 10]
EDIT
I know there are simpler solutions to this example. But this is an abstraction from a much more complex operation.Here's an example of a more complex one:
df_one:
org_id date status id
0 2 2015/02/01 True 3
1 10 2015/05/01 True 27
2 10 2015/06/01 True 18
3 10 2015/04/01 False 27
4 10 2015/03/01 True 40
df_two:
org_id date
0 12 2015/04/01
1 10 2015/02/01
2 2 2015/08/01
3 10 2015/08/01
Here's a more complex operation:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return (id_list)
then finally run:
df_one.sort_values('date', inplace=True)
df_two['id_list'] = df_two.apply(
operation,
axis=1,
args=(df_one,)
)
This would be impossible with simpler solutions. Hence my proposed one below would be to re write operation to:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return pd.Series({'id_list': id_list})
We'd expect the following result:
id_list
0 []
1 []
2 [3]
3 [27,18,40]
IIUC we can simply sum two columns:
In [93]: test.sum(axis=1).to_frame('result')
Out[93]:
result
0 [2, 5]
1 [test, 10]
because when we sum lists:
In [94]: [2] + [5]
Out[94]: [2, 5]
they are getting concatenated...
So the answer to this problem lies in how pandas.apply() method works.
When defining
def combine(row):
result = row['one'] + row['two']
return(result)
the function will be returning a list for each row that gets passed in. This is a problem if we use the function with the .apply() method because it will interpret the resulting lists as a Series where each element is a column of that same row.
To solve this we need to create a Series where we specify a new column name like so:
def combine(row):
result = row['one'] + row['two']
return pd.Series({'result': result})
And if we run this again:
test.apply(lambda x: combine(x), axis=1)
result
0 [2, 5]
1 [test, 10]
We'll get what we originally wanted! Again, this is because we are forcing pandas to interpret the entire result as a column.

Index values of specific rows in python

I am trying to find out Index of such rows before "None" occurs.
pId=["a","b","c","None","d","e","None"]
df = pd.DataFrame(pId,columns=['pId'])
pId
0 a
1 b
2 c
3 None
4 d
5 e
6 None
df.index[df.pId.eq('None') & df.pId.ne(df.pId.shift(-1))]
I am expecting the output of the above code should be
Index([2,5])
It gives me
Index([3,6])
Please correct me
I am not sure for the specific example you showed. Anyway, you could do it in a more simple way:
indexes = [i-1 for i,x in enumerate(pId) if x == 'None']
The problem is that you're returning the index of the "None". You compare it against the previous item, but you're still reporting the index of the "None". Note that your accepted answer doesn't make this check.
In short, you still need to plaster a "-1" onto the result of your checking.
Just -1 from df[df["pId"] == "None"].index:
import pandas as pd
pId=["a","b","c","None","d","e","None"]
df = pd.DataFrame(pId,columns=['pId'])
print(df[df["pId"] == "None"].index - 1)
Which gives you:
Int64Index([2, 5], dtype='int64')
Or if you just want a list of values:
(df[df["pId"] == "None"].index - 1).tolist()
You should be aware that for a list like:
pId=["None","None","b","c","None","d","e","None"]
You get a df like:
pId
0 None
1 None
2 b
3 c
4 None
5 d
6 e
7 None
And output like:
[-1, 0, 3, 6]
Which does not make a great deal of sense.

Categories

Resources