How to deal with “SettingWithCopyWarning” in a for loop if statement - python

Let's say column A is time-based, column B is salary.
I am using an if statement within a for loop trying to find "all salaries that are less than the previous one BUT ALSO greater than the following one." Then assign a new value ('YES') to another column (column C) of the rows that fulfill the condition. Finally, I want to grab the first column A that fulfill the above conditions.
The dataframe looks like this:
In [1]:
df = pd.DataFrame({'A':['2007q3','2007q4','2008q1','2008q2','2008q3','2008q4','2009q1','2009q2','2009q3'],
'B':[14938, 14991, 14899, 14963, 14891, 14577, 14375, 14355, 14402]})
df['C'] = pd.Series()
df
Out [1]:
A B C
0 2007q3 14938 NaN
1 2007q4 14991 NaN
2 2008q1 14899 NaN
3 2008q2 14963 NaN
4 2008q3 14891 NaN
5 2008q4 14577 NaN
6 2009q1 14375 NaN
7 2009q2 14355 NaN
8 2009q3 14402 NaN
The following code does the work but is showing the "SettingWithCopyWarning" warning, I am not sure which parts of the code is causing the problem...
In [2]:
for i in range(1, len(df)-1):
if (df['B'][i] < df['B'][i-1]) & (df['B'][i] > df['B'][i+1]):
df['C'][i] = 'YES'
df
Out [2]:
A B C
0 2007q3 14938 NaN
1 2007q4 14991 NaN
2 2008q1 14899 NaN
3 2008q2 14963 NaN
4 2008q3 14891 YES
5 2008q4 14577 YES
6 2009q1 14375 YES
7 2009q2 14355 NaN
8 2009q3 14402 NaN
In [3]: df['A'][df['C'] == 'YES'].iloc[0]
Out [3]:'2008q3'
Or maybe there's a better way to have the job done. Thank you!!!

For more details on why you got SettingWithCopyWarning, I would suggest you to read this answer. It is mostly because selecting the column df['C'] and then selecting the row with [i] does a "chained assignment" that is flagged this way when you do df['C'][i] = 'YES'
For what you try to do, you can use np.where and shift on the column B such as:
import numpy as np
df['C'] = np.where((df.B < df.B.shift()) & (df.B > df.B.shift(-1)), 'YES', np.nan)
and you get the same output.

Related

assign one column value to another column based on condition in pandas

I want to how we can assign one column value to another column if it has null or 0 value
I have a dataframe like this:
id column1 column2
5263 5400 5400
4354 6567 Null
5656 5456 5456
5565 6768 3489
4500 3490 Null
The Expected Output is
id column1 column2
5263 5400 5400
4354 6567 6567
5656 5456 5456
5565 6768 3489
4500 3490 3490
that is,
if df['column2'] = Null/0 then it has take df['column1'] value.
Can someone explain, how can I achieve my desired output?
Based on the answers to this similar question, you can do the following:
Using np.where:
df['column2'] = np.where((df['column2'] == 'Null') | (df['column2'] == 0), df['column1'], df['column2'])
Instead, using only pandas and Python:
df['column2'][(df['column2'] == 0) | (df['column2'] == 'Null')] = df['column1']
Here's my suggestion. Not sure whether it is the fastest, but it should work here ;)
#we start by creating an empty list
column2 = []
#for each row in the dataframe
for i in df.index:
# if the value col2 is null or 0, then it takes the value of col1
if df.loc[i, 'column2'] in ['null', 0]:
column2.append(df.loc[i, 'column1'])
#else it takes the value of column 2
else:
column2.append(df.loc[i, 'column2'])
#we replace the current column 2 by the new one !
df['column2'] = column2```
Update using only Native Pandas Functionality
#Creates boolean array conditionCheck, checking conditions for each row in df
#Where() will only update when conditionCheck == False, so inverted boolean values using "~"
conditionCheck = ~((df['column2'].isna()) | (df['column2']==0))
df["column2"].where(conditionCheck,df["column1"],inplace=True)
print(df)
Code to Generate Sample DataFrame
Changed row 3 of column2 to 0 to test all scenarios
import numpy as np
import pandas as pd
data = [
[5263,5400,5400]
,[4354,6567,None]
,[5656,5456,0]
,[5565,6768,3489]
,[4500,3490,None]
]
df = pd.DataFrame(data,columns=["id","column1","column2"],dtype=pd.Int64Dtype())
Similar question was already solved here.
"Null" keyword does not exist in python. Empty cells in pandas have np.nan type. So assuming you mean np.nans, one good way to achieve your desired output would be:
Create a boolean mask to select rows with np.nan or 0 value and then copy when mask is True.
mask = (df['column2'].isna()) | (df['column2']==0)
df.loc[mask, "column2"] = df.loc[mask, "column1"]
Just use ffill(). Go through the example.
from pandas import DataFrame as df
import numpy as np
import pandas as pd
items = [1,2,3,4,5]
place = [6,7,8,9,10]
quality = [11,np.nan,12,13,np.nan]
df = pd.DataFrame({"A":items, "B":place, "C":quality})
print(df)
"""
A B C
0 1 6 11.0
1 2 7 NaN
2 3 8 12.0
3 4 9 13.0
4 5 10 NaN
"""
aa = df.ffill(axis=1).astype(int)
print(aa)
"""
A B C
0 1 6 11
1 2 7 7
2 3 8 12
3 4 9 13
4 5 10 10
"""

How Can I drop a column if the last row is nan

I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)

Add new column with column names of a table, based on conditions [duplicate]

I have a dataframe as below:
I want to get the name of the column if column of a particular row if it contains 1 in the that column.
Use DataFrame.dot:
df1 = df.dot(df.columns)
If there is multiple 1 per row:
df2 = df.dot(df.columns + ';').str.rstrip(';')
Firstly
Your question is very ambiguous and I recommend reading this link in #sammywemmy's comment. If I understand your problem correctly... we'll talk about this mask first:
df.columns[
(df == 1) # mask
.any(axis=0) # mask
]
What's happening? Lets work our way outward starting from within df.columns[**HERE**] :
(df == 1) makes a boolean mask of the df with True/False(1/0)
.any() as per the docs:
"Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent".
This gives us a handy Series to mask the column names with.
We will use this example to automate for your solution below
Next:
Automate to get an output of (<row index> ,[<col name>, <col name>,..]) where there is 1 in the row values. Although this will be slower on large datasets, it should do the trick:
import pandas as pd
data = {'foo':[0,0,0,0], 'bar':[0, 1, 0, 0], 'baz':[0,0,0,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data, index=['a','b','c','d'])
print(df)
foo bar baz spam
a 0 0 0 0
b 0 1 0 1
c 0 0 0 0
d 0 0 0 1
# group our df by index and creates a dict with lists of df's as values
df_dict = dict(
list(
df.groupby(df.index)
)
)
Next step is a for loop that iterates the contents of each df in df_dict, checks them with the mask we created earlier, and prints the intended results:
for k, v in df_dict.items(): # k: name of index, v: is a df
check = v.columns[(v == 1).any()]
if len(check) > 0:
print((k, check.to_list()))
('b', ['bar', 'spam'])
('d', ['spam'])
Side note:
You see how I generated sample data that can be easily reproduced? In the future, please try to ask questions with posted sample data that can be reproduced. This way it helps you understand your problem better and it is easier for us to answer it for you.
Getting column name are dividing in 2 sections.
If you want in a new column name then condition should be unique because it will only give 1 col name for each row.
data = {'foo':[0,0,3,0], 'bar':[0, 5, 0, 0], 'baz':[0,0,2,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data)
df=df.replace(0,np.nan)
df
foo bar baz spam
0 NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0
2 3.0 NaN 2.0 NaN
3 NaN NaN NaN 1.0
If you were looking for min or maximum
max= df.idxmax(1)
min = df.idxmin(1)
out= df.assign(max=max , min=min)
out
foo bar baz spam max min
0 NaN NaN NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0 bar spam
2 3.0 NaN 2.0 NaN foo baz
3 NaN NaN NaN 1.0 spam spam
2nd case, If your condition is satisfied in multiple columns for example you are looking for columns that contain 1 and you are looking for list because its not possible to adjust in same dataframe.
str_con= df.astype(str).apply(lambda x:x.str.contains('1.0',case=False, na=False)).any()
df.column[str_con]
#output
Index(['spam'], dtype='object') #only spam contains 1
Or you are looking for numerical condition columns contains value more than 1
num_con = df.apply(lambda x:x>1.0).any()
df.columns[num_con]
#output
Index(['foo', 'bar', 'baz'], dtype='object') #these col has higher value than 1
Happy learning

How would I pivot this basic table using pandas?

What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!
Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN

How to get the max/min value in Pandas DataFrame when nan value in it

Since one column of my pandas dataframe has nan value, so when I want to get the max value of that column, it just return error.
>>> df.iloc[:, 1].max()
'error:512'
How can I skip that nan value and get the max value of that column?
You can use NumPy's help with np.nanmax, np.nanmin :
In [28]: df
Out[28]:
A B C
0 7 NaN 8
1 3 3 5
2 8 1 7
3 3 0 3
4 8 2 7
In [29]: np.nanmax(df.iloc[:, 1].values)
Out[29]: 3.0
In [30]: np.nanmin(df.iloc[:, 1].values)
Out[30]: 0.0
You can use Series.dropna.
res = df.iloc[:, 1].dropna().max()
if you dont use iloc or loc, it is simple as:
df['column'].max()
or
df['column'][df.index.min():df.index.max()]
or any kind of range in this second square brackets
You can set numeric_only = True when calling max:
df.iloc[:, 1].max(numeric_only = True)
Attention:
For everyone trying to use it with pandas.series
This is not working nevertheless it is mentioned in the docs
See post on github
Dataframe aggregate function.agg() will automatically ignore NaN value.
df.agg({'income':'max'})
Besides, it can also be use together with .groupby
df.groupby('column').agg({'income':['max','mean']})
When the df contains NaN values it reports NaN values, Using
np.nanmax(df.values) gave the desired answer.

Categories

Resources