Is this Pandas 'SettingWithCopyWarning' a False Positive? - python

I have a dataframe that I subset to produce a new dataframe:
temp_df = initial_df.loc[initial_df['col'] == val]
And then I add columns to this dataframe, setting all values to np.nan:
temp_df[new_col] = np.nan
This triggers a 'SettingWithCopyWarning', as it should, and tells me:
Try using .loc[row_indexer,col_indexer] = value instead
However, when I do that, like so:
temp_df.loc[:,new_col] = np.nan
I still get the same warning. In fact, I get one instance of the warning using the 1st method, but get two instances of the warning using .loc:
Is this warning incorrect here? I don't care that the new column I am adding doesn't make it back to the initial_df. Is it a false positive? And why are there two warnings?

Related

Figuring out if an entire column in a Pandas dataframe is the same value or not

I have a pandas dataframe that works just fine. I am trying to figure out how to tell if a column with a label that I know if correct does not contain all the same values.
The code
below errors out for some reason when I want to see if the column contains -1 in each cell
# column = "TheColumnLabelThatIsCorrect"
# df = "my correct dataframe"
# I get an () takes 1 or 2 arguments but 3 is passed in error
if (not df.loc(column, estimate.eq(-1).all())):
I just learned about .eq() and .all() and hopefully I am using them correctly.
It's a syntax issue - see docs for .loc/indexing. Specifically, you want to be using [] instead of ()
You can do something like
if not df[column].eq(-1).all():
...
If you want to use .loc specifically, you'd do something similar:
if not df.loc[:, column].eq(-1).all():
...
Also, note you don't need to use .eq(), you can just do (df[column] == -1).all()) if you prefer.
You could drop duplicates and if you get only one record it means all records are the same.
import pandas as pd
df = pd.DataFrame({'col': [1, 1, 1, 1]})
len(df['col'].drop_duplicates()) == 1
> True
Question not as clear. Lets try the following though
Contains only -1 in each cell
df['estimate'].eq(-1).all()
Contains -1 in any cell
df['estimate'].eq(-1).any()
Filter out -1 and all columns
df.loc[df['estimate'].eq(-1),:]
df['column'].value_counts() gives you a list of all unique values and their counts in a column. As for checking if all the values are a specific number, you can do that by dropping duplicates and checking the length to be 1.
len(set(df['column'])) == 1

Modify Pandas dataFrame column values based on a condition

I want to modify only the values that are greater than 750 on a column of a pandas dataframe
datf.iloc[:,index][datf.iloc[:,index] > 750] = datf.iloc[:,index][datf.iloc[:,index] > 750]/10.7639
I think that the syntax is fine but i get a Pandas warning so i don't know if its correct this way:
<ipython-input-24-72eef50951a4>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
What is the correct way to do this without getting this warning?
You can use the apply method to make your modification to your column using your custom function.
N.B you can also use the applymap for multiple columns
def my_func(x):
if x > 750:
x= #do your modification
else:
x
return x
new_dta= datf['col_name'].apply(my_func)

Get a KeyError in Pandas

I am trying to call a function from a different module as below:
module1 - func1: returns a dataframe
module1 - func2(p_df_in_fromfunc1)
function 2:
for i in range(0,len(p_df_in_fromfunc1):
# Trying to retrieve row values of individual columns and assign to variables
v_tmp = p_df_in_fromfunc1.loc[i,"Col1"]
When trying to run the above code, I get the error:
KeyError 0
Could the issue be because I don't have a zero numbered row?
Without knowing much of you're code, well my guess is, for positional indexing try using iloc instead of loc, if you're interesed in going index-wise.
Something like:
v_tmp = p_df_in_fromfunc1.iloc[i,"Col1"]
You may have a missed to close the quote in the loc function after Col1 ?
v_tmp = p_df_in_fromfunc1.loc[i,"Col1"]
For retrieving a row for specific columns do:
columns = ['Col1', 'Col2']
df[columns].iloc[index]
If you only want one column, you can simplify it to: df['Col1'].iloc(index)
As per your comment, you do not need to reset the index, you can iterate over the values of your index array: df.index

Proper way to utilize .loc in python's pandas

When trying to change a column of numbers from object to float dtypes using pandas dataframes, I receive the following warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Now, the code runs just fine, but what would be the proper and intended way to avoid this warning and still achieve the goal of:
df2[col] = df2[col].astype('float')
Let it be noted that df2 is a subset of df1 using a condition similar to:
df2 = df1[df1[some col] == value]
Use the copy method. Instead of:
df2 = df1[df1[some col] == value]
Just write:
df2 = df1[df1[some col] == value].copy()
Initially, df2 is a slice of df1 and not a new dataframe. Which is why, when you try to modify it, python raises an error.

Can't drop NAN with dropna in pandas

I import pandas as pd and run the code below and get the following result
Code:
traindataset = pd.read_csv('/Users/train.csv')
print traindataset.dtypes
print traindataset.shape
print traindataset.iloc[25,3]
traindataset.dropna(how='any')
print traindataset.iloc[25,3]
print traindataset.shape
Output
TripType int64
VisitNumber int64
Weekday object
Upc float64
ScanCount int64
DepartmentDescription object
FinelineNumber float64
dtype: object
(647054, 7)
nan
nan
(647054, 7)
[Finished in 2.2s]
From the result, the dropna line doesn't work because the row number doesn't change and there is still NAN in the dataframe. How that comes? I am craaaazy right now.
You need to read the documentation (emphasis added):
Return object with labels on given axis omitted
dropna returns a new DataFrame. If you want it to modify the existing DataFrame, all you have to do is read further in the documentation:
inplace : boolean, default False
If True, do operation inplace and return None.
So to modify it in place, do traindataset.dropna(how='any', inplace=True).
pd.DataFrame.dropna uses inplace=False by default. This is the norm with most Pandas operations; exceptions do exist, e.g. update.
Therefore, you must either assign back to your variable, or state explicitly inplace=True:
df = df.dropna(how='any') # assign back
df.dropna(how='any', inplace=True) # set inplace parameter
Stylistically, the former is often preferred as it supports operator chaining, and the latter often does not yield any or significant performance benefits.
Alternatively, you can also use notnull() method to select the rows which are not null.
For example if you want to select Non null values from columns country and variety of the dataframe reviews:
answer=reviews.loc[(reviews.country.notnull()) & (reviews.variety.notnull())]
But here we are just selecting relevant data;to remove null values you should use dropna() method.
This is my first post. I just spent a few hours debugging this exact issue and I would like to share how I fixed this issue.
I was converting my entire dataframe to a string and then placing that value back into the dataframe using similar code to what is displayed below: (please note, the code below will only convert the value to a string)
row_counter = 0
for ind, row in dataf.iterrows():
cell_value = str(row['column_header'])
dataf.loc[row_counter, 'column_header'] = cell_value
row_counter += 1
After converting the entire dataframe to a string, I then used the dropna() function. The values that were previously NaN (considered a null value by pandas) were converted to the string 'nan'.
In conclusion, drop blank values FIRST, before you start manipulating data in the CSV and converting its data type.

Categories

Resources