SettingWithCopyWarning occurs when using Pandas extract() within a function - python

In Pandas I'm trying to filter my DataFrame by date, followed by extracting a reportId string (i.e. 6 digits between dashes) from a longer string; however, when I run the below code I get the warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
list_date = [1632309961, 1632310980, 1632311134, 1632411137,
1632411139, 1632411142, 1632411144, 1632411146,
1632413166, 1632413427]
list_id = ['se-84c735-hg5675', 'se-5f73s9-hg3465', 'se-1f34g6-hg3455', 'se-09f67s-hg5123',
'se-5g63g9-hg1235', 'se-47h8h0-hg5555', 'se-h901h3-hg6755', 'se-287n54-hg5321',
'se-g357a8-hg6675', 'se-56q89r-hg5767']
df = pd.DataFrame([list_date, list_id], index=['date_unix','id']).T
def test_extract(df):
df['date'] = pd.to_datetime(df['date_unix'], unit='s')
df = df[df['date'] >= pd.to_datetime('2021-09-23')]
df['reportId'] = df['id'].str.extract(r"([a-zA-Z0-9]{6})")
return df
test_extract(df)
I've tried a few different fixes like making my date filter using .loc[row_indexer,col_indexer] or throwing .copy() after everything; however, I get the same issue:
def test_extract(df):
df['date'] = pd.to_datetime(df['date_unix'], unit='s')
df = df.loc[df['date'] >= pd.to_datetime('2021-09-23'),:]
df['reportId'] = df['id'].str.extract(r"([a-zA-Z0-9]{6})")
return df
Strangely, when I run this same code outside of a function I no longer receive the warning. Can anyone provide me with a solution for avoiding this warning while the code is in the function?
Info:
Pandas - 0.23.4 :: Python 3.7.10 ::
OS - Linux (Ubuntu 16.04.7 LTS)

I found a fix to this issue; however I'm still unsure of why this solution works but not others. I simply moved the Pandas regex extraction before the date filter. It makes sense that df['reportId'] is no longer being created from a copy of a slice, but I still don't know why formatting the date filter with .loc didn't solve this. If anyone has insight I welcome your comment.
def test_extract(df):
df['reportId'] = df['id'].str.extract(r"([a-zA-Z0-9]{6})")
df['date'] = pd.to_datetime(df['date_unix'], unit='s')
df = df[df['date'] >= pd.to_datetime('2021-09-23')]
return df
test_extract(df)

Related

Getting errors while running in Jupyter notebook

I'm having trouble running this code in python. This is my code:
import pandas as pd
import numpy as np
stars_with_planet = pd.read_csv(r'C:\Users\Stars\starswithplanet.csv')
df1 = pd.DataFrame(stars_with_planet)
stars_without_planet = pd.read_csv(r'C:\Users\Stars\starswithoutplanet.csv')
df2 = pd.DataFrame(stars_without_planet)
df3 = df1.loc[(df1['TeffK'] >= 3500) & (df1['TeffK'] <= 5400)]
df4 = df2.loc[(df2['TeffK'] >= 3500) & (df2['TeffK'] <= 5400)]
df3['check'] = df3[['[Fe/H]']].apply(tuple, axis=1)\
.isin(df4[['[Fe/H]']].apply(tuple, axis=1))
It is showing the following error after the last line:
C:\Users\AG\AppData\Local\Temp/ipykernel_5940/3520898032.py:1:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead See the caveats in the documentation:
pandas.pydata.org/pandas-docs/stable/user_guide/… df3['check'] =
df3[['[Fe/H]']].apply(tuple, axis=1)\
Please help me I have used Jupyter notebook.
The CSV Files are attached below:
https://drive.google.com/file/d/1eDf2G969tdaxZrM9mQXk3mSKHrjABRUQ/view?usp=sharing
https://drive.google.com/file/d/1t8OZGgxaXbbp5X-9Ms8NJd4AZfYUMOGC/view?usp=sharing
The one you are showing at the line below is not en error:
df3['check'] = df3[['[Fe/H]']].apply(tuple, axis=1).isin(df4[['[Fe/H]']].apply(tuple, axis=1))
It's a warning:
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
The pandas SettingWithCopyWarning warns you that you may be doing some chained assignments that may not work as expected.
Basically the issue is that modifications to your df3 will not lead to modification to your original df1.
If you don't care about keeping df1 updated, but you only care about df3, you could do this:
df3 = df1.loc[(df1['TeffK'] >= 3500) & (df1['TeffK'] <= 5400)].copy()
...
df3['check'] = df3[['[Fe/H]']].apply(tuple, axis=1).isin(df4[['[Fe/H]']].apply(tuple, axis=1))
Otherwise, you can do as suggested by the warning. I'm not entirely sure what your expected outcome is, but this code below updates directly df1:
df1.loc[(df1['TeffK'] >= 3500) & (df1['TeffK'] <= 5400), 'check'] = df1[['[Fe/H]']].apply(tuple, axis=1).isin(df4[['[Fe/H]']].apply(tuple, axis=1))

Error when filtering DataFrame with a call to loc

I am a complete Python and Pandas novice. I am following a tutorial, and so far have the following code:
import numpy as np
import pandas as pd
import plotly as pyplot
import datetime
df = pd.read_csv("GlobalLandTemperaturesByCountry.csv")
df = df.drop("AverageTemperatureUncertainty", axis=1)
df = df.rename(columns={"dt": "Date"})
df = df.rename(columns={"AverageTemperature": "AvTemp"})
df = df.dropna()
df_countries = df.groupby(["Country", "Date"]).sum().reset_index().sort_values("Date", ascending=False)
start_date = "2001-01-01"
end_date = "2002-01-01"
mask = (df_countries["Date"] > start_date) & (df_countries["Date"] <= end_date)
df_mask = df_countries.loc(mask)
When I try and run the code, I get an error on the last line, i.e. df_mask = df_countries.loc(mask), the error being:
TypeError 'Series' objects are mutable, thus they cannot be hashed
I have already found several StackOverflow answers for this error, but none seem to match my scenario enough to help. Why am I getting this error?
In above example df_countries is dataframe and mask seems to be condition which is to be applied on this dataframe.
The object is mutable, meaning that its value can be changed without reassigning it the same variable, its contents will change at some point in the code. As a result, its hash value will change, so they cannot be hashed.
Try:
df_mask = df_countries.loc[(mask)]

mask function doesn't get rid of unwanted data

I'm working on a data frame taken from Adafruit IO and sadly some of my data is from a time when my project malfunctioned so some of the values are just equal NaN.
I tried to remove it by typing this code lines:
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
This is data retreived from Adafruit IO Feed, getting analyzed by pandas, I tried using 'where' function too but it didn't work
my entire code is
import pandas as pd
temp_data = pd.read_json('https://io.adafruit.com/api/(...)')
light_data = pd.read_json('https://io.adafruit.com/api/(...)')
temp_data['created_at'] = pd.to_datetime(temp_data['created_at'], infer_datetime_format=True)
temp_data = temp_data.set_index('created_at')
light_data['created_at'] = pd.to_datetime(light_data['created_at'], infer_datetime_format=True)
light_data = light_data.set_index('created_at')
tempVals = pd.Series(temp_data['value'])
lightVals = pd.Series(light_data['value'])
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
The output is all of my data for some reason, but it should be only the valid values.
Hey I think the issue here that you're looking for values equal to the string 'NaN', while actual NaN values aren't a string, or more specifically aren't anything.
Try using:
onlyValidData = temp_data.mask(temp_data['value'].isnull())
Edit: to remove rows rather than marking all values in that row as NaN:
onlyValidData = temp_data.dropna()

Warning - value is trying to be set on a copy of a slice

I get the warning when i run this code. I tried all possible solutions I can think of, but cannot get rid of it. Kindly help !
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
import math
task2_df['price_square'] = None
i = 0
for row in data.iterrows():
task2_df['price_square'].at[i] = math.pow(task2_df['price'][i],2)
i += 1
For starters, I don't see your error on Pandas v0.19.2 (tested with code at the bottom of this answer). But that's probably irrelevant to solving your issue. You should avoid iterating rows in Python-level loops. NumPy arrays which are used by Pandas are specifically designed for numerical computations:
df = pd.DataFrame({'price': [54.74, 12.34, 35.45, 51.31]})
df['price_square'] = df['price'].pow(2)
print(df)
price price_square
0 54.74 2996.4676
1 12.34 152.2756
2 35.45 1256.7025
3 51.31 2632.7161
Test on Pandas v0.19.2 with no warnings / errors:
import math
df = pd.DataFrame({'price': [54.74, 12.34, 35.45, 51.31]})
df['price_square'] = None
i = 0
for row in df.iterrows():
df['price_square'].at[i] = math.pow(df['price'][i],2)
i += 1

Python Pandas filtering dataframe on date

I am trying to manipulate a CSV file on a certain date in a certain column.
I am using pandas (total noob) for that and was pretty successful until i got to dates.
The CSV looks something like this (with more columns and rows of course).
These are the columns:
Circuit
Status
Effective Date
These are the values:
XXXX001
Operational
31-DEC-2007
I tried dataframe query (which i use for everything else) without success.
I tried dataframe loc (which worked for everything else) without success.
How can i get all rows that are older or newer from a given date? If i have other conditions to filter the dataframe, how do i combine them with the date filter?
Here's my "raw" code:
import pandas as pd
# parse_dates = ['Effective Date']
# dtypes = {'Effective Date': 'str'}
df = pd.read_csv("example.csv", dtype=object)
# , parse_dates=parse_dates, infer_datetime_format=True
# tried lot of suggestions found on SO
cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df.columns = cols
status1 = 'Suppressed'
status2 = 'Order Aborted'
pool = '2'
region = 'EU'
date1 = '31-DEC-2017'
filt_df = df.query('Status != #status1 and Status != #status2 and Pool == #pool and Region_A == #region')
filt_df.reset_index(drop=True, inplace=True)
filt_df.to_csv('filtered.csv')
# this is working pretty well
supp_df = df.query('Status == #status1 and Effective_Date < #date1')
supp_df.reset_index(drop=True, inplace=True)
supp_df.to_csv('supp.csv')
# this is what is not working at all
I tried many approaches, but i was not able to put it together. This is just one of many approaches i tried.. so i know it is perhaps completely wrong, as no date parsing is used.
supp.csv will be saved, but the dates present are all over the place, so there's no match with the "logic" in this code.
Thanks for any help!
Make sure you convert your date to datetime and then filter slice on it.
df['Effective Date'] = pd.to_datetime(df['Effective Date'])
df[df['Effective Date'] < '2017-12-31']
#This returns all the values with dates before 31th of December, 2017.
#You can also use Query

Categories

Resources