How to use "apply" with a dataframe and avoid SettingWithCopyWarning? - python

I am using the following function with a DataFrame:
df['error_code'] = df.apply(lambda row: replace_semi_colon(row), axis=1)
The embedded function is:
def replace_semi_colon(row):
errrcd = str(row['error_code'])
semi_colon_pat = re.compile(r'.*;.*')
if pd.notnull(errrcd):
if semi_colon_pat.match(errrcd):
mod_error_code = str(errrcd.replace(';',':'))
return mod_error_code
return errrcd
But I am receiving the (in)famous
SettingWithCopyWarning
I have read many posts but still do not know how to prevent it.
The strange thing is that I use other apply functions the same way but they do not throw the same error.
Can someone explain why I am getting this warning?

Before the apply there was another statement:
df = df.query('error_code != "BM" and eror_code != "PM"')
I modified that to:
df.loc[:] = df.query('error_code != "BM" and eror_code != "PM"')
That solved it.

Related

Polars dataframe doesn't drop column

I have a function in a script that I am testing and the df.drop() function is not working as expected.
app.py
def area(df,value):
df["area"] = df['geo'].apply(lambda row:to_area(row))
df["area"] = df["area"].apply(lambda row: abs(row - mean))
df = df.filter(pl.col("area") < value)
df = df.drop("area")
return df
test.py
def test():
df = some df
res = area(df,2)
res_2 = area(df,4)
At res_2, I keep getting the "area" column back in the dataframe, which is causing me problems with type checking. Any ideas on what might be causing this? I know that using df.clone() works, but I don't understand what is causing this issue with how things are set up.

Error occured during changing for loop to .apply()

I am currently working on finding position(row index, column index) of maximum cell in each column of a dataframe.
There are a lot of similar dataframe like this, so I made a function like below.
def FindPosition_max(series_max, txt_name):
# series_max : only needed to get the number of columns in time_history(Index starts from 1).
time_history = pd.read_csv(txt_name, skiprows=len(series_max)+7, header=None, usecols=[i for i in range(1,len(series_max)+1)])[:-2]
col_index = series_max.index
row_index = []
for current_col_index in series_max.index:
row_index.append(time_history.loc[:, current_col_index].idxmax())
return row_index, col_index.tolist()
This works well, but takes too much time to run with a lot of dataframes. I found on the internet that .apply() is much more faster than for loop and I tried like this.
def FindPosition_max(series_max, txt_name):
time_history = pd.read_csv(txt_name, skiprows=len(series_max)+7, header=None, usecols=[i for i in range(1,len(series_max)+1)])[:-2]
col_index = series_max.index
row_index = pd.Series(series_max.index).apply(lambda x: time_history.loc[:, x].idxmax())
return row_index, series_max.index.tolist()
And the error comes like this,
File "C:\Users\hwlee\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 844, in _list_of_series_to_arrays
indexer = indexer_cache[id(index)] = index.get_indexer(columns)
AttributeError: 'builtin_function_or_method' object has no attribute 'get_indexer'
I tried to find what causes this error, but this error never goes away. Also when I tested the codes inside the function separately, it works well.
Could anyone help me to solve this problem? Thank u!

What is the alternative for the deprecated Pandas.Panel

FutureWarning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method.
I am getting the above error whenever i ran this code!
difference = pd.Panel(dict(df1=df1,df2=df2))
Can anyone please tell me the alternative way for usage of Panel with the above line of code.
Edit-1:-
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)
difference = pd.Panel(dict(df1=df1,df2=df2))
res = difference.apply(report_diff, axis=0)
Here df1 and df2 contains both categorical and numerical data.
Just comparing the two dataframes here to get the differences between the two.
As stated in the docs, the recommended replacements for a Pandas Panel are using a multindex, or the xarray library.
For your specific use case, this somewhat hacky code gets you the same result:
a = df1.values.reshape(df1.shape[0] * df1.shape[1])
b = df2.values.reshape(df2.shape[0] * df2.shape[1])
res = np.array([v if v == b[idx] else str(v) + '--->' + str(b[idx]) for idx, v in enumerate(a)]).reshape(
df1.shape[0], df1.shape[1])
res = pd.DataFrame(res, columns=df1.columns)

issue in writing function to filter rows data frame

I am writing a function that will serve as filter for rows that I wanted to use.
The sample data frame is as follow:
df = pd.DataFrame()
df ['Xstart'] = [1,2.5,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
df ['GW'] = [1,1,2,3,4]
def filter(data,Game_week):
pass_data = data [(data['GW'] == Game_week)]
when I recall the function filter as follow, I got an error.
df1 = filter(df,1)
The error message is
AttributeError: 'NoneType' object has no attribute 'head'
but when I use manual filter, it works.
pass_data = df [(df['GW'] == [1])]
This is my first issue.
My second issue is that I want to filter the rows with multiple GW (1,2,3) etc.
For that I can manually do it as follow:
pass_data = df [(df['GW'] == [1])|(df['GW'] == [2])|(df['GW'] == [3])]
if I want to use in function input as list [1,2,3]
how can I write it in function such that I can input a range of 1 to 3?
Could anyone please advise?
Thanks,
Zep
Use isin for pass list of values instead scalar, also filter is existing function in python, so better is change function name:
def filter_vals(data,Game_week):
return data[data['GW'].isin(Game_week)]
df1 = filter_vals(df,range(1,4))
Because you don't return in the function, so it will be None, not the desired dataframe, so do (note that also no need parenthesis inside the data[...]):
def filter(data,Game_week):
return data[data['GW'] == Game_week]
Also, isin may well be better:
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Use return to return data from the function for the first part. For the second, use -
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Now apply the filter function -
df1 = filter(df,[1,2])

Running Half life codes for a mean reverting series

I am currently trying to compute the Half life results for multiple columns of data. I have tried to incorporate the codes I got from 'pythonforfinance.com' Link.
However, I seem to have missed a few edits that is resulting in errors being thrown.
This is how my df looks like: Link
and the code I am running:
import pandas as pd
import numpy as np
import statsmodels.api as sm
df1=pd.read_excel('C:\\Users\Sai\Desktop\Test\Spreads.xlsx')
Halflife_results={}
for col in df1.columns.values:
spread_lag = df1.shift(periods=1, axis=1)
spread_lag.ix([0]) = spread_lag.ix([1])
spread_ret = df1.columns - spread_lag
spread_ret.ix([0]) = spread_ret.ix([1])
spread_lag2 = sm.add_constant(spread_lag)
md = sm.OLS(spread_ret,spread_lag2)
mdf = md.fit()
half_life = round(-np.log(2) / mdf.params[1],0)
print('half life:', half_life)
The error that is being thrown is:
File "C:/Users/Sai/Desktop/Test/Half life test 2.py", line 12
spread_lag.ix([0]) = spread_lag.ix([1])
^
SyntaxError: can't assign to function call
Based on the error message, I seem to have made a very basic mistake but since I am a beginner I am not able to fix the issue. If not a solution to this code, an explanation to these lines of the codes would be of great help:
spread_lag = df1.shift(periods=1, axis=1)
spread_lag.ix([0]) = spread_lag.ix([1])
spread_ret = df1.columns - spread_lag
spread_ret.ix([0]) = spread_ret.ix([1])
spread_lag2 = sm.add_constant(spread_lag)
As explained by the error message, pd.Series.ixisn't callable: you should change spread_lag.ix([0]) to spread_lag.ix[0].
Also, you shouldn't shift on axis=1 (rows) since you're interested in differences along each column (axis=0, default value).
Defining a get_halflifefunction allows you then to directly apply it to each column, removing the need for a loop.
def get_halflife(s):
s_lag = s.shift(1)
s_lag.ix[0] = s_lag.ix[1]
s_ret = s - s_lag
s_ret.ix[0] = s_ret.ix[1]
s_lag2 = sm.add_constant(s_lag)
model = sm.OLS(s_ret,s_lag2)
res = model.fit()
halflife = round(-np.log(2) / res.params[1],0)
return halflife
df1.apply(get_halflife)

Categories

Resources