Polars dataframe doesn't drop column - python

I have a function in a script that I am testing and the df.drop() function is not working as expected.
app.py
def area(df,value):
df["area"] = df['geo'].apply(lambda row:to_area(row))
df["area"] = df["area"].apply(lambda row: abs(row - mean))
df = df.filter(pl.col("area") < value)
df = df.drop("area")
return df
test.py
def test():
df = some df
res = area(df,2)
res_2 = area(df,4)
At res_2, I keep getting the "area" column back in the dataframe, which is causing me problems with type checking. Any ideas on what might be causing this? I know that using df.clone() works, but I don't understand what is causing this issue with how things are set up.

Related

Ungroup pandas dataframe after bfill

I'm trying to write a function that will backfill columns in a dataframe adhearing to a condition. The upfill should only be done within groups. I am however having a hard time getting the group object to ungroup. I have tried reset_index as in the example bellow but that gets an AttributeError.
Accessing the original df through result.obj doesn't lead to the updated value because there is no inplace for the groupby bfill.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column].bfill(axis="rows", inplace=True)
return df
Assigning the dataframe column in the function doesn't work because groupbyobject doesn't support item assingment.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column] = df[column].bfill()
return df
The test I'm trying to get to pass:
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
result.reset_index()
assert result["x_value"].equals(Series([4,4,None,5,5]))
You should use 'transform' method on the grouped DataFrame, like this:
import pandas as pd
def test_upfill():
df = pd.DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
result = df.groupby("group").transform(lambda x: x.bfill())
assert result["x_value"].equals(pd.Series([4,4,None,5,5]))
test_upfill()
Here you can find find more information about the transform method on Groupby objects
Based on the accepted answer this is the full solution I got to although I have read elsewhere there are issues using the obj attribute.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
columns = [column for column in df.obj.columns if column.startswith("x")]
df.obj[columns] = df[columns].transform(lambda x:x.bfill())
return df
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
assert df["x_value"].equals(Series([4,4,None,5,5]))

pmdarima: Apply .predict method via .groupby and .apply to auto_arima output stored rowwise in a pd.DataFrame

I'm using auto_arima via pmdarima to fit multiple time series via a groupby. This is to say, I have a pd.DataFrame of stacked time-indexed data, grouped by variable variable, and have successfully applied transform(pm.auto_arima) to each. The reproducible example finds boring best ARIMA models, but the idea seems to work. I now want to apply .predict() similarly, but cannot get it to play nice with apply / lambda(x) / their combinations.
The code below works until the # Forecasting - help! section. I'm having trouble catching the correct object (apparently) in the apply. How might I adapt one of test1, test2, or test3 to get what I want? Or, is there some other best-practice construct to consider? Is it better across columns (without a melt)? Or via a loop?
Ultimately, I hope that test1, say, is a stacked pd.DataFrame (or pd.Series at least) with 8 rows: 4 forecasted values for each of the 2 time series in this example, with an identifier column variable (possibly tacked on after the fact).
import pandas as pd
import pmdarima as pm
import itertools
# Get data - this is OK.
url = 'https://raw.githubusercontent.com/nickdcox/learn-airline-delays/main/delays_2018.csv'
keep = ['arr_flights', 'arr_cancelled']
# Setup data - this is OK.
df = pd.read_csv(url, index_col=0)
df.index = pd.to_datetime(df.index, format = "%Y-%m")
df = df[keep]
df = df.sort_index()
df = df.loc['2018']
df = df.groupby(df.index).sum()
df.reset_index(inplace = True)
df = df.melt(id_vars = 'date', value_vars = df.columns.to_list()[1:])
# Fit auto.arima for each time series - this is OK.
fit = df.groupby('variable')['value'].transform(pm.auto_arima).drop_duplicates()
fit = fit.to_frame(name = 'model')
fit['variable'] = keep
fit.reset_index(drop = True, inplace = True)
# Setup forecasts - this is OK.
max_date = df.date.max()
dr = pd.to_datetime(pd.date_range(max_date, periods = 4 + 1, freq = 'MS').tolist()[1:])
yhat = pd.DataFrame(list(itertools.product(keep, dr)), columns = ['variable', 'date'])
yhat.set_index('date', inplace = True)
# Forecasting - help! - Can't get any of these to work.
def predict_fn(obj):
return(obj.loc[0].predict(4))
predict_fn(fit.loc[fit['variable'] == 'arr_flights']['model']) # Appears to work!
test1 = fit.groupby('variable')['model'].apply(lambda x: x.predict(n_periods = 4)) # Try 1: 'Series' object has no attribute 'predict'.
test2 = fit.groupby('variable')['model'].apply(lambda x: x.loc[0].predict(n_periods = 4)) # Try 2: KeyError
test3 = fit.groupby('variable')['model'].apply(predict_fn) # Try 3: KeyError

issue in writing function to filter rows data frame

I am writing a function that will serve as filter for rows that I wanted to use.
The sample data frame is as follow:
df = pd.DataFrame()
df ['Xstart'] = [1,2.5,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
df ['GW'] = [1,1,2,3,4]
def filter(data,Game_week):
pass_data = data [(data['GW'] == Game_week)]
when I recall the function filter as follow, I got an error.
df1 = filter(df,1)
The error message is
AttributeError: 'NoneType' object has no attribute 'head'
but when I use manual filter, it works.
pass_data = df [(df['GW'] == [1])]
This is my first issue.
My second issue is that I want to filter the rows with multiple GW (1,2,3) etc.
For that I can manually do it as follow:
pass_data = df [(df['GW'] == [1])|(df['GW'] == [2])|(df['GW'] == [3])]
if I want to use in function input as list [1,2,3]
how can I write it in function such that I can input a range of 1 to 3?
Could anyone please advise?
Thanks,
Zep
Use isin for pass list of values instead scalar, also filter is existing function in python, so better is change function name:
def filter_vals(data,Game_week):
return data[data['GW'].isin(Game_week)]
df1 = filter_vals(df,range(1,4))
Because you don't return in the function, so it will be None, not the desired dataframe, so do (note that also no need parenthesis inside the data[...]):
def filter(data,Game_week):
return data[data['GW'] == Game_week]
Also, isin may well be better:
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Use return to return data from the function for the first part. For the second, use -
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Now apply the filter function -
df1 = filter(df,[1,2])

Python Pandas rolling mean DataFrame Constructor not properly called

I am trying to create a simple time-series, of different rolling types. One specific example, is a rolling mean of N periods using the Panda python package.
I get the following error : ValueError: DataFrame constructor not properly called!
Below is my code :
def py_TA_MA(v, n, AscendType):
df = pd.DataFrame(v, columns=['Close'])
df = df.sort_index(ascending=AscendType) # ascending/descending flag
M = pd.Series(df['Close'].rolling(n), name = 'MovingAverage_' + str(n))
df = df.join(M)
df = df.sort_index(ascending=True) #need to double-check this
return df
Would anyone be able to advise?
Kind regards
found the correction! It was erroring out (new error), where I had to explicitly declare n as an integer. Below, the code works
#xw.func
#xw.arg('n', numbers = int, doc = 'this is the rolling window')
#xw.ret(expand='table')
def py_TA_MA(v, n, AscendType):
df = pd.DataFrame(v, columns=['Close'])
df = df.sort_index(ascending=AscendType) # ascending/descending flag
M = pd.Series(df['Close'], name = 'Moving Average').rolling(window = n).mean()
#df = pd.Series(df['Close']).rolling(window = n).mean()
df = df.join(M)
df = df.sort_index(ascending=True) #need to double-check this
return df

what is the source of this error: python pandas

import pandas as pd
census_df = pd.read_csv('census.csv')
#census_df.head()
def answer_seven():
census_df_1 = census_df[(census_df['SUMLEV'] == 50)].set_index('CTYNAME')
census_df_1['highest'] = census_df_1[['POPESTIAMTE2010','POPESTIAMTE2011','POPESTIAMTE2012','POPESTIAMTE2013','POPESTIAMTE2014','POPESTIAMTE2015']].max()
census_df_1['lowest'] =census_df_1[['POPESTIAMTE2010','POPESTIAMTE2011','POPESTIAMTE2012','POPESTIAMTE2013','POPESTIAMTE2014','POPESTIAMTE2015']].min()
x = abs(census_df_1['highest'] - census_df_1['lowest']).tolist()
return x[0]
answer_seven()
This is trying to use the data from census.csv to find the counties that have the largest absolute change in population within 2010-2015(POPESTIMATES), I wanted to simply find the difference between abs.value of max and min value for each year/column. You must return a string. also [(census_df['SUMLEV'] ==50)] means only counties are taken as they are set to 50. But the code gives an error that ends with
KeyError: "['POPESTIAMTE2010' 'POPESTIAMTE2011' 'POPESTIAMTE2012'
'POPESTIAMTE2013'\n 'POPESTIAMTE2014' 'POPESTIAMTE2015'] not in index"
Am I indexing the wrong data structure? I'm really new to datascience and coding.
I think the column names in the code have typo. The pattern is 'POPESTIMATE201?' and not 'POPESTIAMTE201?'
Any help with shortening the code will be appreciated. Here is the code that works -
census_df = pd.read_csv('census.csv')
def answer_seven():
cdf = census_df[(census_df['SUMLEV'] == 50)].set_index('CTYNAME')
columns = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
cdf['big'] = cdf[columns].max(axis =1)
cdf['sml'] = cdf[columns].min(axis =1)
cdf['change'] = cdf[['big']].sub(cdf['sml'], axis=0)
return cdf['change'].idxmax()

Categories

Resources