I'm trying to write a function that will backfill columns in a dataframe adhearing to a condition. The upfill should only be done within groups. I am however having a hard time getting the group object to ungroup. I have tried reset_index as in the example bellow but that gets an AttributeError.
Accessing the original df through result.obj doesn't lead to the updated value because there is no inplace for the groupby bfill.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column].bfill(axis="rows", inplace=True)
return df
Assigning the dataframe column in the function doesn't work because groupbyobject doesn't support item assingment.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column] = df[column].bfill()
return df
The test I'm trying to get to pass:
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
result.reset_index()
assert result["x_value"].equals(Series([4,4,None,5,5]))
You should use 'transform' method on the grouped DataFrame, like this:
import pandas as pd
def test_upfill():
df = pd.DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
result = df.groupby("group").transform(lambda x: x.bfill())
assert result["x_value"].equals(pd.Series([4,4,None,5,5]))
test_upfill()
Here you can find find more information about the transform method on Groupby objects
Based on the accepted answer this is the full solution I got to although I have read elsewhere there are issues using the obj attribute.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
columns = [column for column in df.obj.columns if column.startswith("x")]
df.obj[columns] = df[columns].transform(lambda x:x.bfill())
return df
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
assert df["x_value"].equals(Series([4,4,None,5,5]))
Related
I have a list of filepaths in the first column of a dataframe. My goal is to create a second column that represents file categories, with categories reflecting the words in the filepath.
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
df["Animal"] =(df['filepath'].str.contains("dog|cat",case=False,regex=True))
df["Fish"] =(df['filepath'].str.contains("barracuda",case=False))
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
This code works. The problem arises when I have 200 statements beginning with df['columnName'] =. Because I have so many, I get the error:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
To fix this I have tried:
dfAnimal = df.copy
dfAnimal['Animal'] = dfAnimal['filepath'].str.contains("dog|cat",case=False,regex=True)
dfFish = df.copy
dfFish["Fish"] =dfFish['filepath'].str.contains("barracuda",case=False)
df = pd.concat(dfAnimal,dfFish)
The above gives me errors such as method object is not iterable and method object is not subscriptable. I then tried df = df.loc[df['filepath'].isin(['cat','dog'])] but this only works when 'cat' or 'dog' is the only word in the column. How do I avoid the performance error?
Try creating all your new columns in a dict, and then convert that dict into a dataframe, and then use pd.concat to add the resulting dataframe (containing the new columns) to the original dataframe:
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
Added to your original code, it would be something like this:
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
##### These are the new lines #####
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
##### End of new lines #####
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
Here's the test code
df1 = pd.DataFrame({'Country':['U.S.A.']})
df2 = df1.copy()
df3 = df1.copy()
def replace1(df, col, mapVals):
df = df.replace({col: mapVals})
def replace2(df, col, mapVals):
return df.replace({col: mapVals})
def replace3(df, col, mapVals):
df.replace({col: mapVals}, inplace=True)
replace1(df1, 'Country', {'U.S.A.':'USA'})
df2 = replace2(df2, 'Country', {'U.S.A.':'USA'})
replace3(df3, 'Country', {'U.S.A.':'USA'})
print(df1)
print(df2)
print(df3)
df1 produces "U.S.A." while df2 and df3 produce "USA"
I don't understand why setting the DataFrame within the replace1() function doesn't work. Isn't replace2() effectively the same as replace1()?
I'm new to DataFrame. Please point out my stupidity.
In the function replace1, you are setting the output of df.replace({col: mapVals}) to a new variable with the same name: df. That is, you are not altering the values of the original object that you provide as input.
Essentially this is what you are doing:
def replace1(df, col, mapVals):
temp = df.replace({col: mapVals})
df = temp # Creating a variable that will overwrite the original input variable
So df is no longer the same object.
This would be another alternative, however:
def replace1(df, col, mapVals):
df.iloc[:, :] = df.replace({col: mapVals})
In replace1, you must return df (similar to replace2), since your change is not done inplace (like you did with replace3).
def replace1(df, col, mapVals):
df = df.replace({col: mapVals})
return df
And when calling it, you need to capture the returned object (see Return Values from here)
df1 = replace1(df1, 'Country', {'U.S.A.':'USA'})
Also Isn't replace2() effectively the same as replace1()?
No. replace2 uses a return to return the modified value. While return 1 simply makes the change (df.replace) but it does not return the changed DataFrame.
I have two dataframes measuring two properties from an instrument, where the depths are offset for a certain dz. Note that the example below is extremely simplified.
df1 = pd.DataFrame({'depth_1': [0.936250, 0.959990, 0.978864, 0.991288, 1.023876, 1.045801, 1.062768, 1.077090, 1.101248, 1.129754, 1.147458, 1.160193, 1.191206, 1.218595, 1.256964] })
df2 = pd.DataFrame({'depth_2': [0.620250, 0.643990, 0.662864, 0.675288, 0.707876, 0.729801, 0.746768, 0.761090, 0.785248, 0.813754, 0.831458, 0.844193, 0.875206, 0.902595, 0.940964 ] })
How do I get the index of df2.depth_2 that gets closest the first element of df1.depth_1 ?
Using reindex with method nearest
df2.reset_index().set_index('depth_2').reindex(df1.depth_1,method = 'nearest')['index'].unique()
Out[265]: array([14], dtype=int64)
You can use pandas merge_asof function (you will need to order your data first if it isn't in real life)
df1 = df1.sort_values(by='depth_1')
df2 = df2.sort_values(by='depth_2')
pd.merge_asof(df1, df2.reset_index(), left_on="depth_1", right_on="depth_2", direction="nearest")
if you just wanted that for the first value in df1 you could do the join on the top row:
df2 = df2.sort_values(by='depth_2')
pd.merge_asof(df1.head(1), df2.reset_index(), left_on="depth_1", right_on="depth_2", direction="nearest")
Get the absolute difference between all elements of df2 and first element of df1 and then get it's index:
import pandas as pd
import numpy as np
def get_closest(df1, df2, idx):
abs_diff = np.array([abs(df1['depth_1'][idx]-item) for item in df2['depth_2']])
return abs_diff.argmin()
df1 = pd.DataFrame({'depth_1': [0.936250, 0.959990, 0.978864, 0.991288, 1.023876, 1.045801, 1.062768, 1.077090, 1.101248, 1.129754, 1.147458, 1.160193, 1.191206, 1.218595, 1.256964] })
df2 = pd.DataFrame({'depth_2': [0.620250, 0.643990, 0.662864, 0.675288, 0.707876, 0.729801, 0.746768, 0.761090, 0.785248, 0.813754, 0.831458, 0.844193, 0.875206, 0.902595, 0.940964 ] })
get_closest(df1,df2,0)
Output:
14
I am trying to create a simple time-series, of different rolling types. One specific example, is a rolling mean of N periods using the Panda python package.
I get the following error : ValueError: DataFrame constructor not properly called!
Below is my code :
def py_TA_MA(v, n, AscendType):
df = pd.DataFrame(v, columns=['Close'])
df = df.sort_index(ascending=AscendType) # ascending/descending flag
M = pd.Series(df['Close'].rolling(n), name = 'MovingAverage_' + str(n))
df = df.join(M)
df = df.sort_index(ascending=True) #need to double-check this
return df
Would anyone be able to advise?
Kind regards
found the correction! It was erroring out (new error), where I had to explicitly declare n as an integer. Below, the code works
#xw.func
#xw.arg('n', numbers = int, doc = 'this is the rolling window')
#xw.ret(expand='table')
def py_TA_MA(v, n, AscendType):
df = pd.DataFrame(v, columns=['Close'])
df = df.sort_index(ascending=AscendType) # ascending/descending flag
M = pd.Series(df['Close'], name = 'Moving Average').rolling(window = n).mean()
#df = pd.Series(df['Close']).rolling(window = n).mean()
df = df.join(M)
df = df.sort_index(ascending=True) #need to double-check this
return df
I have a function which I apply it on the rows of a dataframe. This function returns a list of variable length depending on a parameter.
For now I use the following example code:
import pandas as pd
df = pd.read_csv("data.csv")
def add_columns(x, amount):
return range(amount)
df["L1"], df["L2"], df["L3"] = zip(*df.apply(lambda x: add_columns(x, 3), axis=1))
Is there a way to add the labels automatically ?
Not sure if I understand your question correctly in what you want to populate your columns with but this should work:
import pandas as pd
import numpy as np
def add_columns(x, *args):
col_names = args[0]
return pd.Series({i: x for i in col_names})
def add_range(x, *args):
col_names = args[1]
return pd.Series({k: v for k,v in zip(args[1],range(args[0]))})
df = pd.DataFrame(np.random.uniform(size=(10,2)),columns=["A","B"])
labels = ["L1","L2","L3"]
# This populates with values from "A" column
df.merge(df["A"].apply(add_columns, args=([labels])),left_index=True, right_index=True)
# This populates with values from range(number_passed) function
df.merge(df["A"].apply(add_range, args=([3,labels])),left_index=True, right_index=True)