Here's the test code
df1 = pd.DataFrame({'Country':['U.S.A.']})
df2 = df1.copy()
df3 = df1.copy()
def replace1(df, col, mapVals):
df = df.replace({col: mapVals})
def replace2(df, col, mapVals):
return df.replace({col: mapVals})
def replace3(df, col, mapVals):
df.replace({col: mapVals}, inplace=True)
replace1(df1, 'Country', {'U.S.A.':'USA'})
df2 = replace2(df2, 'Country', {'U.S.A.':'USA'})
replace3(df3, 'Country', {'U.S.A.':'USA'})
print(df1)
print(df2)
print(df3)
df1 produces "U.S.A." while df2 and df3 produce "USA"
I don't understand why setting the DataFrame within the replace1() function doesn't work. Isn't replace2() effectively the same as replace1()?
I'm new to DataFrame. Please point out my stupidity.
In the function replace1, you are setting the output of df.replace({col: mapVals}) to a new variable with the same name: df. That is, you are not altering the values of the original object that you provide as input.
Essentially this is what you are doing:
def replace1(df, col, mapVals):
temp = df.replace({col: mapVals})
df = temp # Creating a variable that will overwrite the original input variable
So df is no longer the same object.
This would be another alternative, however:
def replace1(df, col, mapVals):
df.iloc[:, :] = df.replace({col: mapVals})
In replace1, you must return df (similar to replace2), since your change is not done inplace (like you did with replace3).
def replace1(df, col, mapVals):
df = df.replace({col: mapVals})
return df
And when calling it, you need to capture the returned object (see Return Values from here)
df1 = replace1(df1, 'Country', {'U.S.A.':'USA'})
Also Isn't replace2() effectively the same as replace1()?
No. replace2 uses a return to return the modified value. While return 1 simply makes the change (df.replace) but it does not return the changed DataFrame.
Related
I'm trying to write a function that will backfill columns in a dataframe adhearing to a condition. The upfill should only be done within groups. I am however having a hard time getting the group object to ungroup. I have tried reset_index as in the example bellow but that gets an AttributeError.
Accessing the original df through result.obj doesn't lead to the updated value because there is no inplace for the groupby bfill.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column].bfill(axis="rows", inplace=True)
return df
Assigning the dataframe column in the function doesn't work because groupbyobject doesn't support item assingment.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column] = df[column].bfill()
return df
The test I'm trying to get to pass:
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
result.reset_index()
assert result["x_value"].equals(Series([4,4,None,5,5]))
You should use 'transform' method on the grouped DataFrame, like this:
import pandas as pd
def test_upfill():
df = pd.DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
result = df.groupby("group").transform(lambda x: x.bfill())
assert result["x_value"].equals(pd.Series([4,4,None,5,5]))
test_upfill()
Here you can find find more information about the transform method on Groupby objects
Based on the accepted answer this is the full solution I got to although I have read elsewhere there are issues using the obj attribute.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
columns = [column for column in df.obj.columns if column.startswith("x")]
df.obj[columns] = df[columns].transform(lambda x:x.bfill())
return df
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
assert df["x_value"].equals(Series([4,4,None,5,5]))
I have a list of filepaths in the first column of a dataframe. My goal is to create a second column that represents file categories, with categories reflecting the words in the filepath.
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
df["Animal"] =(df['filepath'].str.contains("dog|cat",case=False,regex=True))
df["Fish"] =(df['filepath'].str.contains("barracuda",case=False))
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
This code works. The problem arises when I have 200 statements beginning with df['columnName'] =. Because I have so many, I get the error:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
To fix this I have tried:
dfAnimal = df.copy
dfAnimal['Animal'] = dfAnimal['filepath'].str.contains("dog|cat",case=False,regex=True)
dfFish = df.copy
dfFish["Fish"] =dfFish['filepath'].str.contains("barracuda",case=False)
df = pd.concat(dfAnimal,dfFish)
The above gives me errors such as method object is not iterable and method object is not subscriptable. I then tried df = df.loc[df['filepath'].isin(['cat','dog'])] but this only works when 'cat' or 'dog' is the only word in the column. How do I avoid the performance error?
Try creating all your new columns in a dict, and then convert that dict into a dataframe, and then use pd.concat to add the resulting dataframe (containing the new columns) to the original dataframe:
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
Added to your original code, it would be something like this:
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
##### These are the new lines #####
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
##### End of new lines #####
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
I have to use the same function twice. The first when the parameter is df, the second when the parameter is df3. How to do that? The function:
def add(df, df3):
df["timestamp"] = pd.to_datetime(df["timestamp"])
df = df.groupby(pd.Grouper(key = "timestamp", freq = "h")).agg("mean")
price = df["price"]
amount = df["amount"]
return (price * amount) // amount
The double use :
out = []
# This loop will use the add(df) function for every csv and append in a list
for f in csv_files:
df = pd.read_csv(f, header=0)
# Replace empty values with numpy, not sure if usefull, maybe pandas can handle this
df.replace("", np.nan)
#added aggregate DataFrame with new column to list of DataFrames
out.append(add(df))
out2 = []
df3 = pd.Series(dtype=np.float64)
for f in csv_files:
df2 = pd.read_csv(f, header=0)
df3 = pd.concat([df3, df2], ignore_index=True)
out2 = pd.DataFrame(add(df = df3))
out2
I got the error:
TypeError: add() missing 1 required positional argument: 'df3'
The names of the add function have nothing to do with the variable names df and df3 in the rest of the script.
As #garagnoth has stated, you only need one parameter in add. You can call it df, foo or myvariablename: it is not related to nor df, nor df3.
In your case, you can change the add function to the following:
def add(a_dataframe):
# I set the argument name to "a_dataframe" so you can
# see its name is not linked to outside variables
a_dataframe["timestamp"] = pd.to_datetime(a_dataframe["timestamp"])
a_dataframe = a_dataframe.groupby(pd.Grouper(key = "timestamp", freq = "h")).agg("mean")
price = a_dataframe["price"]
amount = a_dataframe["amount"]
return (price * amount) // amount
You can now call this function with df or df3 as the rest of the script already does.
I am trying to delete a column called Rank but nothing happens. The remaining code all executes without any issue but the column itself remains in the output file. I've highlighted the part of the code that is not working.
def read_csv():
file = "\mona" + yday+".csv"
#df=[]
df = pd.read_csv(save_path+file,skiprows=3,encoding = "ISO-8859-1",error_bad_lines=False)
return df
# replace . with / in column EPIC
def tickerchange():
df=read_csv()
df['EPIC'] = df['EPIC'].str.replace('.','/')
return df
def consolidate_AB_listings():
df=tickerchange()
Aline = df.loc[(df['EPIC'] =='RDSA'),'Mkt Cap (àm)']
Bline = df.loc[(df['EPIC'] =='RDSB'),'Mkt Cap (àm)']
df.loc[(df['EPIC'] =='RDSA'),'Mkt Cap (àm)']= float(Aline) + float(Bline)
df = df.loc[(df.Ind != 'I/E')]
df = df.loc[(df.Ind != 'FL')]
df = df.loc[(df.Ind != 'M')]
df = df.loc[(df.EPIC != 'RDSB')]
return df
def ranking_mktcap():
df = consolidate_AB_listings()
df['Rank']= df['Mkt Cap (àm)'].rank(ascending=False)
df = df.loc[(df.Rank != 1)]
df['Rank1']= df['Mkt Cap (Em)'].rank(ascending=False)
## This doesn't seem to work
df = df.drop(df['Security'], 1)
return df
def save_outputfile():
#df = drop()
df = ranking_mktcap()
df.to_csv(r'S:\Index_Analytics\UK\Index Methodology\FTSE\Py_file_download\MonitoredList.csv', index=False)
print("finished")
if __name__ == "__main__":
main()
read_csv()
tickerchange()
consolidate_AB_listings()
ranking_mktcap()
save_outputfile()
DataFrame.drop() takes the following: DataFrame.drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise').
When you call df = df.drop(df['Security'], 1) it's using df['security'] as the labels to drop. And the 1 is being passed through the axis parameter.
If you want to drop the column 'Security' then you'd want to do:
df = df.drop('Security', axis=1)
# this is same as
df = df.drop(labels='Security', axis=1)
# you can also specify the column name directly, like this
df = df.drop(columns='Security')
Note: the columns= parameter can take a single lable (str) like above, or can take a list of column names.
Try by replacing
df = df.drop(df['Security'], 1)
By
df.drop(['Security'],axis=1, inplace=True)
I had the same issue and all I did was add inplace = True
So it will be df = df.drop(df['Security'], 1, inplace = True)
Here is the problem. I use a function to return a randomized data,
data1 = [3,5,7,3,2,6,1,6,7,8]
data2 = [1,5,2,1,6,4,3,2,7,8]
df = pd.DataFrame(data1, columns = ['c1'])
df['c2'] = data2
def randomize_data(df):
df['c1_ran'] = df['c1'].apply(lambda x: (x + np.random.uniform(0,1)))
df['c1']=df['c1_ran']
# df.drop(['c1_ran'], 1, inplace=True)
return df
temp_df = randomize_data(df)
display(df)
display(temp_df)
However, the df (source data) and the temp_df (randomized_data) is the same. Here is the result:
How can I make the temp_df and df different from each other?
I find I can get rid of the problem by adding df.copy() at the beginning of the function
def randomize_data(df):
df = df.copy()
But I'm not sure if this is the right way to deal with it?
Use DataFrame.assign():
def randomize_data(df):
return df.assign(c1=df.c1 + np.random.uniform(0, 1, df.shape[0]))
I think you are right, and DataFrame.copy() have an optional argument 'deep'. You can find details in http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html