Update a global data frame through a class method - python

I would like to append rows in a data frame which is specified as the first argument of the following class and according to a name, which is the second argument, when I instantiate an instance.
My problem is that I would like to update the state of the data frame which is passed as the second argument and when I call it I don't get the updated status.
import pandas as pd
class RecordClass(object):
def __init__(self, df, name):
self.name = name
self.df = df
def write_method(self, *args):
keys = ['key1', 'key2', 'key3']
dictionary = dict()
dictionary['name'] = self.name
for idx, key in enumerate(keys):
dictionary[key] = args[idx]
self.df = self.df.append(dictionary, ignore_index=True)
df = self.df[keys]
return self.df
df1 = pd.DataFrame()
data = [1, 2, 3]
instance1 = RecordClass(df1, 'a')
print instance1.write_method(*data)
print
print df1
The result I get is:
key1 key2 key3 name
0 1.0 2.0 3.0 a
Empty DataFrame
Columns: []
Index: []
which means that df1 is not updated. How can I update df1 after calling the write_method method, without the assignment df1 = instance1.write_method(...)?

You can't. append is not an inplace operation, and when you reassign self.df with the result of the append, you are simply creating a new object, completely different from the original, and you are assigning that object to self.df. The original object that self.df (and df1) pointed to is not changed.
If this is important functionality, I can recommend creating a DataFrame with NaN entries, and filling them using loc or [:] assignment.

Related

Ungroup pandas dataframe after bfill

I'm trying to write a function that will backfill columns in a dataframe adhearing to a condition. The upfill should only be done within groups. I am however having a hard time getting the group object to ungroup. I have tried reset_index as in the example bellow but that gets an AttributeError.
Accessing the original df through result.obj doesn't lead to the updated value because there is no inplace for the groupby bfill.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column].bfill(axis="rows", inplace=True)
return df
Assigning the dataframe column in the function doesn't work because groupbyobject doesn't support item assingment.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column] = df[column].bfill()
return df
The test I'm trying to get to pass:
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
result.reset_index()
assert result["x_value"].equals(Series([4,4,None,5,5]))
You should use 'transform' method on the grouped DataFrame, like this:
import pandas as pd
def test_upfill():
df = pd.DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
result = df.groupby("group").transform(lambda x: x.bfill())
assert result["x_value"].equals(pd.Series([4,4,None,5,5]))
test_upfill()
Here you can find find more information about the transform method on Groupby objects
Based on the accepted answer this is the full solution I got to although I have read elsewhere there are issues using the obj attribute.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
columns = [column for column in df.obj.columns if column.startswith("x")]
df.obj[columns] = df[columns].transform(lambda x:x.bfill())
return df
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
assert df["x_value"].equals(Series([4,4,None,5,5]))

Python assign different variables to a class object

This is a general python question. Is it possible to assign different variables to a class object and then perform different set of operations on those variables? I'm trying to reduce code but maybe this isn't how it works. For example, I'm trying to do something like this:
Edit: here is an abstract of the class and methods:
class Class:
def __init__(self, df):
self.df = df
def query(self, query):
self.df = self.df.query(query)
return self
def fill(self, filter):
self.df.update(df.filter(like=filter).mask(lambda x: x == 0).ffill(1))
return self
def diff(self, cols=None, axis=1):
diff = self.df[self.df.columns[~self.df.columns.isin(cols)]].diff(axis=axis)
self.df = diff.join(self.df[self.df.columns.difference(diff.columns)])
return self
def melt(self, cols, var=None, value=None):
return pd.melt(self.df, id_vars=columns, var_name=var, value_name=value)
I'm trying to use it like this:
df = pd.read_csv('data.csv')
df = Class(df)
df = df.query(query).forward_fill(include)
df_1 = df.diff(cols).melt(cols)
df_2 = df.melt(cols)
df_1 and df_2 should have different values, however they are the same as df_1. This issue is resolved if I use the class like this:
df_1 = pd.read_csv('data.csv')
df_2 = pd.read_csv('data.csv')
df_1 = Class(df_1)
df_2 = Class(df_2)
df_1 = df_1.query(query).forward_fill(include)
df_2 = df_2.query(query).forward_fill(include)
df_1 = df_1.diff(cols).melt(cols)
df_2 = df_2.melt(cols)
This results in extra code. Is there a better way to do this where you can use an object differently on different variables, or do I have to create seperate objects if I'm trying to have two variables perform separate operations and return different values?
With the return self statement in the diff- method you return the reference of the object. The same thing happens after the melt method. But in that two methods you allreadey manipulated the origin df.
Here:
1 df = pd.read_csv('data.csv')
2
3 df = Class(df)
4 df = df.query(query).forward_fill(include)
5
6 df_1 = df.diff(cols).melt(cols)
the df has the same values like df_1. I guess the melt method without other args then cols arguments only assigns col names or something like that. Subsequently df_2=df.melt(cols) would have the same result like df_2=df_1.melt(cols).
If you want to work with one object, you dont should use self.df=... in your class methods, because this changes the instance value of df. You only need to write df = ... and than return Class(df).
For example:
def diff(self, cols=None, axis=1):
diff = self.df[self.df.columns[~self.df.columns.isin(cols)]].diff(axis=axis)
df = diff.join(self.df[self.df.columns.difference(diff.columns)])
return Class(df)
Best regards

Pandas - Incrementally add to DataFrame

I'm trying to add rows and columns to pandas incrementally. I have a lot of data stored across multiple datastores and a heuristic to determine a value. As I navigate across this datastore, I'd like to be able to incrementally update a dataframe, where in some cases, either names or days will be missing.
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
df = df.append({col: value, 'name': name}, ignore_index=True)
df.set_index('name', inplace=True, drop=True)
print(df.loc['Bill'])
This produces the following results:
2016_1 2016_2 2016_3
name
Bill 15.0 NaN NaN
Bill NaN 12.0 NaN
I've created a heatmap of the data and it's blocky due to duplicate names, so the output I'm looking for is:
2016_1 2016_2 2016_3
name
Bill 15.0 12.0 NaN
How can I combine these rows?
Is there a more efficient means of creating this dataframe?
Try this :-
df.groupby('name')[df.columns.values].sum()
try this:
df.pivot_table(index='name', aggfunc='sum', dropna=False)
After you run your foo() function, you can use any aggregation function (if you have only one value per column and all the othes are null) and groupby on df.
First, use reset_index to get back your name column.
Then use groupby and apply. Here I propose a custom function which checks that there is only one value per column, and raise a ValueError if not.
df.reset_index(inplace=True)
def aggdata(x):
if all([i <= 1 for i in x.count()]):
return x.mean()
else:
raise ValueError
ddf = df.groupby('name').apply(aggdata)
If all the values of the column are null but one, x.mean() will return that value (actually, you can use almost any aggregator, since there is only one value, that is the one returned).
It would be easier to have the name as column and date as index instead. Plus, you can work within the loop with lists and afterwards create the pd.DataFrame.
e.g.
year = 2016
names = ['Bill', 'Bob', 'Ryan']
index = []
valueBill = []
valueBob = []
valueRyan = []
for day in range(1, 4):
if random.choice([True, False]): # sometimes a name will be missing
valueBill.append(random.randrange(0, 20))
valueBob.append(random.randrange(0, 90))
valueRyan.append(random.randrange(0, 200))
index.append('{}-0{}'.format(year, day)) # column name
else:
valueBill.append(np.nan)
valueBob.append(np.nan)
valueRyan.append(np.nan)
index.append(np.nan)
df = pd.DataFrame({})
for name, value in zip(names,[valueBill,valueBob,valueRyan]):
df[name] = value
df.set_index(pd.to_datetime(index))
You can append the entries with new names if it does not already exist and then do an update to update existing entries.
import pandas as pd
import random
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
new_df = pd.DataFrame({col: value, 'name':name}, index=[1]).set_index('name')
df = pd.concat([df,new_df[~new_df.index.isin(df.index)].dropna()])
df.update(new_df)
#df.set_index('name', inplace=True, drop=True)
print(df)

Why does DataFrame.replace() within a function not change the input DataFrame

Here's the test code
df1 = pd.DataFrame({'Country':['U.S.A.']})
df2 = df1.copy()
df3 = df1.copy()
def replace1(df, col, mapVals):
df = df.replace({col: mapVals})
def replace2(df, col, mapVals):
return df.replace({col: mapVals})
def replace3(df, col, mapVals):
df.replace({col: mapVals}, inplace=True)
replace1(df1, 'Country', {'U.S.A.':'USA'})
df2 = replace2(df2, 'Country', {'U.S.A.':'USA'})
replace3(df3, 'Country', {'U.S.A.':'USA'})
print(df1)
print(df2)
print(df3)
df1 produces "U.S.A." while df2 and df3 produce "USA"
I don't understand why setting the DataFrame within the replace1() function doesn't work. Isn't replace2() effectively the same as replace1()?
I'm new to DataFrame. Please point out my stupidity.
In the function replace1, you are setting the output of df.replace({col: mapVals}) to a new variable with the same name: df. That is, you are not altering the values of the original object that you provide as input.
Essentially this is what you are doing:
def replace1(df, col, mapVals):
temp = df.replace({col: mapVals})
df = temp # Creating a variable that will overwrite the original input variable
So df is no longer the same object.
This would be another alternative, however:
def replace1(df, col, mapVals):
df.iloc[:, :] = df.replace({col: mapVals})
In replace1, you must return df (similar to replace2), since your change is not done inplace (like you did with replace3).
def replace1(df, col, mapVals):
df = df.replace({col: mapVals})
return df
And when calling it, you need to capture the returned object (see Return Values from here)
df1 = replace1(df1, 'Country', {'U.S.A.':'USA'})
Also Isn't replace2() effectively the same as replace1()?
No. replace2 uses a return to return the modified value. While return 1 simply makes the change (df.replace) but it does not return the changed DataFrame.

How to Share Variable Across the in-class def in Python

class Dataframe: #Recommended to instatiate your dataframe with your csv name.
"""
otg_merge = Dataframe("/Users/zachary/Desktop/otg_merge.csv") #instaiate as a pandas dataframe
"""
def __init__(self, filepath, filename = None):
pd = __import__('pandas') #import pandas when the class is instatiated
self.filepath = filepath
self.filename = filename
def df(self): #it makes the DataFrame
df = pd.read_csv(self.filepath, encoding = "cp949", index_col= 0) #index col is not included
return df
def shape(self): #it returns the Dimension of DataFrame
shape = list(df.shape)
return shape
def head(self): #it reutrns the Head of Dataframe
primer = pd.DataFrame.head(df)
del primer["Unnamed: 0"]
return primer
def cust_types(self): #it returns the list of cust_type included in .csv
cust_type = []
for i in range(0, shape[0]):
if df.at[i, "cust_type"] not in cust_type: #if it's new..
cust_type.append(df.at[i, "cust_type"]) #append it as a new list element
return cust_type
I am doing some wrapping pandas functions wrapping for whom doesn't necessarily need to know the pandas.
If you see the code, at the third def, shape returns shape as a list of such as [11000, 134] as a xdim and ydim.
Now I'd like to use the shape again at the last def cust_types, however,, it returns the shape is not defined.
How can I share the variable "share" across defs in the same class?
intersetingly, I didn't do nth, but the df is shared from second df to thrid shape without error
First prepend "self." in all your attributes which you will know after trying out some python oops tutorials. Another issue which you might miss is
def df(self):
df = pd.read_csv(self.filepath, encoding = "cp949", index_col= 0)
return df
Here, the method name and the variable name takes the same name which is fine, if the variable name is not an instance attribute as it is not. But in case if you prepend "self." and make it as an instance attribute, your instance attribute will be self.df and it can't be a function after the first function call self.df().

Categories

Resources