I have 8 functions that I would like to run under one main() function. The process starts with importing from a file and creating a df and then doing some cleaning operations on that df under a new function. I have copied in the basic structure including the three starting functions and then a main() function. What I am unsure about is how to 'carry' the result of loader() to clean_data() and then the result of clean_data() to operation_one() in the right way. At the moment I get an error that df is not defined. Thank you for your help!
def loader():
import pandas as pd
import numpy as np
df = pd.read_excel('file_example.xlsx')
return df
def clean_data():
del df['column_7']
return df
def operation_one():
del df['column_12']
return df
def main():
loader()
clean_data()
operation_one()
with pd.ExcelWriter(file.xlsx") as writer:
df.to_excel(writer, sheet_name='test' , index=False)
if __name__ == "__main__":
main()
So your main function just tells the other functions to run. Functions have their own variables that are kept within the function that defines them. So when def loader() runs is returns the value of df to the line that ran the function, within def main(): To store that value in the main function just put df = loader() in the main function. And when you call the new functions you need to pass this value into them for them to preform on the value of df. So when you call the next function in your main function, add df to the input field. clean_data(df). Then your clean data function will take in the value of df. You now need to redefine your def clean_data(): to take a variable like this, def clean_data(df):
This is what I have a bit cleaned up,
import pandas as pd
import numpy as np
def loader():
df = pd.read_excel('file_example.xlsx')
return df
def clean_data(df):
del df['column_7']
return df
def operation_one(df):
del df['column_12']
return df
def main():
df = loader()
df = clean_data(df)
df = operation_one(df)
with pd.ExcelWriter("file.xlsx") as writer:
df.to_excel(writer, sheet_name='test', index=False)
if __name__ == "__main__":
main()
I hope this was somewhat helpful as it is my first question answered here.
You need to make sure to assign variables for the function return values. That is how you "carry" the result. You also need to pass in those variables as function arguments as you proceed. Adding a function parameter for the filename in loader() rather than hardcoding the file in the function is probably something you'll want to think about too.
import pandas as pd
import numpy as np
def loader():
df = pd.read_excel('file_example.xlsx')
return df
def clean_data(df):
del df['column_7']
return df
def operation_one(df):
del df['column_12']
return df
def main():
df = loader()
df = clean_data(df)
df = operation_one(df)
with pd.ExcelWriter("file.xlsx") as writer:
df.to_excel(writer, sheet_name='test' , index=False)
if __name__ == "__main__":
main()
Related
I got a small problems on working with Pandas. The problem is I created a file that stores class to read and clean data from a .csv file. and I import my own library to load the data and then i want to use the pandas dataframe for other operations. But for some reason, I can't do it.
So, here is the code I created a class for loading/reading the file:
import pandas as pd
class Load_Data:
def __init__(self, filename):
self.__filename = filename
def load(self):
df = pd.read_csv(self.__filename)
del df["Remarks"]
df = df.dropna()
return df
and in another file, i was trying to import this self-created library for data processing step and then try to work on it with Pandas DataFrame.
from Load_Data import Load_Data
import pandas as pd
test_df = Load_Data("Final_file.csv")
test_df.load()
There is no problem printing the table of the content from my file. But when I tried to use it (test_df) as a Pandas dataframe, for example, I want to GroupBy some of the attributes
test_df.groupby(['width','length])
it ends up showing:
'Load_Data' object has no attribute 'groupby'
which means if i want to use the groupby function, i have to write it on my own in my own class. but I don't want to do that. I just want to convert my class to a Pandas DataFrame and work using their package directly for some complex operations.
I would be really appreciate for any kindly helps
You are using class as if it was a function. Push return statement inside load method
import pandas as pd
class Load_Data:
def __init__(self, filename):
self.__filename = filename
def load(self):
df = pd.read_csv(self.__filename)
del df["Remarks"]
df = df.dropna()
return df # this change
Usage:
test_df = Load_Data("Final_file.csv").load() #this change
# or
load_data = Load_Data("Final_file.csv")
test_df = load_data.load()
load returns a DataFrame and not a Load_Data instance.
Can you share the next line or two which throw an error?
Are you referencing the returned data, or the class?
I.e.
df2= test_df.load()
df2.groupby()
Or
test_df.groupby()
Are you trying to create a new data frame class build on pandas?
If so you'd need something like this (might work)
class LoadDF(pd.DataFrame)
def __init__(self, filename):
self.__filename = filename
def load(self):
df = pd.read_csv(self.__filename)
del df["Remarks"]
df = df.dropna()
self = df
Newbie at dealing with classes.
I have some dataframe objects I want to transform, but I'm having trouble manipulating them with classes. Below is an example. The goal is to transpose a dataframe and reassign it to its original variable name. In this case, the dataframe is assets.
import pandas as pd
from requests import get
import numpy as np
html = get("https://www.cbn.gov.ng/rates/Assets.asp").text
table = pd.read_html(html,skiprows=[0,1])[2]
assets = table[1:13]
class Array_Df_Retitle:
def __init__(self,df):
self.df = df
def change(self):
self.df = self.df.transpose()
self.df.columns = self.df[0]
return self.df
However, calling assets = Array_Df_Retitle(assets).change() simply yields an error:
KeyError: 0
I'd like to know where I'm getting things wrong.
I made a few changes to your code. The problem is coming from self.df[0]. This means you are selecting the column named 0. However, after transposing, you will not have any column named 0. You will have a row instead.
import pandas as pd
from requests import get
import numpy as np
html = get("https://www.cbn.gov.ng/rates/Assets.asp").text
table = pd.read_html(html,skiprows=[0,1])[2]
assets = table[1:13]
class Array_Df_Retitle:
def __init__(self,df):
self.df = df
def change(self):
self.df = self.df.dropna(how='all').transpose()
self.df.columns = self.df.loc[0,:]
return self.df.drop(0).reset_index(drop=True)
Array_Df_Retitle(assets).change()
import pandas as pd
I need to create a Dataframe outside a function.Like a global dataframe.And make the header of the dataframe.
import pandas as pd
import datetime
global df = pd.DataFrame(columns = ['Time','Call'])
Now I have 2 functions as below.
def a():
#Checking whether the df is available
if df is None:
df = pd.DataFrame(columns = ['Time','Call'])
#Appending
now = datetime.datetime.now()
timestamp1 = now.strftime("%Y-%m-%d %H:%M:%S")
call_num = 1
df = df.append({'Time':'timestamp1','Call': call_num}, ignore_index=True)
below is the main function.
def main():
a()
print(1) #Do something
a()
main()
Mainly I have a requirement that no values should be passed from a() which is in the main() function.
How to achieve this.Print two date time information in the data frame.
My current code does not work and giving an error.
This question may have been asked for fundamentals of Python, unfortunately, I spent an hour looking for the answer, but couldn't find it. So I am hoping for someone's input. I am used to writing Class where I can give self and get the variable into the def function from another function. How do I capture that variable without writing a Class function? Is there a way? Thanks!
import pandas as pd
file_Name = 'test.xlsx'
def read_file():
df = pd.read_excel(file_Name)
return df
read_file()
def clean_data():
text_data = df['some_column_name'].str.replace(';',',') # How to get df from read_file() function?
return text_data
clean_data()
You're overthinking it:
df = read_file()
clean_data() # Uses the global variable df capturing the return value of read_file
Or course, clean_data should take an argument rather than using a global variable.
def clean_data(f):
text_data = f['some_column_name'].str.replace(';', ',')
return text_data
f = read_file()
clean_data(f)
Call the first function and save the returned dataframe in a variable df. Then call the second function (clean_data) and pass this df inside it as argument.
Use this:
import pandas as pd
file_Name = 'test.xlsx'
import pandas as pd
def read_file():
df = pd.read_excel(file_Name)
return df
df = read_file()
def clean_data(df):
text_data = df['some_column_name'].str.replace(';', ',')
return text_data
clean_data()
In general... you can use global variables. But with how your method is set up, you should just do
df = read_file()
inside of your clean_data() method. Then use df from there. Notice df is just the local name for the result of calling read_file(), you can call it anything.
I am new to python and I am trying to pass an argument (dataframe) to a function and change value of the argument (dataframe) by reading an excel file.
(Assume that I have imported all the necessary files)
I have noticed that python does not pass the argument by reference here and I end up not having the dataframe initialized/changed.
I read that python passes by object-reference and not by value or reference. However, I do not need to change the same dataframe.
The output is : class 'pandas.core.frame.DataFrame'>
from pandas import DataFrame as df
class Data:
x = df
#staticmethod
def import_File(df_name , file):
df_name = pd.io.excel.read_excel(file.replace('"',''), sheetname='Sheet1', header=0, skiprows=None, skip_footer=0, index_col=None, parse_cols=None, parse_dates=True, date_parser=True, na_values=None, thousands=None, convert_float=True, has_index_names=None, converters=None, engine=None )
def inputdata():
Data.import_File(Data.x,r"C:\Users\Data\try.xlsx")
print(Data.x)
You seem to be doing a lot of things the hard way. I'll try to simplify it while conforming to standard patterns of use.
# Whatever imports you need
import pandas as pd
# Static variables and methods should generally be avoided.
# Change class and variable names to whatever is more suitable.
# Names should be meaningful when possible.
class MyData:
# Load data in constructor. Could easily do this in another method.
def __init__(self, filename):
self.data = pd.io.excel.read_excel(filename, sheetname='Sheet1')
def inputData():
# In my experience, forward slashes work just fine on Windows.
# Create new MyData object using constructor
x = MyData('C:/Users/Data/try.xlsx')
# Access member variable from object
print(x.data)
Here's the version where it loads in a method rather than the constructor.
import pandas as pd
class MyData:
# Constructor
def __init__(self):
# Whatever setup you need
self.data = None
self.loaded = False
# Method with optional argument
def loadFile(self, filename, sheetname='Sheet1')
self.data = pd.io.excel.read_excel(filename, sheetname=sheetname)
self.loaded = True
def inputData():
x = MyData()
x.loadFile('C:/Users/Data/try.xlsx')
print(x.data)
# load some other data, using sheetname 'Sheet2' instead of default
y = MyData()
y.loadFile('C:/Users/Data/tryagain.xlsx', 'Sheet2')
# can also pass arguments by name in any order like this:
# y.loadFile(sheetname='Sheet2', filename='C:/Users/Data/tryagain.xlsx')
print(y.data)
# x and y both still exist with different data.
# calling x.loadFile() again with a different path will overwrite its data.
The reason why it doesn't save in your original code is because assigning values to argument names never changes the original variable in Python. What you can do is something like this:
# Continuing from the last code block
def loadDefault(data):
data.loadFile('C:/Users/Data/try.xlsx')
def testReference():
x = MyData()
loadDefault(x)
# x.data now has been loaded
print(x.data)
# Another example
def setIndex0(variable, value):
variable[0] = value
def testSetIndex0():
v = ['hello', 'world']
setIndex0(v, 'Good morning')
# v[0] now equals 'Good morning'
print(v[0])
But you can't do this:
def setString(variable, value):
# The only thing this changes is the value of variable inside this function.
variable = value
def testSetString():
v = 'Start'
setString(v, 'Finish')
# v is still 'Start'
print(v)
If you want to be able to specify the location to store a value using a name, you could use a data structure with indexes/keys. Dictionaries let you access and store values using a key.
import pandas as pd
class MyData:
# Constructor
def __init__(self):
# make data a dictionary
self.data = {}
# Method with optional argument
def loadFile(self, storename, filename, sheetname='Sheet1')
self.data[storename] = pd.io.excel.read_excel(filename, sheetname=sheetname)
# Access method
def getData(self, name):
return self.data[name]
def inputData():
x = MyData()
x.loadFile('name1', 'C:/Users/Data/try.xlsx')
x.loadFile('name2', 'C:/Users/Data/tryagain.xlsx', 'Sheet2')
# access Sheet1
print(x.getData('name1'))
# access Sheet2
print(x.getData('name2'))
If you really want the function to be static, then you don't need to make a new class at all. The main reason for creating a class is to use it as a reusable structure to hold data with methods specific to that data.
import pandas as pd
# wrap read_excel to make it easier to use
def loadFile(filename, sheetname='Sheet1'):
return pd.io.excel.read_excel(filename, sheetname=sheetname)
def inputData():
x = loadFile('C:/Users/Data/try.xlsx')
print(x)
# the above is exactly the same as
x = pd.io.excel.read_excel('C:/Users/Data/try.xlsx', sheetname='Sheet1')
print(x)
In your code df is a class object. To create an empty data frame you need to instantiate it. Instantiating classes in Python uses function notation. Also, we don't need to pass the default parameters when we read the excel file. This will help the code look cleaner.
Also, we don't need to pass the default parameters when we read the excel file. This will help the code look cleaner.
from pandas import DataFrame as df
class Data:
x = df()
#staticmethod
def import_File(df_name, file):
df_name = pd.io.excel.read_excel(file.replace('"',''), sheetname='Sheet1')
When you pass Data.x to import_File(), df_name will refer to the same object as Data.x, which in this case is an empty dataframe. However, when you assign pd.io.excel.read_excel(file) to df_name then the connection between df_name and the empty dataframe is broken, and df_name now refers to the excel dataframe. Data.x has undergone no change during this process so it is still connected to for the empty data frame object.
A simpler way to see this with strings:
x = 'red'
df_name = x
We can break the df_name connection between string object 'red' and form a new one with object 'excel`.
df_name = 'excel'
print(x)
'red'
However, there's a simple fix for Data.x to return the excel dataframe.
from pandas import DataFrame as df
class Data:
x = df()
#staticmethod
def import_File(file):
Data.x = pd.io.excel.read_excel(file.replace('"',''), sheetname='Sheet1')
def inputdata():
Data.import_File(r"C:\Users\Data\try.xlsx")
print(Data.x)
However, I don't recommend using staticmethods, and you should include a constructor in your class as the other answer has recommended.