My code uses pandas to extract information from an excel sheet. I created a function to read and extract what I need and then I set it as two variables so I can work the data.
But the start of my code seems a bit messy. Is there a way to rewrite it?
file_locations = 'C:/Users/sheet.xlsx'
def process_file(file_locations):
file_location = file_locations
df = pd.read_excel(fr'{file_location}')
wagon_list = df['Wagon'].tolist()
weight_list = df['Weight'].tolist()
It seems stupid to have a variable with the file destination and then set the file_location for pandas inside my function as the variable.
I'm not sure if I could use file_location as the variable outside and inside the function I would call file_location = file_location.
Thanks for any input!
You can simply remove the setting of the file location inside the function.
file_location = 'C:/Users/sheet.xlsx'
def process_file():
df = pd.read_excel(fr'{file_location})
wagon_list = df['Wagon'].tolist()
weight_list = df['Weight'].tolist()
But it depends on what you are trying to do with the function as well. Are you using the same function with multiple files in different locations? or is it the same file over and over again.
If it's the later then this seems fine.
You could instead do something like this and feed the location string directly into the function. This is more of a "proper" way to do things.
def process_file(file_location):
df = pd.read_excel(file_location)
wagon_list = df['Wagon'].tolist()
weight_list = df['Weight'].tolist()
process_file('C:/Users/sheet.xlsx')
Related
I currently have the following file load.py which contains:
readText1 = "test1"
name1 = "test1"
readText1 = "test2"
name1 = "test2"
Please note that the number will change frequently. Sometimes there might be 2, sometimes 20 etc.
I need to do something with this data and then save it individually.
In my file I import load like so:
from do.load import *
#where do is a directory
I then create a variable to know how many items are in the file (which I know)
values = range(2)
I then attempt to loop and use each "variable by name" like so:
for x in values:
x = x + 1
textData = readText + x
nameSave = name + x
Notice I try to create a textData variable with the readText but this won't work since readText isn't actually a variable. It errors. This was my attempt but it's obviously not going to work. What I need to do is loop over each item in that file and then use it's individual variable data. How can I accomplish that?
This is a common anti-pattern that you are stepping into. Every time you think "I'll dynamically reference a variable to solve this problem" or "Variable number of variables!" think instead "Dictionary".
load.py can instead contain a dictionary:
load_dict = {'readText1':'test1','name1':'test1','readText2':'test2','name2':'test2'}
You can make that as big or small as you want.
Then in your other script
from do.load import *
#print out everything in the dictionary
for k,v in load_dict.items():
print(k,v)
#create a variable and assign a value from the dictionary, dynamically even
for x in range(2):
text_data = load_dict['readText' + x]
print(text_data)
x+=1
This should allow you to solve whatever you are trying to solve and won't cause you the pain you will find if you continue down your current path.
If you are trying to access the variables in the module you've imported, you can use dir.
loader.py
import load
values = dir(load) # All the values in load.py
# to get how many they are
num_vars = len([var for var in module_vars if not var.startswith("__")])
print(num_vars)
# to get their names
var_names = [var for var in module_vars if not var.startswith("__")]
print(var_names)
# to get their values
var_values = [globals()[f"module.{var}"] for var in var_names]
print(var_values)
However, it is unsafe as it may introduce security vulnerabilities to your code. It is also slower. You can use data structures as JNevil has said here, here
The file load.py will load only the last variable "readText1" and "name1".
To do what you are asking for, you have to open load.py file as a text file and then iterate over each line to get 2 variables ("readText1" and "name1") for each iteration.
I always import one function.py to do some special calculate,
one day I found this function.py have steps of read files, that means, every time I call this function, function.py will open several excel files.
file1 = pd.read_excel('para/first.xlsx',sheet_name='Sheet1')
file2 = pd.read_excel('para/second.xlsx',sheet_name='Sheet1')
...
I think it's a waste of time, can any method to pacakage all the excel files as one parameter, so I can read the files in main script, other than open it many times in function.py.
function.calculate(parameter_group)
to replace
function.calculate(file1,file2, file3...)
how to get "parameter_group"?
I'm thinking if make those files parameters to .pkl, maybe can read faster?
You can loop through the names and put the DataFrames into a list or you could within the loop do whatever calculations you want and save the results in a list. For example:
pd_group = []
file_names = ['file1', 'file2', 'file3']
for name in file_names:
pd_group.append(pd.read_excel(name,sheet_name='Sheet1'))
then access the DataFrames using pd_group[0], pd_group[1] and so on. Or else assign individual names as you want by using myname0 = pd_group[0] and so on.
define a class
read excel in init
class Demo:
def __init__(self):
self.excel_info = self.mock_read_excel()
print("done")
def mock_read_excel(self):
# df = pd.read_excel('para/second.xlsx',sheet_name='Sheet1')
df = "mock data"
print("reading")
return df
def use_df_data_1(self):
print(self.excel_info)
def use_df_data_2(self):
print(self.excel_info)
if __name__ == '__main__':
dm = Demo()
dm.use_df_data_1()
dm.use_df_data_2()
It can solve the problem of reading excel every time the function is called
I am currently working on an application that will convert a messy text file full of data into an organized CSV. It currently works well but when I convert the text file to csv I want to be able to select a text file with any name. The way my code is currently written, I have to name the file "file.txt". How do I go about this?
Here is my code. I can send the whole script if necessary. Just to note this is a function that is linked to a tkinter button. Thanks in advance.
def convert():
df = pd.read_csv("file.txt",delimiter=';')
df.to_csv('Cognex_Data.csv')
Try defining your function as follow:
def convert(input_filename, output_filename='Cognex_Data.csv'):
df = pd.read_csv(input_filename, delimiter=';')
df.to_csv(output_filename)
And for instance use it as follow:
filename = input("Enter filename: ")
convert(filename, "Cognex_Data.csv")
You can also put "Cognex_Data.csv" as a default value for the output_filename argument in the convert function definition (as done above).
And finally you can use any way you like to get the filename (for instance tkinter.filedialog as suggested by matszwecja).
I haven't worked with tkinter, but PySimplyGUI, which to my knowledge is built on tkinter so you should have the possibility to extract the variables that correspond to the name of the file selected by the user. That's what I'm doing using PySimpleGUIon a similar problem.
Then, extract the file name selected by the user through the prompt and pass it as an argument to your function:
def convert(file):
df = pd.read_csv("{}.txt".format(file), delimiter=';')
df.to_csv('Cognex_Data.csv')
I am trying to use pyarrow.dataset.write_dataset function to write data into hdfs. However, if i write into a directory that already exists and has some data, the data is overwritten as opposed to a new file being created. Is there a way to "append" conveniently to already existing dataset without having to read in all the data first? I do not need the data to be in one file, i just don't want to delete the old one.
What i currently do and doesn't work:
import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(
use_deprecated_int96_timestamps = True,
coerce_timestamps = None,
allow_truncated_timestamps = True)
ds.write_dataset(data = data, base_dir = 'my_path', filesystem = hdfs_filesystem, format = parquet_format, file_options = write_options)
Currently, the write_dataset function uses a fixed file name template (part-{i}.parquet, where i is a counter if you are writing multiple batches; in case of writing a single Table i will always be 0).
This means that when writing multiple times to the same directory, it might indeed overwrite pre-existing files if those are named part-0.parquet.
How you can solve this is by ensuring that write_dataset uses unique file names for each write through the basename_template argument, eg:
ds.write_dataset(data=data, base_dir='my_path',
basename_template='my-unique-name-{i}.parquet', ...)
If you want to have automatically a unique name each time you write, you could eg generate a random string to include in the file name. One option for this is using the python uuid stdlib module: basename_template = "part-{i}-" + uuid.uuid4().hex + ".parquet".
Another option could be to include the current time of writing in the filename to make it unique, eg with basename_template = "part-{:%Y%m%d}-{{i}}.parquet".format(datetime.datetime.now())
See https://issues.apache.org/jira/browse/ARROW-10695 for some more discussion about this (customizing the template), and I opened a new issue specifically about the issue of silently overwriting data: https://issues.apache.org/jira/browse/ARROW-12358
For those that are here to work out how to use make_write_options() with write_dataset, try this:
import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(use_deprecated_int96_timestamps = False, coerce_timestamps = 'us', allow_truncated_timestamps = True)
So I am pretty new to classes but I am trying to write one that opens a excel file in a dataframe then extracts some information from it then moves on to the next excel file and does the same. The name of each file is the same with a different number on the end and this cannot be changed - the numbers are not consistent.
I have tried this code:
class Systems:
def __init__(self, survey_number):
self.survey_number = survey_number
self.file_name = 's3://misc/survey' + survey_number + '.xlsx'
def readfile(self):
self.df = pd.read_excel(self.file_name, sheet_name='Results')
survey_1 = Systems('026')
survey_1.df
I thought this should being up the dataframe for the first input then I could do the same with the other files names however I am getting error:
AttributeError: Systems instance has no attribute 'df'
I have not included a sample of that data as I don't think it is needed for this? Let me know if it is.
I will be adding more functions to the class when it is working but think this step needs resolving first and I don't know how to fix it.
Thanks!
EDIT - I believe the problem is trying to read the file using the '...' + variable + '...' method - is there a better way to do this?
class Systems:
def __init__(self, survey_number):
self.file_name = 'yourpath' + survey_number + '.xlsx'
self.df = pd.read_excel(self.file_name, sheet_name='Results')
and then
a = Systems('123')
a has then a property call df which is what you are looking for.
a.df