Pandas Error - Appending multiple files into one - python

I have a set of files that do not have any extension. They are currently stored in a folder that is referenced by this variable "allFiles".
allFiles = glob.glob(base2 + "/*")
I am trying to add an extension to each of the files in allFiles. Add .csv to the file name. I do it using the below code:
for file in allFiles:
os.rename(os.path.join(base2, file), os.path.join(base2, file+'.csv'))
Next I try to append each of these csv files into one as per the below code.
list_ = []
for file_ in allFiles:
try:
df = pd.read_csv(file_, index_col=None, header=None,delim_whitespace = True, error_bad_lines=False)
list_.append(df)
except pd.errors.EmptyDataError:
continue
When I run the above code, I get an error stating one of the files do not exist.
Error : FileNotFoundError: File b'/Users/base2/file1' does not exist
But file1 has now been renamed to file1.csv
Could anyone advice as to where am I going wrong in the above. Thanks
Update:
allFiles = glob.glob(base2 + "/*")
print(allFiles)
list_ = []
print(list_)
allFiles = [x + '.csv' for x in allFiles]
print(allFiles)
for file_ in allFiles:
try:
df = pd.read_csv(file_, index_col=None, header=None)
list_.append(df)
except pd.errors.EmptyDataError:
continue
Error : FileNotFoundError: File b'/Users/base2/file1.csv' does not exist

Before running your loop, do:
EDIT for clarity:
for file in allFiles:
os.rename(os.path.join(base2, file), os.path.join(base2, file+'.csv'))
###What you're adding###
allFiles = [x+'.csv' for x in allFiles]
########################
for file_ in allFiles:
try:
Basically, the problem is that you're changing the file names, but you're not changing the strings in your list to reflect the new file names. You can see this if you print allFiles. The above will make the necessary change for you.

Related

Pyinstaller exe fails execution and is slow

I used pyinstaller to pack the following script in an exe file with pyinstaller -F.
For clarity the script concat csv files in one file and export them to a new csv file.
# import necessary libraries
import pandas as pd
import os
import glob
from datetime import datetime
#Set filename
file_name = 'daglig_LCR_RLI'
# in the folder
path = os.path.dirname(os.path.abspath(__file__))
# Delete CSV file
# first check whether file exists or not
# calling remove method to delete the csv file
# in remove method you need to pass file name and type
del_file = path+"\\" + file_name +'.csv'
## If file exists, delete it ##
if os.path.isfile(del_file):
os.remove(del_file)
print("File deleted")
else: ## Show an error ##
print("File not found: " +del_file)
# use glob to get all the csv files
csv_files = glob.glob(os.path.join(path, "*.csv"))
df_list= list()
#format columns
dict_conv={'line_item': lambda x: str(x),
'column_item': lambda x: str(x)}
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_csv(f, sep=";", converters = dict_conv, encoding='latin1') #test latin1
df_list.append(df)
#print the location and filename
print('Location:', f)
print('File Name:', f.split("\\")[-1])
#add data frames to a list
RLI_combined = pd.concat(df_list, axis=0)
#Write date to approval_text
now = datetime.now()
# dd/mm/YY
print_date = now.strftime("%d/%m/%Y")
RLI_combined.loc[:, 'approval_text'] = print_date
#replace value_text with n/a
RLI_combined.loc[:, 'value_text'] = "n/a"
#Sum columns
m = RLI_combined['column_item'].isin(['0030', '0050', '0080'])
RLI_combined_sum = RLI_combined[~m].copy()
RLI_combined_sum['amount'] = RLI_combined_sum.groupby(['report_name', 'line_item', 'column_item'])['amount'].transform('sum')
RLI_combined_sum = RLI_combined_sum.drop_duplicates(['report_name', 'line_item', 'column_item'])
RLI_combined = pd.concat([RLI_combined_sum, RLI_combined[m]])
#export to csv
RLI_combined.to_csv(path + "//" + file_name + '.csv', index=False, sep=";", encoding='latin1')
#Make log
# Create the directory
directory = "Log"
parent_dir = path
# Path
path_log = os.path.join(parent_dir, directory)
try:
os.mkdir(path_log)
print('Log folder dannet')
except OSError as error:
print('Log folder eksisterer')
#export to csv
log_name = now.strftime("%d-%m-%Y_%H-%M-%S")
print(log_name)
RLI_combined.to_csv(path + "//" + 'Log' +"//" + file_name+'_' + log_name + '.csv', index=False, sep=";", encoding='latin1')
Everything works as intended when don't use pyinstaller. If I run the exe file after 10 sec of nothing it writes the following:
What am I doing wrong that causes the error? and could I improve performance so the exe file isn't that slow.
I hope you can point me in the right direction.
I believe I found the solution to the problem. I use anaconda and pyinstaller uses all the installed libaries on the machine.
So using a clean install of python and only installling the nescecary libaries might fix the problem.
The error seems to be a numpy error and the script isn't using that libary.

Pyinstaller one file works in python shell but fails as an exe

I have a script that takes a few csv files and concat them, it works as intended when running it in the python shell, but it fails when I make an one file exe with pyinstaller.
This is the error I get when I run my script:
The part that seems to fail is this part:
# use glob to get all the csv files
csv_files = glob.glob(os.path.join(path, "*.csv"))
df_list= list()
#format columns
dict_conv={'line_item': lambda x: str(x),
'column_item': lambda x: str(x)}
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_csv(f, sep=";", converters = dict_conv, encoding='latin1') #test latin1
df_list.append(df)
#print the location and filename
print('Location:', f)
print('File Name:', f.split("\\")[-1])
#add data frames to a list
RLI_combined = pd.concat(df_list, axis=0)
This is my whole for context script:
# import necessary libraries
import pandas as pd
import os
import glob
from datetime import datetime
#Set filename
file_name = 'daglig_LCR_RLI'
# in the folder
path = os.path.dirname(os.path.abspath(__file__))
# Delete CSV file
# first check whether file exists or not
# calling remove method to delete the csv file
# in remove method you need to pass file name and type
del_file = path+"\\" + file_name +'.csv'
## If file exists, delete it ##
if os.path.isfile(del_file):
os.remove(del_file)
print("File deleted")
else: ## Show an error ##
print("File not found: " +del_file)
# use glob to get all the csv files
csv_files = glob.glob(os.path.join(path, "*.csv"))
df_list= list()
#format columns
dict_conv={'line_item': lambda x: str(x),
'column_item': lambda x: str(x)}
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_csv(f, sep=";", converters = dict_conv, encoding='latin1') #test latin1
df_list.append(df)
#print the location and filename
print('Location:', f)
print('File Name:', f.split("\\")[-1])
#add data frames to a list
RLI_combined = pd.concat(df_list, axis=0)
#Write date to approval_text
now = datetime.now()
# dd/mm/YY
print_date = now.strftime("%d/%m/%Y")
RLI_combined.loc[:, 'approval_text'] = print_date
#replace value_text with n/a
RLI_combined.loc[:, 'value_text'] = "n/a"
#Sum columns
m = RLI_combined['column_item'].isin(['0030', '0050', '0080'])
RLI_combined_sum = RLI_combined[~m].copy()
RLI_combined_sum['amount'] = RLI_combined_sum.groupby(['report_name', 'line_item', 'column_item'])['amount'].transform('sum')
RLI_combined_sum = RLI_combined_sum.drop_duplicates(['report_name', 'line_item', 'column_item'])
RLI_combined = pd.concat([RLI_combined_sum, RLI_combined[m]])
#export to csv
RLI_combined.to_csv(path + "//" + file_name + '.csv', index=False, sep=";", encoding='latin1')
#Make log
# Create the directory
directory = "Log"
parent_dir = path
# Path
path_log = os.path.join(parent_dir, directory)
try:
os.mkdir(path_log)
print('Log folder dannet')
except OSError as error:
print('Log folder eksisterer')
#export to csv
log_name = now.strftime("%d-%m-%Y_%H-%M-%S")
print(log_name)
RLI_combined.to_csv(path + "//" + 'Log' +"//" + file_name+'_' + log_name + '.csv', index=False, sep=";", encoding='latin1')
I hope you can point me in the right direction.
with the pyinstaller one file executable you will often ran into problems like that. When starting the *.exe it is extracted to a temporary directory and this is for example the start-location for relative path definitions.
So even if you get your script running and export your *.csv it will often be somewhere on your HD and not at the place of the *.exe where you perhaps expect it to be.
I think in your case the variable df_list stays empty because there are no files listed in csv_files. This is because in the temp dir (location is written in the top of the output) there are no *.csv files.
Please try printing the content of csv_files when running the one-file *.exe if this is the right guess
If this is the case start by running a one-dir *.exe and if this works you know that you have a problem with your path definitions

Change file names inside a directory using python

I need to change file names inside a folder
My file names are
BX-002-001.pdf
DX-001-002.pdf
GH-004-004.pdf
HJ-003-007.pdf
I need to add an additional zero after '-' at the end, like this
BX-002-0001.pdf
DX-001-0002.pdf
GH-004-0004.pdf
HJ-003-0007.pdf
I tried this
all_files = glob.glob("*.pdf")
for i in all_files:
fname = os.path.splitext(os.path.basename(i))[0]
fname = fname.replace("-00","-000")
My code is not working, can anyone help?
fname = fname.replace("-00","-000") only changes the variable fname in your program. It does not change the filename on your disk.
you can use os.rename() to actully apply the changes to your files:
all_files = glob.glob("*.pdf")
for i in all_files:
fname = os.path.splitext(os.path.basename(i))[0]
fname = fname.replace("-00","-000")
os.rename(i, os.path.join(os.path.dirname(i), fname ))

Create a list of all file names and their file extension in a directory

I am trying to create a dataset using pd.DataFrame to store file name and file extension of all the files in my directory. I eventually want to have two variables named Name and Extension. The name variable will have a list of file names and the extension variable should have a file type such as xlsx, and png.
I am new to python and was only able to get to this. This gives me a list of file names but I don't know how to incorporate the file extension part. Could anyone please help?
List = pd.DataFrame()
path = 'C:/Users/documnets/'
filelist = []
filepath = []
# r=root, d=directories, f = files
for subdir, dirs, files in os.walk(path):
for file in files:
filelist.append(file)
filename, file_extension = os.path.splitext('/path/to/somefile.xlsx')
filepath.append(file_extension)
List = pd.DataFrame(flielist, filepath)
Also, for this part: os.path.splitext('/path/to/somefile.xlsx'), can I leave what's in the parenthesis as it is or should I replace with my directory path?
Thank you
You can do this:
import os
import pandas as pd
path = 'C:/Users/documnets/'
filename = []
fileext = []
for file in os.listdir(path):
name, ext = file.split('.')
filename.append(name)
fileext.append(ext)
columns = ["Name", "Extension"]
data = [filename, fileext]
df = pd.DataFrame(data, columns).transpose()

Read CSV starting with string from Zipfile

I'm trying to loop through a folder that has zip files in it, and only extracting the csv files that start with a certain prefix.
Here is the code:
for name in glob.glob(path + '/*.zip'):
zf = zipfile.ZipFile(name)
csv_file = pd.read_csv(zf.open('Common_MarketResults*.csv'))
df = pd.concat(csv_file, axis=0).reset_index()
The csv file has some dates after the string I am using, which will be different in every zip file. I am receiving the following error message:
KeyError: "There is no item named 'Common_MarketResults*.csv' in the archive"
Searching for substrings in the filename made this possible.
sub = 'Common_MarketResults'
suf = 'csv'
data = []
for name in glob.glob(path + '*.zip'):
zf = zipfile.ZipFile(name)
zf_nfo = zipfile.ZipFile(name).namelist()
for s in zf_nfo:
if sub in s and suf in s:
csv_file_str = s
csv_file = pd.read_csv(zf.open(csv_file_str))
csv_file['file_name'] = csv_file_str
data.append(csv_file)

Categories

Resources