Python - program keeps iterating over deleted files - python

I've written a program that iterates over all CSV files in a directory and creates a new CSV file based on their contents.
I've written a function ('summary()') that performs these tasks and is called by the following code
cwd = os.getcwd()
csv_list = []
for root, dirs, filenames in os.walk(cwd):
for f in filenames:
if f.endswith('.csv'):
csv_list.append(f)
#for root, dirs, filenames in os.walk(cwd):
summary(csv_list)
Once the file has been loaded into the function, its added to a pandas DF by the following code
df = pd.concat((pd.read_csv(f, parse_dates=True, sep=';') for f in files))
The function creates a output csvfile called 'combined_csv'.
I delete this file between each run (as I currently testing the program).
However I keep running into the following peculiar bug.
FileNotFoundError: File 'combined.csv' does not exist
Even though I deleted the file, the program still parses over it - (where it crashes when it tries to load). Why though? I restart the program after deleting the file, the file should not appear in the 'csv_list' variable at all.
Is the information cached somehow?
I've added the full traceback below.
Traceback (most recent call last):
File "summary.py", line 112, in <module>
summary(csv_list)
File "summary.py", line 17, in summary
df = pd.concat((pd.read_csv(f, parse_dates=True, sep=';') for f in files))
File "/usr/local/lib/python3.5/dist-packages/pandas/core/reshape/concat.py", line 206, in concat
copy=copy)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/reshape/concat.py", line 236, in __init__
objs = list(objs)
File "summary.py", line 17, in <genexpr>
df = pd.concat((pd.read_csv(f, parse_dates=True, sep=';') for f in files))
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 405, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 764, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 985, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 1605, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 394, in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)
File "pandas/_libs/parsers.pyx", line 710, in pandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)
FileNotFoundError: File b'combined.csv' does not exist
Edit I've simplified the program (the code wasnt relevant to the problem, and changed the code to this. This is all of the code that is run.
I am executing the program from a terminal (Ubuntu 16.04), located in the directory.
$ pwd
returns
/home/jasper/PycharmProjects/AHP_Scanner/PVM/true_run/testsum
$ ls -a /home/jasper/PycharmProjects/AHP_Scanner/PVM/true_run/testsum
returns:
. fixed_10.csv fixed_13.csv fixed_16.csv fixed_19.csv fixed_21.csv fixed_4.csv fixed_7.csv goed
.. fixed_11.csv fixed_14.csv fixed_17.csv fixed_1.csv fixed_2.csv fixed_5.csv fixed_8.csv summary.py
fixed_0.csv fixed_12.csv fixed_15.csv fixed_18.csv fixed_20.csv fixed_3.csv fixed_6.csv fixed_9.csv
As we can see, the file 'combined_csv' does not exist
Yet when I run the following code: (this is all of the code that is run, the rest of summary.py has been commented out)
cwd = os.getcwd()
csv_list = []
for root, dirs, filenames in os.walk(cwd):
for f in filenames:
if f.endswith('.csv'):
print(f)
I get this response:
fixed_8.csv
fixed_10.csv
fixed_4.csv
fixed_11.csv
fixed_9.csv
fixed_7.csv
fixed_0.csv
fixed_12.csv
fixed_2.csv
fixed_5.csv
fixed_20.csv
fixed_18.csv
fixed_14.csv
fixed_6.csv
fixed_15.csv
fixed_3.csv
fixed_1.csv
fixed_17.csv
fixed_13.csv
fixed_19.csv
fixed_16.csv
fixed_21.csv
combined.csv
I am at a loss why this file keeps appearing.

Related

"OSError: [Errno 9] Bad file descriptor" error when trying to save excel file with openpyxl in python

I have a couple of excel files I want to merge into one.
I need the second column on all the files to be copied into separate columns in a new Microsoft Excel file.
For this, I am using the openpyxl library in a python script.
This is my code:
import os
from openpyxl import load_workbook
def mergeDataFiles():
path = "C:\\Users\\ethan\\Desktop\\Benzoyl Chloride\\Benzoyl Chloride"
# source excel files
origin_files = list()
for path, subdirs, files in os.walk(path):
for file_index in range(len(files)):
origin_files.append(files[file_index])
# destination excel file
destination_file = path + ".xlsx"
destination_workbook = load_workbook(destination_file)
destination_sheet = destination_workbook["Sheet1"]
# copy data from source files to destination file
for origin_file_index in range(1, len(origin_files)):
origin_workbook = load_workbook(path + "\\" + origin_files[origin_file_index - 1])
origin_sheet = origin_workbook['Data']
destination_sheet.cell(row=1, column=origin_file_index).value = origin_files[origin_file_index - 1]
for i in range(1, 500):
# read cell value from source excel file
data = origin_sheet.cell(row=i, column=2)
# write the value to destination excel file
destination_sheet.cell(row=i + 1, column=origin_file_index).value = data.value
# saving the destination excel file
destination_workbook.save(destination_file)
if __name__ == "__main__":
mergeDataFiles()
When I run the code, I get an error on the last line in the function: OSError: [Errno 9] Bad file descriptor.
Full traceback:
C:\Users\ethan\.venv\Scripts\python.exe "C:/Users/ethan/Coding/Python/Copy Excel Data/main.py"
Traceback (most recent call last):
File "C:\Users\ethan\Coding\Python\Copy Excel Data\main.py", line 32, in <module>
mergeDataFiles()
File "C:\Users\ethan\Coding\Python\Copy Excel Data\main.py", line 28, in mergeDataFiles
destination_workbook.save(destination_file)
File "C:\Users\ethan\.venv\Lib\site-packages\openpyxl\workbook\workbook.py", line 407, in save
save_workbook(self, filename)
File "C:\Users\ethan\.venv\Lib\site-packages\openpyxl\writer\excel.py", line 293, in save_workbook
writer.save()
File "C:\Users\ethan\.venv\Lib\site-packages\openpyxl\writer\excel.py", line 275, in save
self.write_data()
File "C:\Users\ethan\.venv\Lib\site-packages\openpyxl\writer\excel.py", line 67, in write_data
archive.writestr(ARC_APP, tostring(props.to_tree()))
File "C:\Program Files\Python311\Lib\zipfile.py", line 1830, in writestr
with self.open(zinfo, mode='w') as dest:
File "C:\Program Files\Python311\Lib\zipfile.py", line 1204, in close
self._fileobj.seek(self._zinfo.header_offset)
OSError: [Errno 9] Bad file descriptor
Exception ignored in: <function ZipFile.__del__ at 0x000001D101443D80>
Traceback (most recent call last):
File "C:\Program Files\Python311\Lib\zipfile.py", line 1870, in __del__
self.close()
File "C:\Program Files\Python311\Lib\zipfile.py", line 1892, in close
self._fpclose(fp)
File "C:\Program Files\Python311\Lib\zipfile.py", line 1992, in _fpclose
fp.close()
OSError: [Errno 9] Bad file descriptor
Process finished with exit code 1
I have tried changing the file names and locations, having the destination file open and closed, scouring the internet for solutions and at this point I'm not sure what else I can try.
I am running the code on Windows 10 22H2, with an intel i5 cpu.
Please assist me with this issue, if you know how to solve it.

Python : Saving all CSV files in a directory to a python list and then appending it to a combined excel sheet

I am trying to read all files in a directory and then combine each of those csv files to a single excel sheet. Here is my code so far
import pandas as pd
import datetime as dt
import os
import glob
#set Path to Test Directory
os.getcwd()
mwd = os.chdir('Test')
#Create a list with all CSV Files
files = os.listdir(mwd)
#Create a blank dataframe
combined = pd.DataFrame()
for file in files:
df=pd.read_csv(files)
combined = combined.append(df,ignore_index = TRUE)
combined.to_excel('testing.xlsx' , index = False)
When running the code , I get the following error
File "C:\Users\x\Documents\automation\Testing.py", line 19, in <module>
df=pd.read_csv(files)
File "C:\Users\x\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\x\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\x\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\x\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 934, in __init__
self._engine = self._make_engine(f, self.engine)
File "C:\Users\x\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1233, in _make_engine
raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'list'>
Should be pd.read_csv(file) and not pd.read_csv(files)
Another suggestion for handling the files..
Instead of files = os.listdir(mwd), you could do something like...
Get the file names
files = [file for file in os.listdir(mwd) if file.endswith('.csv')]
Get the file paths
files = [os.path.join(mwd, file) for file in os.listdir(mwd) if file.endswith('.csv')]

Having problem with opening same files openpyxl

Writting a program, that shuffels contents in files. All files are almost the same, but it doesn't work for some of them. Can't understand.
for file in allFiles:
print(file)
items = []
fileName = file
fileIndex = 1
directory = os.path.join(path, fileName[:-5].strip())
if not os.path.exists(directory):
os.mkdir(directory)
theFile = openpyxl.load_workbook(file)
allSheetNames = theFile.sheetnames
And after some quantity of files, it shows me these errors:
Traceback (most recent call last):
File "D:\staff\Python\NewProject\glow.py", line 25, in <module>
theFile = openpyxl.load_workbook(file)
File "C:\Users\User\AppData\Local\Programs\Python\Python38-32\lib\site-packages\openpyxl\reader\excel.py", line 313, in load_workbook
reader = ExcelReader(filename, read_only, keep_vba,
File "C:\Users\User\AppData\Local\Programs\Python\Python38-32\lib\site-packages\openpyxl\reader\excel.py", line 124, in __init__
self.archive = _validate_archive(fn)
File "C:\Users\User\AppData\Local\Programs\Python\Python38-32\lib\site-packages\openpyxl\reader\excel.py", line 96, in _validate_archive
archive = ZipFile(filename, 'r')
File "C:\Users\User\AppData\Local\Programs\Python\Python38-32\lib\zipfile.py", line 1269, in __init__
self._RealGetContents()
File "C:\Users\User\AppData\Local\Programs\Python\Python38-32\lib\zipfile.py", line 1336, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
But before that everything worked fine, there was no error. Can someone guess, why? Thanks, everybody.
Looking for files that way:
path = os.getcwd()
sourcePath = os.getcwd() + '\source'
extension = 'xlsx'
os.chdir(sourcePath)
allFiles = glob.glob('*.{}'.format(extension))
You iterate over all files not regarding the filetype. Probably you or a process added a file to the directory which is no xlsx file. This is why openpyxl fails to read it.

FileNotFoundError: [Errno 2] No such file or directory - Can't solve a Path problem

I have this problem, I'm trying to run the script to download Springers free books [https://towardsdatascience.com/springer-has-released-65-machine-learning-and-data-books-for-free-961f8181f189], but many things start to go wrong.
I solved some of the problems but now I'm stuck.
C:\Windows\system32>python C:\Users\loren\Desktop\springer_free_books-master\main.py
Traceback (most recent call last):
File "C:\Users\loren\Desktop\springer_free_books-master\main.py", line 42, in <module>
books.to_excel(table_path)
File "C:\Users\loren\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\generic.py", line 2175, in to_excel
formatter.write(
File "C:\Users\loren\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\formats\excel.py", line 738, in write
writer.save()
File "C:\Users\loren\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\excel\_openpyxl.py", line 43, in save
return self.book.save(self.path)
File "C:\Users\loren\AppData\Local\Programs\Python\Python38-32\lib\site-packages\openpyxl\workbook\workbook.py", line 392, in save
save_workbook(self, filename)
File "C:\Users\loren\AppData\Local\Programs\Python\Python38-32\lib\site-packages\openpyxl\writer\excel.py", line 291, in save_workbook
archive = ZipFile(filename, 'w', ZIP_DEFLATED, allowZip64=True)
File "C:\Users\loren\AppData\Local\Programs\Python\Python38-32\lib\zipfile.py", line 1251, in __init__
self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: 'downloads\\table_v4.xlsx'
This is part of the code, were table_path is introduced.
table_url = 'https://resource-cms.springernature.com/springer-cms/rest/v1/content/17858272/data/v4'
table = 'table_' + table_url.split('/')[-1] + '.xlsx'
table_path = os.path.join(folder, table)
if not os.path.exists(table_path):
books = pd.read_excel(table_url)
# Save table
books.to_excel(table_path)
else:
books = pd.read_excel(table_path, index_col=0, header=0)
Try to create the destination directory before calling .to_excel() to ensure a valid writable directory exists. Make sure the os module is imported:
import os # add to your imports
and replace
books.to_excel(table_path)
with
os.makedirs(folder, exist_ok=True)
books.to_excel(table_path)

Reading in csv file as dataframe from hdfs

I'm using pydoop to read in a file from hdfs, and when I use:
import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
print f.read()
It shows me the file in stdout.
Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:
>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
self._make_engine(self.engine)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist
I know next to nothing about hdfs, but I wonder if the following might work:
with hd.open("/home/file.csv") as f:
df = pd.read_csv(f)
I assume read_csv works with a file handle, or in fact any iterable that will feed it lines. I know the numpy csv readers do.
pd.read_csv("/home/file.csv") would work if the regular Python file open works - i.e. it reads the file a regular local file.
with open("/home/file.csv") as f:
print f.read()
But evidently hd.open is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs documentation.
you can use the following code to read csv from hdfs
import pandas as pd
import pyarrow as pa
hdfs_config = {
"host" : "XXX.XXX.XXX.XXX",
"port" : 8020,
"user" : "user"
}
fs = pa.hdfs.connect(hdfs_config['host'], hdfs_config['port'],
user=hdfs_config['user'])
df=pd.read_csv(fs.open("/home/file.csv"))
Use read instead open, it works
with hd.read("/home/file.csv") as f:
df = pd.read_csv(f)

Categories

Resources