When reading or writing CSV-files sometimes the file canĀ“t be accessed:
The process cannot access the file because another process has locked a portion of the file
I want my code to retry the reading/writing until it works.
Here is a draft how i would make a while loop until the file could be read.
But how can i test if "READING_DID_WORK"? Is tehre a way to test if the task was successfull? Or should i just test if FILE = List?
timeout = time.time() + 120 #seconds
bool = True
while bool == True:
time.sleep(0.5) # sleep for 500 milliseconds
if time.time() > timeout:
syncresult="timeout"
break
with io.open(SlogFilePath,"r", encoding = "utf-16(LE)") as File:
FILE = File.read().splitlines()
if READING_DID_WORK:
bool = False
else:
bool = True
OUT = FILE
You don't need the extra boolean (bool is a very bad variable name anyway) and you don't need READING_DID_WORK, just rely on the OSError that will be raised.
A simple wrapper function:
import time
...
def read_file_with_retry(file_name, encoding="utf-16(LE)"):
while True:
try:
with open(file_name, encoding=encoding) as f:
file_content = f.readlines()
except OSError:
time.sleep(0.5)
else:
return file_content
To avoid a case of infinite loop, it is suggested to implement a max-retry mechanism:
import time
...
def read_file_with_retry(file_name, encoding="utf-16(LE)", max_retries=5):
retry = 0
while True:
try:
with open(file_name, encoding=encoding) as f:
file_content = f.readlines()
except OSError:
time.sleep(0.5)
retry += 1
if retry > max_retries:
raise
else:
return file_content
Related
I want to convert this corpus hu.txt.xz 15GB which becomes around 60GB after unpacking to small versions of text files, each file with less than 1GB or 100000 lines
The expected output:
| siplit_1.txt
| siplit_2.txt
| siplit_3.txt
.....
| siplit_n.txt
I have this script on a local machine but doesn't work it just loads without process because bigdata as I think :
import fun
import sys
import os
import shutil
# //-----------------------
# Retrieve and return output file max lines from input
def how_many_lines_per_file():
try:
return int(input("Max lines per output file: "))
except ValueError:
print("Error: Please use a valid number.")
sys.exit(1)
# //-----------------------
# Retrieve input filename and return file pointer
def file_dir():
try:
filename = input("Input filename: ")
return open(filename, 'r')
except FileNotFoundError:
print("Error: File not found.")
sys.exit(1)
# //-----------------------
# Create output file
def create_output_file_dir(num, filename):
return open(f"./data/output_{filename}/split_{num}.txt", "a")
# //-----------------------
# Create output directory
def create_output_directory(filename):
output_path = f"./data/output_{filename}"
try:
if os.path.exists(output_path): # Remove directory if exists
shutil.rmtree(output_path)
os.mkdir(output_path)
except OSError:
print("Error: Failed to create output directory.")
sys.exit(1)
def ch_dir():
# Print the current working directory
print("Current working directory: {0}".format(os.getcwd()))
# Change the current working directory
os.chdir('./data')
# Print the current working directory
print("Current working directory: {0}".format(os.getcwd()))
# //-----------------------
def split_file():
try:
line_count = 0
split_count = 1
max_lines = how_many_lines_per_file()
# ch_dir()
input_file = fun.file_dir()
input_lines = input_file.readlines()
create_output_directory(input_file.name)
output_file = create_output_file_dir(split_count, input_file.name)
for line in input_lines:
output_file.write(line)
line_count += 1
# Create new output file if current output file's line count is greater than max line count
if line_count > max_lines:
split_count += 1
line_count = 0
output_file.close()
# Prevent creation of an empty file after splitting is finished
if not len(input_lines) == max_lines:
output_file = create_output_file_dir(split_count, input_file.name)
# Handle errors
except Exception as e:
print(f"An unknown error occurred: {e}")
# Success message
else:
print(f"Successfully split {input_file.name} into {split_count} output files!")
# //-----------------------
if __name__ == "__main__":
split_file()
Is there any python script or deep learning tool to split them for using the to next task
By calling readlines() on the input file handle, you are reading (or trying to) the whole file into memory at the same time. You can do this instead to process the file one line at a time, never having more than a single line in memory:
input_file = fun.file_dir()
...
for line in input_file:
...
Another issue to be aware of is that this line:
if not len(input_lines) == max_lines:
output_file = create_output_file_dir(split_count, input_file.name)
is likely not doing what you think it is. Neither input_lines or max_lines will ever change inside the loop, so this will either always create a new file or never will. Unless you happen to process a file with exactly max_lines lines in it, this will always be true. This is not a big deal, but I think as your code is now you're going to end up with an extra empty file. You need to change the logic anyway, so you'll have to rethink how to make this work.
UPDATE:
Here's how I would modify the logic to do the right thing regarding opening each of the output files:
input_file = fun.file_dir()
# output_file = create_output_file_dir(split_count, input_file.name)
output_file = None
...
for line in input_file:
# Open a new output file if we don't have one open
if not output_file:
output_file = create_output_file_dir(split_count, input_file.name)
output_file.write(line)
line_count += 1
# Close the current output file if the line count has reached its max
if line_count > max_lines:
split_count += 1
line_count = 0
output_file.close()
output_file = None
The key idea here is that you can't know if you need a new output file until you have tried to read the next line after closing the current output file. This logic only opens an output file when it has a line to write out and there is no open output file.
You're trying to allocate a big file into memory which is not possible.
Instead of reading all the content at once just read line by line and process it.
I've fixed the bug seen by #CryptoFool
import fun
import sys
import os
import shutil
# //-----------------------
# Retrieve and return output file max lines from input
def how_many_lines_per_file():
try:
return int(input("Max lines per output file: "))
except ValueError:
print("Error: Please use a valid number.")
sys.exit(1)
# //-----------------------
# Retrieve input filename and return file pointer
def file_dir():
try:
filename = input("Input filename: ")
return open(filename, 'r')
except FileNotFoundError:
print("Error: File not found.")
sys.exit(1)
# //-----------------------
# Create output file
def create_output_file_dir(num, filename):
return open(f"./data/output_{filename}/split_{num}.txt", "a")
# //-----------------------
# Create output directory
def create_output_directory(filename):
output_path = f"./data/output_{filename}"
try:
if os.path.exists(output_path): # Remove directory if exists
shutil.rmtree(output_path)
os.mkdir(output_path)
except OSError:
print("Error: Failed to create output directory.")
sys.exit(1)
def ch_dir():
# Print the current working directory
print("Current working directory: {0}".format(os.getcwd()))
# Change the current working directory
os.chdir('./data')
# Print the current working directory
print("Current working directory: {0}".format(os.getcwd()))
# //-----------------------
def split_file():
try:
line_count = 0
split_count = 1
max_lines = how_many_lines_per_file()
# ch_dir()
input_file = fun.file_dir()
create_output_directory(input_file.name)
output_file = None # No output file is created at first, we need to check if there's any line if it enters the for
for line in input_file:
# Open a new output file if we don't have one open
if not output_file:
output_file = create_output_file_dir(split_count, input_file.name)
output_file.write(line)
line_count += 1
# Close the current output file if the line count has reached its max
if line_count > max_lines:
split_count += 1
line_count = 0
output_file.close()
output_file = None
# Handle errors
except Exception as e:
print(f"An unknown error occurred: {e}")
# Success message
else:
print(f"Successfully split {input_file.name} into {split_count} output files!")
# //-----------------------
if __name__ == "__main__":
split_file()
I have this code to read a file
def collect_exp_data(file_name):
data = dict()
while True:
try:
with open(file_name, 'r') as h:
break
for line in h:
batch, x, y, value = line.split(',')
try:
if not batch in data:
data[batch] = []
data[batch] += [(float(x), float(y), float(value))]
except ValueError:
print("\nCheck that all your values are integers!")
except FileNotFoundError:
print("\nThis file doesn't exist, Try again!")
return data
I'm trying to add some error handling, i want to re ask the user to enter file in case the file doesn't exist, but the code is just returning an endless loop!
what did I do wrong and how can I fix it?
Edit:
If i try and take the while loop outside, then it works in case file doesn't exists, but if file exists, the code is just stopping after the loop and not running next function, here is the code
def collect_exp_data(file_name):
data = dict()
with open(file_name, 'r') as h:
for line in h:
batch, x, y, value = line.split(',')
try:
if not batch in data:
data[batch] = []
data[batch] += [(float(x), float(y), float(value))]
except ValueError:
print("\nCheck that all your values are integers!")
return data
while True:
file_name = input("Choose a file: ")
try:
data = collect_exp_data(file_name)
break
except FileNotFoundError:
print('This file does not exist, try again!')
Make a condition to break the loop
finished = False
while not finished:
file_name = input("Choose a file: ")
try:
data = collect_exp_data(file_name)
# we executed the previous line succesfully,
# so we set finished to true to break the loop
finished = True
except FileNotFoundError:
print('This file does not exist, try again!')
# an exception has occurred, finished will remain false
# and the loop will iterate again
Do all your exception handling in the main function.
def collect_exp_data(filename):
data = dict()
with open(filename) as infile:
for line in map(str.strip, infile):
batch, *v = line.split(',')
assert batch and len(v) == 3
data.setdefault(batch, []).extend(map(float, v))
return data
while True:
filename = input('Choose file: ')
try:
print(collect_exp_data(filename))
break
except FileNotFoundError:
print('File not found')
except (ValueError, AssertionError):
print('Unhandled file content')
Obviously the assertion won't work if debug is disabled but you get the point
I have this code to consolidate thousands of csv file from folder (Box Drive folder) however time to time I get permission error on some of the files. When I start code over it is fine. But this error pops up more less randomly with random file in directory.
What I need to do is to just wait few seconds and try to open that file again (not skip it)
So far I have this but not working as expected:
with open(OutputTo, "wb") as fout:
with open(sampleFile, "rb") as f: # first file to get header
fout.write(f.read())
for fs in toLoad: # now the rest
with open(path + fs, "rb") as f:
#while True: # infinite loop
try:
next(f) # skip the header
fout.write(f.read())
except PermissionError:
#second try usually works.
failed = failed + 1 # counter
if failed > 10:
print('\n Script failed more than 10 time so I stopped it.')
break
else:
print('\n Perm error, trying again in 5 sec.')
time.sleep(5)
By surrounding the file opening inside a while True loop with a try catch block, and waiting n seconds when PermissionError shows up. Be careful though, if you don't want to skip the file, you have to be sure the PermissionError is just a temporary thing, otherwise you end up in an endless loop constantly retrying (I would implement a break to stop after n tries)
import glob, time
filenames = glob.glob('*.csv')
for file in filenames:
while True:
try:
hndl = open(file,"rb")
# Do stuff
break
except PermissionError:
time.sleep(5)
hndl.close()
continue
I made couple of changes according to #Michael Butscher's comment. (credit to him)
btw code never failed more than twice.
with open(OutputTo, "wb") as fout:
with open(sampleFile, "rb") as f: # first file to get header
fout.write(f.read())
for fs in toLoad: # now the rest
while True: # infinite loop
with open(path + fs, "rb") as f:
try:
next(f) # skip the header
fout.write(f.read())
#status
sys.stdout.write('\r') # overwrite in place
expectedTime = round((((time.time() - startTime)/(toLoad.index(fs)+1))*(len(toLoad) - toLoad.index(fs)))/60,2) # based on how it went so far
percentage = round((toLoad.index(fs)/(len(toLoad)+1))*100,2) # number of done files relative to all files (after date filter)
success = success + 1 # counter
sys.stdout.write('completed {} perc. expected time to end: {} minutes, success: {}, failed: {}'.format(percentage, expectedTime, success, failed)) # status
sys.stdout.flush()
failed = 0 # reset counter if success
break
except PermissionError:
#second try always works. Script waits 5 seconds and usually run ok.
failed = failed + 1 # counter
print('\n Perm error, trying again in 5 sec. ({})'.format(path + fs))
time.sleep(5)
if failed > 10:
print('\n Script failed more than 10 time so I stopped it.')
break
Using this Python code I get printed lines of file in UPPERCASE but file remains unchanged (lowercase.)
def open_f():
while True:
fname=raw_input("Enter filename:")
if fname != "done":
try:
fhand=open(fname, "r+")
break
except:
print "WRONG!!!"
continue
else: exit()
return fhand
fhand=open_f()
for line in fhand:
ss=line.upper().strip()
print ss
fhand.write(ss)
fhand.close()
Can you suggest please why files remain unaffected?
Code:
def file_reader(read_from_file):
with open(read_from_file, 'r') as f:
return f.read()
def file_writer(read_from_file, write_to_file):
with open(write_to_file, 'w') as f:
f.write(file_reader(read_from_file))
Usage:
Create a file named example.txt with the following content:
Hi my name is Dmitrii Gangan.
Create an empty file called file_to_be_written_to.txt
Add this as the last line file_writer("example.txt", "file_to_be_written_to.txt") of your .py python file.
python <your_python_script.py> from the terminal.
NOTE: They all must be in the same folder.
Result:
file_to_be_written_to.txt:
Hi my name is Dmitrii Gangan.
This program should do as you requested and allows for modifying the file as it is being read. Each line is read, converted to uppercase, and then written back to the source file. Since it runs on a line-by-line basis, the most extra memory it should need would be related to the length of the longest line.
Example 1
def main():
with get_file('Enter filename: ') as file:
while True:
position = file.tell() # remember beginning of line
line = file.readline() # get the next available line
if not line: # check if at end of the file
break # program is finished at EOF
file.seek(position) # go back to the line's start
file.write(line.upper()) # write the line in uppercase
def get_file(prompt):
while True:
try: # run and catch any error
return open(input(prompt), 'r+t') # r+t = read, write, text
except EOFError: # see if user if finished
raise SystemExit() # exit the program if so
except OSError as error: # check for file problems
print(error) # report operation errors
if __name__ == '__main__':
main()
The following is similar to what you see up above but works in binary mode instead of text mode. Instead of operating on lines, it processes the file in chunks based on the given BUFFER_SIZE and can operate more efficiently. The code under the main loop may replace the code in the loop if you wish for the program to check that it is operating correctly. The assert statements check some assumptions.
Example 2
BUFFER_SIZE = 1 << 20
def main():
with get_file('Enter filename: ') as file:
while True:
position = file.tell()
buffer = file.read(BUFFER_SIZE)
if not buffer:
return
file.seek(position)
file.write(buffer.upper())
# The following code will not run but can replace the code in the loop.
start = file.tell()
buffer = file.read(BUFFER_SIZE)
if not buffer:
return
stop = file.tell()
assert file.seek(start) == start
assert file.write(buffer.upper()) == len(buffer)
assert file.tell() == stop
def get_file(prompt):
while True:
try:
return open(input(prompt), 'r+b')
except EOFError:
raise SystemExit()
except OSError as error:
print(error)
if __name__ == '__main__':
main()
I suggest the following approach:
1) Read/close the file, return the filename and content
2) Create a new file with above filename, and content with UPPERCASE
def open_f():
while True:
fname=raw_input("Enter filename:")
if fname != "done":
try:
with open(fname, "r+") as fhand:
ss = fhand.read()
break
except:
print "WRONG!!!"
continue
else: exit()
return fname, ss
fname, ss =open_f()
with open(fname, "w+") as fhand:
fhand.write(ss.upper())
Like already alluded to in comments, you cannot successively read from and write to the same file -- the first write will truncate the file, so you cannot read anything more from the handle at that point.
Fortunately, the fileinput module offers a convenient inplace mode which works exactly like you want.
import fileinput
for line in fileinput.input(somefilename, inplace=True):
print(line.upper().strip())
try:
content = open("/tmp/out").read()
except:
content = ""
Can I go any shorter or more elegant than this? I've to do it for more than one files so I want something more short.
Is writing function the only shorter way to do it?
What I actually want is this but I want to concat "" if there is any exception
lines = (open("/var/log/log.1").read() + open("/var/log/log").read()).split("\n")
Yes, you'll have to write something like
def get_contents(filename):
try:
with open(filename) as f:
return f.read()
except EnvironmentError:
return ''
lines = (get_contents('/var/log/log.1')
+ get_contents('/var/log/log')).split('\n')
NlightNFotis raises a valid point, if the files are big, you don't want to do this. Maybe you'd write a line generator that accepts a list of filenames:
def get_lines(filenames):
for fname in filenames:
try:
with open(fname) as f:
for line in f:
yield line
except EnvironmentError:
continue
...
for line in get_lines(["/var/log/log.1", "/var/log/log"]):
do_stuff(line)
Another way is to use the standard fileinput.FileInput class (thanks, J.F. Sebastian):
import fileinput
def eat_errors(f, mode):
try:
return open(f, mode)
except IOError:
return open(os.devnull)
for line in fileinput.FileInput(["/var/log/log.1", "/var/log/log"], openhook=eat_errors):
do_stuff(line)
This code will monkey patch out open for another open that creates a FakeFile that always returns a "empty" string if open throws an `IOException``.
Whilst it's more code than you'd really want to write for the problem at hand, it does mean that you have a reusable context manager for faking open if the need arises again (probably twice in the next decade)
with monkey_patched_open():
...
Actual code.
#!/usr/bin/env python
from contextlib import contextmanager
from StringIO import StringIO
################################################################################
class FakeFile(StringIO):
def __init__(self):
StringIO.__init__(self)
self.count = 0
def read(self, n=-1):
return "<empty#1>"
def readlines(self, sizehint = 0):
return ["<empty#2>"]
def next(self):
if self.count == 0:
self.count += 1
return "<empty#3>"
else:
raise StopIteration
################################################################################
#contextmanager
def monkey_patched_open():
global open
old_open = open
def new_fake_open(filename, mode="r"):
try:
fh = old_open(filename, mode)
except IOError:
fh = FakeFile()
return fh
open = new_fake_open
try:
yield
finally:
open = old_open
################################################################################
with monkey_patched_open():
for line in open("NOSUCHFILE"):
print "NOSUCHFILE->", line
print "Other", open("MISSING").read()
print "OK", open(__file__).read()[:30]
Running the above gives:
NOSUCHFILE-> <empty#3>
Other <empty#1>
OK #!/usr/bin/env python
from co
I left in the "empty" strings just to show what was happening.
StringIO would have sufficed just to read it once but I thought the OP was looking to keep reading from file, hence the need for FakeFile - unless someone knows of a better mechanism.
I know some see monkey patching as the act of a scoundrel.
You could try the following, but it's probably not the best:
import os
def chk_file(filename):
if os.stat(filename).st_size == 0:
return ""
else:
with open(filename) as f:
return f.readlines()
if __name__=="__main__":
print chk_file("foobar.txt") #populated file
print chk_file("bar.txt") #empty file
print chk_file("spock.txt") #populated
It works. You can wrap it with your try-except, if you want.
You could define a function to catch errors:
from itertools import chain
def readlines(filename):
try:
with open(filename) as file:
return file.readlines() # or just `file` to return an iterator
except EnvironmentError:
return []
files = (readlines(name) for name in ["/var/log/1", "/var/log/2"])
lines = list(chain.from_iterable(files))