Python file.read() Callback - python

I currently have code that reads raw content from a file chosen by the user:
def choosefile():
filec = tkFileDialog.askopenfile()
# Wait a few to prevent attempting to displayng the file's contents before the entire file was read.
time.sleep(1)
filecontents = filec.read()
But, sometimes people open big files that take more than 2 seconds to open. Is there a callback for FileObject.read([size])? For people who don't know what a callback is, it's a operation executed once another operation has executed.

Slightly modified from the docs:
#!/usr/bin/env python
import signal, sys
def handler(signum, frame):
print "You took too long"
sys.exit(1)
f = open(sys.argv[1])
# Set the signal handler and a 2-second alarm
signal.signal(signal.SIGALRM, handler)
signal.alarm(2)
contents = f.read()
signal.alarm(0) # Disable the alarm
print contents

Answer resolved by asker
Hm, I made a mistake at first. tkFileDialog.askopenfile() does not read the file, but FileObject.read() reads the file, and blocks the code. I found the solution according to #kindall. I'm not a complete expert at Python, though.
Your question seems to assume that Python will somehow start reading your file while some other code executes, and therefore you need to wait for the read to catch up. This is not even slightly true; both open() and read() are blocking calls and will not return until the operation has completed. Your sleep() is not necessary and neither is your proposed workaround. Simply open the file and read it. Python won't do anything else while that is happening.
Thanks kindall! Resolved code:
def choosefile():
filec = tkFileDialog.askopenfile()
filecontents = filec.read()

Related

Python read YAML: where does it go wrong: with open() or yaml.load()?

I have a script which loads a YAML file as an object. The related part is very simple:
def run_test_spec(self, file_path):
try:
with open(file_path, 'r') as f:
test_spec = yaml.load(f)
if test_spec:
do_test(test_spec)
else:
print("empty test_spec")
except BaseException as err:
print("error in loading yaml file:", file_path)
The file_path was passed in after finishing some comparisons on file entries with for entries in os.scandir(some_directory) (there is no break statement within the for loop).
It has been running fine until recently. The test_spec gets the value None after the first run. I debugged it with Pycharm. It the breakpoint is set at the line if test_spec:, test_spec is None but if the breakpoint is set either at the line with open(...) or yaml.load(), test_spec gets loaded properly. In the end, I added a time.sleep(0.2) statement before with open(...), then it works all the time.
What was the likely cause of it? Is it the problem of with open(...) or yaml.load()? How do get it right without the sleep?
Edited on June 27, 2018,
I did further debugging, and found the line in the code which makes the difference. In file /usr/local/lib/python3.5/dist-packages/yaml/reader.py on my machine:
def update_raw(self, size=4096):
data = self.stream.read(size)
if self.raw_buffer is None:
self.raw_buffer = data
else:
self.raw_buffer += data
self.stream_pointer += len(data)
if not data:
self.eof = True
If the breakpoint is set to the first line (data = ...), data is read fine with the content of the file, however, it the breakpoint is set to the second line (if self.raw_buffer is None:), data is read in as an empty string, which caused a StreamEndEvent and thus the empty return from yaml.load().
I could not step in self.stream.read(size), which only got me to some code in /usr/lib/python3.5/codecs.py.
I don't think it is the Python library caused this problem. Probably it has something to do with my code. I noticed this happens after I run a test, which involves spawning two child processes running as pipe and kills the second process with terminate(). I checked the program with psutil, there is only one thread, no child processes, no open files after the run. Looks like it is clean. But then the new request files could not be read, unless I added a sleep or did a break before the stream read. If the second process, also in a pipe, terminates by itself, the issue does not occur.
If no breakpoint is set but just print the f.tell() before calling yaml.load(f), it is always 0, whether the yaml.load(f) returns None or not.
PyYAML got a new release yesterday (2018-06-26). There were no announcements that this indicated an API break, but as can be expected from the major version number change there was.
The (unsafe) load() that you use has been renamed
danger_load() by the merge of this PR
You can pin your PyYAML install on 3.x ( pip install "pyyaml<4" ) or change your code to use danger_load(). The best solution would probably be to write explicit representers for the objects that now are dumped using !!python/path_to_your_type, so you can use the safe_load().
I could not find any announcement of possible breakage in the documentation.

File closes before async call finishes, causing IO error

I wrote a package that includes a function to upload something asynchronously. The intent is that the user can use my package, open a file, and upload it async. The problem is, depending on how the user writes their code, I get an IO error.
# EXAMPLE 1
with open("my_file", "rb") as my_file:
package.upload(my_file)
# I/O operation on closed file error
#EXAMPLE 2
my_file = open("my_file", "rb")
package.upload(my_file)
# everything works
I understand that in the first example the file is closing immediately because the call is async. I don't know how to fix this though. I can't tell the user they can't open files in the style of example 1. Can I do something in my package.upload() implementation to prevent the file from closing?
You can use os.dup to duplicate the file descriptor and shield the async process from a close in the caller. The duplicated handle shares other characteristics of the original such as the current file position, so you are not completely shielded from bad things the caller can do.
This also limits your process to things that have file descriptors. If you stick to using the standard file calls, then a user can hand in any file-like object instead of just a file on disk.
def upload(my_file):
my_file = os.fdopen(os.dup(my_file.fileno()))
# ...queue for async
If you are using with to open files, it will close when code block execution finishes inside with. In your case, just pass filename and open inside asynchronus function

Python (Watchdog) - Waiting for file to be created correctly

I'm new to Python and I'm trying to implement a good "file creation" detection. If I do not put a time.sleep(x) my files are elaborated in a wrong way since they are still being "created" in the folder. (buffer is not empty)
How can I circumvent this thing without waiting x seconds every time a file is created?
This is my code:
Main:
while 1:
if len(parser()) > 0: # arguments are valid
if len(parser()) == 3:
log_path = parser()['log_path']
else:
log_path = os.getcwd()
paths = parser()
if paths:
handler = Event_Handler()
observer = Observer()
observer.schedule(handler, paths['src_fld'], True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
else:
exit(1)
Event_Handler class:
class Event_Handler(FileSystemEventHandler):
def on_created(self, event):
if not event.is_directory:
time.sleep(1)
As I said, without that time.sleep(1) if I try to process a big file I'll fail since it's still not completely written.
For the sake of any future readers who stumble upon this question, as I have, the answer appears to be that you cannot. Watchdog does not and will not support any feature to tell if a file is "complete" as Windows doesn't allow for it and Watchdog is meant to be system-agnostic.
If you're on Linux or some distro of it, inotify is probably a safe bet. Otherwise, on Windows, the best solutions I've found are:
Upload a big file, bigfile, and then another file, bigfile-complete. When you find a file name-complete, you go back and upload/transfer/react to the original file name. In this case, your files would all be added to the monitored directory in a queue going file, file-complete, file2, file2-complete, . . .
Poll on the size of the file until it has remained fixed for a suitable length of time. When it hasn't changed in long enough that you can be reasonably certain it is finished, react to it as normal.
Similarly, when a file is being uploaded to your directory in bits and pieces, it will generate a constant stream of file-modified Watchdog events. You can poll these instead of file size, waiting until they've stopped for a reasonable length of time, and then assume the file is complete and proceed.
None of these solutions are perfect, but this seems to be an inherent issue to Watchdog on Windows. Unfortunately the "perfect" solution seems to be "swap to Linux and use inotify".
Try reading the file in a while loop:
def on_created(event):
...
# WAITING FOR FILE TRANSFER
file = None
while file is None:
try:
file = open(event.src_path)
except OSError:
file = None
print("WAITING FOR FILE TRANSFER....")
time.sleep(3)
continue
Instead of using elapsed time as an indicator, the cleanest solution would be to monitor only certain types of files, using the patterns variable of a PatternMatchingEventHandler.
Simply append '.temp' to every file you are uploading/writing, and rename them to their real name when they're finished.
Set the patterns to look for '*.temp' files, and monitor their renaming to whatever type of file you desire using the FileSystemMovedEvent event (and its associated Handler.on_moved() method) and its dest_path value, which will include the new name of the file, now completely written.

Python: Reread contents of a file

I have a file that an application updates every few seconds, and I want to extract a single number field in that file, and record it into a list for use later. So, I'd like to make an infinite loop where the script reads a source file, and any time it notices a change in a particular figure, it writes that figure to an output file.
I'm not sure why I can't get Python to notice that the source file is changing:
#!/usr/bin/python
import re
from time import gmtime, strftime, sleep
def write_data(new_datapoint):
output_path = '/media/USBHDD/PythonStudy/torrent_data_collection/data_one.csv'
outfile = open(output_path, 'a')
outfile.write(new_datapoint)
outfile.close()
forever = 0
previous_data = "0"
while forever < 1:
input_path = '/var/lib/transmission-daemon/info/stats.json'
infile = open(input_path, "r")
infile.seek(0)
contents = infile.read()
uploaded_bytes = re.search('"uploaded-bytes":\s(\d+)', contents)
if uploaded_bytes:
current_time = strftime("%Y-%m-%d %X", gmtime())
current_data = uploaded_bytes.group(1)
if current_data != previous_data:
write_data(","+ current_time + "$" + uploaded_bytes.group(1))
previous_data = uploaded_bytes.group(1)
infile.close()
sleep(5)
else:
print "couldn't write" + strftime("%Y-%m-%d %X", gmtime())
infile.close()
sleep(60)
As is now, the (messy) script writes once correctly, and then I can see that although my source file (stats.json) file is changing, my script never picks up on any changes. It keeps on running, but my output file doesn't grow.
I thought that an open() and a close() would do the trick, and then tried throwing in a .seek(0).
What file method am I missing to ensure that python re-opens and re-reads my source file, (stats.json)?
Unless you are implementing some synchronization mechanism or could guarantee somehow atomic read and write, I think you are calling for race condition and subtle bugs here.
Imagine the "reader" accessing the file whereas the "writer" hasn't completed its write cycle. There is a risk of reading incomplete/inconsistent data. In "modern" systems, you could also hit the cache -- and not seeing file modifications "live" as they appends.
I can think of two possible solutions:
You forgot the parentheses on the close in the else of the infinite loop.
infile.close --> infile.close()
The program that is changing the JSON file is not closing the file, and therefore it is not actually changing.
Two problems I see:
Are you sure your file is really updated on filesystem? I do not know on what operating system you are playing with your code, but caching may kick your a$$ in this case, if the files is not flushed by producer.
Your problem is worth considering pipe instead of file, however I cannot guarantee what transmission will do if it stuck on writing to pipe if your consumer is dead.
Answering your problems, consider using one of the following:
pynotifyu
watchdog
watcher
These modules are intended to monitor changes on filesystem and then call proper actions. Method in your example is primitive, has big performance penalty and couple other problems mentioned already in other answers.
Ilya, would it help to check(os.path.getmtime), whether stats.json changed before you process the file?
Moreover, i'd suggest to make advantage of the fact it's JSON file:
import json
import os
import sys
dir_name ='/home/klaus/.config/transmission/'
# stats.json of daemon might be elsewhere
file_name ='stats.json'
full_path = os.path.join(dir_name, file_name)
with open(full_path) as fp:
json.load(fp)
data = json.load(fp)
print data['uploaded-bytes']
Thanks for all the answers, unfortunately my error was in the shell, and not in the script with Python.
The cause of the problem turned out to be the way I was putting the script in the background. I was doing: Ctrl+Z which I thought would put the task in the background. But it does not, Ctrl+Z only suspends the task and returns you to the shell, a subsequent bg command is necessary for the script to run on infinite loop in the background

Why won't my script write to a file?

import time
import traceback
import sys
import tools
from BeautifulSoup import BeautifulSoup
f = open("randomwords.txt","w")
while 1:
try:
page = tools.download("http://wordnik.com/random")
soup = BeautifulSoup(page)
si = soup.find("h1")
w = si.string
print w
f.write(w)
f.write("\n")
time.sleep(3)
except:
traceback.print_exc()
continue
f.close()
It prints just fine. It just won't write to the file. It's 0 bytes.
You can never leave the while loop, hence the f.close() call will never be called and the stream buffer to the file will never be flushed.
Let me explain a little bit further, in your exception catch statement you've included continue so there's no "exit" to the loop condition. Perhaps you should add some sort of indicator that you've reached the end of the page instead of a static 1. Then you'd see the close call and information printed to the file.
A bare except is almost certainly a bad idea; you should only handle the exception you expect to see. Then if it does something totally unexpected you will still get a useful error trace about it.
import time
import tools
from BeautifulSoup import BeautifulSoup
def scan_file(url, logf):
try:
page = tools.download(url)
except IOError:
print("Couldn't read url {0}".format(url))
return
try:
soup = BeautifulSoup(page)
w = soup.find("h1").string
except AttributeError:
print("Couldn't find <h1> tag")
return
print(w)
logf.write(w)
logf.write('\n')
def main():
with open("randomwords.txt","a") as logf:
try:
while True:
time.sleep(3)
scan_file("http://wordnik.com/random", logf)
except KeyboardInterrupt:
break
if __name__=="__main__":
main()
Now you can close the program by typing Ctrl-C, and the "with" clause will ensure that the log file is closed properly.
From what i understand, you want to output a random number every three second into a file. But caching will take place, so you will not see your numbers until the cache has grown too large, typically in the order of 4K bytes.
i suggest that in your loop, you add a f.flush() before the sleep() line.
Also, like wheaties sugessted, you should have proper exception handling (if i want to stop your program, i will likely do a SIGINT using Ctrl+C, and your program won't stop in this case) and a proper exit path.
I'm sure that when you test your program, you will kill it hard to stop it, and any random number it has written will not be written because the file is not properly closed. If you program could exit normally, you would have close()d the file, and close() triggers a flush(), and so you would have something written in your file.
Read the answer posted by wheaties.
And, if you want to force to write the file's buffer to the disk, read:
http://docs.python.org/library/stdtypes.html#file.flush

Categories

Resources