So I'm writing a script to download a file and restart the download from where it left off if it doesn't complete. That part's taken care of.
I took out all of those parts out to show just the part that doesn't work. The amazon url here I just made up, so it won't actually download anything if you run this, but replacing the url with an actual download link, would:
import urllib
import time
import os
file_name = "setup.exe"
web_page = urllib.FancyURLopener().open("https://s3.amazonaws.com/some_bucket/files/"+file_name)
while True:
data = web_page.read(8192)
if not data:
print "done"
break
#print os.getcwd()
with open(file_name, "ab") as outputFile:
outputFile.write(data)
#print "going..."
#time.sleep(1)
What happens is (and this is ONLY when trying to download EXE files), The process will read from web_page a seemingly random number of times (between 1 and 20ish), then throws an IOError: 13, Permission denied. Again, with a .gif or .mov, or a few other things I've tested, the permission denied error is never thrown.
Additionally, uncommenting the time.sleep(1) line resolves the issue. It's as though the with statement doesn't fully close the file before continuing.
I thought the with statement was supposed to handle the close, no?
I also thought that perhaps SOMEHOW my current directory was being changed, but uncommenting that never reveals it (though by the same logic it wouldn't necessarily have to).
(What's also weird is that if I run this script from the desktop [so that it also writes to the desktop] with Aptana opened in front of it, the permission denied error won't occur, yet the second I minimize the text editor to focus the desktop, the error is thrown -- I attribute this to Aptana taking up a lot of resources while it's opened, therefore slowing down the other process and being kind of like a time.sleep??)
Thanks very much for any pointers.
I don't understand why you're going to the trouble to re-open and close the file for each network read. As Pavel suggests, this might give a virus scanner a chance to open (and lock?) the file to scan it. Why not just open it once, do all your I/O, then close it? (I suppose it may have something to do with the code you omitted.)
Instead of:
while True:
data = web_page.read(8192)
if not data:
print "done"
break
with open(file_name, "ab") as outputFile:
outputFile.write(data)
Try:
with open(file_name, "ab") as outputFile:
while True:
data = web_page.read(8192)
if not data:
print "done"
break
outputFile.write(data)
Related
I'm writing a web scraper that has to download 8000 files in total. In my script i download the files consecutively and delete the previous one after the relevant information has been extracted. To delete the file, i use "os.remove(downloaded_file)". So far, in 500+ downloads, 3 times it didnt remove the file, but just removed the contents of the file, so an exception happened when the script tried to copy things from an empty file. Has anyone experienced this or can explain what is happening?
Working on windows 10
I couldn't fine any relevant information on this error so far.
def copy_to_master_and_delete_df(downloaded_file,master_file):
'''open a downloaded csv file, copy the data (line 10), append to master file and delete the downloaded file'''
while not os.path.exists(downloaded_file):
time.sleep(0.5)
log(f'waiting for {bank} {quarter} to download')
with open(downloaded_file, encoding='utf-8') as df:
data = list(df.readlines())[-1]
os.remove(downloaded_file)
while os.path.exists(downloaded_file):
time.sleep(0.1)
log(f'waiting for {bank} {quarter} to be deleted')
with open(master_file, 'a', encoding='utf-8') as mf:
mf.write(data)
On data = list(df.readlines())[-1] it gives an exception:
Exception has occurred: IndexError
list index out of range
It happens because of what is described earlier, the contents are removed, but not the file itself.
To work around this problem a little, i have put an infinite
while os.path.exists(downloaded_file):
time.sleep(0.1)
log(f'waiting for {bank} {quarter} to be deleted')
that allows me to manually delete the file and have the script not break down.
I'm asking for help because it went to a next level. The script somehow jumped over the line where i check if the file is deleted (again, the content is deleted but the file is not) and downloaded the next one, so the script crashed when it looked into an empty file.
Any ideas to why this is happening or how to handle this?
I suspect that's a buffer flush problem. Try following the remove with a call to os.sync(), os.fsync() on Windows, or see here for the buffering option to disable buffering when opening a file.
I've encountered this strange problem with opening/closing files in python. I am trying to do the same thing in python that i was doing successfully in matlab, and i am getting a problem communicating with some software through text files. I've come up with a strange workaround to solve the problem, but i dont understand why it works.
I have software that communicates with a some lab equipment. To communicate with this software, i write a file ('wavefile.txt') to a specific folder, containing parameters to send to the device. I then write another file named 'request.txt' containing the location of this first file ('wavefile.txt') which contains the parameters to send to the device. The software is constantly checking this folder to find the file named 'request.txt' and once it finds it, it will read the parameters in the file which is specified by the text in 'request.txt' and then delete 'request.txt'. The software/equipment developer instructs to give a 50 ms second delay before closing the 'request.txt' file.
original matlab code that works:
home = cd;
cd \\CREOL-FAST-01\data
fileID = fopen('request.txt', 'wt');
proj = 'C:\\dazzler\\data\\wavefile.txt';
fprintf(fileID, proj);
pause(0.05);
fclose('all');
cd(home);
original python code that does not work:
home = os.getcwd()
os.chdir(r'\\CREOL-FAST-01\data')
with open('request.txt', 'w') as file:
proj = r'C:\dazzler\data\wavefile.txt'
file.write(proj)
time.sleep(0.05)
os.chdir(home)
Every time the device program reads the 'request.txt' when its working with matlab, it deletes it immediately after matlab closes it. When i run that code with python it works SOMETIMES, maybe 1 in every 5 tries will be successful and the parameters are sent. The 'request.txt' file is always deleted with the python code above,but the parameters i've input are clearly not sent to my lab device. My guess is that when i write the file in python, the device program is able to read it before python writes the text to it, so its just opening the blank file, not applying any parameters, and then deleting it.
My workaround in python:
home = os.getcwd()
os.chdir(r'\\CREOL-FAST-01\data')
fileh = open('request.txt', 'w+')
proj = r'C:\dazzler\data\wavefile.txt'
fileh.write(proj)
time.sleep(0.05)
print(fileh.read())
time.sleep(0.05)
fileh.close()
This method in python seems to work 100% of the time. I open the file in w+ mode, and using fileh.read() is absolutely necessary. if i delete that line and still include the extra sleeptime, it will again work about 1 in 5 tries. This seems really strange to me. Any explanation, or better solutions?
My guess (which could be wrong) is that the file is being read before it is completely flushed. I would try using the flush() method after the write to make sure that the complete data is written to the file. You might also need the os.fsync() method to make sure the data is flushed properly. Try something like this:
home = os.getcwd()
os.chdir(r'\\CREOL-FAST-01\data')
with open('request.txt', 'w') as file:
proj = r'C:\dazzler\data\wavefile.txt'
file.write(proj)
file.flush()
os.fsync()
time.sleep(0.05)
os.chdir(home)
Not knowing any details about the particular equipment and other software you are using it's hard to say. One guess is the difference in buffering on write calls.
From this blog post on Matlab's fwrite: "The default behavior for fprintf and fwrite is to flush the file buffer after each call to either of these functions"
Whereas for Python's open:
When no buffering argument is given, the default buffering policy
works as follows:
Binary files are buffered in fixed-size chunks; the size of the buffer
is chosen using a heuristic trying to determine the underlying
device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On
many systems, the buffer will typically be 4096 or 8192 bytes long.
“Interactive” text files (files for which isatty() returns True) use
line buffering. Other text files use the policy described above for
binary files.
To test this guess change:
with open('request.txt', 'w') as file:
proj = r'C:\dazzler\data\wavefile.txt'
to:
with open('request.txt', 'w', buffer=1) as file:
proj = 'C:\\dazzler\\data\\wavefile.txt\n'
The problem is probably that you are doing the delay while the file is still open and thus not written to disk. Try something like this:
home = os.getcwd()
os.chdir(r'\\CREOL-FAST-01\data')
with open('request.txt', 'w') as file:
proj = r'C:\dazzler\data\wavefile.txt'
file.write(proj)
time.sleep(0.05)
os.chdir(home)
The only difference here is that the sleep is done after the file is closed (the file is closed when the with block ends), and thus the delay doesn't happen until after the text is written to disk.
To put it in words, what you are doing is:
Open (and create) file
Write text to a buffer (in memory, not on disk)
Wait 50 ms
Close (and write) the file
What you want to do is:
Open (and create) file
Write text to a buffer (in memory, not on disk)
Close (and write) the file
Wait 50 ms
So what you end up with is a period of at least 50 ms where the text file has been created, but where there is nothing in it because the text is sitting in your computer memory not on disk.
To be honest, I would put as little in the with block as possible, to avoid issues like this. So I would write it like so:
home = os.getcwd()
os.chdir(r'\\CREOL-FAST-01\data')
proj = r'C:\dazzler\data\wavefile.txt'
with open('request.txt', 'w') as file:
file.write(proj)
time.sleep(0.05)
os.chdir(home)
Also keep in mind that you also can't do the opposite: assume that no text is written until you close. For small files like this, that will probably happen. But when the file is written to disk depends on a lot of factors. When a file is closed, it is written, but it may be written before that too.
I have a Python script that runs properly on my laptop, but when running on my raspberry pi, the following code does not seem to be working properly. Specifically, "TextFile.txt" is not being updated and/or saved.
openfile = open('/PATH/TextFile.txt','w')
for line in lines:
if line.startswith(start):
openfile.write(keep+'\n')
print ("test 1")
else:
openfile.write(line)
print ("test 2")
openfile.close()
I am seeing "test 1" and "test 2" in my output, so I know that the code is being reached, paths are correct, etc
It may be due to a permissions problem. I am running the script from the terminal by using:
usr/bin/python PATH/script.py
Python is owned by "root" and script.py is owned by "Michael".
My first guess:
Does the file exist? If it does not exist then you cannot write to it. Try this to create the file if it does not exist: file = open('myfile.dat', 'w+')
Additionally manually opening and closing file handles is bad practice in python. The with statement handles the opening and closing of the resource automatically for you:
with open("myfile.dat", "w+") as f:
#doyourcalculations with the file object here
for line in f:
print line
All, thank you for your input. I was able to figure out that it was writing to the new file, but it was overwriting with the same text. The reason was because ".startswith" was returning false when I expected true. The misconception was due to the difference between how Windows and Unix treat new line characters (/n /r).
Since your code is running, there should be a file somewhere.
You call "PATH/script.py", but there is "/PATH/TextFile.txt" in your program. Is the slash before PATH a mistake? Have you checked the path in your program is really where you are looking for the output file?
I have a homebrew web based file system that allows users to download their files as zips; however, I found an issue while dev'ing on my local box not present on the production system.
In linux this is a non-issue (the local dev box is a windows system).
I have the following code
algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
temp_file = open(temp_file_path, 'wb+')
data = file.read(settings.READ_SIZE)
while data:
dec_data = decrypt.update(data)
temp_file.write(dec_data)
data = file.read(settings.READ_SIZE)
# Takes a dump right here!
# error in cipher operation (wrong final block length)
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()
The above code opens a file, and (using the key for the current file share) decrypts the file and writes it to a temporary location (that will later be stuffed into a zip file).
My issue is on the file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb') line. Since windows cares a metric ton about binary files if I don't specify 'rb' the file will not read to end on the data read loop; however, for some reason since I am also writing to temp_file it never completely reads to the end of the file...UNLESS i add a + after the b 'rb+'.
if i change the code to file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb+') everything works as desired and the code successfully scrapes the entire binary file and decrypts it. If I do not add the plus it fails and cannot read the entire file...
Another section of the code (for downloading individual files) reads (and works flawlessly no matter the OS):
algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
filename = smart_str(cur_file.name, errors='replace')
response = HttpResponse(mimetype='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename="' + filename + '"'
data = file.read(settings.READ_SIZE)
while data:
dec_data = decrypt.update(data)
response.write(dec_data)
data = file.read(settings.READ_SIZE)
# no dumps to be taken when finishing up the decrypt process...
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()
Clarification
The cipher error is likely because the file was not read in its entirety. For example, I have a 500MB file I am reading in at 64*1024 bytes at a time. I read until I receive no more bytes, when I don't specify b in windows it cycles through the loop twice and returns some crappy data (because python thinks it is interacting with a string file not a binary file).
When I specify b it takes 10-15 seconds to completely read in the file, but it does it succesfully, and the code completes normally.
When I am concurrently writing to another file as i read in from the source file (as in the first example) if I do not specify rb+ it displays the same behavior as not even specifying b which is, that it only reads a couple segments from the file before closing the handle and moving on, i end up with an incomplete file and the decryption fails.
I'm going to take a guess here:
You have some other program that's continually replacing the files you're trying to read.
On linux, this other program works by atomically replacing the file (that is, writing to a temporary file, then moving the temporary file to the path). So, when you open a file, you get the version from 8 seconds ago. A few seconds later, someone comes along and unlinks it from the directory, but that doesn't affect your file handle in any way, so you can read the entire file at your leisure.
On Windows, there is no such thing as atomic replacement. There are a variety of ways to work around that problem, but what many people do is to just rewrite the file in-place. So, when you open a file, you get the version from 8 seconds ago, start reading it… and then suddenly someone else blanks the file to rewrite it. That does affect your file handle, because they've rewritten the same file. So you hit an EOF.
Opening the file in r+ mode doesn't do anything to solve the problem, but it adds a new problem that hides it: You're opening the file with sharing settings that prevent the other program from rewriting the file. So, now the other program is failing, meaning nobody is interfering with this one, meaning this one appears to work.
In fact, it could be even more subtle and annoying than this. Later versions of Windows try to be smart. If I try to open a file while someone else has it locked, instead of failing immediately, it may wait a short time and try again. The rules for exactly how this works depend on the sharing and access you need, and aren't really documented anywhere. And effectively, whenever it works the way you want, it means you're relying on a race condition. That's fine for interactive stuff like dragging a file from Explorer to Notepad (better to succeed 99% of the time instead of 10% of the time), but obviously not acceptable for code that's trying to work reliably (where succeeding 99% of the time just means the problem is harder to debug). So it could easily work differently between r and r+ modes for reasons you will never be able to completely figure out, and wouldn't want to rely on if you could…
Anyway, if any variation of this is your problem, you need to fix that other program, the one that rewrites the file, or possibly both programs in cooperation, to properly simulate atomic file replacement on Windows. There's nothing you can do from just this program to solve it.*
* Well, you could do things like optimistic check-read-check and start over whenever the modtime changes unexpectedly, or use the filesystem notification APIs, or… But it would be much more complicated than fixing it in the right place.
i have made a python script that performs a nagios check. The functionality of the script is pretty simple it just parses a log and matches some info witch is used to construct the nagios check output. The log is a snmptrapd log witch records the traps from other servers and logs them in /var/log/snmptrapd after witch i just parse them with the script. In order to have the latest traps i erase the log from python each time after reading it. In order to preserve the info i have made a cron job that copies the content of the log into another log at an time interval a bit smaller than the nagios check interval. The thing that i don't understand is why is the log growing so much (i mean the messages log which has i guess 1000 times more info is smaller). From what i've seen in the log there are a lot of special characters like ^# and i think that this is done by the way i'm manipulating the file from pyton but seeing that i olny have like three weeks of experience with it I can't seem to figure out the problem.
The script code is the following:
import sys, os, re
validstring = "OK"
filename = "/var/log/snmptrapd.log"
if os.stat(filename)[6] == 0:
print validstring
sys.exit()
else:
f = open(filename,"r")
sharestring = ""
line1 = []
patte0 = re.compile("[0-9]+-[0-9]+-[0-9]+")
patte2 = re.compile("NG: [a-zA-Z\s=0-9]+.*")
for line in f:
line1 = line.split(" ")
if re.search(patte0,line1[0]):
sharestring = sharestring + line1[1] + " "
continue
result2 = re.search(patte2,line)
if result2:
result22 = result2.group()
result22 = result22.replace("NG:","")
sharestring = sharestring + result22 + " "
f.close()
f1 = open(filename,"w")
f1.close()
print sharestring
sys.exit(2)
~
The log looks like:
2012-07-11 04:17:16 Some IP(via UDP: [this is an ip]:port) TRAP, SNMP v1, community somestring
SNMPv2-SMI::enterprises.OID Some info which is not necesarry
SNMPv2-MIB::sysDescrOID = STRING: info which i'm matching
I'm pretty sure that it has something to do with the my way of erasing the file but i can't figure it out. If you have some idea i would be really interested. Thank you.
As an information about the size i have 93 lines(so says Vim) and the log occupies 161K and that is not ok because the lines are quite short.
OK it has nothing to do with the way i read and erased the file. Is something in the snmptrapd daemon that is doing this when i'm erasing it's log file. I have modified my code and now i send SIGSTOP to snmptrapd reight before i open the file, and i make my modifications to the file and then i send SIGCONT after i'm done but it seem i experience the same behavior. The new code looks like(the different parts):
else:
command = "pidof snmptrapd"
p=subprocess.Popen(shlex.split(command),stdout=subprocess.PIPE)
pidstring = p.stdout.readline()
patte1 = re.compile("[0-9]+")
pidnr = re.search(patte1,pidstring)
pid = pidnr.group()
os.kill(int(pid), SIGSTOP)
time.sleep(0.5)
f = open(filename,"r+")
sharestring = ""
and
sharestring = sharestring + result22 + " "
f.truncate(0)
f.close()
time.sleep(0.5)
os.kill(int(pid), SIGCONT)
print sharestring
I'm thinking of stopping the daemon erasing the file and after that recreating it with the proper permissions and starting the daemon.
I don't think you can, but here are some things to try
Truncating a File
f1 = open(filename, 'w')
f1.close()
is a hacky side effect way of deleting a files contents and will probably be causing undesired side effects depending on the underlying OS if other applications have that file open.
Using the File Object method truncate()
truncate([size])
Truncate the file's size. If the optional size argument is present,
the file is truncated to (at most) that size. The size defaults to the
current position. The current file position is not changed. Note that
if a specified size exceeds the file's current size, the result is
platform-dependent: possibilities include that the file may remain
unchanged, increase to the specified size as if zero-filled, or
increase to the specified size with undefined new content.
Availability: Windows, many Unix variants.
Probably the only determinist way to do this is
stop the snmptrapd process at the start of the script, use the proper os module function remove and then recreate the file and restart the snmptrapd daemon at the end of the script.
os.remove(path)
Remove (delete) the file path. If path is a directory, OSError is
raised; see rmdir() below to remove a directory. This is identical to
the unlink() function documented below. On Windows, attempting to
remove a file that is in use causes an exception to be raised; on
Unix, the directory entry is removed but the storage allocated to the
file is not made available until the original file is no longer in
use.
Shared resource concern
You still might have problems with having two processes trying to fight for writing to a single file without some kind of locking mechanism and having non-deterministic things happening to the file. I bet you can send a SIGINT or something similar to your daemon process and get it to re-read the file or something, check your documentation.
Manipulating shared resources, especially file resources without exclusive locking is going to be trouble, especially with filesystem caching and application caching of data.