Python caching file reads? File reads are faster on second execution

Python caching file reads? File reads are faster on second execution - python

I wrote a simple caching script that searches through a directory for certain files, hashes that directory and then reads and processes the files found in that directory in a loop. It then writes the hash plus whatever data I was able to pull to a file.
On the first execution it takes about twenty seconds which is expected as there are lots of big files to regex match line by line. However, on subsequent executions, it runs MUCH faster, around 4 seconds. This seems to persist even if I start a new instance of the python interpreter or close out my IDE (Spyder) entirely.
What's interesting though is that if I run the script on a different folder, it takes about 20 seconds again, then back to 4 seconds every time after. Also interesting, is that running it with Python 3 on those same two folders also takes 20 seconds the first time, then again back to 4 seconds every time after. If I rename the folder, it only takes 4 seconds on the first run. So far I haven't been able to recreate the 20 second runtime at all.
It seems to me that Python or the OS must be internally caching the file reads (just an educated guess as I think the open() calls are the most expensive).
I'm not using anything fancy here btw, just pythons built-in open() and looping through the lines, the re module and a very minimal hash module that can be found here.
Here's one of the functions that opens and processes these files in a loop (the others are virtually the same, just searching in different paths)
def createGlobalDataDictionary(self, path):
dict1 = {}
dict2 = {}
start = timeit.default_timer()
self.filelist[path] = hashfunc.dirhash(path, 'sha1')
for i in os.listdir(path):
if i.endswith(".m"):
if re.match('.*foo.*', i):
openi = open(os.path.join(path, i), "r")
for line in openi:
if re.match('.*bar.*', line):
key = line.split('.')[0]
value = line.partition('\'')[-1].rpartition('\'')[0]
dict1[key] = value
openi.close()
elif re.match('.*baz.*', i):
openi = open(os.path.join(path, i), "r")
for line in openi:
if re.match(".*qux.*", line):
key = line.split('.')[0]
value = line.partition('\'')[-1].rpartition('\'')[0]
dict2[key] = value
openi.close()
return (dict1, dict2)
I'm using WinPython 32-Bit 2.7.6.3 and WinPython 64-Bit 3.4.2.4
I've been trying to figure out what's causing this behavior all day and I'm pretty stumped so any help is appreciated.

Related

Loop an existing script

I'm using a script from a third party I can't modify or show (let's call it original.py) which takes a file and produces some calculations. At the end it ouputs a result (using the print statment).
Since I have many files I decided to make a second script that gets all wanted files and runs them through the original.py
1st get list of all files to run
2nd run each file through the original.py
3rd obtain results from each file
I have the 1st and 2nd step. However, the end result only saves the calculations from the last file it read.
import sys
import original
import glob
import os
fn=str(sys.argv[1])
for filename in sys.argv[1:]:
print(filename)
ficheiros = [f for f in glob.glob(fn)]
for ficheiro in ficheiros:
original.file = bytes(ficheiro,'utf-8')
original.function()
To summarize:
Knowing I can't change the original script (which is made with a print statement) how can I obtain the results for each loop? Is there a better way than using a for loop?.
The first script can be invoked with python original.py
It requires the file to be changed manually inside the script in the original.file line.
This script outputs the result in the console and I redirect it with: python original.py > result.txt
At the moment when I try to run my script, it reads all the correct files in the folder but only returns the results for the last file.
#
(I tried to reformulate the question hopefully it's easier to understand)
#
The problem is due to a mistake in the ````ficheiros = [f for f in glob.glob(fn)]`````it's only reading one file, hence only outputting one result.
Thanks for the time.sleep() trick in the comments.
Solved:
I changed the initial part to:
fn=str(sys.argv[1])
ficheiros= []
for filename in sys.argv[1:]:
ficheiros.append(filename)
#print(filename)
and now it correctly reads all the files and it outputs all the results

Depending on your operating system there are different ways to take what is printed to the console and append it to a file.
For example on Linux, you could run this file that calls original.py for every file python yourfile.py >> outputfile.txt, which will then effectively save everything that is printed into outputfile.txt.
The syntax is similar for Windows.

I'm not quite sure what you're asking, but you could try one of these:
Either redirecting all output to a file for later use, by running the script like so: python secondscript.py > outfilename.txt
Or, and this might or might not work for you, redefining the print command to a function that outputs the result how you want, eg:
def print(x):
with open('outfile.txt','w') as f:
f.write('example: ' + x)
If you choose the second option, I recommend saving the old print function (oldprint = print) so you can restore and use the regular print later.

I don't know if I got exactly what you want. You have a first script named original.py which takes some arguments and returns things in the form of print statements and you would like to grab these prints statements in your scripts to do things?
If so, a solution could be the subprocess module:
Let's say that this is original.py:
print("Hi, I'm original.py")
print("print me!")
And this is main.py:
import subprocess
script_path = "original.py"
print("Executing ", script_path)
process = subprocess.Popen(["python3", script_path], stdout=subprocess.PIPE)
for line in process.stdout:
print(line.decode("utf8"))
You can easily add more arguments in the Popen call like ["arg1", "arg2",] etc.
Output:
Executing original.py
Hi, I'm original.py
print me!
and you can grab the lines in the main.py to do what you want with them.

Getting number of lines in a text file without readlines

Let's say I have a program that uses a .txt file to store data it needs to operate. Because it's a very large amount of data (just go with it) in the text file I was to use a generator rather than an iterator to go through the data in it so that my program leaves as much space as possible. Let's just say (I know this isn't secure) that it's a list of usernames. So my code would look like this (using python 3.3).
for x in range LenOfFile:
id = file.readlines(x)
if username == id:
validusername = True
#ask for a password
if validusername == True and validpassword == True:
pass
else:
print("Invalid Username")
Assume that valid password is set to True or False where I ask for a password. My question is, since I don't want to take up all of the RAM I don't want to use readlines() to get the whole thing, and with the code here I only take a very small amount of RAM at any given time. However, I am not sure how I would get the number of lines in the file (assume I cannot find the number of lines and add to it as new users arrive). Is there a way Python can do this without reading the entire file and storing it at once? I already tried len(), which apparently doesn't work on text files but was worth a try. The one way I have thought of to do this is not too great, it involves just using readlines one line at a time in a range so big the text file must be smaller, and then continuing when I get an error. I would prefer not to use this way, so any suggestions would be appreciated.

You can just iterate over the file handle directly, which will then iterate over it line-by-line:
for line in file:
if username == line.strip():
validusername = True
break
Other than that, you can’t really tell how many lines a file has without looking at it completely. You do know how big a file is, and you could make some assumptions on the character count for example (UTF-8 ruins that though :P); but you don’t know how long each line is without seeing it, so you don’t know where the line breaks are and as such can’t tell how many lines there are in total. You still would have to look at every character one-by-one to see if a new line begins or not.
So instead of that, we just iterate over the file, and stop once whenever we read a whole line—that’s when the loop body executes—and then we continue looking from that position in the file for the next line break, and so on.

Yes, the good news is you can find number of lines in a text file without readlines, for line in file, etc. More specifically in python you can use byte functions, random access, parallel operation, and regular expressions, instead of slow sequential text line processing. Parallel text file like CSV file line counter is particularly suitable for SSD devices which have fast random access, when combined with a many processor cores. I used a 16 core system with SSD to store the Higgs Boson dataset as a standard file which you can go download to test on. Even more specifically here are fragments from working code to get you started. You are welcome to freely copy and use but if you do then please cite my work thank you:
import re
from argparse import ArgumentParser
from multiprocessing import Pool
from itertools import repeat
from os import stat
unitTest = 0
fileName = None
balanceFactor = 2
numProcesses = 1
if __name__ == '__main__':
argparser = ArgumentParser(description='Parallel text file like CSV file line counter is particularly suitable for SSD which have fast random access')
argparser.add_argument('--unitTest', default=unitTest, type=int, required=False, help='0:False 1:True.')
argparser.add_argument('--fileName', default=fileName, required=False, help='')
argparser.add_argument('--balanceFactor', default=balanceFactor, type=int, required=False, help='integer: 1 or 2 or 3 are typical')
argparser.add_argument('--numProcesses', default=numProcesses, type=int, required=False, help='integer: 1 or more. Best when matched to number of physical CPU cores.')
cmd = vars(argparser.parse_args())
unitTest=cmd['unitTest']
fileName=cmd['fileName']
balanceFactor=cmd['balanceFactor']
numProcesses=cmd['numProcesses']
#Do arithmetic to divide partitions into startbyte, endbyte strips among workers (2 lists of int)
#Best number of strips to use is 2x to 3x number of workers, for workload balancing
#import numpy as np # long heavy import but i love numpy syntax
def PartitionDataToWorkers(workers, items, balanceFactor=2):
strips = balanceFactor * workers
step = int(round(float(items)/strips))
startPos = list(range(1, items+1, step))
if len(startPos) > strips:
startPos = startPos[:-1]
endPos = [x + step - 1 for x in startPos]
endPos[-1] = items
return startPos, endPos
def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'): # counts number of searchChar appearing in the byte range
with open(fileName, 'r') as f:
f.seek(startByte-1) # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
bytes = f.read(endByte - startByte + 1)
cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
return cnt
if 0 == unitTest:
# Run app, not unit tests.
fileBytes = stat(fileName).st_size # Read quickly from OS how many bytes are in a text file
startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
p = Pool(numProcesses)
partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
globalSum = sum(partialSum)
print(globalSum)
else:
print("Running unit tests") # Bash commands like: head --bytes 96 beer.csv are how I found the correct values.
fileName='beer.csv' # byte 98 is a newline
assert(8==ReadFileSegment(1, 288, fileName))
assert(1==ReadFileSegment(1, 100, fileName))
assert(0==ReadFileSegment(1, 97, fileName))
assert(1==ReadFileSegment(97, 98, fileName))
assert(1==ReadFileSegment(98, 99, fileName))
assert(0==ReadFileSegment(99, 99, fileName))
assert(1==ReadFileSegment(98, 98, fileName))
assert(0==ReadFileSegment(97, 97, fileName))
print("OK")
The bash wc program is slightly faster but you wanted pure python, and so did I. Below is some performance testing results. That said if you change some of this code to use cython or something you might even get some more speed.
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.257s
user 0m12.088s
sys 0m20.512s
HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv
real 0m1.820s
user 0m0.364s
sys 0m1.456s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.256s
user 0m10.696s
sys 0m19.952s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000
real 0m17.380s
user 0m11.124s
sys 0m6.272s
Conclusion: The speed is good for a pure python program compared to a C program. However, it’s not good enough to use the pure python program over the C program.
I wondered if compiling the regex just one time and passing it to all workers will improve speed. Answer: Regex pre-compiling does NOT help in this application. I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating.
One more thing. Does parallel CSV file reading even help, I wondered? Is the disk the bottleneck, or is it the CPU? Oh yes, yes it does. Parallel file reading works quite well. Well there you go!
Data science is a typical use case for pure python. I like to use python (jupyter) notebooks, and I like to keep all code in the notebook rather than use bash scripts when possible. Finding the number of examples in a dataset is a common need for doing machine learning where you generally need to partition a dataset into training, dev, and testing examples.
Higgs Boson dataset:
https://archive.ics.uci.edu/ml/datasets/HIGGS

If you want number of lines in a file so badly, why don't you use len
with open("filename") as f:
num = len(f.readlines())

opening and writing a large binary file python

I have a homebrew web based file system that allows users to download their files as zips; however, I found an issue while dev'ing on my local box not present on the production system.
In linux this is a non-issue (the local dev box is a windows system).
I have the following code
algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
temp_file = open(temp_file_path, 'wb+')
data = file.read(settings.READ_SIZE)
while data:
dec_data = decrypt.update(data)
temp_file.write(dec_data)
data = file.read(settings.READ_SIZE)
# Takes a dump right here!
# error in cipher operation (wrong final block length)
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()
The above code opens a file, and (using the key for the current file share) decrypts the file and writes it to a temporary location (that will later be stuffed into a zip file).
My issue is on the file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb') line. Since windows cares a metric ton about binary files if I don't specify 'rb' the file will not read to end on the data read loop; however, for some reason since I am also writing to temp_file it never completely reads to the end of the file...UNLESS i add a + after the b 'rb+'.
if i change the code to file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb+') everything works as desired and the code successfully scrapes the entire binary file and decrypts it. If I do not add the plus it fails and cannot read the entire file...
Another section of the code (for downloading individual files) reads (and works flawlessly no matter the OS):
algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
filename = smart_str(cur_file.name, errors='replace')
response = HttpResponse(mimetype='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename="' + filename + '"'
data = file.read(settings.READ_SIZE)
while data:
dec_data = decrypt.update(data)
response.write(dec_data)
data = file.read(settings.READ_SIZE)
# no dumps to be taken when finishing up the decrypt process...
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()
Clarification
The cipher error is likely because the file was not read in its entirety. For example, I have a 500MB file I am reading in at 64*1024 bytes at a time. I read until I receive no more bytes, when I don't specify b in windows it cycles through the loop twice and returns some crappy data (because python thinks it is interacting with a string file not a binary file).
When I specify b it takes 10-15 seconds to completely read in the file, but it does it succesfully, and the code completes normally.
When I am concurrently writing to another file as i read in from the source file (as in the first example) if I do not specify rb+ it displays the same behavior as not even specifying b which is, that it only reads a couple segments from the file before closing the handle and moving on, i end up with an incomplete file and the decryption fails.

I'm going to take a guess here:
You have some other program that's continually replacing the files you're trying to read.
On linux, this other program works by atomically replacing the file (that is, writing to a temporary file, then moving the temporary file to the path). So, when you open a file, you get the version from 8 seconds ago. A few seconds later, someone comes along and unlinks it from the directory, but that doesn't affect your file handle in any way, so you can read the entire file at your leisure.
On Windows, there is no such thing as atomic replacement. There are a variety of ways to work around that problem, but what many people do is to just rewrite the file in-place. So, when you open a file, you get the version from 8 seconds ago, start reading it… and then suddenly someone else blanks the file to rewrite it. That does affect your file handle, because they've rewritten the same file. So you hit an EOF.
Opening the file in r+ mode doesn't do anything to solve the problem, but it adds a new problem that hides it: You're opening the file with sharing settings that prevent the other program from rewriting the file. So, now the other program is failing, meaning nobody is interfering with this one, meaning this one appears to work.
In fact, it could be even more subtle and annoying than this. Later versions of Windows try to be smart. If I try to open a file while someone else has it locked, instead of failing immediately, it may wait a short time and try again. The rules for exactly how this works depend on the sharing and access you need, and aren't really documented anywhere. And effectively, whenever it works the way you want, it means you're relying on a race condition. That's fine for interactive stuff like dragging a file from Explorer to Notepad (better to succeed 99% of the time instead of 10% of the time), but obviously not acceptable for code that's trying to work reliably (where succeeding 99% of the time just means the problem is harder to debug). So it could easily work differently between r and r+ modes for reasons you will never be able to completely figure out, and wouldn't want to rely on if you could…
Anyway, if any variation of this is your problem, you need to fix that other program, the one that rewrites the file, or possibly both programs in cooperation, to properly simulate atomic file replacement on Windows. There's nothing you can do from just this program to solve it.*
* Well, you could do things like optimistic check-read-check and start over whenever the modtime changes unexpectedly, or use the filesystem notification APIs, or… But it would be much more complicated than fixing it in the right place.

Saving data in Python without a text file?

I have a python program that just needs to save one line of text (a path to a specific folder on the computer).
I've got it working to store it in a text file and read from it; however, I'd much prefer a solution where the python file is the only one.
And so, I ask: is there any way to save text in a python program even after its closed, without any new files being created?
EDIT: I'm using py2exe to make the program an .exe file afterwards: maybe the file could be stored in there, and so it's as though there is no text file?

You can save the file name in the Python script and modify it in the script itself, if you like. For example:
import re,sys
savefile = "widget.txt"
x = input("Save file name?:")
lines = list(open(sys.argv[0]))
out = open(sys.argv[0],"w")
for line in lines:
if re.match("^savefile",line):
line = 'savefile = "' + x + '"\n'
out.write(line)
This script reads itself into a list then opens itself again for writing and amends the line in which savefile is set. Each time the script is run, the change to the value of savefile will be persistent.
I wouldn't necessarily recommend this sort of self-modifying code as good practice, but I think this may be what you're looking for.

Seems like what you want to do would better be solved using the Windows Registry - I am assuming that since you mentioned you'll be creating an exe from your script.
This following snippet tries to read a string from the registry and if it doesn't find it (such as when the program is started for the first time) it will create this string. No files, no mess... except that there will be a registry entry lying around. If you remove the software from the computer, you should also remove the key from the registry. Also be sure to change the MyCompany and MyProgram and My String designators to something more meaningful.
See the Python _winreg API for details.
import _winreg as wr
key_location = r'Software\MyCompany\MyProgram'
try:
key = wr.OpenKey(wr.HKEY_CURRENT_USER, key_location, 0, wr.KEY_ALL_ACCESS)
value = wr.QueryValueEx(key, 'My String')
print('Found value:', value)
except:
print('Creating value.')
key = wr.CreateKey(wr.HKEY_CURRENT_USER, key_location)
wr.SetValueEx(key, 'My String', 0, wr.REG_SZ, 'This is what I want to save!')
wr.CloseKey(key)
Note that the _winreg module is called winreg in Python 3.

Why don't you just put it at the beginning of the code. E.g. start your code:
import ... #import statements should always go first
path = 'what you want to save'
And now you have path saved as a string

How to read and truncate the snmptrapd log file without restarting the daemon

i have made a python script that performs a nagios check. The functionality of the script is pretty simple it just parses a log and matches some info witch is used to construct the nagios check output. The log is a snmptrapd log witch records the traps from other servers and logs them in /var/log/snmptrapd after witch i just parse them with the script. In order to have the latest traps i erase the log from python each time after reading it. In order to preserve the info i have made a cron job that copies the content of the log into another log at an time interval a bit smaller than the nagios check interval. The thing that i don't understand is why is the log growing so much (i mean the messages log which has i guess 1000 times more info is smaller). From what i've seen in the log there are a lot of special characters like ^# and i think that this is done by the way i'm manipulating the file from pyton but seeing that i olny have like three weeks of experience with it I can't seem to figure out the problem.
The script code is the following:
import sys, os, re
validstring = "OK"
filename = "/var/log/snmptrapd.log"
if os.stat(filename)[6] == 0:
print validstring
sys.exit()
else:
f = open(filename,"r")
sharestring = ""
line1 = []
patte0 = re.compile("[0-9]+-[0-9]+-[0-9]+")
patte2 = re.compile("NG: [a-zA-Z\s=0-9]+.*")
for line in f:
line1 = line.split(" ")
if re.search(patte0,line1[0]):
sharestring = sharestring + line1[1] + " "
continue
result2 = re.search(patte2,line)
if result2:
result22 = result2.group()
result22 = result22.replace("NG:","")
sharestring = sharestring + result22 + " "
f.close()
f1 = open(filename,"w")
f1.close()
print sharestring
sys.exit(2)
~
The log looks like:
2012-07-11 04:17:16 Some IP(via UDP: [this is an ip]:port) TRAP, SNMP v1, community somestring
SNMPv2-SMI::enterprises.OID Some info which is not necesarry
SNMPv2-MIB::sysDescrOID = STRING: info which i'm matching
I'm pretty sure that it has something to do with the my way of erasing the file but i can't figure it out. If you have some idea i would be really interested. Thank you.
As an information about the size i have 93 lines(so says Vim) and the log occupies 161K and that is not ok because the lines are quite short.
OK it has nothing to do with the way i read and erased the file. Is something in the snmptrapd daemon that is doing this when i'm erasing it's log file. I have modified my code and now i send SIGSTOP to snmptrapd reight before i open the file, and i make my modifications to the file and then i send SIGCONT after i'm done but it seem i experience the same behavior. The new code looks like(the different parts):
else:
command = "pidof snmptrapd"
p=subprocess.Popen(shlex.split(command),stdout=subprocess.PIPE)
pidstring = p.stdout.readline()
patte1 = re.compile("[0-9]+")
pidnr = re.search(patte1,pidstring)
pid = pidnr.group()
os.kill(int(pid), SIGSTOP)
time.sleep(0.5)
f = open(filename,"r+")
sharestring = ""
and
sharestring = sharestring + result22 + " "
f.truncate(0)
f.close()
time.sleep(0.5)
os.kill(int(pid), SIGCONT)
print sharestring
I'm thinking of stopping the daemon erasing the file and after that recreating it with the proper permissions and starting the daemon.

I don't think you can, but here are some things to try
Truncating a File
f1 = open(filename, 'w')
f1.close()
is a hacky side effect way of deleting a files contents and will probably be causing undesired side effects depending on the underlying OS if other applications have that file open.
Using the File Object method truncate()
truncate([size])
Truncate the file's size. If the optional size argument is present,
the file is truncated to (at most) that size. The size defaults to the
current position. The current file position is not changed. Note that
if a specified size exceeds the file's current size, the result is
platform-dependent: possibilities include that the file may remain
unchanged, increase to the specified size as if zero-filled, or
increase to the specified size with undefined new content.
Availability: Windows, many Unix variants.
Probably the only determinist way to do this is
stop the snmptrapd process at the start of the script, use the proper os module function remove and then recreate the file and restart the snmptrapd daemon at the end of the script.
os.remove(path)
Remove (delete) the file path. If path is a directory, OSError is
raised; see rmdir() below to remove a directory. This is identical to
the unlink() function documented below. On Windows, attempting to
remove a file that is in use causes an exception to be raised; on
Unix, the directory entry is removed but the storage allocated to the
file is not made available until the original file is no longer in
use.
Shared resource concern
You still might have problems with having two processes trying to fight for writing to a single file without some kind of locking mechanism and having non-deterministic things happening to the file. I bet you can send a SIGINT or something similar to your daemon process and get it to re-read the file or something, check your documentation.
Manipulating shared resources, especially file resources without exclusive locking is going to be trouble, especially with filesystem caching and application caching of data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.