Advice on writing to a log file with python - python

I have some code that will need to write about 20 bytes of data every 10 seconds.
I'm on Windows 7 using python 2.7
You guys recommend any 'least strain to the os/hard drive' way to do this?
I was thinking about opening and closing the same file very 10 seconds:
f = open('log_file.txt', 'w')
f.write(information)
f.close()
Or should I keep it open and just flush() the data and not close it as often?
What about sqllite? Will it improve performance and be less intensive than the open and close file operations?
(Isn't it just a flat file database so == to text file anyways...?)
What about mysql (this uses a local server/process.. not sure the specifics on when/how it saves data to hdd) ?
I'm just worried about not frying my hard drive and improving the performance on this logging procedure. I will be receiving new log information about every 10 seconds, and this will be going on 24/7 24 hours a day.
Your advice?
ie: Think about programs like utorrent that require saving large amounts of data on a constant basis for long periods of time, (my log file is significantly less data that those being written in such "downloader type programs" like utorrent)
import random
import time
def get_data():
letters = 'isn\'t the code obvious'
data = ''
for i in xrange(20):
data += random.choice(letters)
return data
while True:
f = open('log_file.txt', 'w')
f.write(get_data())
f.close()
time.sleep(10)
My CPU starts whining after about 15 seconds... (or is that my hdd? )

As expected, python comes included with a great tool for this, have a look at the logging module

Use the logging framework. This is exactly what it is designed to do.
Edit: Balls, beaten to it :).

Don't worry about "frying" your hard drive - 20 bytes every 10 seconds is a small fraction of the data written to the disk in the normal operation of the OS.

Related

Python in-memory files for caching large files

I am doing very large data processing (16GB) and I would like to try and speed up the operations through storing the entire file in RAM in order to deal with disk latency.
I looked into the existing libraries but couldn't find anything that would give me the flexibility of interface.
Ideally, I would like to use something with integrated read_line() method so that the behavior is similar to the standard file reading interface.
Homer512 answered my question somewhat, mmap is a nice option that results in caching.
Alternatively, Numpy has some interfaces for parsing byte streams from a file directly into memory and it also comes with some string-related operations.
Just read the file once. The OS will cache it for the next read operations. It's called the page cache.
For example, run this code and see what happens (replace the path with one pointing to a file on your system):
import time
filepath = '/home/mostafa/Videos/00053.MTS'
n_bytes = 200_000_000 # read 200 MB
t0 = time.perf_counter()
with open(filepath, 'rb') as f:
f.read(n_bytes)
t1 = time.perf_counter()
with open(filepath, 'rb') as f:
f.read(n_bytes)
t2 = time.perf_counter()
print(f'duration 1 = {t1 - t0}')
print(f'duration 2 = {t2 - t1}')
On my Linux system, I get:
duration 1 = 2.1419905399670824
duration 2 = 0.10992361599346623
Note how faster the second read operation is, even though the file was closed and reopened. This is because, in the second read operation, the file is being read from RAM, thanks to the Linux kernel. Also note that if you run this program again, you'll see that both operations finish quickly, because the file was cached from the previous run:
duration 1 = 0.10143039299873635
duration 2 = 0.08972924604313448
So in short, just use this code to preload the file into RAM:
with open(filepath, 'rb') as f:
f.read()
Finally, it's worth mentioning that page cache is practically non-deterministic, i.e. your file might not get fully cached, especially if the file size is big compared to the RAM. But in cases like that, caching the file might not be practical after all. In short, default OS page caching is not the most sophisticated way, but it is good enough for many use cases.

Randomized-offset binary raw disk writes with no caching whatsoever

For my application, I am attempting to determine whether a data backup system missed any writes. I am doing this by writing an incrementing integer counter to a 1GB virtual disk, and to make sure no writes were missed I can look at the reverted snapshot and see if there were any gaps (i.e. if I see 1, 2, 3, 0, 0, 6, 7 I know that the backup didn't get writes 4 and 5 correctly). This is all on a CentOS 7 VM, with mostly Python 2.7 scripts for writes/reads (speed isn't a huge concern)
A big part of my issues has been caching: since I'm simulating random I/O, writes are often flushed from caches and written to disk out of order. This makes every test appear as a false positive, since it looks like some data is missing at the time of the snapshot. Again, I don't really care about efficiency at all, so I don't mind really slow writes. Reads can use caching, that's not a problem, but also doesn't matter much one way or the other
Here are the things I have done to try to disable caching:
disable the disk write cache with sudo hdparm -W 0 /dev/sdb where /dev/sdb
writing to a raw disk with no filesystem, so no filesystem caching
set the buffering flag on with open in the Python script to 0 (no Python write cache)
Is it basically an impossible task to make sure that my writes get put on the disk in sequential order? All I need is write #(n) to happen before write #(n+1), and #(n+1) before #(n+2), etc.
This is the Python script I'm using to write to disk (SIZE and PRIME change based on the size of the disk an a random seed):
from struct import pack, unpack
import sys
SIZE,PRIME = [x],[x]
# random I/O traversal iterator
def rand_index_generator(a,b):
ctr=0
while True:
yield (ctr%b)
ctr+=a
with open('/dev/sdb', 'rb+', buffering=0) as f:
index_gen = rand_index_generator(PRIME, SIZE)
# random traversal using iterator above, write counter to file
for counter in xrange(1, SIZE-16):
f.seek(index_gen.next()*4)
f.write(pack('>I', counter))
Then to validate I traverse in the same order and watch for gaps of unwritten data. This is after reverting the VM back to the snapshot. I know all the traversal and writing things work since validation will work smoothly with no missed writes before reverting, but I think some "written" data dies in RAM and doesn't make it to disk
Will take any suggestions to guarantee the write order I need for this application
Found out the answer to this question. I misunderstood the effect of writing to a raw disk, it did not eliminate OS caching since I was still calling the OS to write to my raw disk. Oops
To bypass OS caches you should use os.open and pass os.O_DIRECT and os.O_SYNC flags to make sure writes happen in the correct sequence (more info on those flags) and are not stuck in volatile memory. I used mmap and os file descriptors but you could also use the normal filehandles like this
Page size is specific to your operating system. For Linux it is 4096
The top section of the code stayed the same but here is the write loop:
PAGESIZE = 4096
filedesc = os.open('/dev/sdb', os.O_DIRECT|os.O_SYNC|os.O_RDWR)
for counter in xrange(1, SIZE-16):
write_loc = index_gen.next()*4
page_dist = (write_loc%PAGESIZE)
offset = write_loc - page_dist
bytemap = mmap.mmap(filedesc, PAGESIZE, offset=offset)
bytemap[page_dist:(page_dist+4)] = pack('>I', counter)
bytemap.flush()
bytemap.close()

Python asynchronous file download + parsing + outputting to JSON

To briefly explain context, I am downloading SEC prospectus data for example. After downloading I want to parse the file to extract certain data, then output the parsed dictionary to a JSON file which consists of a list of dictionaries. I would use a SQL database for output, but the research cluster admins at my university are being slow getting me access. If anyone has any suggestions for how to store the data for easy reading/writing later I would appreciate it, I was thinking about HDF5 as a possible alternative.
A minimal example of what I am doing with the spots that I think I need to improved labeled.
def classify_file(doc):
try:
data = {
'link': doc.url
}
except AttributeError:
return {'flag': 'ATTRIBUTE ERROR'}
# Do a bunch of parsing using regular expressions
if __name__=="__main__":
items = list()
for d in tqdm([y + ' ' + q for y in ['2019'] for q in ['1']]):
stream = os.popen('bash ./getformurls.sh ' + d)
stacked = stream.read().strip().split('\n')
# split each line into the fixed-width fields
widths=(12,62,12,12,44)
items += [[item[sum(widths[:j]):sum(widths[:j+1])].strip() for j in range(len(widths))] for item in stacked]
urls = [BASE_URL + item[4] for item in items]
resp = list()
# PROBLEM 1
filelimit = 100
for i in range(ceil(len(urls)/filelimit)):
print(f'Downloading: {i*filelimit/len(urls)*100:2.0f}%... ',end='\r',flush=True)
resp += [r for r in grequests.map((grequests.get(u) for u in urls[i*filelimit:(i+1)*filelimit]))]
# PROBLEM 2
with Pool() as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
with open('prospectus_data.json') as f:
json.dump(prospectus,f)
The getfileurls.sh referenced is a bash script I wrote that was faster than doing it in python since I could use grep, the code for that is
#!/bin/bash
BASE_URL="https://www.sec.gov/Archives/"
INDEX="edgar/full-index/"
url="${BASE_URL}${INDEX}$1/QTR$2/form.idx"
out=$(curl -s ${url} | grep "^485[A|B]POS")
echo "$out"
PROBLEM 1: So I am currently pulling about 18k files in the grequests map call. I was running into an error about too many files being open so I decided to split up the urls list into manageable chunks. I don't like this solution, but it works.
PROBLEM 2: This is where my actual error is. This code runs fine on a smaller set of urls (~2k) on my laptop (uses 100% of my cpu and ~20GB of RAM ~10GB for the file downloads and another ~10GB when the parsing starts), but when I take it to the larger 18k dataset using 40 cores on a research cluster it spins up to ~100GB RAM and ~3TB swap usage then crashes after parsing about 2k documents in 20 minutes via a KeyboardInterrupt from the server.
I don't really understand why the swap usage is getting so crazy, but I think I really just need help with memory management here. Is there a way to create an generator of unsent requests that will be sent when I call classify_file() on them later? Any help would be appreciated.
Generally when you have runaway memory usage with a Pool it's because the workers are being re-used and accumulating memory with each iteration. You can occasionally close and re-open the pool to prevent this but it's so common of an issue that Python now has a built-in parameter to do it for you...
Pool(...maxtasksperchild) is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool.
There's no way for me to tell you what the right value is but you generally want to set it low enough that resources can be freed fairly often but not so low that it slows things down. (Maybe a minutes worth of processing... just as a guess)
with Pool(maxtasksperchild=5) as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
For your first problem, you might consider just using requests and moving the call inside of the worker process you already have. Pulling 18K worth of URLs and caching all that data initially is going to take time and memory. If it's all encapsulated in the worker, you'll minimize data usage and you wont need to spin up so many open file handles.

How come Python 3 deserializes json files so fast?

Recently, I had to json.load() a file. A 13,045KB file from a network location. Over a slow corporate VPN. I did it in an interactive shell and decided to go grab a coffee in the meantime, because this would surely take ages to load. I didn't even manage do stand up and my code was done reading the json and translating its half milion lines into a beautiful dictionary.
How does it happen that Python is able to do it so efficiently? What optimizations are used? Why does it take a few minutes for Sublime to show this json as a text?
>>> f_path = r"Z:\some\network\location\data.json"
>>> import os
>>> os.stat(f_path).st_size
13357297
>>> with measurement_tools.SimpleBenchmark():
... with open(f_path) as f:
... t_dict = json.load(f)
Took 0.203125 seconds.
Edit: SimpleBenchmark() is just my own context manager measuring tool doing basically t1-t0.
Edit 2: since there were questions about why I consider it fast, I compared it to serializing the same dictionary into a json in the same location which took c.a. 10 times as much time (well over 2 secs).

Python MD5 Hash Faster Calculation

I will try my best to explain my problem and my line of thought on how I think I can solve it.
I use this code
for root, dirs, files in os.walk(downloaddir):
for infile in files:
f = open(os.path.join(root,infile),'rb')
filehash = hashlib.md5()
while True:
data = f.read(10240)
if len(data) == 0:
break
filehash.update(data)
print "FILENAME: " , infile
print "FILE HASH: " , filehash.hexdigest()
and using start = time.time() elapsed = time.time() - start I measure how long it takes to calculate an hash. Pointing my code to a file with 653megs this is the result:
root#Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.624
root#Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.373
root#Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.540
Ok now 12 seconds +- on a 653mb file, my problem is I intend to use this code on a program that will run through multiple files, some of them might be 4/5/6Gb and it will take wayy longer to calculate. What am wondering is if there is a faster way for me to calculate the hash of the file? Maybe by doing some multithreading? I used a another script to check the use of the CPU second by second and I see that my code is only using 1 out of my 2 CPUs and only at 25% max, any way I can change this?
Thank you all in advance for the given help.
Hash calculation in your case will almost certanly be I/O bound (unless you'll be running it on a machine with a really slow processor), so multithreading or processing of multiple files at once probably won't yield you expected results.
Arraging files over multiple drives or on a faster (SSD) drive would probably help, even though that is probably not the solution you are looking for.
Aren't disk operations a bottleneck here?
Assuming 80MB/sec read speed (this is how my hard disk performs), it takes about 8 seconds to read the file.
For what it's worth, doing this:
c:\python\Python.exe c:\python\Tools\scripts\md5sum.py cd.iso
takes 9.671 seconds on my laptop (2GHz core2 duo with an 80 GB SATA laptop hard drive).
As others have mentioned, MD5s are disk-bound, but your 12 second benchmark is probably pretty close to the fastest you could get.
Also, python's md5sum.py uses 8096 for the buffer size (even though I'm sure they meant either 4096 or 8192).
It helped me to increase my buffer size, up to a point. I started with 1024 and multiplied it with 2^N, increasing N each time starting from 1. With this method, I found that on my system that a buffer size of 65536 seemed to be about as good as it would get. However, it only gave me about an 7% improvement in running time.
Profiling indicated that about 80% of the time is spent in the MD5 update method and the other 20% is reading in the file. Since MD5 is a serial algorithm and the Python algorithm is already implemented in C, I don't think that there is much that you can do to speed up the MD5 part. You can try calculating the MD5s of two different files in parallel, but as everyone has said, you're ultimately going to be limited by the disk access speed.

Categories

Resources