Throttling Popen() calls - python

How much danger is there from starting too many processes with Popen() before the initial Popens have resolved?
I am doing some processing on a directory filled with PDFs. I iterate over each file and do two things using external calls.
First, I get the an html representation from the Xpdf-based pdftohtml tool (pdfminer is too slow). This makes an output of only the first page:
html = check_output(['pdftohtml.exe','-f','1','-l','1','-stdout','-noframes',pdf])
then if my conditions are met (I identify that it is the right document), I call tabula-extractor on it to extract a table. This is a slow/long running process compared to checking the document and only happens on maybe 1/20 files.
if I just do call(['jruby', 'C:\\jruby-1.7.4\\bin\\tabula', .....]), I will spend a long time waiting for the extraction to complete while I could be checking more files (I've got 4 cores and 16gb of ram and Tabula doesn't seem to multithread).
So instead, I am using Popen() to avoid blocking.
Popen(['jruby', 'C:\\jruby-1.7.4\\bin\\tabula', '-o', csv, '-f', 'CSV', '-a', "'",topBorder, ',', leftBorder, ',', bottomBorder, ',', rightBorder, "'", '-p', '1', pdf])
#where CSV is the name of the output file and pdf is the name of the input
I don't care about the return value (tabula is creating a csv file, so I can always see after the fact if it was created sucessfully). Doing it this way means that I can keep checking files in the background and starting more tabula processes as needed (again, only about 1 in 20).
This works, but it gets backlogged and ends up running a ton of tabula processes at once. So my questions are:
Is this bad? It makes the computer slow for anything else, but as long as it doesn't crash and is working as fast as it can, I don't really mind (all 4 cores sit at 100% the whole time, but memory usage doesn't go above 5.5GB, so it appears CPU-bound).
If it is bad, what is the right way to improve it? Is there a convenient way to say, queue up tabula processes so there are always 1-2 running per core, but I am not trying to process 30 files at once?

Is there a convenient way to say, queue up tabula processes so there are always 1-2 running per core, but I am not trying to process 30 files at once?
Yes, the multiprocessing module does just that.
import multiprocessing
import subprocess
def process_pdf(path):
subprocess.call(['jruby', 'C:\\jruby-1.7.4\\bin\\tabula', path, ...])
pool = multiprocessing.Pool(3) # 3 processes
results = []
for path in search_for_files():
results.append(pool.apply_async(process_pdf, [path]))
for result in results:
result.wait()

Related

Multiprocessing of CSV chunks in Python using Pool

I'm trying to process a very large CSV file (.gz files of around 6 GB) on a VM, and to speed things up I'm looking into various multiprocessing tools. I'm pretty new to this so I'm learning as I go but so far what I got from a day of research is that pool works great for CPU reliant tasks.
I'm processing a very large CSV by dividing it into chunks of set size, and processing those chunks individually. The goal is to be able to process these chunks in parallel, but without needing to first create a list of dataframes with all the chunks in it, as that would take a really long time in itself. Chunk processing is almost entirely pandas based (not sure if that's relevant) so I can't use dask. One of the processing functions then writes my results to an outfile. Ideally I would like to preserve the order of the results, but if I can't do that I can try to work around it later. Here's what I got so far:
if __name__ == "__main__":
parser = parse()
args = parser.parse_args()
a = Analysis( vars( args ) )
attributes = vars( a )
count = 0
pool = mp.Pool( processes = mp.cpu_count() )
for achunk in pd.read_csv( a.File,
compression = 'gzip',
names = inputHeader,
chunksize = simsize,
skipinitialspace = True,
header = None
):
pool.apply_async( a.beginProcessChunk( achunk,
start_time,
count
)
)
count += 1
This ultimately takes the same amount of time as running it serially (tested on a small file), and it actually takes a tiny bit longer. I'm not sure exactly what I'm doing wrong but I'm assuming that putting the pool function inside a loop won't make the loop process in parallel. I'm really new to this so maybe I'm just missing something trivial, so I'm sorry in advance for that. Could anyone give me some advice on this and/or tell me how exactly I can make this work?

Python asynchronous file download + parsing + outputting to JSON

To briefly explain context, I am downloading SEC prospectus data for example. After downloading I want to parse the file to extract certain data, then output the parsed dictionary to a JSON file which consists of a list of dictionaries. I would use a SQL database for output, but the research cluster admins at my university are being slow getting me access. If anyone has any suggestions for how to store the data for easy reading/writing later I would appreciate it, I was thinking about HDF5 as a possible alternative.
A minimal example of what I am doing with the spots that I think I need to improved labeled.
def classify_file(doc):
try:
data = {
'link': doc.url
}
except AttributeError:
return {'flag': 'ATTRIBUTE ERROR'}
# Do a bunch of parsing using regular expressions
if __name__=="__main__":
items = list()
for d in tqdm([y + ' ' + q for y in ['2019'] for q in ['1']]):
stream = os.popen('bash ./getformurls.sh ' + d)
stacked = stream.read().strip().split('\n')
# split each line into the fixed-width fields
widths=(12,62,12,12,44)
items += [[item[sum(widths[:j]):sum(widths[:j+1])].strip() for j in range(len(widths))] for item in stacked]
urls = [BASE_URL + item[4] for item in items]
resp = list()
# PROBLEM 1
filelimit = 100
for i in range(ceil(len(urls)/filelimit)):
print(f'Downloading: {i*filelimit/len(urls)*100:2.0f}%... ',end='\r',flush=True)
resp += [r for r in grequests.map((grequests.get(u) for u in urls[i*filelimit:(i+1)*filelimit]))]
# PROBLEM 2
with Pool() as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
with open('prospectus_data.json') as f:
json.dump(prospectus,f)
The getfileurls.sh referenced is a bash script I wrote that was faster than doing it in python since I could use grep, the code for that is
#!/bin/bash
BASE_URL="https://www.sec.gov/Archives/"
INDEX="edgar/full-index/"
url="${BASE_URL}${INDEX}$1/QTR$2/form.idx"
out=$(curl -s ${url} | grep "^485[A|B]POS")
echo "$out"
PROBLEM 1: So I am currently pulling about 18k files in the grequests map call. I was running into an error about too many files being open so I decided to split up the urls list into manageable chunks. I don't like this solution, but it works.
PROBLEM 2: This is where my actual error is. This code runs fine on a smaller set of urls (~2k) on my laptop (uses 100% of my cpu and ~20GB of RAM ~10GB for the file downloads and another ~10GB when the parsing starts), but when I take it to the larger 18k dataset using 40 cores on a research cluster it spins up to ~100GB RAM and ~3TB swap usage then crashes after parsing about 2k documents in 20 minutes via a KeyboardInterrupt from the server.
I don't really understand why the swap usage is getting so crazy, but I think I really just need help with memory management here. Is there a way to create an generator of unsent requests that will be sent when I call classify_file() on them later? Any help would be appreciated.
Generally when you have runaway memory usage with a Pool it's because the workers are being re-used and accumulating memory with each iteration. You can occasionally close and re-open the pool to prevent this but it's so common of an issue that Python now has a built-in parameter to do it for you...
Pool(...maxtasksperchild) is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool.
There's no way for me to tell you what the right value is but you generally want to set it low enough that resources can be freed fairly often but not so low that it slows things down. (Maybe a minutes worth of processing... just as a guess)
with Pool(maxtasksperchild=5) as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
For your first problem, you might consider just using requests and moving the call inside of the worker process you already have. Pulling 18K worth of URLs and caching all that data initially is going to take time and memory. If it's all encapsulated in the worker, you'll minimize data usage and you wont need to spin up so many open file handles.

MPI in Python: load data from a file by line concurrently

I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards
If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.
(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.

Python append to tarfile in parallel

import tarfile
from cStringIO import StringIO
from io import BytesIO as BIO
unique_keys = ['1:bigstringhere...:5'] * 5000
file_out = BytesIO()
tar = tarfile.open(mode='w:bz2', fileobj=file_out)
for k in unique_keys:
id, mydata, s_index= k.split(':')
inner_fname = '%s_%s.data' % (id, s_index)
info = tarfile.TarInfo(inner_fname)
info.size = len(mydata)
tar.addfile(info, StringIO(mydata))
tar.close()
I would like to do the above loop to add to the tarfile (tar) in parallel for faster execution.
Any ideas?
You cannot write multiple files to the same tarfile, at the same time. If you try to do so, the blocks will get intermingled, and it will be impossible to extract them.
You could do it by starting multiple threads, then each thread can open a tarfile, write to it, and close it.
I believe you can probably join tarfiles end-to-end. Normally, this would involve reading the tarfiles back at at the end, but since this is all in memory (and presumably, the size is small enough to allow that), this won't be so much of an issue.
If you take this approach, you don't want 5000 individual threads - 5000 threads will make the box stop responding (at least for a while), and the compression will be awful. Limit yourself to 1 thread per processor, and divide the work by the threads.
Also, your code, as written, will create a tar with 5000 files, all called 1_5.data, and with the contents "bigstringhere...". I'm assuming this is just an example. If not, create a tarfile with a single file, close it (to flush it), then duplicate the result 5000 times (e.g. if you then want to write it to disk, just write the entire BytesIO 5000 times).
I believe the most expensive part of this is the compression - you could use the external program 'pigz', which does gzip compression in parallel.

Failing to continually save and update a .CSV file after a period of time

I have a written a small program that reads values from two pieces of equipment every minuet and then saves it to a .csv file. I wanted the file to be updated and saved after every collection of every point so that if pc crashes, or other problem occurs no data loss occurs. To do that I open the file (ab mode), use write row and the close the file in a loop. The time between collections is about 1 minuet. This works quiet well, but the problem is after 5-6 hours of data collection, it stops saving to .csv file, and does not bring up any errors, the code continues to run with graph being update like nothing happened, but opening the .csv file reveals that data is lost. I would like to know if there is something wrong with the code I am using. I should also not I am running a subprocess from this that does live plotting, but I do not think it would cause an issue... I added those code lines as well.
##Initial file declaration and header
with open(filename,'wb') as wdata:
savefile=csv.writer(wdata,dialect='excel')
savefile.writerow(['System time','Time from Start(s)','Weight(g)','uS/cm','uS','Measured degC','%/C','Ideal degC','/cm'])
##Open Plotting Subprocess
draw=subprocess.Popen('TriPlot.py',shell=True,stdin=subprocess.PIPE,stdout=subprocess.PIPE)
##data collection loop
while True:
Collect Data x and y
Waits for data about 60 seconds, no sleep or pause commoand used, pyserial inteface is used.
## Send Data to subprocess
draw.stdin.write('%d\n' % tnow)
draw.stdin.write('%d\n' % s_w)
draw.stdin.write('%d\n' % tnow)
draw.stdin.write('%d\n' % float(s_c[5]))
##Saving data Section
wdata=open(filename,'ab')
savefile=csv.writer(wdata,dialect='excel')
savefile.writerow([tcurrent,tnow,s_w,s_c[5],s_c[7],s_c[9],s_c[11],s_c[13],s_c[15]])
wdata.close()
P.S This code uses the following packages for code not shown. pyserial, csv, os, subprocess,Tkinter, string, numpy, time and wx.
If draw.stdin.write() blocks it probably means that you are not consuming draw.stdout in a timely manner. The docs warn about the dead-lock due to the full OS pipe buffer.
If you don't need the output you could set stdout=devnull where devnull = open(os.devnull, 'wb') otherwise there are several approaches to read the output without blocking your code: threads, select, tempfile.TemoraryFile.

Categories

Resources