Multiprocessing of CSV chunks in Python using Pool - python

I'm trying to process a very large CSV file (.gz files of around 6 GB) on a VM, and to speed things up I'm looking into various multiprocessing tools. I'm pretty new to this so I'm learning as I go but so far what I got from a day of research is that pool works great for CPU reliant tasks.
I'm processing a very large CSV by dividing it into chunks of set size, and processing those chunks individually. The goal is to be able to process these chunks in parallel, but without needing to first create a list of dataframes with all the chunks in it, as that would take a really long time in itself. Chunk processing is almost entirely pandas based (not sure if that's relevant) so I can't use dask. One of the processing functions then writes my results to an outfile. Ideally I would like to preserve the order of the results, but if I can't do that I can try to work around it later. Here's what I got so far:
if __name__ == "__main__":
parser = parse()
args = parser.parse_args()
a = Analysis( vars( args ) )
attributes = vars( a )
count = 0
pool = mp.Pool( processes = mp.cpu_count() )
for achunk in pd.read_csv( a.File,
compression = 'gzip',
names = inputHeader,
chunksize = simsize,
skipinitialspace = True,
header = None
):
pool.apply_async( a.beginProcessChunk( achunk,
start_time,
count
)
)
count += 1
This ultimately takes the same amount of time as running it serially (tested on a small file), and it actually takes a tiny bit longer. I'm not sure exactly what I'm doing wrong but I'm assuming that putting the pool function inside a loop won't make the loop process in parallel. I'm really new to this so maybe I'm just missing something trivial, so I'm sorry in advance for that. Could anyone give me some advice on this and/or tell me how exactly I can make this work?

Related

Python asynchronous file download + parsing + outputting to JSON

To briefly explain context, I am downloading SEC prospectus data for example. After downloading I want to parse the file to extract certain data, then output the parsed dictionary to a JSON file which consists of a list of dictionaries. I would use a SQL database for output, but the research cluster admins at my university are being slow getting me access. If anyone has any suggestions for how to store the data for easy reading/writing later I would appreciate it, I was thinking about HDF5 as a possible alternative.
A minimal example of what I am doing with the spots that I think I need to improved labeled.
def classify_file(doc):
try:
data = {
'link': doc.url
}
except AttributeError:
return {'flag': 'ATTRIBUTE ERROR'}
# Do a bunch of parsing using regular expressions
if __name__=="__main__":
items = list()
for d in tqdm([y + ' ' + q for y in ['2019'] for q in ['1']]):
stream = os.popen('bash ./getformurls.sh ' + d)
stacked = stream.read().strip().split('\n')
# split each line into the fixed-width fields
widths=(12,62,12,12,44)
items += [[item[sum(widths[:j]):sum(widths[:j+1])].strip() for j in range(len(widths))] for item in stacked]
urls = [BASE_URL + item[4] for item in items]
resp = list()
# PROBLEM 1
filelimit = 100
for i in range(ceil(len(urls)/filelimit)):
print(f'Downloading: {i*filelimit/len(urls)*100:2.0f}%... ',end='\r',flush=True)
resp += [r for r in grequests.map((grequests.get(u) for u in urls[i*filelimit:(i+1)*filelimit]))]
# PROBLEM 2
with Pool() as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
with open('prospectus_data.json') as f:
json.dump(prospectus,f)
The getfileurls.sh referenced is a bash script I wrote that was faster than doing it in python since I could use grep, the code for that is
#!/bin/bash
BASE_URL="https://www.sec.gov/Archives/"
INDEX="edgar/full-index/"
url="${BASE_URL}${INDEX}$1/QTR$2/form.idx"
out=$(curl -s ${url} | grep "^485[A|B]POS")
echo "$out"
PROBLEM 1: So I am currently pulling about 18k files in the grequests map call. I was running into an error about too many files being open so I decided to split up the urls list into manageable chunks. I don't like this solution, but it works.
PROBLEM 2: This is where my actual error is. This code runs fine on a smaller set of urls (~2k) on my laptop (uses 100% of my cpu and ~20GB of RAM ~10GB for the file downloads and another ~10GB when the parsing starts), but when I take it to the larger 18k dataset using 40 cores on a research cluster it spins up to ~100GB RAM and ~3TB swap usage then crashes after parsing about 2k documents in 20 minutes via a KeyboardInterrupt from the server.
I don't really understand why the swap usage is getting so crazy, but I think I really just need help with memory management here. Is there a way to create an generator of unsent requests that will be sent when I call classify_file() on them later? Any help would be appreciated.
Generally when you have runaway memory usage with a Pool it's because the workers are being re-used and accumulating memory with each iteration. You can occasionally close and re-open the pool to prevent this but it's so common of an issue that Python now has a built-in parameter to do it for you...
Pool(...maxtasksperchild) is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool.
There's no way for me to tell you what the right value is but you generally want to set it low enough that resources can be freed fairly often but not so low that it slows things down. (Maybe a minutes worth of processing... just as a guess)
with Pool(maxtasksperchild=5) as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
For your first problem, you might consider just using requests and moving the call inside of the worker process you already have. Pulling 18K worth of URLs and caching all that data initially is going to take time and memory. If it's all encapsulated in the worker, you'll minimize data usage and you wont need to spin up so many open file handles.

Python Multiprocessing to speed up I/O and Groupby/Sum

I have a dataset with ~200 million rows, ~10 grouping variables, and ~20 variables to sum, and is a ~50GB csv. The first thing I did was see what the runtime was run sequentially but in chunks. It's a bit more complicated because some of the groupbys are actually in another dataset at a different aggregation level so it's only ~200mb. So the relevant code right now looks like this:
group_cols = ['cols','to','group','by']
cols_to_summarize = ['cols','to','summarize']
groupbys = []
df = pd.read_csv("file/path/df.csv",chunksize=1000000)
for chunk in df:
chunk = chunk.merge(other_df,left_on="id",right_index=True,how="inner")
groupbys.append(chunk.groupby(group_cols)[cols_to_summarize].sum())
finalAgg = pd.concat(groupbys).groupby(group_cols)[cols_to_summarize].sum()
Each chunk takes roughly 5 seconds to process so the 200 chunks takes about 15-20 minutes. The server I'm working on has 16 cores, so I'm hoping to get a bit of speedup here, if I could get it to 2-3 minutes that would be amazing.
But when I try to use multiprocess I'm struggling to get much speedup at all. Based on my googling I thought that would help with reading in CSVs but I'm wondering if multiple processes can't read the same CSV and maybe I should split it up first? Here's what I tried and it took longer than the sequential run:
def agg_chunk(start):
[pull in small dataset]
chunk = pd.read_csv("file/path/df.csv",skiprows=range(1,start+1),nrows=1000000)
chunk = chunk.merge(other_df,left_on="id",right_index=True,how="inner")
return chunk.groupby(group_cols)[cols_to_summarize].sum()
if __name__ == "__main__":
pool = mp.Pool(16)
r = list(np.array(range(200))*1000000)
groupbys = pool.map(agg_chunk,r)
finalAgg = pd.concat(groupbys).groupby(group_cols)[cols_to_summarize].sum()
Is there a better way to do this? The extra [pull in small dataset] piece takes about 5 seconds, but doubling the time per process and then dividing by 16 should still be a pretty good speedup right? Instead the parallel version has been running for half an hour and is still not complete. Also is there some way to pass the dataset to each process instead of making each re-create it again?

Optimizing processing of a large excel file

I am working on a large dataset where I need to read excel files and then find valid numbers but the task takes enormous time for only 500k data. For valid numbers, I am using google phonelib. processing can be done in an async way as they are independent.
parts = dask.delayed(pd.read_excel)('500k.xlsx')
data = dd.from_delayed(parts)
data['Valid'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x)),meta=('Valid','object'))
for background
phonenumbers.is_valid_number(phonenumbers.parse('+442083661177'))
gives output as True
I expect the output to be less than 10sec but it takes around 40s
just been playing with this and you might just need to repartition your dataframe to allow the computation to be run in parallel
I start by generating some data:
import csv
import random
with open('tmp.csv', 'w') as fd:
out = csv.writer(fd)
out.writerow(['id', 'number'])
for i in range(500_000):
a = random.randrange(1000, 2999)
b = random.randrange(100_000, 899_999)
out.writerow([i+1, f'+44 {a} {b}'])
note that these are mostly valid UK numbers.
I then run something similar to your code:
from dask.distributed import Client
import dask.dataframe as dd
import phonenumbers
def fn(num):
return phonenumbers.is_valid_number(phonenumbers.parse(num))
with Client(processes=True):
df = dd.read_csv('tmp.csv')
# repartition to increase parallelism
df = df.repartition(npartitions=8)
df['valid'] = df.number.apply(fn, meta=('valid', 'object'))
out = df.compute()
this takes ~20 seconds to complete on my laptop (4 cores, 8 threads, Linux 5.2.8), which is only a bit more than double the performance of the plain loop. which indicates dask has quite a bit of runtime overhead as I'd expect it to be much faster than that. if I remove the call to repartition it takes a longer than I'm willing to wait and top only shows a single process running
note that if I rewrite it to do the naive thing in multiprocessing I get much better results:
from multiprocessing import Pool
import pandas as pd
df = pd.read_csv('tmp.csv')
with Pool(4) as pool:
df['valid'] = pool.map(fn, df['number'])
which reduces runtime to ~11 seconds and is even less code here as a bonus

MPI in Python: load data from a file by line concurrently

I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards
If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.
(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.

Killed/MemoryError when creating a large dask.dataframe from delayed collection

I am trying to create a dask.dataframe from a bunch of large CSV files (currently 12 files, 8-10 million lines and 50 columns each). A few of them might fit together into my system memory but all of them at once definitely will not, hence the use of dask instead of regular pandas.
Since reading each csv file involves some extra work (adding columns with data from the file path), I tried creating the dask.dataframe from a list of delayed objects, similar to this example.
This is my code:
import dask.dataframe as dd
from dask.delayed import delayed
import os
import pandas as pd
def read_file_to_dataframe(file_path):
df = pd.read_csv(file_path)
df['some_extra_column'] = 'some_extra_value'
return df
if __name__ == '__main__':
path = '/path/to/my/files'
delayed_collection = list()
for rootdir, subdirs, files in os.walk(path):
for filename in files:
if filename.endswith('.csv'):
file_path = os.path.join(rootdir, filename)
delayed_reader = delayed(read_file_to_dataframe)(file_path)
delayed_collection.append(delayed_reader)
df = dd.from_delayed(delayed_collection)
print(df.compute())
When starting this script (Python 3.4, dask 0.12.0), it runs for a couple of minutes while my system memory constantly fills up. When it is fully used, everything starts lagging and it runs for some more minutes, then it crashes with killed or MemoryError.
I thought the whole point of dask.dataframe was to be able to operate on larger-than-memory dataframes that span over multiple files on disk, so what am I doing wrong here?
edit: Reading the files instead with df = dd.read_csv(path + '/*.csv') seems to work fine as far as I can see. However, this does not allow me to alter each single dataframe with additional data from the file path.
edit #2:
Following MRocklin's answer, I tried to read my data with dask's read_bytes() method as well as using the single-threaded scheduler as well as doing both in combination.
Still, even when reading chunks of 100MB in single-threaded mode on a laptop with 8GB of memory, my process gets killed sooner or later. Running the code stated below on a bunch of small files (around 1MB each) of similar shape works fine though.
Any ideas what I am doing wrong here?
import dask
from dask.bytes import read_bytes
import dask.dataframe as dd
from dask.delayed import delayed
from io import BytesIO
import pandas as pd
def create_df_from_bytesio(bytesio):
df = pd.read_csv(bytesio)
return df
def create_bytesio_from_bytes(block):
bytesio = BytesIO(block)
return bytesio
path = '/path/to/my/files/*.csv'
sample, blocks = read_bytes(path, delimiter=b'\n', blocksize=1024*1024*100)
delayed_collection = list()
for datafile in blocks:
for block in datafile:
bytesio = delayed(create_bytesio_from_bytes)(block)
df = delayed(create_df_from_bytesio)(bytesio)
delayed_collection.append(df)
dask_df = dd.from_delayed(delayed_collection)
print(dask_df.compute(get=dask.async.get_sync))
If each of your files is large then a few concurrent calls to read_file_to_dataframe might be flooding memory before Dask ever gets a chance to be clever.
Dask tries to operate in low memory by running functions in an order such that it can delete intermediate results quickly. However if the results of just a few functions can fill up memory then Dask may never have a chance to delete things. For example if each of your functions produced a 2GB dataframe and if you had eight threads running at once, then your functions might produce 16GB of data before Dask's scheduling policies can kick in.
Some options
Use dask.bytes.read_bytes
The reason why read_csv works is that it chunks up large CSV files into many ~100MB blocks of bytes (see the blocksize= keyword argument). You could do this too, although it's tricky because you need to always break on an endline.
The dask.bytes.read_bytes function can help you here. It can convert a single path into a list of delayed objects, each corresponding to a byte range of that file that starts and stops cleanly on a delimiter. You would then put these bytes into an io.BytesIO (standard library) and call pandas.read_csv on that. Beware that you'll also have to handle headers and such. The docstring to that function is extensive and should provide more help.
Use a single thread
In the example above everything would be fine if we didn't have the 8x multiplier from parallelism. I suspect that if you only ran a single function at once that things would probably pipeline without ever reaching your memory limit. You can set dask to use only a single thread with the following line
dask.set_options(get=dask.async.get_sync)
Note: For Dask versions >= 0.15, you need to use dask.local.get_sync instead.
Make sure that results fit in memory (response to edit 2)
If you make a dask.dataframe and then compute it immediately
ddf = dd.read_csv(...)
df = ddf.compute()
You're loading in all of the data into a Pandas dataframe, which will eventually blow up memory. Instead it's better to operate on the Dask dataframe and only compute on small results.
# result = df.compute() # large result fills memory
result = df.groupby(...).column.mean().compute() # small result
Convert to a different format
CSV is a pervasive and pragmatic format, but also has some flaws. You might consider a data format like HDF5 or Parquet.

Categories

Resources