Python Multiprocessing to speed up I/O and Groupby/Sum - python

I have a dataset with ~200 million rows, ~10 grouping variables, and ~20 variables to sum, and is a ~50GB csv. The first thing I did was see what the runtime was run sequentially but in chunks. It's a bit more complicated because some of the groupbys are actually in another dataset at a different aggregation level so it's only ~200mb. So the relevant code right now looks like this:
group_cols = ['cols','to','group','by']
cols_to_summarize = ['cols','to','summarize']
groupbys = []
df = pd.read_csv("file/path/df.csv",chunksize=1000000)
for chunk in df:
chunk = chunk.merge(other_df,left_on="id",right_index=True,how="inner")
groupbys.append(chunk.groupby(group_cols)[cols_to_summarize].sum())
finalAgg = pd.concat(groupbys).groupby(group_cols)[cols_to_summarize].sum()
Each chunk takes roughly 5 seconds to process so the 200 chunks takes about 15-20 minutes. The server I'm working on has 16 cores, so I'm hoping to get a bit of speedup here, if I could get it to 2-3 minutes that would be amazing.
But when I try to use multiprocess I'm struggling to get much speedup at all. Based on my googling I thought that would help with reading in CSVs but I'm wondering if multiple processes can't read the same CSV and maybe I should split it up first? Here's what I tried and it took longer than the sequential run:
def agg_chunk(start):
[pull in small dataset]
chunk = pd.read_csv("file/path/df.csv",skiprows=range(1,start+1),nrows=1000000)
chunk = chunk.merge(other_df,left_on="id",right_index=True,how="inner")
return chunk.groupby(group_cols)[cols_to_summarize].sum()
if __name__ == "__main__":
pool = mp.Pool(16)
r = list(np.array(range(200))*1000000)
groupbys = pool.map(agg_chunk,r)
finalAgg = pd.concat(groupbys).groupby(group_cols)[cols_to_summarize].sum()
Is there a better way to do this? The extra [pull in small dataset] piece takes about 5 seconds, but doubling the time per process and then dividing by 16 should still be a pretty good speedup right? Instead the parallel version has been running for half an hour and is still not complete. Also is there some way to pass the dataset to each process instead of making each re-create it again?

Related

Multiprocessing of CSV chunks in Python using Pool

I'm trying to process a very large CSV file (.gz files of around 6 GB) on a VM, and to speed things up I'm looking into various multiprocessing tools. I'm pretty new to this so I'm learning as I go but so far what I got from a day of research is that pool works great for CPU reliant tasks.
I'm processing a very large CSV by dividing it into chunks of set size, and processing those chunks individually. The goal is to be able to process these chunks in parallel, but without needing to first create a list of dataframes with all the chunks in it, as that would take a really long time in itself. Chunk processing is almost entirely pandas based (not sure if that's relevant) so I can't use dask. One of the processing functions then writes my results to an outfile. Ideally I would like to preserve the order of the results, but if I can't do that I can try to work around it later. Here's what I got so far:
if __name__ == "__main__":
parser = parse()
args = parser.parse_args()
a = Analysis( vars( args ) )
attributes = vars( a )
count = 0
pool = mp.Pool( processes = mp.cpu_count() )
for achunk in pd.read_csv( a.File,
compression = 'gzip',
names = inputHeader,
chunksize = simsize,
skipinitialspace = True,
header = None
):
pool.apply_async( a.beginProcessChunk( achunk,
start_time,
count
)
)
count += 1
This ultimately takes the same amount of time as running it serially (tested on a small file), and it actually takes a tiny bit longer. I'm not sure exactly what I'm doing wrong but I'm assuming that putting the pool function inside a loop won't make the loop process in parallel. I'm really new to this so maybe I'm just missing something trivial, so I'm sorry in advance for that. Could anyone give me some advice on this and/or tell me how exactly I can make this work?

Optimizing processing of a large excel file

I am working on a large dataset where I need to read excel files and then find valid numbers but the task takes enormous time for only 500k data. For valid numbers, I am using google phonelib. processing can be done in an async way as they are independent.
parts = dask.delayed(pd.read_excel)('500k.xlsx')
data = dd.from_delayed(parts)
data['Valid'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x)),meta=('Valid','object'))
for background
phonenumbers.is_valid_number(phonenumbers.parse('+442083661177'))
gives output as True
I expect the output to be less than 10sec but it takes around 40s
just been playing with this and you might just need to repartition your dataframe to allow the computation to be run in parallel
I start by generating some data:
import csv
import random
with open('tmp.csv', 'w') as fd:
out = csv.writer(fd)
out.writerow(['id', 'number'])
for i in range(500_000):
a = random.randrange(1000, 2999)
b = random.randrange(100_000, 899_999)
out.writerow([i+1, f'+44 {a} {b}'])
note that these are mostly valid UK numbers.
I then run something similar to your code:
from dask.distributed import Client
import dask.dataframe as dd
import phonenumbers
def fn(num):
return phonenumbers.is_valid_number(phonenumbers.parse(num))
with Client(processes=True):
df = dd.read_csv('tmp.csv')
# repartition to increase parallelism
df = df.repartition(npartitions=8)
df['valid'] = df.number.apply(fn, meta=('valid', 'object'))
out = df.compute()
this takes ~20 seconds to complete on my laptop (4 cores, 8 threads, Linux 5.2.8), which is only a bit more than double the performance of the plain loop. which indicates dask has quite a bit of runtime overhead as I'd expect it to be much faster than that. if I remove the call to repartition it takes a longer than I'm willing to wait and top only shows a single process running
note that if I rewrite it to do the naive thing in multiprocessing I get much better results:
from multiprocessing import Pool
import pandas as pd
df = pd.read_csv('tmp.csv')
with Pool(4) as pool:
df['valid'] = pool.map(fn, df['number'])
which reduces runtime to ~11 seconds and is even less code here as a bonus

h5py create_dataset loop slow

I'm trying to create a hdf5 file where each dataset is a 90x18 numpy array. I'm looking to create 2704332 total datasets for the file with an approximate final size of 40 GB.
with h5py.File('allDaysData.h5', 'w') as hf:
for x in list:
start = datetime.datetime.now()
hf.create_dataset(x, data=currentData)
end = datetime.datetime.now()
print(end-start)
When running this the create_dataset command takes no longer then .0004 seconds in the beginning. Once the file hits around 6 GB it abruptly switches to taking 0.08 seconds per dataset.
Is there some sort of limit on datasets for hdf5 files?
There is a related answer.
In this answer, you can see the performance of create_dataset is decreasing with the increasing of iterations. As h5py stores data in special structure, I think it is because h5py need more time to index the datasets.
There are two solutions, One is to use key word libver='latest'. It will improve the performance significantly even though the generated file will be incompatible with old ones. Second one is to aggregate your arrays into several aggregations. For example, aggregate every 1024 arrays into one.

How to read .xls in parallel using pandas?

I would like to read a large .xls file in parallel using pandas.
currently I am using this:
LARGE_FILE = "LARGEFILE.xlsx"
CHUNKSIZE = 100000 # processing 100,000 rows at a time
def process_frame(df):
# process data frame
return len(df)
if __name__ == '__main__':
reader = pd.read_excel(LARGE_FILE, chunksize=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
# process each data frame
f = pool.apply_async(process_frame,[df])
funclist.append(f)
result = 0
for f in funclist:
result += f.get(timeout=10) # timeout in 10 seconds
While this runs, I dont think it actually speeds up the process of reading the file. Is there a more efficient way of achieving this?
Just for your information: i'm reading 13 Mbyte, 29000 lines of csv in about 4 seconds. (not using parallel processing)
Archlinux, AMD Phenom II X2, Python 3.4, python-pandas 0.16.2.
How big is your file and how long does it take to read it ?
That would help to understand the problem better.
Is your excel sheet very complex ? Maybe read_excel has difficulty processing that complexity ?
Suggestion: install genumeric and use the helper function ssconvert to translate the file to csv. In your program change to read_csv. Check the time used by ssconvert and the time taken by read_csv.
By the way, python-pandas had major improvements while it went from version 13 .... 16, hence usefull to check you have a recent version.

Performance issue with loop on datasets with h5py

I want to apply a simple function to the datasets contained in an hdf5 file.
I am using a code similar to this
import h5py
data_sums = []
with h5py.File(input_file, "r") as f:
for (name, data) in f["group"].iteritems():
print name
# data_sums.append(data.sum(1))
data[()] # My goal is similar to the line above but this line is enough
# to replicate the problem
It goes very fast at the beginning and after a certain number, reproducible to some extent, of datasets it slow down dramatically.
If I comment the last line, it finishes almost instantly. It does not matter if the data are stored (here append to a list) or not: something like data[:100] as a similar effect.
The number of datasets that can be treated before the drop in performance is dependent to the size of portion that is accessed at each iteration.
Iterating over smaller chunks does not solve the issue.
I suppose I am filling some memory space and that the process slows down when it is full but I do not understand why.
How to circumvent this performance issue?
I run python 2.6.5 on ubuntu 10.04.
Edit:
The following code does not slow down if the second line of the loop is un-commented. It does slow down without out it
f = h5py.File(path to file, "r")
list_name = f["data"].keys()
f.close()
import numpy as np
for name in list_name:
f = h5py.File(d.storage_path, "r")
# name = list_name[0] # with this line the issue vanishes.
data = f["data"][name]
tag = get_tag(name)
data[:, 1].sum()
print "."
f.close()
Edit: I found out that accessing the first dimension of multidimensional datasets seems to run without issues. The problem occurs when higher dimensions are involved.
platform?
on windows 64 bit, python 2.6.6, i have seen some weird issues when crossing a 2GB barrier (i think) if you have allocated it in small chunks.
you can see it with a script like this:
ix = []
for i in xrange(20000000):
if i % 100000 == 0:
print i
ix.append('*' * 1000)
you can see that it will run pretty fast for a while, and then suddenly slow down.
but if you run it in larger blocks:
ix = []
for i in xrange(20000):
if i % 100000 == 0:
print i
ix.append('*' * 1000000)
it doesn't seem to have the problem (though it will run out of memory, depending on how much you have - 8GB here).
weirder yet, if you eat the memory using large blocks, and then clear the memory (ix=[] again, so back to almost no memory in use), and then re-run the small block test, it isn't slow anymore.
i think there was some dependence on the pyreadline version - 2.0-dev1 helped a lot with these sorts of issues. but don't remember too much. when i tried it now, i don't see this issue anymore really - both slow down significant around 4.8GB, which with everything else i have running is about where it hits the limits of physical memory and starts swapping.

Categories

Resources