Optimizing processing of a large excel file - python

I am working on a large dataset where I need to read excel files and then find valid numbers but the task takes enormous time for only 500k data. For valid numbers, I am using google phonelib. processing can be done in an async way as they are independent.
parts = dask.delayed(pd.read_excel)('500k.xlsx')
data = dd.from_delayed(parts)
data['Valid'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x)),meta=('Valid','object'))
for background
phonenumbers.is_valid_number(phonenumbers.parse('+442083661177'))
gives output as True
I expect the output to be less than 10sec but it takes around 40s

just been playing with this and you might just need to repartition your dataframe to allow the computation to be run in parallel
I start by generating some data:
import csv
import random
with open('tmp.csv', 'w') as fd:
out = csv.writer(fd)
out.writerow(['id', 'number'])
for i in range(500_000):
a = random.randrange(1000, 2999)
b = random.randrange(100_000, 899_999)
out.writerow([i+1, f'+44 {a} {b}'])
note that these are mostly valid UK numbers.
I then run something similar to your code:
from dask.distributed import Client
import dask.dataframe as dd
import phonenumbers
def fn(num):
return phonenumbers.is_valid_number(phonenumbers.parse(num))
with Client(processes=True):
df = dd.read_csv('tmp.csv')
# repartition to increase parallelism
df = df.repartition(npartitions=8)
df['valid'] = df.number.apply(fn, meta=('valid', 'object'))
out = df.compute()
this takes ~20 seconds to complete on my laptop (4 cores, 8 threads, Linux 5.2.8), which is only a bit more than double the performance of the plain loop. which indicates dask has quite a bit of runtime overhead as I'd expect it to be much faster than that. if I remove the call to repartition it takes a longer than I'm willing to wait and top only shows a single process running
note that if I rewrite it to do the naive thing in multiprocessing I get much better results:
from multiprocessing import Pool
import pandas as pd
df = pd.read_csv('tmp.csv')
with Pool(4) as pool:
df['valid'] = pool.map(fn, df['number'])
which reduces runtime to ~11 seconds and is even less code here as a bonus

Related

Parallelize 20K requests + filter & concat results into 1 dataframe

I need to make around 20K API calls, each one returns a CSV file and then on that file I have to perform some operations, finally, concatenate all the results into a single dataframe.
I've completed this task sequentially, but the issue is that each API Call lasts around 1sec and it takes around 6h to complete. So, I would like parallelize the task as I can make up to 100 simulatenous API calls and up to 1000 calls per minute.
I've tried several stuff, but I'm struggling... I have accomplished to parallelize the tasks and complete 200 API calls in about 8 seconds, but I can't concat all the results into a single dataframe... Would appreciate any help.
Thanks! :)
This is what i have:
from concurrent.futures import ThreadPoolExecutor
I have a few issues recreating the data:
start_time = time.time() # invalid start time
df = df[df.gap > 30] # returns an empty df since all values for gap are below 30
Suggestion:
You can set &fmt=json to have an easier time creating your df
df = pd.DataFrame(r.json())
If I change those things, your code works as expected.
Please provide a list of tickers and a reproducible example. (I have an API key)
Alternatively, you could use dask.dataframe, which gives an API much like that of pandas and handles parallelization over multiple threads, processes, or physical servers:
import dask
import dask.distributed
import dask.dataframe
import pandas as pd
# start a dask client
client = dask.distributed.Client()
tickers = tickers[:200]
# map your job over the tickers
futures = client.map(scrape, tickers)
# wait for the jobs to finish, then filter out the Null return values
dask.distributed.wait(futures)
non_null_futures = [f for f in futures if f.type != type(None)]
# convert the futures to a single dask.dataframe
ddf = dask.dataframe.from_delayed(non_null_futures)
# if desired, you can now convert to pandas
df = ddf.compute()
# alternatively, you could write straight to disk, giving
# a partitioned binary file, e.g.
ddf.to_parquet("myfile.parquet")

Python Multiprocessing to speed up I/O and Groupby/Sum

I have a dataset with ~200 million rows, ~10 grouping variables, and ~20 variables to sum, and is a ~50GB csv. The first thing I did was see what the runtime was run sequentially but in chunks. It's a bit more complicated because some of the groupbys are actually in another dataset at a different aggregation level so it's only ~200mb. So the relevant code right now looks like this:
group_cols = ['cols','to','group','by']
cols_to_summarize = ['cols','to','summarize']
groupbys = []
df = pd.read_csv("file/path/df.csv",chunksize=1000000)
for chunk in df:
chunk = chunk.merge(other_df,left_on="id",right_index=True,how="inner")
groupbys.append(chunk.groupby(group_cols)[cols_to_summarize].sum())
finalAgg = pd.concat(groupbys).groupby(group_cols)[cols_to_summarize].sum()
Each chunk takes roughly 5 seconds to process so the 200 chunks takes about 15-20 minutes. The server I'm working on has 16 cores, so I'm hoping to get a bit of speedup here, if I could get it to 2-3 minutes that would be amazing.
But when I try to use multiprocess I'm struggling to get much speedup at all. Based on my googling I thought that would help with reading in CSVs but I'm wondering if multiple processes can't read the same CSV and maybe I should split it up first? Here's what I tried and it took longer than the sequential run:
def agg_chunk(start):
[pull in small dataset]
chunk = pd.read_csv("file/path/df.csv",skiprows=range(1,start+1),nrows=1000000)
chunk = chunk.merge(other_df,left_on="id",right_index=True,how="inner")
return chunk.groupby(group_cols)[cols_to_summarize].sum()
if __name__ == "__main__":
pool = mp.Pool(16)
r = list(np.array(range(200))*1000000)
groupbys = pool.map(agg_chunk,r)
finalAgg = pd.concat(groupbys).groupby(group_cols)[cols_to_summarize].sum()
Is there a better way to do this? The extra [pull in small dataset] piece takes about 5 seconds, but doubling the time per process and then dividing by 16 should still be a pretty good speedup right? Instead the parallel version has been running for half an hour and is still not complete. Also is there some way to pass the dataset to each process instead of making each re-create it again?

Load a single large file from client to dask workers

How do I make a single large file of 8 GB accessible by all other worker nodes in dask? I have tried pd.read_csv() with chunksize and client.scatter but it is taking quite long. I am running it on macOS.
This is my code:
import time
import pandas as pd
import dask as dask
import dask.distributed as distributed
import dask.dataframe as dd
import dask.delayed as delayed
from dask.distributed import Client, progress
client = Client(IP:PORT)
print client
print client.scheduler_info()
f = []
chunksize = 10 ** 6
for chunk in pd.read_csv('file.csv', chunksize=chunksize):
f_in = client.scatter(chunk)
f.append(f_in)
print "read"
ddf = dd.from_delayed(f)
ddf = ddf.groupby(['col1'])[['col2']].sum()
future = client.compute(ddf)
print future
progress(future)
result = client.gather(future)
print result
Stuck with it. Thanks in advance!
Dask will chunk the file as long as it's a .csv file (not compressed), not sure why you are trying to chunk it yourself. Just do:
import dask.dataframe as dd
df = dd.read_csv('data*.csv')
In your work-flow, you are loading the CSV data locally, parsing into dataframes, and then transmitting serialised versions of those dataframes to the workers, one at a time.
Some possible solutions:
copy the file to each worker (which is wasteful in space terms), or put it in some location they can all see, like a shared file-system or cloud storage
use client.upload_file, which wasn't really designed for a large payload, and would also replicate to every worker
use dask.bytes.read_bytes to read the blocks of data in series as before, and persist those to the workers, so at least you suffer no serialisation cost, and the parsing effort is shared between the workers.

Slow Performance with Python Dask bag?

I'm trying out some tests of dask.bag to prepare for a big text processing job over millions of text files. Right now, on my test sets of dozens to hundreds of thousands of text files, I'm seeing that dask is running about 5 to 6 times slower than a straight single-threaded text processing function.
Can someone explain where I'll see the speed benefits of running dask over a large amount of text files? How many files would I have to process before it starts getting faster? Is 150,000 small text files simply too few? What sort of performance parameters should I be tweaking to get dask to speed up when processing files? What could account for a 5x decrease in performance over straight single-threaded text processing?
Here's an example of the code I'm using to test dask out. This is running against a test set of data from Reuters located at:
http://www.daviddlewis.com/resources/testcollections/reuters21578/
This data isn't exactly the same as the data I'm working against. In my other case it's a bunch of individual text files, one document per file, but the performance decrease I'm seeing is about the same. Here's the code:
import dask.bag as db
from collections import Counter
import string
import glob
import datetime
my_files = "./reuters/*.ascii"
def single_threaded_text_processor():
c = Counter()
for my_file in glob.glob(my_files):
with open(my_file, "r") as f:
d = f.read()
c.update(d.split())
return(c)
start = datetime.datetime.now()
print(single_threaded_text_processor().most_common(5))
print(str(datetime.datetime.now() - start))
start = datetime.datetime.now()
b = db.read_text(my_files)
wordcount = b.str.split().concat().frequencies().topk(5, lambda x: x[1])
print(str([w for w in wordcount]))
print(str(datetime.datetime.now() - start))
Here were my results:
[('the', 119848), ('of', 72357), ('to', 68642), ('and', 53439), ('in', 49990)]
0:00:02.958721
[(u'the', 119848), (u'of', 72357), (u'to', 68642), (u'and', 53439), (u'in', 49990)]
0:00:17.877077
Dask incurs about a cost of roughly 1ms overhead per task. By default the dask.bag.read_text function creates one task per filename. I suspect that you're just being swamped by overhead.
The solution here is probably to process several files in one task. The read_text function doesn't give you any options to do this, but you could switch out to dask.delayed, which provides a bit more flexibility and then convert to a dask.bag later if preferred.

Killed/MemoryError when creating a large dask.dataframe from delayed collection

I am trying to create a dask.dataframe from a bunch of large CSV files (currently 12 files, 8-10 million lines and 50 columns each). A few of them might fit together into my system memory but all of them at once definitely will not, hence the use of dask instead of regular pandas.
Since reading each csv file involves some extra work (adding columns with data from the file path), I tried creating the dask.dataframe from a list of delayed objects, similar to this example.
This is my code:
import dask.dataframe as dd
from dask.delayed import delayed
import os
import pandas as pd
def read_file_to_dataframe(file_path):
df = pd.read_csv(file_path)
df['some_extra_column'] = 'some_extra_value'
return df
if __name__ == '__main__':
path = '/path/to/my/files'
delayed_collection = list()
for rootdir, subdirs, files in os.walk(path):
for filename in files:
if filename.endswith('.csv'):
file_path = os.path.join(rootdir, filename)
delayed_reader = delayed(read_file_to_dataframe)(file_path)
delayed_collection.append(delayed_reader)
df = dd.from_delayed(delayed_collection)
print(df.compute())
When starting this script (Python 3.4, dask 0.12.0), it runs for a couple of minutes while my system memory constantly fills up. When it is fully used, everything starts lagging and it runs for some more minutes, then it crashes with killed or MemoryError.
I thought the whole point of dask.dataframe was to be able to operate on larger-than-memory dataframes that span over multiple files on disk, so what am I doing wrong here?
edit: Reading the files instead with df = dd.read_csv(path + '/*.csv') seems to work fine as far as I can see. However, this does not allow me to alter each single dataframe with additional data from the file path.
edit #2:
Following MRocklin's answer, I tried to read my data with dask's read_bytes() method as well as using the single-threaded scheduler as well as doing both in combination.
Still, even when reading chunks of 100MB in single-threaded mode on a laptop with 8GB of memory, my process gets killed sooner or later. Running the code stated below on a bunch of small files (around 1MB each) of similar shape works fine though.
Any ideas what I am doing wrong here?
import dask
from dask.bytes import read_bytes
import dask.dataframe as dd
from dask.delayed import delayed
from io import BytesIO
import pandas as pd
def create_df_from_bytesio(bytesio):
df = pd.read_csv(bytesio)
return df
def create_bytesio_from_bytes(block):
bytesio = BytesIO(block)
return bytesio
path = '/path/to/my/files/*.csv'
sample, blocks = read_bytes(path, delimiter=b'\n', blocksize=1024*1024*100)
delayed_collection = list()
for datafile in blocks:
for block in datafile:
bytesio = delayed(create_bytesio_from_bytes)(block)
df = delayed(create_df_from_bytesio)(bytesio)
delayed_collection.append(df)
dask_df = dd.from_delayed(delayed_collection)
print(dask_df.compute(get=dask.async.get_sync))
If each of your files is large then a few concurrent calls to read_file_to_dataframe might be flooding memory before Dask ever gets a chance to be clever.
Dask tries to operate in low memory by running functions in an order such that it can delete intermediate results quickly. However if the results of just a few functions can fill up memory then Dask may never have a chance to delete things. For example if each of your functions produced a 2GB dataframe and if you had eight threads running at once, then your functions might produce 16GB of data before Dask's scheduling policies can kick in.
Some options
Use dask.bytes.read_bytes
The reason why read_csv works is that it chunks up large CSV files into many ~100MB blocks of bytes (see the blocksize= keyword argument). You could do this too, although it's tricky because you need to always break on an endline.
The dask.bytes.read_bytes function can help you here. It can convert a single path into a list of delayed objects, each corresponding to a byte range of that file that starts and stops cleanly on a delimiter. You would then put these bytes into an io.BytesIO (standard library) and call pandas.read_csv on that. Beware that you'll also have to handle headers and such. The docstring to that function is extensive and should provide more help.
Use a single thread
In the example above everything would be fine if we didn't have the 8x multiplier from parallelism. I suspect that if you only ran a single function at once that things would probably pipeline without ever reaching your memory limit. You can set dask to use only a single thread with the following line
dask.set_options(get=dask.async.get_sync)
Note: For Dask versions >= 0.15, you need to use dask.local.get_sync instead.
Make sure that results fit in memory (response to edit 2)
If you make a dask.dataframe and then compute it immediately
ddf = dd.read_csv(...)
df = ddf.compute()
You're loading in all of the data into a Pandas dataframe, which will eventually blow up memory. Instead it's better to operate on the Dask dataframe and only compute on small results.
# result = df.compute() # large result fills memory
result = df.groupby(...).column.mean().compute() # small result
Convert to a different format
CSV is a pervasive and pragmatic format, but also has some flaws. You might consider a data format like HDF5 or Parquet.

Categories

Resources