Load a single large file from client to dask workers - python

How do I make a single large file of 8 GB accessible by all other worker nodes in dask? I have tried pd.read_csv() with chunksize and client.scatter but it is taking quite long. I am running it on macOS.
This is my code:
import time
import pandas as pd
import dask as dask
import dask.distributed as distributed
import dask.dataframe as dd
import dask.delayed as delayed
from dask.distributed import Client, progress
client = Client(IP:PORT)
print client
print client.scheduler_info()
f = []
chunksize = 10 ** 6
for chunk in pd.read_csv('file.csv', chunksize=chunksize):
f_in = client.scatter(chunk)
f.append(f_in)
print "read"
ddf = dd.from_delayed(f)
ddf = ddf.groupby(['col1'])[['col2']].sum()
future = client.compute(ddf)
print future
progress(future)
result = client.gather(future)
print result
Stuck with it. Thanks in advance!

Dask will chunk the file as long as it's a .csv file (not compressed), not sure why you are trying to chunk it yourself. Just do:
import dask.dataframe as dd
df = dd.read_csv('data*.csv')

In your work-flow, you are loading the CSV data locally, parsing into dataframes, and then transmitting serialised versions of those dataframes to the workers, one at a time.
Some possible solutions:
copy the file to each worker (which is wasteful in space terms), or put it in some location they can all see, like a shared file-system or cloud storage
use client.upload_file, which wasn't really designed for a large payload, and would also replicate to every worker
use dask.bytes.read_bytes to read the blocks of data in series as before, and persist those to the workers, so at least you suffer no serialisation cost, and the parsing effort is shared between the workers.

Related

How to read large sas file with pandas and export to parquet using multiprocessing?

i'm trying to make a scrip that read a sasb7dat file and export to parquet using pandas, but i'm struggling to increase my performance with large files (>1Gb and more de 1 million rows). Doing some research, i found that using multiprocessing could help me, but i can't make it work. The code runs with no errors, but no parquet files are created.
Here is what i got so far:
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
arq = path_to_my_file
def sas_mult_process(data):
for i, df in enumerate(data):
df.to_parquet(f"{'hist_dif_base_pt'+str(i)}.parquet")
file_reader = pd.read_sas(arq, chunksize=100000,encoding='ISO-8859-1',format='sas7bdat')
with ThreadPoolExecutor(max_workers=10) as executor:
executor.map(sas_mult_process, file_reader)
Can anyone see where is my mistake?
You use the term multiprocessing all over the place yet your code is not using multiprocessing but rather multithreading. It appears that you are trying to break up the input file into dataframe chunks and have each chunk become a separate output file. If that is so, you would want to pass each chunk to you worker sas_mult_process, which would then process that single chunk. I am assuming that converting the input to parquet involves more than just I/O but rather entails some CPU processing. Therefore, multiprocessing would be a better choice.
import pandas as pd
from concurrent.futures import ProcessPoolExecutor
arq = path_to_my_file
def sas_mult_process(tpl):
"""
This worker function is passed an index and single chunk
as a tuple.
"""
i, df = tpl # Unpack
# The following f-string can be simplified:
df.to_parquet(f"hist_dif_base_pt{i}.parquet")
# Required for Windows:
if __name__ == '__main__':
file_reader = pd.read_sas(arq, chunksize=100000,encoding='ISO-8859-1',format='sas7bdat')
with ProcessPoolExecutor(max_workers=10) as executor:
executor.map(sas_mult_process, enumerate(file_reader))

Parallelize 20K requests + filter & concat results into 1 dataframe

I need to make around 20K API calls, each one returns a CSV file and then on that file I have to perform some operations, finally, concatenate all the results into a single dataframe.
I've completed this task sequentially, but the issue is that each API Call lasts around 1sec and it takes around 6h to complete. So, I would like parallelize the task as I can make up to 100 simulatenous API calls and up to 1000 calls per minute.
I've tried several stuff, but I'm struggling... I have accomplished to parallelize the tasks and complete 200 API calls in about 8 seconds, but I can't concat all the results into a single dataframe... Would appreciate any help.
Thanks! :)
This is what i have:
from concurrent.futures import ThreadPoolExecutor
I have a few issues recreating the data:
start_time = time.time() # invalid start time
df = df[df.gap > 30] # returns an empty df since all values for gap are below 30
Suggestion:
You can set &fmt=json to have an easier time creating your df
df = pd.DataFrame(r.json())
If I change those things, your code works as expected.
Please provide a list of tickers and a reproducible example. (I have an API key)
Alternatively, you could use dask.dataframe, which gives an API much like that of pandas and handles parallelization over multiple threads, processes, or physical servers:
import dask
import dask.distributed
import dask.dataframe
import pandas as pd
# start a dask client
client = dask.distributed.Client()
tickers = tickers[:200]
# map your job over the tickers
futures = client.map(scrape, tickers)
# wait for the jobs to finish, then filter out the Null return values
dask.distributed.wait(futures)
non_null_futures = [f for f in futures if f.type != type(None)]
# convert the futures to a single dask.dataframe
ddf = dask.dataframe.from_delayed(non_null_futures)
# if desired, you can now convert to pandas
df = ddf.compute()
# alternatively, you could write straight to disk, giving
# a partitioned binary file, e.g.
ddf.to_parquet("myfile.parquet")

Optimizing processing of a large excel file

I am working on a large dataset where I need to read excel files and then find valid numbers but the task takes enormous time for only 500k data. For valid numbers, I am using google phonelib. processing can be done in an async way as they are independent.
parts = dask.delayed(pd.read_excel)('500k.xlsx')
data = dd.from_delayed(parts)
data['Valid'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x)),meta=('Valid','object'))
for background
phonenumbers.is_valid_number(phonenumbers.parse('+442083661177'))
gives output as True
I expect the output to be less than 10sec but it takes around 40s
just been playing with this and you might just need to repartition your dataframe to allow the computation to be run in parallel
I start by generating some data:
import csv
import random
with open('tmp.csv', 'w') as fd:
out = csv.writer(fd)
out.writerow(['id', 'number'])
for i in range(500_000):
a = random.randrange(1000, 2999)
b = random.randrange(100_000, 899_999)
out.writerow([i+1, f'+44 {a} {b}'])
note that these are mostly valid UK numbers.
I then run something similar to your code:
from dask.distributed import Client
import dask.dataframe as dd
import phonenumbers
def fn(num):
return phonenumbers.is_valid_number(phonenumbers.parse(num))
with Client(processes=True):
df = dd.read_csv('tmp.csv')
# repartition to increase parallelism
df = df.repartition(npartitions=8)
df['valid'] = df.number.apply(fn, meta=('valid', 'object'))
out = df.compute()
this takes ~20 seconds to complete on my laptop (4 cores, 8 threads, Linux 5.2.8), which is only a bit more than double the performance of the plain loop. which indicates dask has quite a bit of runtime overhead as I'd expect it to be much faster than that. if I remove the call to repartition it takes a longer than I'm willing to wait and top only shows a single process running
note that if I rewrite it to do the naive thing in multiprocessing I get much better results:
from multiprocessing import Pool
import pandas as pd
df = pd.read_csv('tmp.csv')
with Pool(4) as pool:
df['valid'] = pool.map(fn, df['number'])
which reduces runtime to ~11 seconds and is even less code here as a bonus

ERROR - error from worker No such file or directory: 'filepath'

I have a sample data set present in my local and I'm trying to do some basic opertaions on a cluster.
import dask.dataframe as ddf
from dask.distributed import Client
client = Client('Ip address of the scheduler')
import dask.dataframe as ddf
csvdata = ddf.read_csv('Path to the CSV file')
Client is connected to a scheduler which in turn is connected to two workers(on other machines).
My Questions may be pretty trivial.
Should this csv file be present on other worker nodes?
I seem to get file not found errors.
Using,
futures=client.scatter(csvdata)
x = ddf.from_delayed([future], meta=df)
#Price is a column in the data
df.Price.sum().compute(get=client.get) #returns" dd.Scalar<series-..., dtype=float64>" How do I access it?
client.submit(sum, x.Price) #returns "distributed.utils - ERROR - 6dc5a9f58c30954f77913aa43c792cc8"
Also, I did refer this
Loading local file from client onto dask distributed cluster and http://distributed.readthedocs.io/en/latest/manage-computation.html
I thinking I'm mixing up a lot of things here and my understanding is muddled up.
Any help would be really appreciated.
Yes, here dask.dataframe is assuming that the files you refer to in your client code are also accessible by your workers. If this is not the case then you will have you read in your data explicitly in your local machine and scatter it out to your workers.
It looks like you're trying to do exactly this, except that you're scattering dask dataframes rather than pandas dataframes. You will actually have to concretely load pandas data from disk before you scatter it. If your data fits in memory then you should be able to do exactly what you're doing now, but replace the dd.read_csv call with pd.read_csv
csvdata = pandas.read_csv('Path to the CSV file')
[future] = client.scatter([csvdata])
x = ddf.from_delayed([future], meta=df).repartition(npartitions=10).persist()
#Price is a column in the data
df.Price.sum().compute(get=client.get) # Should return an integer
If your data is too large then you might consider using dask locally to read and scatter data to your cluster piece by piece.
import dask.dataframe as dd
ddf = dd.read_csv('filename')
futures = ddf.map_partitions(lambda part: c.scatter([part])[0]).compute(get=dask.get) # single threaded local scheduler
ddf = dd.from_delayed(list(futures), meta=ddf.meta)

Killed/MemoryError when creating a large dask.dataframe from delayed collection

I am trying to create a dask.dataframe from a bunch of large CSV files (currently 12 files, 8-10 million lines and 50 columns each). A few of them might fit together into my system memory but all of them at once definitely will not, hence the use of dask instead of regular pandas.
Since reading each csv file involves some extra work (adding columns with data from the file path), I tried creating the dask.dataframe from a list of delayed objects, similar to this example.
This is my code:
import dask.dataframe as dd
from dask.delayed import delayed
import os
import pandas as pd
def read_file_to_dataframe(file_path):
df = pd.read_csv(file_path)
df['some_extra_column'] = 'some_extra_value'
return df
if __name__ == '__main__':
path = '/path/to/my/files'
delayed_collection = list()
for rootdir, subdirs, files in os.walk(path):
for filename in files:
if filename.endswith('.csv'):
file_path = os.path.join(rootdir, filename)
delayed_reader = delayed(read_file_to_dataframe)(file_path)
delayed_collection.append(delayed_reader)
df = dd.from_delayed(delayed_collection)
print(df.compute())
When starting this script (Python 3.4, dask 0.12.0), it runs for a couple of minutes while my system memory constantly fills up. When it is fully used, everything starts lagging and it runs for some more minutes, then it crashes with killed or MemoryError.
I thought the whole point of dask.dataframe was to be able to operate on larger-than-memory dataframes that span over multiple files on disk, so what am I doing wrong here?
edit: Reading the files instead with df = dd.read_csv(path + '/*.csv') seems to work fine as far as I can see. However, this does not allow me to alter each single dataframe with additional data from the file path.
edit #2:
Following MRocklin's answer, I tried to read my data with dask's read_bytes() method as well as using the single-threaded scheduler as well as doing both in combination.
Still, even when reading chunks of 100MB in single-threaded mode on a laptop with 8GB of memory, my process gets killed sooner or later. Running the code stated below on a bunch of small files (around 1MB each) of similar shape works fine though.
Any ideas what I am doing wrong here?
import dask
from dask.bytes import read_bytes
import dask.dataframe as dd
from dask.delayed import delayed
from io import BytesIO
import pandas as pd
def create_df_from_bytesio(bytesio):
df = pd.read_csv(bytesio)
return df
def create_bytesio_from_bytes(block):
bytesio = BytesIO(block)
return bytesio
path = '/path/to/my/files/*.csv'
sample, blocks = read_bytes(path, delimiter=b'\n', blocksize=1024*1024*100)
delayed_collection = list()
for datafile in blocks:
for block in datafile:
bytesio = delayed(create_bytesio_from_bytes)(block)
df = delayed(create_df_from_bytesio)(bytesio)
delayed_collection.append(df)
dask_df = dd.from_delayed(delayed_collection)
print(dask_df.compute(get=dask.async.get_sync))
If each of your files is large then a few concurrent calls to read_file_to_dataframe might be flooding memory before Dask ever gets a chance to be clever.
Dask tries to operate in low memory by running functions in an order such that it can delete intermediate results quickly. However if the results of just a few functions can fill up memory then Dask may never have a chance to delete things. For example if each of your functions produced a 2GB dataframe and if you had eight threads running at once, then your functions might produce 16GB of data before Dask's scheduling policies can kick in.
Some options
Use dask.bytes.read_bytes
The reason why read_csv works is that it chunks up large CSV files into many ~100MB blocks of bytes (see the blocksize= keyword argument). You could do this too, although it's tricky because you need to always break on an endline.
The dask.bytes.read_bytes function can help you here. It can convert a single path into a list of delayed objects, each corresponding to a byte range of that file that starts and stops cleanly on a delimiter. You would then put these bytes into an io.BytesIO (standard library) and call pandas.read_csv on that. Beware that you'll also have to handle headers and such. The docstring to that function is extensive and should provide more help.
Use a single thread
In the example above everything would be fine if we didn't have the 8x multiplier from parallelism. I suspect that if you only ran a single function at once that things would probably pipeline without ever reaching your memory limit. You can set dask to use only a single thread with the following line
dask.set_options(get=dask.async.get_sync)
Note: For Dask versions >= 0.15, you need to use dask.local.get_sync instead.
Make sure that results fit in memory (response to edit 2)
If you make a dask.dataframe and then compute it immediately
ddf = dd.read_csv(...)
df = ddf.compute()
You're loading in all of the data into a Pandas dataframe, which will eventually blow up memory. Instead it's better to operate on the Dask dataframe and only compute on small results.
# result = df.compute() # large result fills memory
result = df.groupby(...).column.mean().compute() # small result
Convert to a different format
CSV is a pervasive and pragmatic format, but also has some flaws. You might consider a data format like HDF5 or Parquet.

Categories

Resources