Using Dask to parallelize HDF read-translate-write - python

TL;DR: We're having issues parallelizing Pandas code with Dask that reads and writes from the same HDF
I'm working on a project that generally requires three steps: reading, translating (or combining data), and writing these data. For context, we're working with medical records, where we receive claims in different formats, translate them into a standardized format, then re-write them to disk. Ideally, I'm hoping to save intermediate datasets in some form that I can access via Python/Pandas later.
Currently, I've chosen HDF as my data storage format, however I'm having trouble with runtime issues. On a large population, my code currently can take upwards of a few days. This has led me to investigate Dask, but I'm not positive I've applied Dask best to my situation.
What follows is a working example of my workflow, hopefully with enough sample data to get a sense of runtime issues.
Read (in this case Create) data
import pandas as pd
import numpy as np
import dask
from dask import delayed
from dask import dataframe as dd
import random
from datetime import timedelta
from pandas.io.pytables import HDFStore
member_id = range(1, 10000)
window_start_date = pd.to_datetime('2015-01-01')
start_date_col = [window_start_date + timedelta(days=random.randint(0, 730)) for i in member_id]
# Eligibility records
eligibility = pd.DataFrame({'member_id': member_id,
'start_date': start_date_col})
eligibility['end_date'] = eligibility['start_date'] + timedelta(days=365)
eligibility['insurance_type'] = np.random.choice(['HMO', 'PPO'], len(member_id), p=[0.4, 0.6])
eligibility['gender'] = np.random.choice(['F', 'M'], len(member_id), p=[0.6, 0.4])
(eligibility.set_index('member_id')
.to_hdf('test_data.h5',
key='eligibility',
format='table'))
# Inpatient records
inpatient_record_number = range(1, 20000)
service_date = [window_start_date + timedelta(days=random.randint(0, 730)) for i in inpatient_record_number]
inpatient = pd.DataFrame({'inpatient_record_number': inpatient_record_number,
'service_date': service_date})
inpatient['member_id'] = np.random.choice(list(range(1, 10000)), len(inpatient_record_number))
inpatient['procedure'] = np.random.choice(['A', 'B', 'C', 'D'], len(inpatient_record_number))
(inpatient.set_index('member_id')
.to_hdf('test_data.h5',
key='inpatient',
format='table'))
# Outpatient records
outpatient_record_number = range(1, 30000)
service_date = [window_start_date + timedelta(days=random.randint(0, 730)) for i in outpatient_record_number]
outpatient = pd.DataFrame({'outpatient_record_number': outpatient_record_number,
'service_date': service_date})
outpatient['member_id'] = np.random.choice(range(1, 10000), len(outpatient_record_number))
outpatient['procedure'] = np.random.choice(['A', 'B', 'C', 'D'], len(outpatient_record_number))
(outpatient.set_index('member_id')
.to_hdf('test_data.h5',
key='outpatient',
format='table'))
Translate/Write data
Sequential approach
def pull_member_data(member_i):
inpatient_slice = pd.read_hdf('test_data.h5', 'inpatient', where='index == "{}"'.format(member_i))
outpatient_slice = pd.read_hdf('test_data.h5', 'outpatient', where='index == "{}"'.format(member_i))
return inpatient_slice, outpatient_slice
def create_visits(inpatient_slice, outpatient_slice):
# In reality this is more complicated, using some logic to combine inpatient/outpatient/ER into medical 'visits'
# But for simplicity, we'll just stack the inpatient/outpatient and assign a record identifier
visits_stacked = pd.concat([inpatient_slice, outpatient_slice]).reset_index().sort_values('service_date')
visits_stacked.insert(0, 'visit_id', range(1, len(visits_stacked) + 1))
return visits_stacked
def save_visits_to_hdf(visits_slice):
with HDFStore('test_data.h5', mode='a') as store:
store.append('visits', visits_slice)
# Read in the data by member_id, perform some operation
def translate_by_member(member_i):
inpatient_slice, outpatient_slice = pull_member_data(member_i)
visits_slice = create_visits(inpatient_slice, outpatient_slice)
save_visits_to_hdf(visits_slice)
def run_translate_sequential():
# Simple approach: Loop through each member sequentially
for member_i in member_id:
translate_by_member(member_i)
run_translate_sequential()
The above code takes ~9 minutes to run on my machine.
Dask approach
def create_visits_dask_version(visits_stacked):
# In reality this is more complicated, using some logic to combine inpatient/outpatient/ER
# But for simplicity, we'll just stack the inpatient/outpatient and assign a record identifier
len_of_visits = visits_stacked.shape[0]
visits_stacked_1 = (visits_stacked
.sort_values('service_date')
.assign(visit_id=range(1, len_of_visits + 1))
.set_index('visit_id')
)
return visits_stacked_1
def run_translate_dask():
# Approach 2: Dask, with individual writes to HDF
inpatient_dask = dd.read_hdf('test_data.h5', 'inpatient')
outpatient_dask = dd.read_hdf('test_data.h5', 'outpatient')
stacked = dd.concat([inpatient_dask, outpatient_dask])
visits = stacked.groupby('member_id').apply(create_visits_dask_version)
visits.to_hdf('test_data_dask.h5', 'visits')
run_translate_dask()
This Dask approach takes 13 seconds(!)
While this is a great improvement, we're generally curious about a few things:
Given this simple example, is the approach of using Dask dataframes, concatenating them, and using groupby/apply the best approach?
In reality, we have multiple processes like this that read from the same HDF, and write to the same HDF. Our original codebase was structured in a manner that allowed for running the entire workflow one member_id at a time. When we tried to parallelize them, it sometimes worked on small samples, but most of the time produced a segmentation fault. Are there known issues with parallelizing workflows like this, reading/writing with HDFs? We're working on producing an example of this as well, but figured we'd post this here in case this triggers suggestions (or if this code helps someone facing a similar problem).
Any and all feedback appreciated!

In general groupby-apply will be fairly slow. It is generally challenging to resort data like this, especially in limited memory.
In general I recommend using the Parquet format (dask.dataframe has to_ and read_parquet functions). You are much less likely to get segfaults than with HDF files.

Related

Apply function to each group where the group are splitted in multiple files without concatenating all the files

My data come from BigQuery exported to GCS bucket as CSV file and if the file size is quite massive, BigQuery will automatically split the data into several chunk. With time series in mind, the time series might be scattered across different files. I have a custom function that I want to applied to each TimeseriesID.
Here's some constraint of the data:
The data is sorted by TimeseriesID and TimeID
The number of row of each files is may vary, but at minimum 1 row (which is very unlikely)
The starting of TimeID is not always 0
The length of each time series may vary but at maximum it will only scattered across 2 files. No time series scatter in 3 different files.
Here's the initial setup to illustrate the problem:
# Please take note this is just for simplicity. The actual goal is not to calculate mean for all group, but to apply a custom_func to each Timeseries ID
def custom_func(x):
return np.mean(x)
# Please take note this is just for simplicity. In actual, I read the file one by one since reading all the data is not possible
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
This should be pretty trivial if I can just concat all the files but the problem is if I concat all the dataframe then it won't fit in the memory.
The output I desired is should be similar to this but without concat all the files.
pd.concat([df1,df2,df3],axis=0).groupby('TimeseriesID').agg({"value":simple_func})
I'm also aware about vaex and dask but I want to stick with simple pandas for time being.
I'm also open to solution which involve modifying the BigQuery to split the files better.
Approach presented by op to use concat with million of records would be overkill for memories/other resources.
I have tested OP code using Google Colab Nootebooks and this was a bad approach
import pandas as pd
import numpy as np
import time
# Please take note this is just for simplicity. The actual goal is not to calculate mean for all group, but to apply a custom_func to each Timeseries ID
def custom_func(x):
return np.mean(x)
# Please take note this is just for simplicity. In actual, I read the file one by one since reading all the data is not possible
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
start = time.time()
df = pd.concat([df1,df2,df3]).groupby('TimeseriesID').agg({"value":custom_func})
elapsed = (time.time() - start)
print(elapsed)
print(df.head())
output will be:
0.023952960968017578
value
TimeseriesID A 11.666667
B 16.250000
C 20.000000
D 18.333333
As you can see, 'concat' takes time to process. Due to few records this is not perceived.
The approach should be as follow:
Get files with data that you are going to process. ie: only workable columns.
Create a dictionary from the processed files key and values. if necessary, obtain values per key in a necessary file. You can store the results in a 'results' directory as json/csv:
A.csv will have all key 'A' values
...
n.csv will have all key 'n' values
Iterate trough results directory and start building your final output inside a dictionary.
{'A': [10, 20, 5], 'B': [30, 10, 20, 5], 'C': [30, 10], 'D': [20, 5, 30]}
apply custom function to each key value list.
{'A': 11.666666666666666, 'B': 16.25, 'C': 20.0, 'D': 18.333333333333332}
You can check the logic using below code, I use json to store the data:
from google.colab import files
import json
import pandas as pd
#initial dataset
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
#get unique keys and its values
df1.groupby('TimeseriesID')['value'].apply(list).to_json('df1.json')
df2.groupby('TimeseriesID')['value'].apply(list).to_json('df2.json')
df3.groupby('TimeseriesID')['value'].apply(list).to_json('df3.json')
#as this is an example you can download the output as jsons
files.download('df1.json')
files.download('df2.json')
files.download('df3.json')
Update 06/10/2021
I have tuned code for OPs needs. This part creates refined files.
from google.colab import files
import json
#you should use your own function to get the data from the file
def retrieve_data(uploaded,file):
return json.loads(uploaded[file].decode('utf-8'))
#you should use your own function to get a list of files to process
def retrieve_files():
return files.upload()
key_list =[]
#call a function that gets a list of files to process
file_to_process = retrieve_files()
#read every raw file:
for file in file_to_process:
file_data = retrieve_data(file_to_process,file)
for key,value in file_data.items():
if key not in key_list:
key_list.append(key)
with open(f'{key}.json','w') as new_key_file:
new_json = json.dumps({key:value})
new_key_file.write(new_json)
else:
with open(f'{key}.json','r+') as key_file:
raw_json = key_file.read()
old_json = json.loads(raw_json)
new_json = json.dumps({key:old_json[key]+value})
key_file.seek(0)
key_file.write(new_json)
for key in key_list:
files.download(f'{key}.json')
print(key_list)
Update 07/10/2021
I have updated code to avoid confusion. This part process refined files.
import time
import numpy as np
#Once we get the refined values we can use it to apply custom functions
def custom_func(x):
return np.mean(x)
#Get key and data content from single json
def get_data(file_data):
content = file_data.popitem()
return content[0],content[1]
#load key list and build our refined dictionary
refined_values = []
#call a function that gets a list of files to process
file_to_process = retrieve_files()
start = time.time()
#read every refined file:
for file in file_to_process:
#read content of file n
file_data = retrieve_data(file_to_process,file)
#parse and apply function per file read
key,data = get_data(file_data)
func_output = custom_func(data)
#start building refined list
refined_values.append([key,func_output])
elapsed = (time.time() - start)
print(elapsed)
df = pd.DataFrame.from_records(refined_values,columns=['TimerSeriesID','value']).sort_values(by=['TimerSeriesID'])
df = df.reset_index(drop=True)
print(df.head())
output will be:
0.00045609474182128906
TimerSeriesID value
0 A 11.666667
1 B 16.250000
2 C 20.000000
3 D 18.333333
summarize:
When handling large datasets, you should always need to focus on the data that you are going to use and keep it minimal. Only using the workable values.
Processing times are faster when operations are performed by basic operators or python native libraries.

Using Multiprocessing on a Collection of DataFrames

Setup
I have a multiple datasets each with their own DataFrame. I'm running calculations within them before comparing my results to a separate DataFrame which we can think of as constraints.
For example lets say 2 sets of data in a dictionary:
df_data_1 = pd.DataFrame(np.random.randint(0,50,size=(10, 4)), columns=list('ABCD'))
df_data_2 = pd.DataFrame(np.random.randint(0,50,size=(10, 4)), columns=list('ABCD'))
data_sets = {'data_1': df_data, 'data_2': df_data_2}
and one set of constraints:
df_constraints = pd.DataFrame([['a', 10, 20, 10000000],
['b', 100, 200, 20000000],
['c', 1000, 2000, 30000000]])
df_constraints.columns = ['index', 'sumMin', 'sumMax', 'productMax']
df_constraints.set_index('index', inplace=True)
Visually:
data_set_1
data_set_2
constraints
Function
I'm making calculations within each set of data and then comparing them to a set of constraints. For the sake of simplifying my question I am only comparing the data to the first row of constraints here, but in reality I have to compare the results of my calculations within each data-set to up to 20 sets of constraints.
Here is a simplified version of the function that I am trying to have run in parallel:
def test_func(df_data, df_constraints):
# Run some calculations
df = df_data.copy()
df['sum'] = df.sum(axis=1)
df['product'] = df.product(axis=1)
# Compare results to constraints
df['sumFit'] = ((df['sum'] > df_constraints.loc['a', 'sumMin']) &
(df['sum'] < df_constraints.loc['a', 'sumMax']))
df['productFit'] = df['product'] < df_constraints.loc['a', 'productMax']
# Analyze results
count_sumFits = df['sumFit'].sum()
count_productFits = df['productFit'].sum()
df_results = pd.DataFrame([['data_set_1', count_sumFits, count_productFits]],
columns=['DataSet', 'FittingSums', 'FittingProducts'])
df_results.set_index('DataSet', inplace=True)
return df_results
Sequential Version
I can run this function sequentially through each set of data; iterating through the dictionary with a while loop and then append the results as shown here, but with increased complexity this is taking way longer than I would like. (It's ugly but it works)
n=0
while n < len(data_sets):
data_set_names = list(data_sets.keys())
df_temp = test_func(data_sets[data_set_names[n]], df_constraints)
df_all_results.loc[n, 'FittingSums'] = df_temp.loc[0, 'FittingSums']
df_all_results.loc[n, 'FittingProducts'] = df_temp.loc[0, 'FittingProducts']
n+=1
The Problem
When I have 25 data-sets and I'm running more complex analysis with more calculations, the run time ends up being minutes long. Leading me to pursue concurrency/multiprocessing. I'm hoping to make this significantly faster as it is one step of many that I'm trying to optimize and then run them all a few thousand times.
So, Multiprocessing...
Due to the need to pass two arguments to the function I've been looking at mp.Pool.starmap, and pool.map(partial(test_func, b=df_constraints), data_sets, but I haven't been able to get either method to work.
ex.1) mp.Pool.starmap
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
output = pool.starmap(test_file.test_func, zip(data_sets, itertools.repeat(df_contraints)
This is as far as I've been able to get. Is it possible to process data concurrently like this and then append results to a dataframe? I don't need them to be in any particular order I just want to get the data into the right format.
I don't fully understand your code and your logic but replace data_sets by data_sets.values():
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
output = pool.starmap(test_file.test_func, zip(data_sets.values(),itertools.repeat(df_contraints)))

pd.read_sav and pyreadstat are so slow. how can i speed up pandas for big data if i have to use SAV/SPSS file format?

I've been transitioning away from SPSS for syntax writing/data management where I work to python and pandas for higher levels of functionality and programming. The issue is, reading SPSS files into pandas is SO slow. i work with bigger datasets (1 million or more rows often with 100+ columns). it seems that there are some pretty cool plugins out there to speed up processing CSV files such as Dask and Modin, but i don't think these work with SPSS files. i'd like to continue using pandas, but i have to stick with the SPSS file format (it's what everyone else where i work uses).
Are there any tips on how to accomplish faster data processing outside of computer upgrades and or file chunking?
You can try to parallelize reading your file:
As an example I have a file "big.sav" which is 294000 rows x 666 columns. Reading the file with pyreadstat.read_sav (which is what pd.read_spss uses in the background) takes 115 seconds. By parallelizing it I get 29 seconds:
first I create a file worker.py:
def worker(inpt):
import pyreadstat
offset, chunksize, path = inpt
df, meta = pyreadstat.read_sav(path, row_offset=offset, row_limit=chunksize)
return df
and then in the main script I have this:
import multiprocessing as mp
from time import time
import pandas as pd
import pyreadstat
from worker import worker
# calculate the number of rows in the file
_, meta = pyreadstat.read_sav("big.sav", metadataonly=True)
numrows = meta.number_rows
# calculate number of cores in the machine, this could also be set manually to some number, i.e. 8
numcores = mp.cpu_count()
# calculate the chunksize and offsets
divs = [numrows // numcores + (1 if x < numrows % numcores else 0) for x in range (numcores) ]
chunksize = divs[0]
offsets = [indx*chunksize for indx in range(numcores)]
# pack the data for the jobs
jobs = [(x, chunksize, "big.sav") for x in offsets]
pool = mp.Pool(processes=numcores)
# let's go!
t0=time()
chunks = pool.map(worker, jobs)
t1=time()
print(t1-t0) # this prints 29 seconds
# chunks is a list of dataframes in the right order
# you can concatenate all the chunks into a single big dataframe if you like
final = pd.concat(chunks, axis=0, ignore_index=True)
EDIT:
pyreadstat version 1.0.3 has had a big improvement in performance of about 5x.
In addition a new function "read_file_multiprocessing" has been added that is a wrapper around the previous code shared in this answer. It can give up to another 3x improvement, making (up to) a 15 times improvement compared to the previous version!
You can use the function like this:
import pyreadstat
fpath = "path/to/file.sav"
df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath)

How to read data in HDF5 format file partially when the data is too large to read fully

I am engaged in analysing HDF5 format data for scientific research purposes. I'm using Python's h5py library.
Now, the HDF file I want to read is so large. Its file size is about 20GB and the main part of its data is 400000*10000 float matrix. I tried to read the data once, but my development environment Spyder was terminated by compulsion because of the shortage of the memory. Then is there any method to read it partially and avoid this problem?
Use pd.read_hdf with columns argument. See example below:
import numpy as np
import pandas as pd
from contexttimer import Timer
def create_sample_df():
with Timer() as t:
df = pd.DataFrame(np.random.rand(100000, 5000))
df.to_hdf('file.h5', 'df', format='table')
print('create_sample_df: %.2fs' % t.elapsed)
def read_full_df():
""" data is too large to read fully """
with Timer() as t:
df = pd.read_hdf('file.h5')
print('read_full_df: %.2fs' % t.elapsed)
def read_df_with_start_stop():
""" to quick look all columns """
with Timer() as t:
df = pd.read_hdf('file.h5', start=0, stop=5)
print('read_df_with_start_stop: %.2fs' % t.elapsed)
def read_df_with_columns():
""" to read dataframe (hdf5) with necessary columns """
with Timer() as t:
df = pd.read_hdf('file.h5', columns=list(range(4)))
print('read_df_with_columns: %.2fs' % t.elapsed)
if __name__ == '__main__':
create_sample_df()
read_full_df()
read_df_with_start_stop()
read_df_with_columns()
# outputs:
# create_sample_df: 51.25s
# read_full_df: 5.21s
# read_df_with_start_stop: 0.03s
# read_df_with_columns: 4.44s
read_df_with_columns only reduces space cost, but does not necessarily improve speed performance. And this is under the assumption that the HDF5 was saved in table format (otherwise columns argument cannot be applied).
You can slice h5py datasets like numpy arrays, so you could work on a number of subsets instead of the whole dataset (e.g. 4 100000*10000 subsets).

Multiprocess Python/Numpy code for processing data faster

I am reading in hundreds of HDF files and processing the data of each HDF seperately. However, this takes an awful amount of time, since it is working on one HDF file at a time. I just stumbled upon http://docs.python.org/library/multiprocessing.html and am now wondering how I can speed things up using multiprocessing.
So far, I came up with this:
import numpy as np
from multiprocessing import Pool
def myhdf(date):
ii = dates.index(date)
year = date[0:4]
month = date[4:6]
day = date[6:8]
rootdir = 'data/mydata/'
filename = 'no2track'+year+month+day
records = read_my_hdf(rootdir,filename)
if records.size:
results[ii] = np.mean(records)
dates = ['20080105','20080106','20080107','20080108','20080109']
results = np.zeros(len(dates))
pool = Pool(len(dates))
pool.map(myhdf,dates)
However, this is obviously not correct. Can you follow my chain of thought what I want to do? What do I need to change?
Try joblib for a friendlier multiprocessing wrapper:
from joblib import Parallel, delayed
def myhdf(date):
# do work
return np.mean(records)
results = Parallel(n_jobs=-1)(delayed(myhdf)(d) for d in dates)
The Pool classes map function is like the standard python libraries map function, you're guaranteed to get your results back in the order that you put them in. Knowing that, the only other trick is that you need to return results in a consistant manner, and the filter them afterwards.
import numpy as np
from multiprocessing import Pool
def myhdf(date):
year = date[0:4]
month = date[4:6]
day = date[6:8]
rootdir = 'data/mydata/'
filename = 'no2track'+year+month+day
records = read_my_hdf(rootdir,filename)
if records.size:
return np.mean(records)
dates = ['20080105','20080106','20080107','20080108','20080109']
pool = Pool(len(dates))
results = pool.map(myhdf,dates)
results = [ result for result in results if result ]
results = np.array(results)
If you really do want results as soon as they are available you can use imap_unordered

Categories

Resources