Multiprocess Python/Numpy code for processing data faster - python

I am reading in hundreds of HDF files and processing the data of each HDF seperately. However, this takes an awful amount of time, since it is working on one HDF file at a time. I just stumbled upon http://docs.python.org/library/multiprocessing.html and am now wondering how I can speed things up using multiprocessing.
So far, I came up with this:
import numpy as np
from multiprocessing import Pool
def myhdf(date):
ii = dates.index(date)
year = date[0:4]
month = date[4:6]
day = date[6:8]
rootdir = 'data/mydata/'
filename = 'no2track'+year+month+day
records = read_my_hdf(rootdir,filename)
if records.size:
results[ii] = np.mean(records)
dates = ['20080105','20080106','20080107','20080108','20080109']
results = np.zeros(len(dates))
pool = Pool(len(dates))
pool.map(myhdf,dates)
However, this is obviously not correct. Can you follow my chain of thought what I want to do? What do I need to change?

Try joblib for a friendlier multiprocessing wrapper:
from joblib import Parallel, delayed
def myhdf(date):
# do work
return np.mean(records)
results = Parallel(n_jobs=-1)(delayed(myhdf)(d) for d in dates)

The Pool classes map function is like the standard python libraries map function, you're guaranteed to get your results back in the order that you put them in. Knowing that, the only other trick is that you need to return results in a consistant manner, and the filter them afterwards.
import numpy as np
from multiprocessing import Pool
def myhdf(date):
year = date[0:4]
month = date[4:6]
day = date[6:8]
rootdir = 'data/mydata/'
filename = 'no2track'+year+month+day
records = read_my_hdf(rootdir,filename)
if records.size:
return np.mean(records)
dates = ['20080105','20080106','20080107','20080108','20080109']
pool = Pool(len(dates))
results = pool.map(myhdf,dates)
results = [ result for result in results if result ]
results = np.array(results)
If you really do want results as soon as they are available you can use imap_unordered

Related

pd.read_sav and pyreadstat are so slow. how can i speed up pandas for big data if i have to use SAV/SPSS file format?

I've been transitioning away from SPSS for syntax writing/data management where I work to python and pandas for higher levels of functionality and programming. The issue is, reading SPSS files into pandas is SO slow. i work with bigger datasets (1 million or more rows often with 100+ columns). it seems that there are some pretty cool plugins out there to speed up processing CSV files such as Dask and Modin, but i don't think these work with SPSS files. i'd like to continue using pandas, but i have to stick with the SPSS file format (it's what everyone else where i work uses).
Are there any tips on how to accomplish faster data processing outside of computer upgrades and or file chunking?
You can try to parallelize reading your file:
As an example I have a file "big.sav" which is 294000 rows x 666 columns. Reading the file with pyreadstat.read_sav (which is what pd.read_spss uses in the background) takes 115 seconds. By parallelizing it I get 29 seconds:
first I create a file worker.py:
def worker(inpt):
import pyreadstat
offset, chunksize, path = inpt
df, meta = pyreadstat.read_sav(path, row_offset=offset, row_limit=chunksize)
return df
and then in the main script I have this:
import multiprocessing as mp
from time import time
import pandas as pd
import pyreadstat
from worker import worker
# calculate the number of rows in the file
_, meta = pyreadstat.read_sav("big.sav", metadataonly=True)
numrows = meta.number_rows
# calculate number of cores in the machine, this could also be set manually to some number, i.e. 8
numcores = mp.cpu_count()
# calculate the chunksize and offsets
divs = [numrows // numcores + (1 if x < numrows % numcores else 0) for x in range (numcores) ]
chunksize = divs[0]
offsets = [indx*chunksize for indx in range(numcores)]
# pack the data for the jobs
jobs = [(x, chunksize, "big.sav") for x in offsets]
pool = mp.Pool(processes=numcores)
# let's go!
t0=time()
chunks = pool.map(worker, jobs)
t1=time()
print(t1-t0) # this prints 29 seconds
# chunks is a list of dataframes in the right order
# you can concatenate all the chunks into a single big dataframe if you like
final = pd.concat(chunks, axis=0, ignore_index=True)
EDIT:
pyreadstat version 1.0.3 has had a big improvement in performance of about 5x.
In addition a new function "read_file_multiprocessing" has been added that is a wrapper around the previous code shared in this answer. It can give up to another 3x improvement, making (up to) a 15 times improvement compared to the previous version!
You can use the function like this:
import pyreadstat
fpath = "path/to/file.sav"
df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath)

Dask high memory consumption when loading multiple Pandas dataframes on dictionary

I have a folder (7.7GB) with multiple pandas dataframes stored in parquet file format. I need to load all these dataframes in a python dictionary, but since I only have 32GB of RAM, I use the .loc method to just load the data that I need.
When all the dataframes are loaded in memory in the python dictory, I create a common index from the indexes all of the data, then I reindex all the dataframes with the new index.
I developed two scripts to do this, the first one is in a classic sequential way, the second one is using Dask in oder to get some performance improvement from all the cores of my Threadripper 1920x.
Sequential code:
# Standard library imports
import os
import pathlib
import time
# Third party imports
import pandas as pd
# Local application imports
class DataProvider:
def __init__(self):
self.data = dict()
def load_parquet(self, source_dir: str, timeframe_start: str, timeframe_end: str) -> None:
t = time.perf_counter()
symbol_list = list(file for file in os.listdir(source_dir) if file.endswith('.parquet'))
# updating containers
for symbol in symbol_list:
path = pathlib.Path.joinpath(pathlib.Path(source_dir), symbol)
name = symbol.replace('.parquet', '')
self.data[name] = pd.read_parquet(path).loc[timeframe_start:timeframe_end]
print(f'Loaded data in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# building index
index = None
for symbol in self.data:
if index is not None:
index.union(self.data[symbol].index)
else:
index = self.data[symbol].index
print(f'Built index in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# reindexing data
for symbol in self.data:
self.data[symbol] = self.data[symbol].reindex(index=index, method='pad').itertuples()
print(f'Indexed data in {round(time.perf_counter() - t, 3)} seconds.')
if __name__ == '__main__' or __name__ == 'builtins':
source = r'WindowsPath'
x = DataProvider()
x.load_parquet(source_dir=source, timeframe_start='2015', timeframe_end='2015')
Dask code:
# Standard library imports
import os
import pathlib
import time
# Third party imports
from dask.distributed import Client
import pandas as pd
# Local application imports
def __load_parquet__(directory, timeframe_start, timeframe_end):
return pd.read_parquet(directory).loc[timeframe_start:timeframe_end]
def __reindex__(new_index, df):
return df.reindex(index=new_index, method='pad').itertuples()
if __name__ == '__main__' or __name__ == 'builtins':
client = Client()
source = r'WindowsPath'
start = '2015'
end = '2015'
t = time.perf_counter()
file_list = [file for file in os.listdir(source) if file.endswith('.parquet')]
# build data
data = dict()
for file in file_list:
path = pathlib.Path.joinpath(pathlib.Path(source), file)
symbol = file.replace('.parquet', '')
data[symbol] = client.submit(__load_parquet__, path, start, end)
print(f'Loaded data in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# build index
index = None
for symbol in data:
if index is not None:
index.union(data[symbol].result().index)
else:
index = data[symbol].result().index
print(f'Built index in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# reindex
for symbol in data:
data[symbol] = client.submit(__reindex__, index, data[symbol].result())
print(f'Indexed data in {round(time.perf_counter() - t, 3)} seconds.')
I found the results pretty weird.
Sequential code:
max memory consumption during computations: 30.2GB
memory consumption at the end of computations: 15.6GB
total memory consumption (without Windows and others): 11.6GB
Loaded data in 54.289 seconds.
Built index in 0.428 seconds.
Reindexed data in 9.666 seconds.
Dask code:
max memory consumption during computations: 25.2GB
memory consumption at the end of computations: 22.6GB
total memory consumption (without Windows and others): 18.9GB
Loaded data in 0.638 seconds.
Built index in 27.541 seconds.
Reindexed data in 30.179 seconds.
My questions:
Why with Dask the memory consumption at the end of computation is so much higher?
Why with Dask building the common index and reindexing all the dataframes takes so much time?
Also, when using the Dask code the console prints me the following error.
C:\Users\edit\Anaconda3\envs\edit\lib\site-packages\distribute\worker.py:901:UserWarning: Large object of size 5.41 MB detected in task graph:
(DatetimeIndex(['2015-01-02 09:30:00', '2015-01-02 ... s x 5 columns])
Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s))
Even if the error suggestions are really good, I don't get what's wrong with my code. Why is it saying keep data on workers? I thought that with submit method I'm sending all the data to my client, and so the workers have an easy access to all the data. Thank you all for the help.
I am not an expert at all, just try to help.
you might want to try not using time.perf_counter , see if that changes anything.

Applying parallelization when updating dictionary values

datasets = {}
datasets['df1'] = df1
datasets['df2'] = df2
datasets['df3'] = df3
datasets['df4'] = df4
def prepare_dataframe(dataframe):
return dataframe.apply(lambda x: x.astype(str).str.lower().str.replace('[^\w\s]', ''))
for key, value in datasets.items():
datasets[key] = prepare_dataframe(value)
I need to prepare the data in some dataframes for further analysis. I would like to parallelize the for loop that updates the dictionary with a prepared dataframe. This code will eventually run on a machine with dozens of cores and thousands of dataframes. On my local machine I do not appear to be using more than a single core in the prepare_dataframe function.
I have looked at Numba and Joblib but I cannot find a way to work with dictionary values in either library.
Any insight would be very much appreciated!
You can use the multiprocessing library. You can read about its basics here.
Here is the code that does what you need:
from multiprocessing import Pool
def prepare_dataframe(dataframe):
# do whatever you want here
# changes made here are *not* global
# return a modified version of what you want
return dataframe
def worker(dict_item):
key,value = dict_item
return (key,prepare_dataframe(value))
def parallelize(data, func):
data_list = list(data.items())
pool = Pool()
data = dict(pool.map(func, data_list))
pool.close()
pool.join()
return data
datasets = parallelize(datasets,worker)

Python parallel data processing

We have a dataset which has approx 1.5MM rows. I would like to process that in parallel. The main function of that code is to lookup master information and enrich the 1.5MM rows. The master is a two column dataset with roughly 25000 rows. However i am unable to make the multi-process work and test its scalability properly. Can some one please help. The cut-down version of the code is as follows
import pandas
from multiprocessing import Pool
def work(data):
mylist =[]
#Business Logic
return mylist.append(data)
if __name__ == '__main__':
data_df = pandas.read_csv('D:\\retail\\customer_sales_parallel.csv',header='infer')
print('Source Data :', data_df)
agents = 2
chunksize = 2
with Pool(processes=agents) as pool:
result = pool.map(func=work, iterable= data_df, chunksize=20)
pool.close()
pool.join()
print('Result :', result)
Method work will have the business logic and i would like to pass partitioned data_df into work to enable parallel processing. The sample data is as follows
CUSTOMER_ID,PRODUCT_ID,SALE_QTY
641996,115089,2
1078894,78144,1
1078894,121664,1
1078894,26467,1
457347,59359,2
1006860,36329,2
1006860,65237,2
1006860,121189,2
825486,78151,2
825486,78151,2
123445,115089,4
Ideally i would like to process 6 rows in each partition.
Please help.
Thanks and Regards
Bala
First, work is returning the output of mylist.append(data), which is None. I assume (and if not, I suggest) you want to return a processed Dataframe.
To distribute the load, you could use numpy.array_split to split the large Dataframe into a list of 6-row Dataframes, which are then processed by work.
import pandas
import math
import numpy as np
from multiprocessing import Pool
def work(data):
#Business Logic
return data # Return it as a Dataframe
if __name__ == '__main__':
data_df = pandas.read_csv('D:\\retail\\customer_sales_parallel.csv',header='infer')
print('Source Data :', data_df)
agents = 2
rows_per_workload = 6
num_loads = math.ceil(data_df.shape[0]/float(rows_per_workload))
split_df = np.array_split(data_df, num_loads) # A list of Dataframes
with Pool(processes=agents) as pool:
result = pool.map(func=work, iterable=split_df)
result = pandas.concat(result) # Stitch them back together
pool.close()
pool.join()pool = Pool(processes=agents)
print('Result :', result)
My best recommendation is for you to use the chunksize parameter in read_csv (Docs) and iterate over. This way you wont crash your ram trying to load everything plus if you want you can for example use threads to speed up the process.
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
Im not sure if this answer your specific question but i hope it helps.

Using Dask to parallelize HDF read-translate-write

TL;DR: We're having issues parallelizing Pandas code with Dask that reads and writes from the same HDF
I'm working on a project that generally requires three steps: reading, translating (or combining data), and writing these data. For context, we're working with medical records, where we receive claims in different formats, translate them into a standardized format, then re-write them to disk. Ideally, I'm hoping to save intermediate datasets in some form that I can access via Python/Pandas later.
Currently, I've chosen HDF as my data storage format, however I'm having trouble with runtime issues. On a large population, my code currently can take upwards of a few days. This has led me to investigate Dask, but I'm not positive I've applied Dask best to my situation.
What follows is a working example of my workflow, hopefully with enough sample data to get a sense of runtime issues.
Read (in this case Create) data
import pandas as pd
import numpy as np
import dask
from dask import delayed
from dask import dataframe as dd
import random
from datetime import timedelta
from pandas.io.pytables import HDFStore
member_id = range(1, 10000)
window_start_date = pd.to_datetime('2015-01-01')
start_date_col = [window_start_date + timedelta(days=random.randint(0, 730)) for i in member_id]
# Eligibility records
eligibility = pd.DataFrame({'member_id': member_id,
'start_date': start_date_col})
eligibility['end_date'] = eligibility['start_date'] + timedelta(days=365)
eligibility['insurance_type'] = np.random.choice(['HMO', 'PPO'], len(member_id), p=[0.4, 0.6])
eligibility['gender'] = np.random.choice(['F', 'M'], len(member_id), p=[0.6, 0.4])
(eligibility.set_index('member_id')
.to_hdf('test_data.h5',
key='eligibility',
format='table'))
# Inpatient records
inpatient_record_number = range(1, 20000)
service_date = [window_start_date + timedelta(days=random.randint(0, 730)) for i in inpatient_record_number]
inpatient = pd.DataFrame({'inpatient_record_number': inpatient_record_number,
'service_date': service_date})
inpatient['member_id'] = np.random.choice(list(range(1, 10000)), len(inpatient_record_number))
inpatient['procedure'] = np.random.choice(['A', 'B', 'C', 'D'], len(inpatient_record_number))
(inpatient.set_index('member_id')
.to_hdf('test_data.h5',
key='inpatient',
format='table'))
# Outpatient records
outpatient_record_number = range(1, 30000)
service_date = [window_start_date + timedelta(days=random.randint(0, 730)) for i in outpatient_record_number]
outpatient = pd.DataFrame({'outpatient_record_number': outpatient_record_number,
'service_date': service_date})
outpatient['member_id'] = np.random.choice(range(1, 10000), len(outpatient_record_number))
outpatient['procedure'] = np.random.choice(['A', 'B', 'C', 'D'], len(outpatient_record_number))
(outpatient.set_index('member_id')
.to_hdf('test_data.h5',
key='outpatient',
format='table'))
Translate/Write data
Sequential approach
def pull_member_data(member_i):
inpatient_slice = pd.read_hdf('test_data.h5', 'inpatient', where='index == "{}"'.format(member_i))
outpatient_slice = pd.read_hdf('test_data.h5', 'outpatient', where='index == "{}"'.format(member_i))
return inpatient_slice, outpatient_slice
def create_visits(inpatient_slice, outpatient_slice):
# In reality this is more complicated, using some logic to combine inpatient/outpatient/ER into medical 'visits'
# But for simplicity, we'll just stack the inpatient/outpatient and assign a record identifier
visits_stacked = pd.concat([inpatient_slice, outpatient_slice]).reset_index().sort_values('service_date')
visits_stacked.insert(0, 'visit_id', range(1, len(visits_stacked) + 1))
return visits_stacked
def save_visits_to_hdf(visits_slice):
with HDFStore('test_data.h5', mode='a') as store:
store.append('visits', visits_slice)
# Read in the data by member_id, perform some operation
def translate_by_member(member_i):
inpatient_slice, outpatient_slice = pull_member_data(member_i)
visits_slice = create_visits(inpatient_slice, outpatient_slice)
save_visits_to_hdf(visits_slice)
def run_translate_sequential():
# Simple approach: Loop through each member sequentially
for member_i in member_id:
translate_by_member(member_i)
run_translate_sequential()
The above code takes ~9 minutes to run on my machine.
Dask approach
def create_visits_dask_version(visits_stacked):
# In reality this is more complicated, using some logic to combine inpatient/outpatient/ER
# But for simplicity, we'll just stack the inpatient/outpatient and assign a record identifier
len_of_visits = visits_stacked.shape[0]
visits_stacked_1 = (visits_stacked
.sort_values('service_date')
.assign(visit_id=range(1, len_of_visits + 1))
.set_index('visit_id')
)
return visits_stacked_1
def run_translate_dask():
# Approach 2: Dask, with individual writes to HDF
inpatient_dask = dd.read_hdf('test_data.h5', 'inpatient')
outpatient_dask = dd.read_hdf('test_data.h5', 'outpatient')
stacked = dd.concat([inpatient_dask, outpatient_dask])
visits = stacked.groupby('member_id').apply(create_visits_dask_version)
visits.to_hdf('test_data_dask.h5', 'visits')
run_translate_dask()
This Dask approach takes 13 seconds(!)
While this is a great improvement, we're generally curious about a few things:
Given this simple example, is the approach of using Dask dataframes, concatenating them, and using groupby/apply the best approach?
In reality, we have multiple processes like this that read from the same HDF, and write to the same HDF. Our original codebase was structured in a manner that allowed for running the entire workflow one member_id at a time. When we tried to parallelize them, it sometimes worked on small samples, but most of the time produced a segmentation fault. Are there known issues with parallelizing workflows like this, reading/writing with HDFs? We're working on producing an example of this as well, but figured we'd post this here in case this triggers suggestions (or if this code helps someone facing a similar problem).
Any and all feedback appreciated!
In general groupby-apply will be fairly slow. It is generally challenging to resort data like this, especially in limited memory.
In general I recommend using the Parquet format (dask.dataframe has to_ and read_parquet functions). You are much less likely to get segfaults than with HDF files.

Categories

Resources