Setup
I have a multiple datasets each with their own DataFrame. I'm running calculations within them before comparing my results to a separate DataFrame which we can think of as constraints.
For example lets say 2 sets of data in a dictionary:
df_data_1 = pd.DataFrame(np.random.randint(0,50,size=(10, 4)), columns=list('ABCD'))
df_data_2 = pd.DataFrame(np.random.randint(0,50,size=(10, 4)), columns=list('ABCD'))
data_sets = {'data_1': df_data, 'data_2': df_data_2}
and one set of constraints:
df_constraints = pd.DataFrame([['a', 10, 20, 10000000],
['b', 100, 200, 20000000],
['c', 1000, 2000, 30000000]])
df_constraints.columns = ['index', 'sumMin', 'sumMax', 'productMax']
df_constraints.set_index('index', inplace=True)
Visually:
data_set_1
data_set_2
constraints
Function
I'm making calculations within each set of data and then comparing them to a set of constraints. For the sake of simplifying my question I am only comparing the data to the first row of constraints here, but in reality I have to compare the results of my calculations within each data-set to up to 20 sets of constraints.
Here is a simplified version of the function that I am trying to have run in parallel:
def test_func(df_data, df_constraints):
# Run some calculations
df = df_data.copy()
df['sum'] = df.sum(axis=1)
df['product'] = df.product(axis=1)
# Compare results to constraints
df['sumFit'] = ((df['sum'] > df_constraints.loc['a', 'sumMin']) &
(df['sum'] < df_constraints.loc['a', 'sumMax']))
df['productFit'] = df['product'] < df_constraints.loc['a', 'productMax']
# Analyze results
count_sumFits = df['sumFit'].sum()
count_productFits = df['productFit'].sum()
df_results = pd.DataFrame([['data_set_1', count_sumFits, count_productFits]],
columns=['DataSet', 'FittingSums', 'FittingProducts'])
df_results.set_index('DataSet', inplace=True)
return df_results
Sequential Version
I can run this function sequentially through each set of data; iterating through the dictionary with a while loop and then append the results as shown here, but with increased complexity this is taking way longer than I would like. (It's ugly but it works)
n=0
while n < len(data_sets):
data_set_names = list(data_sets.keys())
df_temp = test_func(data_sets[data_set_names[n]], df_constraints)
df_all_results.loc[n, 'FittingSums'] = df_temp.loc[0, 'FittingSums']
df_all_results.loc[n, 'FittingProducts'] = df_temp.loc[0, 'FittingProducts']
n+=1
The Problem
When I have 25 data-sets and I'm running more complex analysis with more calculations, the run time ends up being minutes long. Leading me to pursue concurrency/multiprocessing. I'm hoping to make this significantly faster as it is one step of many that I'm trying to optimize and then run them all a few thousand times.
So, Multiprocessing...
Due to the need to pass two arguments to the function I've been looking at mp.Pool.starmap, and pool.map(partial(test_func, b=df_constraints), data_sets, but I haven't been able to get either method to work.
ex.1) mp.Pool.starmap
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
output = pool.starmap(test_file.test_func, zip(data_sets, itertools.repeat(df_contraints)
This is as far as I've been able to get. Is it possible to process data concurrently like this and then append results to a dataframe? I don't need them to be in any particular order I just want to get the data into the right format.
I don't fully understand your code and your logic but replace data_sets by data_sets.values():
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
output = pool.starmap(test_file.test_func, zip(data_sets.values(),itertools.repeat(df_contraints)))
Related
Set Up
This is part two of a question that I posted regarding accessing results from multiple processes.
For part one click Here: Link to Part One
I have a complex set of data that I need to compare to various sets of constraints concurrently, but I'm running into multiple issues. The first issue is getting results out of my multiple processes, and the second issue is making anything beyond an extremely simple function to run concurrently.
Example
I have multiple sets of constraints that I need to compare against some data and I would like to do this concurrently because I have a lot of sets of constrains. In this example I'll just be using two sets of constraints.
Jupyter Notebook
Create Some Sample Constraints & Data
# Create a set of constraints
constraints = pd.DataFrame([['2x2x2', 2,2,2],['5x5x5',5,5,5],['7x7x7',7,7,7]],
columns=['Name','First', 'Second', 'Third'])
constraints.set_index('Name', inplace=True)
# Create a second set of constraints
constraints2 = pd.DataFrame([['4x4x4', 4,4,4],['6x6x6',6,6,6],['7x7x7',7,7,7]],
columns=['Name','First', 'Second', 'Third'])
constraints2.set_index('Name', inplace=True)
# Create some sample data
items = pd.DataFrame([['a', 2,8,2],['b',5,3,5],['c',7,4,7]], columns=['Name','First', 'Second', 'Third'])
items.set_index('Name', inplace=True)
Running Sequentially
If I run this sequentially I can get my desired results but with the data that I am actually dealing with it can take over 12 hours. Here is what it would look like ran sequentially so that you know what my desired result would look like.
# Function
def seq_check_constraint(df_constraints_input, df_items_input):
df_constraints = df_constraints_input.copy()
df_items = df_items_input.copy()
df_items['Product'] = df_items.product(axis=1)
df_constraints['Product'] = df_constraints.product(axis=1)
for constraint in df_constraints.index:
df_items[constraint+'Product'] = df_constraints.loc[constraint,'Product']
for constraint in df_constraints.index:
for item in df_items.index:
col_name = constraint+'_fits'
df_items[col_name] = False
df_items.loc[df_items['Product'] < df_items[constraint+'Product'], col_name] = True
df_res = df_items.iloc[:: ,7:]
return df_res
constraint_sets = [constraints, constraints2, ...]
results = {}
counter = 0
for df in constrain_sets:
res = seq_check_constraint(df, items)
results['constraints'+str(counter)] = res
or uglier:
df_res1 = seq_check_constraint(constraints, items)
df_res2 = seq_check_constraint(constraints2, items)
results = {'constraints0':df_res1, 'constraints1': df_res2}
As a result of running these sequentially I end up with DataFrame's like shown here:
I'd ultimately like to end up with a dictionary or list of the DataFrame's, or be able to append the DataFrame's all together. The order that I get the results doesn't matter to me, I just want to have them all together and need to be able to do further analysis on them.
What I've Tried
So this brings me to my attempts at multiprocessing, From what I understand you can either use Queues or Managers to handle shared data and memory, but I haven't been able to get either to work. I also am struggling to get my function which takes two arguments to execute within the Pool's at all.
Here is my code as it stands right now using the same sample data from above:
Function
def check_constraint(df_constraints_input, df_items_input):
df_constraints = df_constraints_input.copy()
df_items = df_items_input.copy()
df_items['Product'] = df_items.product(axis=1) # Mathematical Product
df_constraints['Product'] = df_constraints.product(axis=1)
for constraint in df_constraints.index:
df_items[constraint+'Product'] = df_constraints.loc[constraint,'Product']
for constraint in df_constraints.index:
for item in df_items.index:
col_name = constraint+'_fits'
df_items[col_name] = False
df_items.loc[df_items['Product'] < df_items[constraint+'Product'], col_name] = True
df_res = df_items.iloc[:: ,7:]
return df_res
Jupyter Notebook
df_manager = mp.Manager()
df_ns = df_manager.Namespace()
df_ns.constraint_sets = [constraints, constraints2]
print('---Starting pool---')
if __name__ == '__main__':
with mp.Pool() as p:
print('--In the pool--')
res = p.map_async(mpf.check_constraint, (df_ns.constraint_sets, itertools.repeat(items)))
print(res.get())
and my current error:
TypeError: check_constraint() missing 1 required positional argument: 'df_items_input'
Easiest way is to create a list of tuples (where one tuple represents one set of arguments to the function) and pass it to starmap.
df_manager = mp.Manager()
df_ns = df_manager.Namespace()
df_ns.constraint_sets = [constraints, constraints2]
print('---Starting pool---')
if __name__ == '__main__':
with mp.Pool() as p:
print('--In the pool--')
check_constraint_args = []
for constraint in constraint_sets:
check_constraint_args.append((constraint, items))
res = p.starmap(mpf.check_constraint, check_constraint_args)
print(res.get())
There is a huge dataframe with hundreds of thousands of rows for every single client. I want to summarise this dataframe into another dataframe where a single row contains the summarised data of all the rows of that client.
The problem is it is not the only code, it contains similar 1000 more lines.
and it takes a lot of time to execute. but when I run this in R. It is 10 times faster. Attaching the R code for reference as well.
Is there a way I can make it fast like R code?
Python Code:
for i in range(len(client)):
print(i)
sub = data.loc[data['Client Name']==client['Client Name'][i],:]
client['requests'][i] = len(sub)
client['ppt_req'][i] = len(sub)/sub['CID'].nunique()
client['approval'][i] = (((sub['verify']=='Yes').sum())/client['requests'][i])*100
client['denial'][i] = (((sub['verify']=='No').sum())/client['requests'][i])*100
client['male'][i] = (((sub['gender']=='Male').sum())/client['requests'][i])*100
client['female'][i] = (((sub['gender ']=='female').sum())/client['requests'][i])*100
R Code:
for(i in 1: nrow(client))
{print(i)
#i=1
sub<-subset(data,data$Client.Name==client$Client.Name[i])
client$requests[i]<-nrow(sub)
client$ppt_req[i]<-nrow(sub)/(length(unique(sub$CID)))
client$approval[i]<-((as.numeric(table(sub$verify=="Yes")["TRUE"]))/client$requests[i])*100
client$denial[i]<-((as.numeric(table(sub$verify=="No")["TRUE"]))/client$requests[i])*100
client$male[i]<-((as.numeric(table(sub$gender)["Male"]))/client$requests[i])*100
client$female[i]<-((as.numeric(table(sub$gender)["Female"]))/client$requests[i])*100
Iterating over Pandas dataframe using Python loops is very slow.
But the main issue comes from the line data.loc[data['Client Name']==client['Client Name'][i],:] which walk through the whole dataframe data for each client. This means this line will finally iterate over >100,000 of strings >100,000 times, and thus tens of billions of costly string comparisons are made. Not to mention that per-group computation are replicated for each client.
You can solve this by using a groupby on client names followed by a merge.
Here is a sketch of code (untested):
# If the number of client name in `data` is much more important than in `client`,
# one can filter `data` before applying the next `groupby` using:
# client['Client Name'].unique()
# Generate a compact dataframe containing the information for each
# possible client name that appear in `data`.
clientDataInfos = pd.DataFrame(
{
'requests': len(group),
'ppt_req': len(group) / group['CID'].nunique(),
'approval': (((group['verify']=='Yes').sum()) / len(group)) * 100,
'denial': (((group['verify']=='No').sum()) / len(group)) * 100,
'male': (((group['gender']=='Male').sum()) / len(group)) * 100,
'female': (((group['gender ']=='female').sum()) / len(group)) * 100
} for name,group in data.groupby('Client Name')
)
# Extend `client` with the precomputed information in `clientDataInfos`.
# The extended columns should not already appear in `client`.
client = client.merge(clientDataInfos, on='Client Name')
I'd like to know what is the best solution to get distances from the google maps distance API for my dataframe composed of coordinates (origin & destination) which is around 75k rows.
#Origin #Destination
1 (40.7127837, -74.0059413) (34.0522342, -118.2436849)
2 (41.8781136, -87.6297982) (29.7604267, -95.3698028)
3 (39.9525839, -75.1652215) (40.7127837, -74.0059413)
4 (41.8781136, -87.6297982) (34.0522342, -118.2436849)
5 (29.7604267, -95.3698028) (39.9525839, -75.1652215)
So far my code iterates through the dataframe and calls the API copying the distance value into the new "distance" column.
df['distance'] = ""
for index, row in df.iterrows():
result = gmaps.distance_matrix(row['origin'], row['destination'], mode='driving')
status = result['rows'][0]['elements'][0]['status']
if status == "OK": # Handle "no result" exception
KM = int(result['rows'][0]['elements'][0]['distance']['value'] / 1000)
df['distance'].iloc[index] = KM
else:
df['distance'].iloc[index] = 0
df.to_csv('distance.csv')
I get the desired result but from what I've read iterating through dataframe is rather inefficient and should be avoided. It took 20 secondes for 240 rows, so it would take 1h30 to do all dataframe. Note that once done, no need to re-run anymore, only new few new rows a month (~500).
What would we the best solution here ?
Edit: if anybody has experience with the google distance API and its limitations any tips/best practices is welcomed.
I tried to understand about any limitations about concurrent calls here but I couldn't find anything. Few suggestions
Avoid loops
About your code I'd rather skip for loops and use apply first
def get_gmaps_distance(row):
result = gmaps.distance_matrix(row['origin'], row['destination'], mode='driving')
status = result['rows'][0]['elements'][0]['status']
if status == "OK":
KM = int(result['rows'][0]['elements'][0]['distance']['value'] / 1000)
else:
KM = 0
return KM
df["distance"] = df.apply(get_gmaps_distance, axis=1)
Split your dataframe and use multiprocessing
import multiprocessing as mp
def parallelize(fun, vec, cores=mp.cpu_count()):
with mp.Pool(cores) as p:
res = p.map(fun, vec)
return res
# split your dataframe in many chunks as the number of cores
df = np.array_split(df, mp.cpu_count())
# this use your functions for every chunck
def parallel_distance(x):
x["distance"] = x.apply(get_gmaps_distance, axis=1)
return x
df = parallelize(parallel_distance, df)
df = pd.concat(df, ignore_index=True, sort=False)
Do not calculate twice the same distance (save $$$)
In case you have duplicates row you should drop some of them
grp = df.drop_duplicates(["origin", "destination"]).reset_index(drop=True)
Here I didn't overwrite df as it possibly contain more information you need and you can merge the results to it.
grp["distance"] = grp.apply(get_gmaps_distance, axis=1)
df = pd.merge(df, grp, how="left")
Reduce decimals
You should ask you this question: do I really need to be accurate to the 7th decimal? As 1 degree of latitude is ~111km the 7th decimal place gives you a precision up to ~1cm. You get the idea from this when-less-is-more where reducing decimals they improved the model.
Conclusion
If you can eventually use all the suggested methods you could get some interesting improvements. I'd like you to comment them here as I don't have a personal API key to try by myself.
TL;DR: We're having issues parallelizing Pandas code with Dask that reads and writes from the same HDF
I'm working on a project that generally requires three steps: reading, translating (or combining data), and writing these data. For context, we're working with medical records, where we receive claims in different formats, translate them into a standardized format, then re-write them to disk. Ideally, I'm hoping to save intermediate datasets in some form that I can access via Python/Pandas later.
Currently, I've chosen HDF as my data storage format, however I'm having trouble with runtime issues. On a large population, my code currently can take upwards of a few days. This has led me to investigate Dask, but I'm not positive I've applied Dask best to my situation.
What follows is a working example of my workflow, hopefully with enough sample data to get a sense of runtime issues.
Read (in this case Create) data
import pandas as pd
import numpy as np
import dask
from dask import delayed
from dask import dataframe as dd
import random
from datetime import timedelta
from pandas.io.pytables import HDFStore
member_id = range(1, 10000)
window_start_date = pd.to_datetime('2015-01-01')
start_date_col = [window_start_date + timedelta(days=random.randint(0, 730)) for i in member_id]
# Eligibility records
eligibility = pd.DataFrame({'member_id': member_id,
'start_date': start_date_col})
eligibility['end_date'] = eligibility['start_date'] + timedelta(days=365)
eligibility['insurance_type'] = np.random.choice(['HMO', 'PPO'], len(member_id), p=[0.4, 0.6])
eligibility['gender'] = np.random.choice(['F', 'M'], len(member_id), p=[0.6, 0.4])
(eligibility.set_index('member_id')
.to_hdf('test_data.h5',
key='eligibility',
format='table'))
# Inpatient records
inpatient_record_number = range(1, 20000)
service_date = [window_start_date + timedelta(days=random.randint(0, 730)) for i in inpatient_record_number]
inpatient = pd.DataFrame({'inpatient_record_number': inpatient_record_number,
'service_date': service_date})
inpatient['member_id'] = np.random.choice(list(range(1, 10000)), len(inpatient_record_number))
inpatient['procedure'] = np.random.choice(['A', 'B', 'C', 'D'], len(inpatient_record_number))
(inpatient.set_index('member_id')
.to_hdf('test_data.h5',
key='inpatient',
format='table'))
# Outpatient records
outpatient_record_number = range(1, 30000)
service_date = [window_start_date + timedelta(days=random.randint(0, 730)) for i in outpatient_record_number]
outpatient = pd.DataFrame({'outpatient_record_number': outpatient_record_number,
'service_date': service_date})
outpatient['member_id'] = np.random.choice(range(1, 10000), len(outpatient_record_number))
outpatient['procedure'] = np.random.choice(['A', 'B', 'C', 'D'], len(outpatient_record_number))
(outpatient.set_index('member_id')
.to_hdf('test_data.h5',
key='outpatient',
format='table'))
Translate/Write data
Sequential approach
def pull_member_data(member_i):
inpatient_slice = pd.read_hdf('test_data.h5', 'inpatient', where='index == "{}"'.format(member_i))
outpatient_slice = pd.read_hdf('test_data.h5', 'outpatient', where='index == "{}"'.format(member_i))
return inpatient_slice, outpatient_slice
def create_visits(inpatient_slice, outpatient_slice):
# In reality this is more complicated, using some logic to combine inpatient/outpatient/ER into medical 'visits'
# But for simplicity, we'll just stack the inpatient/outpatient and assign a record identifier
visits_stacked = pd.concat([inpatient_slice, outpatient_slice]).reset_index().sort_values('service_date')
visits_stacked.insert(0, 'visit_id', range(1, len(visits_stacked) + 1))
return visits_stacked
def save_visits_to_hdf(visits_slice):
with HDFStore('test_data.h5', mode='a') as store:
store.append('visits', visits_slice)
# Read in the data by member_id, perform some operation
def translate_by_member(member_i):
inpatient_slice, outpatient_slice = pull_member_data(member_i)
visits_slice = create_visits(inpatient_slice, outpatient_slice)
save_visits_to_hdf(visits_slice)
def run_translate_sequential():
# Simple approach: Loop through each member sequentially
for member_i in member_id:
translate_by_member(member_i)
run_translate_sequential()
The above code takes ~9 minutes to run on my machine.
Dask approach
def create_visits_dask_version(visits_stacked):
# In reality this is more complicated, using some logic to combine inpatient/outpatient/ER
# But for simplicity, we'll just stack the inpatient/outpatient and assign a record identifier
len_of_visits = visits_stacked.shape[0]
visits_stacked_1 = (visits_stacked
.sort_values('service_date')
.assign(visit_id=range(1, len_of_visits + 1))
.set_index('visit_id')
)
return visits_stacked_1
def run_translate_dask():
# Approach 2: Dask, with individual writes to HDF
inpatient_dask = dd.read_hdf('test_data.h5', 'inpatient')
outpatient_dask = dd.read_hdf('test_data.h5', 'outpatient')
stacked = dd.concat([inpatient_dask, outpatient_dask])
visits = stacked.groupby('member_id').apply(create_visits_dask_version)
visits.to_hdf('test_data_dask.h5', 'visits')
run_translate_dask()
This Dask approach takes 13 seconds(!)
While this is a great improvement, we're generally curious about a few things:
Given this simple example, is the approach of using Dask dataframes, concatenating them, and using groupby/apply the best approach?
In reality, we have multiple processes like this that read from the same HDF, and write to the same HDF. Our original codebase was structured in a manner that allowed for running the entire workflow one member_id at a time. When we tried to parallelize them, it sometimes worked on small samples, but most of the time produced a segmentation fault. Are there known issues with parallelizing workflows like this, reading/writing with HDFs? We're working on producing an example of this as well, but figured we'd post this here in case this triggers suggestions (or if this code helps someone facing a similar problem).
Any and all feedback appreciated!
In general groupby-apply will be fairly slow. It is generally challenging to resort data like this, especially in limited memory.
In general I recommend using the Parquet format (dask.dataframe has to_ and read_parquet functions). You are much less likely to get segfaults than with HDF files.
I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.
How can I do a pivot on data this large with a limited ammount of RAM?
EDIT: adding sample code
The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
You could do the appending with HDF5/pytables. This keeps it out of RAM.
Use the table format:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
Aside: You should also buy more RAM, it's cheap.
Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
Perhaps it's more pythonic/readable to write this as:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.