To summarize: How to perform groupby operations in parallel for a limited number of groups, but writing the result of each group apply function to disk?
My problem: I'm trying to create a supervised structure for regression models from information of a lot of clients separated into years. From the same clients I have to build different models, with different inputs X and labels Y, thus my idea is to create a single X and Y dataframe holding all variables at once, and slicing each one according to the task. For example, X could hold information from the salary, age or sex, but model 1 would use only age and sex, while model 2 only use salary.
As clients are not present every year, I can only use clients that are present from one period to the next one.
Instead of selecting the intersection of clients for each pair of contigous years, I'm trying to concatenate the whole information and performing groupby operations by client ID (and then filtering by year sequence, for example using the rows where the difference of periods are 1). The problem of using Dask for this task is that distributed workers are running low on memory (even after increasing the limit to 30Gb each). Note that for each group I'm creating a new dataframe, so I'm not reducing calculation to a single number per group, thus the memory intensive operation.
What I'm currently doing is performing a groupby operation, then iterating over the groupby object and writing to disk sequentially: for example like:
x_file=open('X.csv', 'w')
for name, group in concatenated_data.groupby('ID'):
data_x=my_func(group) # In my real code, my_func returns x and y dataframes
data_x.to_csv(x_file, header=None)
x_file.close()
which write the data sequentially applying my_func which selects the x and y for each group.
What I want is to perform the operation for a controlled number of groups (lets say 3 at the time), and writing the result of each group to disk (maybe with data_x.to_csv(x_file, single_file=True)).
Of course I can do the same for a dask dataframe, and iterate over the groupbpy object using get_group(), but I don't believe it will run in parallel while also keeping the memory on check.
EDIT: Example
# Lets say I have 3 csv files:
data=['./data_2016', './data_2017', './data_2018'] # Each file contains millions of rows (1 per client ID) and like 85 columns
# and certains variables
x_vars=['x1', 'x2', 'x3'] # x variables
y_vars= ['y1', 'y2', 'x1'] # note than some variables can be among x and y (like using today's salary to predict tomorrows salary)
data=[pd.read_csv(x) for x in data]
def func1(df_):
# do some preprocessing stuff
return df_
data=map(func1, data) # Some preprocessing and adding some columns (for example adding column for year)
concatenated_data=pd.concat(data, axis=1) # Big file, all clients from 2016-2018
def my_func(df_): # function applied above
# order by year
df_['Diff']=df_.year.diff() # calculating the difference among years
df['shifted']=df.Diff.shift(-1) # calculate shift of difference
# For exammple, *client z* may be on 2016 and 2018, thus his year difference is 2.
# I can't use *clien z* x_vars to predict y (only a single period ahead regression)
x=df_.loc[df_['shifted']==1, x_vars] # select only contigous years
y=df_.loc[df_['Diff']==1, y_vars] # the same, but a year ahead of x
return (x, y)
# ... Iteration over groupby object
Instead of using groupby() to reduce, I'm expanding the single, big file into an x and y dataframes, on which y holds information a period ahead of x.
As you can see, using a dask dataframe groupby (omitted for simplicity) would parallelize my_func operation, but as I understand would also wait until all operations nodes are completed, thus depleting my memory. What I would like is to perform my_func for certain groups (ideally as most as memory could hold), finish them, save to disk (without problems related to paralell saving) and finally proceed to the next batch of groups.
Maybe I can use some dask delayed objects, but I don't think it will make good use of my memory if a set the batches manually.
I'm not sure if this is what you are looking for
Generate data
import pandas as pd
import numpy as np
import dask.dataframe as dd
import os
n = 200
df = pd.DataFrame({"grp":np.random.choice(list("abcd"), n),
"x":np.random.randn(n),
"y":np.random.randn(n),
"z":np.random.randn(n)})
df.to_csv("file.csv", index=False)
# we will need later on
df.to_parquet("file.parquet", index=False)
Pandas solution
# we save our files on a given folder
fldr = "output1"
os.makedirs(fldr, exist_ok=True)
# we read the columns we need only
cols2read = ["grp", "x", "y"]
df = pd.read_csv("file.csv")
df = df[cols2read]
def write_file(x, fldr):
name = x["grp"].iloc[0]
x.to_csv(f"{fldr}/{name}.csv", index=False)
df.groupby("grp")\
.apply(lambda x: write_file(x, fldr))
Dask solution
This is basically the same but we need to add meta to our apply and the compute
# we save our files on a given folder
fldr = "output2"
os.makedirs(fldr, exist_ok=True)
# we read the columns we need only
cols2read = ["grp", "x", "y"]
df = pd.read_csv("file.csv")
df = df[cols2read]
def write_file(x, fldr):
name = x["grp"].iloc[0]
x.to_csv(f"{fldr}/{name}.csv", index=False)
df.groupby("grp")\
.apply(lambda x: write_file(x, fldr), meta='f8')\
.compute()
Working with parquet
Here I suggest you to work with parquet as it's going to be ways more efficient
cols2read = ["grp", "x", "y"]
df = dd.read_parquet("file.parquet",
columns=cols2read)
df.to_parquet("output3/",
partition_on="grp")
Inside output3 you can find several folders called grp=a and so on. And each off them could eventually contain several files. but you can read all of them with pd.read_parquet("output3/grp=a)
Related
I have the following workflow in a Python notebook
Load data into a pandas dataframe from a table (around 200K rows) --> I will call this orig_DF moving forward
Manipulate orig_DF to get into a DF that has columns <Feature1, Feature 2,...,Feature N, Label> --> I will call this derived DF ```ML_input DF`` moving forward. This DF is used to train a ML model
To get ML_input DF, I need to do some complex processing on each row in orig_DF. In particular, each row in orig_DF gets converted into multiple "rows" (number unknown before processing a row) in ML_input DF
Currently, I am doing (code below)
orig_df.iterrows() to loop through each row
Apply a function on each row. This returns a list.
Accumulate results from multiple rows into one list
Convert this list into ML_input DF after the loop ends
This works but I want speed this up by parallelizing the work on each row and accumulating the results. Would appreciate pointers from Pandas experts on how to do this. An example would be greatly appreciated
Current code is below.
Note: I have looked into using df.apply(). But two issues seem to be
apply in itself does not seem to parallelize things.
I don't how to make apply handle this one row converted to multiple row issue (any pointers here will also help)
Current code
def get_training_dataframe(dfin):
X = []
for index, row in dfin.iterrows():
ts_frame_dict = ast.literal_eval(row["sample_dictionary"])
for ts, frame in ts_frame_dict.items():
features = get_features(frame)
if features != None:
X += [features]
return pd.DataFrame(X, columns=FEATURE_NAMES)
It's difficult to know what optimizations are possible without having example data and without knowing what get_features() does.
The following code ought to be equivalent (I think) to your code, but it attempts to "vectorize" each step instead of performing it all within the for-loop. Perhaps that will offer you a chance to more easily measure the time taken by each step, and optimize the bottlenecks.
In particular, I wonder if it's faster to combine the calls to ast.literal_eval() into a single call. That's what I've done here, but I have no idea if it's truly faster.
I recommend trying line profiler if you can.
import ast
import pandas as pd
def get_training_dataframe(dfin):
frame_dicts = ast.literal_eval('[' + ','.join(dfin['sample_dictionary']) + ']')
frames = chain(*(d.values() for d in frame_dicts))
features = map(get_features, frames)
features = [f for f in features if f is not None]
return pd.DataFrame(features, columns=FEATURE_NAMES)
I have an HDF5 file with 2 groups, each containing 50 datasets of 4D numpy arrays of same type per group. I want to combine all 50 datasets in each group into a single dataset. In other words, instead of 2 x 50 datasets I want 2x1 dataset. How can I accomplish this? The file is 18.4 Gb in size. I am a novice at working with large datasets. I am working in python with h5py.
Thanks!
Look at this answer: How can I combine multiple .h5 file? - Method 3b: Merge all data into 1 Resizeable Dataset. It describes a way to copy data from multiple HDF5 files into a single dataset. You want to do something similar. The only difference is all of your datasets are in 1 HDF5 file.
You didn't say how you want to stack the 4D arrays. In my first answer I stacked them along axis=3. As noted in my comment, I it's easier (and cleaner) to create the merged dataset as a 5d array, and stack the data along the 5th axis (axis=4). I like this for 2 reasons: The code is simpler/easier to follow, and 2) it's more intuitive (to me) that axis=4 represents a unique dataset (instead of slicing on axis=3).
I wrote a self-contained example to demonstrate the procedure. First it creates some data and closes the file. Then it reopens the file (read only) and creates a new file for the copied datasets. It loops over the groups and and datasets in the first and copies the data into to a merged dataset in the second file. The 5D example is first, and my original 4D example follows.
Note: this is a simple example that will work for your specific case. If you are writing a general solution, it should check for consistent shapes and dtypes before blindly merging the data (which I don't do).
Code to create the Example data (2 groups, 5 datasets each):
import h5py
import numpy as np
# Create a simple H5 file with 2 groups and 5 datasets (shape=a0,a1,a2,a3)
with h5py.File('SO_69937402_2x5.h5','w') as h5f1:
a0,a1,a2,a3 = 100,20,20,10
grp1 = h5f1.create_group('group1')
for ds in range(1,6):
arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
grp1.create_dataset(f'dset_{ds:02d}',data=arr)
grp2 = h5f1.create_group('group2')
for ds in range(1,6):
arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
grp2.create_dataset(f'dset_{ds:02d}',data=arr)
Code to merge the data (2 groups, 1 5D dataset each -- my preference):
with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
h5py.File('SO_69937402_2x1_5d.h5','w') as h5f2:
# loop on groups in existing file (h5f1)
for grp in h5f1.keys():
# Create group in h5f2 if it doesn't exist
print('working on group:',grp)
h5f2.require_group(grp)
# Loop on datasets in group
ds_cnt = len(h5f1[grp].keys())
for i,ds in enumerate(h5f1[grp].keys()):
print('working on dataset:',ds)
if 'merged_ds' not in h5f2[grp].keys():
# If dataset doesn't exist in group, create it
# Set maxshape so dataset is resizable
ds_shape = h5f1[grp][ds].shape
merge_ds = h5f2[grp].create_dataset('merged_ds',dtype=h5f1[grp][ds].dtype,
shape=(ds_shape+(ds_cnt,)), maxshape=(ds_shape+(None,)) )
# Now add data to the merged dataset
merge_ds[:,:,:,:,i] = h5f1[grp][ds]
Code to merge the data (2 groups, 1 4D dataset each):
with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
h5py.File('SO_69937402_2x1_4d.h5','w') as h5f2:
# loop on groups in existing file (h5f1)
for grp in h5f1.keys():
# Create group in h5f2 if it doesn't exist
print('working on group:',grp)
h5f2.require_group(grp)
# Loop on datasets in group
for ds in h5f1[grp].keys():
print('working on dataset:',ds)
if 'merged_ds' not in h5f2[grp].keys():
# if dataset doesn't exist in group, create it
# Set maxshape so dataset is resizable
ds_shape = h5f1[grp][ds].shape
merge_ds = h5f2[grp].create_dataset('merged_ds',data=h5f1[grp][ds],
maxshape=[ds_shape[0],ds_shape[1],ds_shape[2],None])
else:
# otherwise, resize the merged dataset to hold new values
ds1_shape = h5f1[grp][ds].shape
ds2_shape = merge_ds.shape
merge_ds.resize(ds1_shape[3]+ds2_shape[3],axis=3)
merge_ds[ :,:,:, ds2_shape[3]:ds2_shape[3]+ds1_shape[3] ] = h5f1[grp][ds]
I wrote some code to perform interpolation based on two criteria, the amount of insurance and the deductible amount %. I was struggling to do the interpolation all at once, so had split the filtering.The table hf contains the known data which I am using to base my interpolation results on.Table df contains the new data which needs the developed factors interpolated based on hf.
Right now my work around is first filtering each table based on the ded_amount percentage and then performing the interpolation into an empty data frame and appending after each loop.
I feel like this is inefficient, and there is a better way to perform this, looking to hear some feedback on some improvements I can make. Thanks
Test data provided below.
import pandas as pd
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
deduct_fact=pd.DataFrame()
for deduct in hf['Ded_amount'].unique():
deduct_table=hf[hf['Ded_amount']==deduct]
aoi_table=df[df['Ded_amount']==deduct]
x=deduct_table['AOI']
y=deduct_table['factor']
f=interpolate.interp1d(x,y,fill_value="extrapolate")
xnew=aoi_table[['AOI']]
ynew=f(xnew)
append_frame=aoi_table
append_frame['Factor']=ynew
deduct_fact=deduct_fact.append(append_frame)
Yep, there is a way to do this more efficiently, without having to make a bunch of intermediate dataframes and appending them. have a look at this code:
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
# Create this column now
df['Factor'] = None
# I like specifying this explicitly; easier to debug
deduction_amounts = list(hf.Ded_amount.unique())
for deduction_amount in deduction_amounts:
# You can index a dataframe and call a column in one line
x, y = hf[hf['Ded_amount']==deduction_amount]['AOI'], hf[hf['Ded_amount']==deduction_amount]['factor']
f = interpolate.interp1d(x, y, fill_value="extrapolate")
# This is the most important bit. Lambda function on the dataframe
df['Factor'] = df.apply(lambda x: f(x['AOI']) if x['Ded_amount']==deduction_amount else x['Factor'], axis=1)
The way the lambda function works is:
It goes row by row through the column 'Factor' and gives it a value based on conditions on the other columns.
It returns the interpolation of the AOI column of df (this is what you called xnew) if the deduction amount matches, otherwise it just returns the same thing back.
So I know you're never suppose to iterate over a Pandas DataFrame, but I can't find another way around this problem.
I have a bunch of different time series, say they're end-of-day stock prices. They're in a DataFrame like this:
Ticker Price
0 AAA 10
1 AAA 11
2 AAA 10.5
3 BBB 100
4 BBB 110
5 CCC 60
etc.
For each Ticker, I want to take a variety of models and train them on successively larger batches of data. Specifically, I want to take a model, train it on day1 data, predict day2. Train the same model on day1 and day2, predict day3, etc. For each day, I want to slice up to the day before and predict on that subset [day0:dayN-1].
Essentially I'm implementing sklearn's TimeSeriesSplit, except I'm doing it myself because the models I'm training aren't in sklearn (for example, one model is Prophet).
The idea is I try a bunch of models on a bunch of different Tickers, then I see which models work well for which Tickers.
So my basic code for running one model on all my data looks like:
import pandas as pd
def make_predictions(df):
res = pd.DataFrame()
for ticker in df.ticker.unique():
df_ticker = df[df['ticker'] == ticker]
for i,_ in df_ticker.iterrows():
X = df_ticker[0:i]
X = do_preparations(X) # do some processing to prepare the data
m = train_model(X) # train the model
forecast = make_predictions(m) # predict one week
df_ticker.loc[i,'preds'] = forecast['y'][0]
res = pd.concat([res,df_ticker])
return res
But my code runs super slow. Can I speed this up somehow?
I can't figure out how I would use .apply() or any of the other common anti-iterating techniques.
Consider several items:
First, avoid quadratic copying by calling pd.concat inside a loop. Instead, build a list/dict of data frames to be concatenated once outside the loop.
Second, avoid DataFrame.iterrows since you only use i. Instead, traverse the index.
Third, for compactness, avoid unique() with subsequent subset [...]. Instead, use groupby() in a dictionary or list comprehension which may be slightly faster than list.append approach and due to your multiple steps, an inner defined function would be needed.
Inner loop may be unavoidable as you are really running different models.
def make_predictions(df):
def proc_model(sub_df):
for i in sub_df.index:
X = sub_df.loc[0:i]
X = do_preparations(X) # do some processing to prepare the data
m = train_model(X) # train the model
forecast = make_predictions(m) # predict one week
sub_df.loc[i,'preds'] = forecast['y'][0]
return sub_df
# BUILD DICTIONARY OF DATA FRAMES
df_dict = {i:proc_model(g) for i, g in df.groupby('ticker')}
# CONCATENATE DATA FRAMES
res = pd.concat(df_dict, ignore_index=True)
return res
I'm dealing with data on a fairly large scale. For reference, a given sample will have ~75,000,000 rows and 15,000-20,000 columns.
As of now, to conserve memory I've taken the approach of creating a list of Series (each column is a series, so ~15K-20K Series each containing ~250K rows). Then I create a SparseDataFrame containing every index within these series (because as you notice, this is a large but not very dense dataset). The issue is this becomes extremely slow, and appending each column to the dataset takes several minutes. To overcome this I've tried batching the merges as well (select a subset of the data, merge these to a DataFrame, which is then merged into my main DataFrame), but this approach is still too slow. Slow meaning it only processed ~4000 columns in a day, with each append causing subsequent appends to take longer as well.
One part which struck me as odd is why my column count of the main DataFrame affects the append speed. Because my main index already contains all entries it will ever see, I shouldn't have to lose time due to re-indexing.
In anycase, here is my code:
import time
import sys
import numpy as np
import pandas as pd
precision = 6
df = []
for index, i in enumerate(raw):
if i is None:
break
if index%1000 == 0:
sys.stderr.write('Processed %s...\n' % index)
df.append(pd.Series(dict([(np.round(mz, precision),int(intensity)) for mz, intensity in i.scans]), dtype='uint16', name=i.rt))
all_indices = set([])
for j in df:
all_indices |= set(j.index.tolist())
print len(all_indices)
t = time.time()
main_df = pd.DataFrame(index=all_indices)
first = True
del all_indices
while df:
subset = [df.pop() for i in xrange(10) if df]
all_indices = set([])
for j in subset:
all_indices |= set(j.index.tolist())
df2 = pd.DataFrame(index=all_indices)
df2.sort_index(inplace=True, axis=0)
df2.sort_index(inplace=True, axis=1)
del all_indices
ind=0
while subset:
t2 = time.time()
ind+=1
arr = subset.pop()
df2[arr.name] = arr
print ind,time.time()-t,time.time()-t2
df2.reindex(main_df.index)
t2 = time.time()
for i in df2.columns:
main_df[i] = df2[i]
if first:
main_df = main_df.to_sparse()
first = False
print 'join time', time.time()-t,time.time()-t2
print len(df), 'entries remain'
Any advice on how I can load this large dataset quickly is appreciated, even if it means writing it to disk to some other format first/etc.
Some additional info:
1) Because of the number of columns, I can't use most traditional on-disk stores such as HDF.
2) The data will be queried across columns and rows when it is in use. So main_df.loc[row:row_end, col:col_end]. These aren't predictable block sizes so chunking isn't really an option. These lookups also need to be fast, on the order of ~10 a second to be realistically useful.
3) I have 32G of memory, so a SparseDataFrame I think is the best option since it fits in memory and allows fast lookups as needed. Just the creation of it is a pain at the moment.
Update:
I ended up using scipy sparse matrices and handling the indexing on my own for the time being. This results in appends at a constant rate of ~0.2 seconds which is acceptable (versus Pandas taking ~150seconds for my full dataset per append). I'd love to know how to make Pandas match this speed.