DASK: Is there an equivalent of numpy.select for dask? - python

I'm using Dask to load an 11m row csv into a dataframe and perform calculations. I've reached a position where I need conditional logic - If this, then that, else other.
If I were to use pandas, for example, I could do the following, where a numpy select statement is used along with an array of conditions and results. This statement takes about 35 seconds to run - not bad, but not great:
df["AndHeathSolRadFact"] = np.select(
[
(df['Month'].between(8,12)),
(df['Month'].between(1,2) & df['CloudCover']>30) #Array of CONDITIONS
], #list of conditions
[1, 1], #Array of RESULTS (must match conditions)
default=0) #DEFAULT if no match
What I am hoping to do is use dask to do this, natively, in a dask dataframe, without having to first convert my dask dataframe to a pandas dataframe, and then back again.
This allows me to:
- Use multithreading
- Use a dataframe that is larger than available ram
- Potentially speed up the result.
Sample CSV
Location,Date,Temperature,RH,WindDir,WindSpeed,DroughtFactor,Curing,CloudCover
1075,2019-20-09 04:00,6.8,99.3,143.9,5.6,10.0,93.0,1.0
1075,2019-20-09 05:00,6.4,100.0,93.6,7.2,10.0,93.0,1.0
1075,2019-20-09 06:00,6.7,99.3,130.3,6.9,10.0,93.0,1.0
1075,2019-20-09 07:00,8.6,95.4,68.5,6.3,10.0,93.0,1.0
1075,2019-20-09 08:00,12.2,76.0,86.4,6.1,10.0,93.0,1.0
Full Code for minimum viable sample
import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
import numpy as np
# Dataframes implement the Pandas API
import dask.dataframe as dd
from timeit import default_timer as timer
start = timer()
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
#Convert back to a Dask dataframe because we want that juicy parallelism
ddf2 = dd.from_pandas(df,npartitions=4)
del [df]
print(ddf2.head())
#print(ddf.tail())
end = timer()
print(end - start)
#Clean up remaining dataframes
del [[ddf2]

So, the answer I was able to come up with that was the most performant was:
#Create a helper column where we store the value we want to set the column to later.
ddf['Helper'] = 1
#Create the column where we will be setting values, and give it a default value
ddf['AndHeathSolRadFact'] = 0
#Break the logic out into separate where clauses. Rather than looping we will be selecting those rows
#where the conditions are met and then set the value we went. We are required to use the helper
#column value because we cannot set values directly, but we can match from another column.
#First, a very simple clause. If Temperature is greater than or equal to 8, make
#AndHeathSolRadFact equal to the value in Helper
#Note that at the end, after the comma, we preserve the existing cell value if the condition is not met
ddf['AndHeathSolRadFact'] = (ddf.Helper).where(ddf.Temperature >= 8, ddf.AndHeathSolRadFact)
#A more complex example
#this is the same as the above, but demonstrates how to use a compound select statement where
#we evaluate multiple conditions and then set the value.
ddf['AndHeathSolRadFact'] = (ddf.Helper).where(((ddf.Temperature == 6.8) & (ddf.RH == 99.3)), ddf.AndHeathSolRadFact)
I'm a newbie at this, but I'm assuming this approach counts as being vectorised. It makes full use of the array and evaluates very quickly.
Adding the new column, filling it with 0, evaluating both select statements and replacing the values in the target rows only added 0.2s to the processing time on an 11m row dataset with npartitions = 4.
Former, and similar approaches in pandas took 45 seconds or so.
The only thing left to do is to remove the helper column once we're done. Currently, I'm not sure how to do this.

It sounds like you're looking to dd.Series.where

Related

Filter Dask DataFrame rows by specific values of index

Is there effective solution to select specific rows in Dask DataFrame?
I would like to get only those rows which index is in a closed set (using the isin function is not enough efficient for me).
Are there any other effective solutions than ddf.loc[ddf.index.isin(list_of_index_values)]
ddf.loc[~ddf.index.isin(list_of_index_values)]
?
You can use the query method. You haven't provided a usable example but the format would be something like this
list_of_index_values = [6, 3]
dff.query('column in #list_of_index_values')
EDIT: Just for fun. I did this in pandas but I wouldn't expect much variance.
No clue whats stored in the index but assumed int.
from random import randint
import pandas as pd
from datetime import datetime as dt
# build huge random dataset
lst = []
for i in range(100000000):
lst.append(randint(0,100000))
# build huge random index
index = []
for i in range(1000000):
index.append(randint(0,100000))
df = pd.DataFrame(lst, columns=['values'])
isin = dt.now()
df[df['values'].isin(index)]
print(f'total execution time for isin {dt.now()-isin}')
query = dt.now()
df.query('values in #index')
print(f'total execution time for query {dt.now()-query}')
# total execution time for isin 0:01:22.914507
# total execution time for query 0:01:13.794499
If your index is sequential however
time = dt.now()
df[df['values']>100000]
print(dt.now()-time)
# 0:00:00.128209
It's not even close. You can even build out a range
time = dt.now()
df[(df['values']>100000) | (df['values'] < 500)]
print(dt.now()-time)
# 0:00:00.650321
Obviously the third method isn't always an option, but something to keep in mind if speed is a priority and you just need index between 2 values or some such.

Create new dataframe with condtion per groupby in pandas

I am trying to create new dataframe based on condition per groupby.
Suppose, I have dataframe with Name, Flag and Month.
import pandas as pd
import numpy as np
data = {'Name':['A', 'A', 'B', 'B'], 'Flag':[0, 1, 0, 1], 'Month':[1,2,1,2]}
df = pd.DataFrame(data)
need = df.loc[df['Flag'] == 0].groupby(['Name'], as_index = False)['Month'].min()
My condition is to find minimum month where flag equal to 0 per name.
I have used .loc to define my condition, it works fine but I found that it quite poor performance when applying with 10 million of rows.
Any more efficient way to do so?
Thank you!
Just had this same scenario yesterday, where I took a 90 second process down to about 3 seconds. Because speed is your concern (like mine was), and not using solely Pandas itself, I would recommend using Numba and NumPy. The catch is you're going to have to brush up on your data structures and types to get a good grasp on what Numba is really doing with JIT. Once you do though, it rocks.
I would recommend finding a way to get every value in your DataFrame to an integer. For your name column, try unique ID's. Flag and month already look good.
name_ids = []
for i, name in enumerate(np.unique(df["Name"])):
name_ids.append({i: name})
Then, create a function and loop the old-fashioned way:
#njit
def really_fast_numba_loop(data):
for row in data:
# do stuff
return data
new_df = really_fast_numba_loop(data)
The first time your function is called in your file, it will be about the same speed as it would elsewhere, but all the other times it will be lightning fast. So, the trick is finding the balance between what to put in the function and what to put in its outside loop.
In either case, when you're done processing your values, convert your name_ids back to strings and wrap your data in pd.DataFrame.
Et voila. You just beat Pandas iterrows/itertuples.
Comment back if you have questions!

Alternative method for two way interpolation

I wrote some code to perform interpolation based on two criteria, the amount of insurance and the deductible amount %. I was struggling to do the interpolation all at once, so had split the filtering.The table hf contains the known data which I am using to base my interpolation results on.Table df contains the new data which needs the developed factors interpolated based on hf.
Right now my work around is first filtering each table based on the ded_amount percentage and then performing the interpolation into an empty data frame and appending after each loop.
I feel like this is inefficient, and there is a better way to perform this, looking to hear some feedback on some improvements I can make. Thanks
Test data provided below.
import pandas as pd
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
deduct_fact=pd.DataFrame()
for deduct in hf['Ded_amount'].unique():
deduct_table=hf[hf['Ded_amount']==deduct]
aoi_table=df[df['Ded_amount']==deduct]
x=deduct_table['AOI']
y=deduct_table['factor']
f=interpolate.interp1d(x,y,fill_value="extrapolate")
xnew=aoi_table[['AOI']]
ynew=f(xnew)
append_frame=aoi_table
append_frame['Factor']=ynew
deduct_fact=deduct_fact.append(append_frame)
Yep, there is a way to do this more efficiently, without having to make a bunch of intermediate dataframes and appending them. have a look at this code:
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
# Create this column now
df['Factor'] = None
# I like specifying this explicitly; easier to debug
deduction_amounts = list(hf.Ded_amount.unique())
for deduction_amount in deduction_amounts:
# You can index a dataframe and call a column in one line
x, y = hf[hf['Ded_amount']==deduction_amount]['AOI'], hf[hf['Ded_amount']==deduction_amount]['factor']
f = interpolate.interp1d(x, y, fill_value="extrapolate")
# This is the most important bit. Lambda function on the dataframe
df['Factor'] = df.apply(lambda x: f(x['AOI']) if x['Ded_amount']==deduction_amount else x['Factor'], axis=1)
The way the lambda function works is:
It goes row by row through the column 'Factor' and gives it a value based on conditions on the other columns.
It returns the interpolation of the AOI column of df (this is what you called xnew) if the deduction amount matches, otherwise it just returns the same thing back.

Iterating through start, finish and class values in Python

I have a little script that creates a new column in my pandas dataset called class, and assigns class values for a given time range. It works well, but suddenly I have thousands of time ranges to input, and wondered if it might possible to write some kind of loop which gets the three columns (start, finish, and class) from a pandas dataframe.
To complicate things, the time ranges are of irregular interval in dataframe 1 (e.g. a nanosecond, 30 seconds, 4 minutes) and in dataframe 2, (which contains accelerometer data) the time series data increases in increments of 0.010 seconds. Any help appreciated as I'm new to Python.
conditions = [(X['DATETIME'] < '2017-11-17 07:31:07') & (X['DATETIME']>= '2017-11-17 00:00:00'),(X['DATETIME'] < '2017-11-17 07:32:35') & (X['DATETIME']>= '2017-11-17 07:31:07'),(X['DATETIME'] < '2017-11-17 09:01:05') & (X['DATETIME']>= '2017-11-17 08:58:39')]
classes = ['0','1','2']
X['CLASS'] = np.select(conditions, classes, default='5')
There are many possible solutions to this, you could use for loops as you said, etc. But if you are new to Python, I think this answer would show you more about the power of python and its great packages. I will use the numpy package here. And I suppose that your first table is in a pandas data frame named X while the second in one named condidtions.
import numpy as np
X['CLASS'] = conditions['CLASS'].iloc[np.digitize(X['Datetime'].view('i8'),
conditions['Start'].view('i8')) - 1]
Don't worry, I won't let you there. So np.digitize takes it's first list and bins it based on the bin borders defined by the second argument. So here you will get the index of the condition corresponding to the time in the given row.
There are a couple of details to be noted:
.view('i8') provides a view of the datetime object which can be easily used by the numpy package (if you are interested, you can read more about the details)
-1 is needed to realign the results (the value after the start of the first condition will get a value of 1, but we want it to start from 0.
in the end we use the iloc function of the conditions['CLASS'] series to map these indices to the class values.

Creation of large pandas DataFrames from Series

I'm dealing with data on a fairly large scale. For reference, a given sample will have ~75,000,000 rows and 15,000-20,000 columns.
As of now, to conserve memory I've taken the approach of creating a list of Series (each column is a series, so ~15K-20K Series each containing ~250K rows). Then I create a SparseDataFrame containing every index within these series (because as you notice, this is a large but not very dense dataset). The issue is this becomes extremely slow, and appending each column to the dataset takes several minutes. To overcome this I've tried batching the merges as well (select a subset of the data, merge these to a DataFrame, which is then merged into my main DataFrame), but this approach is still too slow. Slow meaning it only processed ~4000 columns in a day, with each append causing subsequent appends to take longer as well.
One part which struck me as odd is why my column count of the main DataFrame affects the append speed. Because my main index already contains all entries it will ever see, I shouldn't have to lose time due to re-indexing.
In anycase, here is my code:
import time
import sys
import numpy as np
import pandas as pd
precision = 6
df = []
for index, i in enumerate(raw):
if i is None:
break
if index%1000 == 0:
sys.stderr.write('Processed %s...\n' % index)
df.append(pd.Series(dict([(np.round(mz, precision),int(intensity)) for mz, intensity in i.scans]), dtype='uint16', name=i.rt))
all_indices = set([])
for j in df:
all_indices |= set(j.index.tolist())
print len(all_indices)
t = time.time()
main_df = pd.DataFrame(index=all_indices)
first = True
del all_indices
while df:
subset = [df.pop() for i in xrange(10) if df]
all_indices = set([])
for j in subset:
all_indices |= set(j.index.tolist())
df2 = pd.DataFrame(index=all_indices)
df2.sort_index(inplace=True, axis=0)
df2.sort_index(inplace=True, axis=1)
del all_indices
ind=0
while subset:
t2 = time.time()
ind+=1
arr = subset.pop()
df2[arr.name] = arr
print ind,time.time()-t,time.time()-t2
df2.reindex(main_df.index)
t2 = time.time()
for i in df2.columns:
main_df[i] = df2[i]
if first:
main_df = main_df.to_sparse()
first = False
print 'join time', time.time()-t,time.time()-t2
print len(df), 'entries remain'
Any advice on how I can load this large dataset quickly is appreciated, even if it means writing it to disk to some other format first/etc.
Some additional info:
1) Because of the number of columns, I can't use most traditional on-disk stores such as HDF.
2) The data will be queried across columns and rows when it is in use. So main_df.loc[row:row_end, col:col_end]. These aren't predictable block sizes so chunking isn't really an option. These lookups also need to be fast, on the order of ~10 a second to be realistically useful.
3) I have 32G of memory, so a SparseDataFrame I think is the best option since it fits in memory and allows fast lookups as needed. Just the creation of it is a pain at the moment.
Update:
I ended up using scipy sparse matrices and handling the indexing on my own for the time being. This results in appends at a constant rate of ~0.2 seconds which is acceptable (versus Pandas taking ~150seconds for my full dataset per append). I'd love to know how to make Pandas match this speed.

Categories

Resources