I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2
I have a file with some data that looks like
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
I can process this data and do math on it just fine:
import sys
import numpy as np
import pandas as pd
def main():
if(len(sys.argv) != 2):
print "Takes one filename as argument"
sys.exit()
file_name = sys.argv[1]
data = pd.read_csv(file_name, sep=" ", header=None)
data.columns = ["timestep", "mux", "muy", "muz"]
t = data["timestep"].count()
c = np.zeros(t)
for i in range(0,t):
for j in range(0,i+1):
c[i-j] += data["mux"][i-j] * data["mux"][i]
c[i-j] += data["muy"][i-j] * data["muy"][i]
c[i-j] += data["muz"][i-j] * data["muz"][i]
for i in range(t):
print c[i]/(t-i)
The expected result for my sample input above is
42.5
62.0
84.5
110.0
This math is finding the time correlation function for my data, which is the time-average of all permutations of the pairs of products in each column.
I would like to generalize this program to
work on n number of columns (in the i/j loop for example), and
be able to read in the column names from the file, so as to not have them hard-coded in
Which numpy or pandas methods can I use to accomplish this?
We can reduce it to one loop, as we would make use of array-slicing and use sum ufunc to operate along the rows of the dataframe and thus in the process make it generic to cover any number of columns, like so -
a = data.values
t = data["timestep"].count()
c = np.zeros(t)
for i in range(t):
c[:i+1] += (a[:i+1,1:]*a[i,1:]).sum(axis=1)
Explanation
1) a[:i+1,1:] is the slice of all rows until the i+1-th row and all columns starting from the second column, i.e mux, muy and so on.
2) Similarly, for [i,1:], that's the i-th row and all columns from second column onwards.
To keep it "pandas-way", simply replace a[ with data.iloc[.
consider the df
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df
I want to calculate the sum over a trailing 5 days, every 3 days.
I expect something that looks like this
this was edited
what I had was incorrect. #ivan_pozdeev and #boud noticed this was a centered window and that was not my intention. Appologies for the confusion.
everyone's solutions capture much of what I was after.
criteria
I'm looking for smart efficient solutions that can be scaled to large data sets.
I'll be timing solutions and also considering elegance.
Solutions should also be generalizable for a variety of sample and look back frequencies.
from comments
I want a solution that generalizes to handle a look back of a specified frequency and grab anything that falls within that look back.
for the sample above, the look back is 5D and there may be 4 or 50 observations that fall within that look back.
I want the timestamp to be the last observed timestamp within the look back period.
the df you gave us is :
A
2012-12-31 0
2013-01-01 1
2013-01-02 2
2013-01-03 3
2013-01-04 4
2013-01-05 5
2013-01-06 6
2013-01-07 7
2013-01-08 8
2013-01-09 9
2013-01-10 10
you could create your rolling 5-day sum series and then resample it. I can't think of a more efficient way than this. overall this should be relatively time efficient.
df.rolling(5,min_periods=5).sum().dropna().resample('3D').first()
Out[36]:
A
2013-01-04 10.0000
2013-01-07 25.0000
2013-01-10 40.0000
Listed here are two three few NumPy based solutions using bin based summing covering basically three scenarios.
Scenario #1 : Multiple entries per date, but no missing dates
Approach #1 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
# Extract the index names and values
vals = df.A.values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
# Perform bin based summing and subtract the repeated ones
IDsums = np.bincount(id_arr,vals)
allsums = IDsums[:-1] + IDsums[1:]
allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #2 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
# Extract the index names and values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
# Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern
# Pad with 0 and length of array at either ends as we use diff later on
shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
search_id = np.hstack((0,shiftIDs,date_id[-1]+1))
# Find the start of those shifting indices
# Generate ID based on shifts and do bin based summing of dataframe
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
IDsums = np.bincount(id_arr,df.A.values)
# Sum each group of 3 elems with a stride of 2, make dataframe if needed
allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #3 :
def vectorized_app3(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
We could replace the convolution part with direct sliced summation for a modified version of it -
def vectorized_app3_v2(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
f = c.size+S-W
out = c[:f:S].copy()
for i in range(1,W):
out += c[i:f+i:S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #2 : Multiple entries per date and missing dates
Approach #4 :
def vectorized_app4(df, S=3, W=5):
dt = df.index.values
indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
WL = ((indx[-1]+1)//S)
c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
freq_str = str(S)+'D'
grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #3 : Consecutive dates and exactly one entry per date
Approach #5 :
def vectorized_app5(df, S=3, W=5):
vals = df.A.values
N = (df.shape[0]-W+2*S-1)//S
n = vals.strides[0]
out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
strides=(S*n,n)).sum(1)
index_idx = (W-1)+S*np.arange(N)
out_index = df.index[index_idx]
return pd.DataFrame(out,index=out_index,columns=['A'])
Suggestions for creating test-data
Scenario #1 :
# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])
Scenario #2 :
To create setup for multiple dates and with missing dates, we could just edit the df_data creation step, like so -
df_data = np.random.randint(0,9,(len(idx0)))
Scenario #3 :
# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
If the dataframe is sorted by date, what we actually have is iterating over an array while calculating something.
Here's the algorithm that calculates sums all in one iteration over the array. To understand it, see a scan of my notes below. This is the base, unoptimized version intended to showcase the algorithm (optimized ones for Python and Cython follow), and list(<call>) takes ~500 ms for an array of 100k on my system (P4). Since Python ints and ranges are relatively slow, this should benefit tremendously from being transferred to C level.
from __future__ import division
import numpy as np
#The date column is unimportant for calculations.
# I leave extracting the numbers' column from the dataframe
# and adding a corresponding element from data column to each result
# as an exercise for the reader
data = np.random.randint(100,size=100000)
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
index=0
currentsum=0
while index<lim_index:
for _ in range(Mp):
#np.take is awkward, requiring a full list of indices to take
for i in range(currentsum,currentsum+nsums-1):
sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Note that it produces the first sum at kth element, not nth (this can be changed but by sacrificing elegance - a number of dummy iterations before the main loop - and is more elegantly done by prepending data with extra zeros and discarding a number of first sums)
It can easily be generalized to any operation by replacing sums[slice]+=data[index] with operation(sums[slice],data[index]) where operation is a parameter and should be a mutating operation (like ndarray.__iadd__).
parallelizing between any number or workers by splitting the data is as easy (if n>k, chunks after the first one should be fed extra elements at the start)
To deduce the algorithm, I wrote a sample for a case where a decent number of sums are calculated simultaneously in order to see patterns (click the image to see it full-size).
Optimized: pure Python
Caching range objects brings the time down to ~300ms. Surprisingly, numpy functionality is of no help: np.take is unusable, and replacing currentsum logic with static slices and np.roll is a regression. Even more surprisingly, the benefit of saving output to an np.empty as opposed to yield is nonexistent.
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
RM=range(M) #cache for efficiency
RMp=range(Mp) #cache for efficiency
index=0
currentsum=0
currentsum_ranges=[range(currentsum,currentsum+nsums-1)
for currentsum in range(nsums)] #cache for efficiency
while index<lim_index:
for _ in RMp:
#np.take is unusable as it allocates another array rather than view
for i in currentsum_ranges[currentsum]:
sums[i%nsums]+=data[index]
index+=1
for _ in RM:
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Optimized: Cython
Statically typing everything in Cython instantly speeds things up to 150ms. And (optionally) assuming np.int as dtype to be able to work with data at C level brings the time down to as little as ~11ms. At this point, saving to an np.empty does make a difference, saving an unbelievable ~6.5ms, totalling ~5.5ms.
def calc_trailing_data_with_interval(np.ndarray data,int n,int k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: 1-d ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
if not data.ndim==1: raise TypeError("One-dimensional array required")
cdef int lim_index=data.size-k+1
cdef np.ndarray result = np.empty(data.size//k,dtype=data.dtype)
cdef int rindex = 0
cdef int nsums = int(np.ceil(float(n)/k))
cdef np.ndarray sums = np.zeros(nsums,dtype=data.dtype)
#optional speedup for dtype=np.int
cdef bint use_int_buffer = data.dtype==np.int and data.flags.c_contiguous
cdef int[:] cdata = data
cdef int[:] csums = sums
cdef int[:] cresult = result
cdef int M=n%k
cdef int Mp=k-M
cdef int index=0
cdef int currentsum=0
cdef int _,i
while index<lim_index:
for _ in range(Mp):
#np.take is unusable as it allocates another array rather than view
for i in range(currentsum,currentsum+nsums-1):
if use_int_buffer: csums[i%nsums]+=cdata[index] #optional speedup
else: sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
if use_int_buffer:
for i in range(nsums): csums[i]+=cdata[index] #optional speedup
else: sums+=data[index]
index+=1
if use_int_buffer: cresult[rindex]=csums[currentsum] #optional speedup
else: result[rindex]=sums[currentsum]
currentsum=(currentsum+1)%nsums
rindex+=1
return result
For regularly-spaced dates only
Here are two methods, first a pandas way and second a numpy function.
>>> n=5 # trailing periods for rolling sum
>>> k=3 # frequency of rolling sum calc
>>> df.rolling(n).sum()[-1::-k][::-1]
A
2013-01-01 NaN
2013-01-04 10.0
2013-01-07 25.0
2013-01-10 40.0
And here's a numpy function (adapted from Jaime's numpy moving_average):
def rolling_sum(a, n=5, k=3):
ret = np.cumsum(a.values)
ret[n:] = ret[n:] - ret[:-n]
return pd.DataFrame( ret[n-1:][-1::-k][::-1],
index=a[n-1:][-1::-k][::-1].index )
rolling_sum(df,n=6,k=4) # default n=5, k=3
For irregularly-spaced dates (or regularly-spaced)
Simply precede with:
df.resample('D').sum().fillna(0)
For example, the above methods become:
df.resample('D').sum().fillna(0).rolling(n).sum()[-1::-k][::-1]
and
rolling_sum( df.resample('D').sum().fillna(0) )
Note that dealing with irregularly-spaced dates can be done simply and elegantly in pandas as this is a strength of pandas over almost anything else out there. But you can likely find a numpy (or numba or cython) approach that will trade off some simplicity for an increase in speed. Whether this is a good tradeoff will depend on your data size and performance requirements, of course.
For the irregularly spaced dates, I tested on the following example data and it seemed to work correctly. This will produce a mix of missing, single, and multiple entries per date:
np.random.seed(12345)
per = 11
tidx = np.random.choice( pd.date_range('2012-12-31', periods=per, freq='D'), per )
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx).sort_index()
this isn't quite perfect yet, but I've gotta go make fake blood for a haloween party tonight... you should be able to see what I was getting at through the comments. One of the biggest speedups is finding the window edges with np.searchsorted. it doesn't quite work yet, but I'd bet it's just some index offsets that need tweaking
import pandas as pd
import numpy as np
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
sample_freq = 3 #days
sample_width = 5 #days
sample_freq *= 86400 #seconds per day
sample_width *= 86400 #seconds per day
times = df.index.astype(np.int64)//10**9 #array of timestamps (unix time)
cumsum = np.cumsum(df.A).as_matrix() #array of cumulative sums (could eliminate extra summation with large overlap)
mat = np.array([times, cumsum]) #could eliminate temporary times and cumsum vars
def yieldstep(mat, freq):
normtime = ((mat[0] - mat[0,0]) / freq).astype(int) #integer numbers indicating sample number
for i in range(max(normtime)+1):
yield np.searchsorted(normtime, i) #yield beginning of window index
def sumwindow(mat,i , width): #i is the start of the window returned by yieldstep
normtime = ((mat[0,i:] - mat[0,i])/ width).astype(int) #same as before, but we norm to window width
j = np.searchsorted(normtime, i, side='right')-1 #find the right side of the window
#return rightmost timestamp of window in seconds from unix epoch and sum of window
return mat[0,j], mat[1,j] - mat[1,i] #sum of window is just end - start because we did a cumsum earlier
windowed_sums = np.array([sumwindow(mat, i, sample_width) for i in yieldstep(mat, sample_freq)])
Looks like a rolling centered window where you pick up data every n days:
def rolleach(df, ndays, window):
return df.rolling(window, center=True).sum()[ndays-1::ndays]
rolleach(df, 3, 5)
Out[95]:
A
2013-01-02 10.0
2013-01-05 25.0
2013-01-08 40.0
I have a dataset where we record the electrical power demand from each individual appliance in the home. The dataset is quite large (2 years or data; 1 sample every 6 seconds; 50 appliances). The data is in a compressed HDF file.
We need to add the power demand for every appliance to get the total aggregate power demand over time. Each individual meter might have a different start and end time.
The naive approach (using a simple model of our data) is to do something like this:
LENGHT = 2**25
N = 30
cumulator = pd.Series()
for i in range(N):
# change the index for each new_entry to mimick the fact
# that out appliance meters have different start and end time.
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator = cumulator.add(new_entry, fill_value=0)
This works fine for small amounts of data. It also works OK with large amounts of data as long as every new_entry has exactly the same index.
But, with large amounts of data, where each new_entry has a different start and end index, Python quickly gobbles up all the available RAM. I suspect this is a memory fragmentation issue. If I use multiprocessing to fire up a new process for each meter (to load the meter's data from disk, load the cumulator from disk, do the addition in memory, then save the cumulator back to disk, and exit the process) then we have fine memory behaviour but, of course, all that disk IO slows us down a lot.
So, I think what I want is an in-place Pandas add function. The plan would be to initialise cumulator to have an index which is the union of all the meters' indicies. Then allocate memory once for that cumulator. Hence no more fragmentation issues.
I have tried two approaches but neither is satisfactory.
I tried using numpy.add to allow me to set the out argument:
# Allocate enough space for the cumulator
cumulator = pd.Series(0, index=np.arange(0, LENGTH+N))
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator, aligned_new_entry = cumulator.align(new_entry, copy=False, fill_value=0)
del new_entry
np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values)
del aligned_new_entry
But this gobbles up all my RAM too and doesn't seem to do the addition. If I change the penaultiate line to cumulator.values = np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values) then I get an error about not being able to assign to cumulator.values.
This second approach appears to have the correct memory behaviour but is far too slow to run:
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
for index in cumulator.index:
try:
cumulator[index] += new_entry[index]
except KeyError:
pass
I suppose I could write this function in Cython. But I'd rather not have to do that.
So: is there any way to do an 'inplace add' in Pandas?
Update
In response to comments below, here is a toy example of our meter data and the sum we want. All values are watts.
time meter1 meter2 meter3 sum
09:00:00 10 10
09:00:06 10 20 30
09:00:12 10 20 30
09:00:18 10 20 30 50
09:00:24 10 20 30 50
09:00:30 10 30 40
If you want to see more details then here's the file format description of our data logger, and here's the 4TByte archive of our entire dataset.
After messing around a lot with multiprocessing, I think I've found a fairly simple and efficient way to do an in-place add without using multiprocessing:
import numpy as np
import pandas as pd
LENGTH = 2**26
N = 10
DTYPE = np.int
# Allocate memory *once* for a Series which will hold our cumulator
cumulator = pd.Series(0, index=np.arange(0, N+LENGTH), dtype=DTYPE)
# Get a numpy array from the Series' buffer
cumulator_arr = np.frombuffer(cumulator.data, dtype=DTYPE)
# Create lots of dummy data. Each new_entry has a different start
# and end index.
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i), dtype=DTYPE)
aligned_new_entry = np.pad(new_entry.values, pad_width=((i, N-i)),
mode='constant', constant_values=((0, 0)))
# np.pad could be replaced by new_entry.reindex(index, fill_value=0)
# but np.pad is faster and more memory efficient than reindex
del new_entry
np.add(cumulator_arr, aligned_new_entry, out=cumulator_arr)
del aligned_new_entry
del cumulator_arr
print cumulator.head(N*2)
which prints:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 10
11 10
12 10
13 10
14 10
15 10
16 10
17 10
18 10
19 10
assuming that your dataframe looks something like:
df.index.names == ['time']
df.columns == ['meter1', 'meter2', ..., 'meterN']
then all you need to do is:
df['total'] = df.fillna(0, inplace=True).sum(1)
I have pandas data object - data - that is stored as Series of Series. The first series is indexed on ID1 and the second on ID2.
ID1 ID2
1 10259 0.063979
14166 0.120145
14167 0.177417
14244 0.277926
14245 0.436048
15021 0.624367
15260 0.770925
15433 0.918439
15763 1.000000
...
1453 812690 0.752274
813000 0.755041
813209 0.756425
814045 0.778434
814474 0.910647
814475 1.000000
Length: 19726, dtype: float64
I have a function that uses values from this object for further data processing. Here is the function:
#Function
def getData(ID1, randomDraw):
dataID2 = data[ID1]
value = dataID2.index[np.searchsorted(dataID2, randomDraw, side='left').iloc[0]]
return value
I use np.vectorize to apply this function on a DataFrame - dataFrame - that has about 22 million rows.
dataFrame['ID2'] = np.vectorize(getData)(dataFrame['ID1'], dataFrame['RAND'])
where ID1 and RAND are columns with values that are feeding into the function.
The code takes about 6 hours to process everything. A similar implementation in Java takes only about 6 minutes to get through 22 million rows of data.
On running a profiler on my program I find that the most expensive call is the indexing into data and the second most expensive is searchsorted.
Function Name: pandas.core.series.Series.__getitem__
Elapsed inclusive time percentage: 54.44
Function Name: numpy.core.fromnumeric.searchsorted
Elapsed inclusive time percentage: 25.49
Using data.loc[ID1] to get data makes the program even slower. How can I make this faster? I understand that Python cannot achieve the same efficiency as Java but 6 hours compared to 6 minutes seems too much of a difference. Maybe I should be using a different data structure/ functions? I am using Python 2.7 and PTVS IDE.
Adding a minimum working example:
import numpy as np
import pandas as pd
np.random.seed = 0
#Creating a dummy data object - Series within Series
alt = pd.Series(np.array([ 0.25, 0.50, 0.75, 1.00]), index=np.arange(1,5))
data = pd.Series([alt]*1500, index=np.arange(1,1501))
#Creating dataFrame -
nRows = 200000
d = {'ID1': np.random.randint(1500, size=nRows) + 1
,'RAND': np.random.uniform(low=0.0, high=1.0, size=nRows)}
dataFrame = pd.DataFrame(d)
#Function
def getData(ID1, randomDraw):
dataID2 = data[ID1]
value = dataID2.index[np.searchsorted(dataID2, randomDraw, side='left').iloc[0]]
return value
dataFrame['ID2'] = np.vectorize(getData)(dataFrame['ID1'], dataFrame['RAND'])
You may get a better performance with this code:
>>> def getData(ts):
... dataID2 = data[ts.name]
... i = np.searchsorted(dataID2.values, ts.values, side='left')
... return dataID2.index[i]
...
>>> dataFrame['ID2'] = dataFrame.groupby('ID1')['RAND'].transform(getData)