Create a sampling rate array - python

I have a df with 40 000 000 points that looks like this:
A
0 0.50
1 0.90
2 5.94
.
40 000 000 84.53
As the data does not have any time, I am trying to create a time array to the df but every time that I do it I get Memory Errors. Sampling time = 60 kHz
I tried shrinking the data by slicing it and instead of taking 40 000 000 points. I checked and the important data for me lay between 20000001:40000000. I have tried to take less data points e.g. 20 000 but still, whenever I create the Time array I get the memory error.
N = N.iloc[20000001:40000000] #Lock data
N = N[0 : len(N) : 1000] # Slice by 1000 increments
N['Time'] = np.arange(0, len(N), 1/60000)
How could I create a Time array without killing my memory? Am I doing something wrong?

You may write a generator of floats similar to xrange (in python2) or range (python3). They lack of float support so we write it by ourselves:
def frange(end_number, fraction):
end_idx = end_number * fraction
idx = 0
while idx < end_idx:
yield float(idx) / fraction
idx += 1
a = frange(20, 3)
print([i for i in a]) # see how it works
b = frange(40000000, 60000) # no memory error

Related

fastest way to filter a 2d numpy array

I am trying to filter a numpy array of array, I have done a function like the following:
#nb.njit
def numpy_filter (npX):
n = np.full (npX.shape[0], True)
for npo_index in range(npX.shape[0]):
n[npo_index] = npX[npo_index][0] < 2000 and npX[npo_index][1] < 4000 and npX[npo_index][2] < 5000
return npX[n]
It took 1.75s (numba njit mode) for len of array = 600K , while it only take < 0.5s for list [x for x in obj1 if x[0] < 2000 and x[1] < 4000 and x[2] < 5000]
Is there any better implementation could have a filtering function that could make it run faster?
Generally with Pandas/NumPy arrays, you'll get the best performance if you
avoid iterating over the array
only create soft copies or views of the base array
create a minimal number of intermediate Python objects
Pandas is probably your friend here, allowing you to create a view of the backing NumPy arrays and operate on the rows of each via a shared index
starting data
This creates a random array of the same shape as your source data, with values range from 0-10000
>>> import numpy as np
>>> arr = np.random.rand(600000, 3) * 10000
>>> arr
array([[8079.54193993, 925.74430028, 2031.45569251],
[8232.74161149, 2347.42814063, 7571.21287502],
[7435.52165567, 756.74380534, 1023.12181186],
...,
[2176.36643662, 5374.36584708, 637.43482263],
[2645.0737415 , 9059.42475818, 3913.32941652],
[3626.54923011, 1494.57126083, 6121.65034039]])
create a Pandas DataFrame
This creates view over your source data so you can work with all the rows together using a shared index
>>> import pandas as pd
>>> df = pd.DataFrame(arr)
>>> df
0 1 2
0 8079.541940 925.744300 2031.455693
1 8232.741611 2347.428141 7571.212875
2 7435.521656 756.743805 1023.121812
3 4423.799649 2256.125276 7591.732828
4 6892.019075 3170.699818 1625.226953
... ... ... ...
599995 642.104686 3164.107206 9508.818253
599996 102.819102 3068.249711 1299.341425
599997 2176.366437 5374.365847 637.434823
599998 2645.073741 9059.424758 3913.329417
599999 3626.549230 1494.571261 6121.650340
[600000 rows x 3 columns]
filter
This gets a filtered view of the index for each columns and uses the combined result filter the DataFrame
>>> df[(df[0] < 2000) & (df[1] < 4000) & (df[2] < 5000)]
0 1 2
35 1829.777633 1333.083450 1928.982210
38 653.584288 3129.089395 4753.734920
71 1354.736876 279.202816 5.793797
97 1381.531847 551.465381 3767.436640
115 183.112455 1573.272310 1973.143995
... ... ... ...
599963 1895.537096 1695.569792 1866.575164
599970 1061.011239 51.534961 1014.290040
599988 1780.535714 2311.671494 1012.828410
599994 878.643910 352.858091 3014.505666
599996 102.819102 3068.249711 1299.341425
[24067 rows x 3 columns]
benchmark maybe to follow, but it's very fast
The jit function has not been warmed up, after the first run, the result shows it only takes 0.07s to finish the task.
Make your jit function return only the mask n, dont send npX[n].
Since jit compiler can not fix the the return size of filtered array there is chance it will slow down.
Do the filtering i.e. npX[n] outside jit function. That should speed up.
Also to get actual time better add signature with decorater, this will force eager compilation.
The optimization way in both numpy and numba are different so you always experiment which will be faster. But when speed is almost same you can add parallel option which will make it more faster(I guess you already know that)

How to fix the code which is interrupted by OS due to out-of-memory & optimize the code?

I have a problem with creating numpy.ndarray.
This is the part of code.
.....
def make_weighted_pedge(graph):
number_of_nodes = len(graph.nodes())
l_ij = np.full([number_of_nodes, number_of_nodes], np.nan)
# If two nodes are adjacent, its minimum path length is 1
for node in graph.nodes():
neighbor_nodes = [n for n in graph.neighbors(node)]
for adj_node in neighbor_nodes:
l_ij[int(node) - 1][int(adj_node) - 1] = 1
l_ij[int(adj_node) - 1][int(node) - 1] = 1
After I run this code, I get this error message!
"Process finished with exit code 137 (interrupted by signal 9: SIGKILL)"
np.nan is a float. np.full(...) produces an ndarray of dtype int64. Your 20k x 20k ndarray will be ~3.2Gb.
In [45]: np.full((1000,1000),np.nan).nbytes
Out[45]: 8000000 # times 400 for shape (20k,20k)
It doesn't look like you need floats so you could reduce the size by a factor of 4 or 8 using dtypes np.uint16 or np.uint8
In [47]: np.zeros((1000,1000), dtype=np.uint16).nbytes
Out[47]: 2000000
In [48]: np.zeros((1000,1000), dtype=np.uint8).nbytes
Out[48]: 1000000
np.uint16 will limit the max path length to 65535 and np.uint8 will limit it to 255.
If you need the np.nan values, the best you can do is a factor of four using np.float16
np.full((number_of_nodes, number_of_nodes),np.nan,dtype=np.float16)
Data type objects

Performance issues when using plt.hist() for array with 2500 rows

I am using pandas, numpy and matplotlib.pyplot to create a histogram. I have a bottleneck when using the command plt.hist() (NB I have done import matplotlib.pyplot as plt)
The data (called data) is a DataFrame that is 2501 rows long and its .head() looks like this:
>>> print data.head()
Step X
0 0 2.16261125638519e+01
1 100 2.05988670714102e+01
2 200 2.04719904741260e+01
3 300 2.02079668696942e+01
4 400 1.98093907227472e+01
I then use .values to return an array that looks like this:
>>> h = data['X'].values
>>> print h
['2.16261125638519e+01' '2.05988670714102e+01' '2.04719904741260e+01' ...
'1.90247338878272e+01' '1.90866656009559e+01' '1.91424583903948e+01']
Finally, where I am having problems, I call plt.hist(h) and this was still running after waiting for 10 minutes. (NB I have not performed this to completion)
Is there anyway to speed this up as I would have thought that a length of order 10^3 would not take this long to run?
UPDATE: The values of the array h were strings so I have changed them to floats but am having the same problem as described:
>>> h = h.astype(float)
>>> print h
[21.62611256 20.59886707 20.47199047 ... 19.02473389 19.0866656
19.14245839]

daily data, resample every 3 days, calculate over trailing 5 days efficiently

consider the df
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df
I want to calculate the sum over a trailing 5 days, every 3 days.
I expect something that looks like this
this was edited
what I had was incorrect. #ivan_pozdeev and #boud noticed this was a centered window and that was not my intention. Appologies for the confusion.
everyone's solutions capture much of what I was after.
criteria
I'm looking for smart efficient solutions that can be scaled to large data sets.
I'll be timing solutions and also considering elegance.
Solutions should also be generalizable for a variety of sample and look back frequencies.
from comments
I want a solution that generalizes to handle a look back of a specified frequency and grab anything that falls within that look back.
for the sample above, the look back is 5D and there may be 4 or 50 observations that fall within that look back.
I want the timestamp to be the last observed timestamp within the look back period.
the df you gave us is :
A
2012-12-31 0
2013-01-01 1
2013-01-02 2
2013-01-03 3
2013-01-04 4
2013-01-05 5
2013-01-06 6
2013-01-07 7
2013-01-08 8
2013-01-09 9
2013-01-10 10
you could create your rolling 5-day sum series and then resample it. I can't think of a more efficient way than this. overall this should be relatively time efficient.
df.rolling(5,min_periods=5).sum().dropna().resample('3D').first()
Out[36]:
A
2013-01-04 10.0000
2013-01-07 25.0000
2013-01-10 40.0000
Listed here are two three few NumPy based solutions using bin based summing covering basically three scenarios.
Scenario #1 : Multiple entries per date, but no missing dates
Approach #1 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
# Extract the index names and values
vals = df.A.values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
# Perform bin based summing and subtract the repeated ones
IDsums = np.bincount(id_arr,vals)
allsums = IDsums[:-1] + IDsums[1:]
allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #2 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
# Extract the index names and values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
# Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern
# Pad with 0 and length of array at either ends as we use diff later on
shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
search_id = np.hstack((0,shiftIDs,date_id[-1]+1))
# Find the start of those shifting indices
# Generate ID based on shifts and do bin based summing of dataframe
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
IDsums = np.bincount(id_arr,df.A.values)
# Sum each group of 3 elems with a stride of 2, make dataframe if needed
allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #3 :
def vectorized_app3(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
We could replace the convolution part with direct sliced summation for a modified version of it -
def vectorized_app3_v2(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
f = c.size+S-W
out = c[:f:S].copy()
for i in range(1,W):
out += c[i:f+i:S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #2 : Multiple entries per date and missing dates
Approach #4 :
def vectorized_app4(df, S=3, W=5):
dt = df.index.values
indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
WL = ((indx[-1]+1)//S)
c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
freq_str = str(S)+'D'
grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #3 : Consecutive dates and exactly one entry per date
Approach #5 :
def vectorized_app5(df, S=3, W=5):
vals = df.A.values
N = (df.shape[0]-W+2*S-1)//S
n = vals.strides[0]
out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
strides=(S*n,n)).sum(1)
index_idx = (W-1)+S*np.arange(N)
out_index = df.index[index_idx]
return pd.DataFrame(out,index=out_index,columns=['A'])
Suggestions for creating test-data
Scenario #1 :
# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])
Scenario #2 :
To create setup for multiple dates and with missing dates, we could just edit the df_data creation step, like so -
df_data = np.random.randint(0,9,(len(idx0)))
Scenario #3 :
# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
If the dataframe is sorted by date, what we actually have is iterating over an array while calculating something.
Here's the algorithm that calculates sums all in one iteration over the array. To understand it, see a scan of my notes below. This is the base, unoptimized version intended to showcase the algorithm (optimized ones for Python and Cython follow), and list(<call>) takes ~500 ms for an array of 100k on my system (P4). Since Python ints and ranges are relatively slow, this should benefit tremendously from being transferred to C level.
from __future__ import division
import numpy as np
#The date column is unimportant for calculations.
# I leave extracting the numbers' column from the dataframe
# and adding a corresponding element from data column to each result
# as an exercise for the reader
data = np.random.randint(100,size=100000)
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
index=0
currentsum=0
while index<lim_index:
for _ in range(Mp):
#np.take is awkward, requiring a full list of indices to take
for i in range(currentsum,currentsum+nsums-1):
sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Note that it produces the first sum at kth element, not nth (this can be changed but by sacrificing elegance - a number of dummy iterations before the main loop - and is more elegantly done by prepending data with extra zeros and discarding a number of first sums)
It can easily be generalized to any operation by replacing sums[slice]+=data[index] with operation(sums[slice],data[index]) where operation is a parameter and should be a mutating operation (like ndarray.__iadd__).
parallelizing between any number or workers by splitting the data is as easy (if n>k, chunks after the first one should be fed extra elements at the start)
To deduce the algorithm, I wrote a sample for a case where a decent number of sums are calculated simultaneously in order to see patterns (click the image to see it full-size).
Optimized: pure Python
Caching range objects brings the time down to ~300ms. Surprisingly, numpy functionality is of no help: np.take is unusable, and replacing currentsum logic with static slices and np.roll is a regression. Even more surprisingly, the benefit of saving output to an np.empty as opposed to yield is nonexistent.
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
RM=range(M) #cache for efficiency
RMp=range(Mp) #cache for efficiency
index=0
currentsum=0
currentsum_ranges=[range(currentsum,currentsum+nsums-1)
for currentsum in range(nsums)] #cache for efficiency
while index<lim_index:
for _ in RMp:
#np.take is unusable as it allocates another array rather than view
for i in currentsum_ranges[currentsum]:
sums[i%nsums]+=data[index]
index+=1
for _ in RM:
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Optimized: Cython
Statically typing everything in Cython instantly speeds things up to 150ms. And (optionally) assuming np.int as dtype to be able to work with data at C level brings the time down to as little as ~11ms. At this point, saving to an np.empty does make a difference, saving an unbelievable ~6.5ms, totalling ~5.5ms.
def calc_trailing_data_with_interval(np.ndarray data,int n,int k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: 1-d ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
if not data.ndim==1: raise TypeError("One-dimensional array required")
cdef int lim_index=data.size-k+1
cdef np.ndarray result = np.empty(data.size//k,dtype=data.dtype)
cdef int rindex = 0
cdef int nsums = int(np.ceil(float(n)/k))
cdef np.ndarray sums = np.zeros(nsums,dtype=data.dtype)
#optional speedup for dtype=np.int
cdef bint use_int_buffer = data.dtype==np.int and data.flags.c_contiguous
cdef int[:] cdata = data
cdef int[:] csums = sums
cdef int[:] cresult = result
cdef int M=n%k
cdef int Mp=k-M
cdef int index=0
cdef int currentsum=0
cdef int _,i
while index<lim_index:
for _ in range(Mp):
#np.take is unusable as it allocates another array rather than view
for i in range(currentsum,currentsum+nsums-1):
if use_int_buffer: csums[i%nsums]+=cdata[index] #optional speedup
else: sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
if use_int_buffer:
for i in range(nsums): csums[i]+=cdata[index] #optional speedup
else: sums+=data[index]
index+=1
if use_int_buffer: cresult[rindex]=csums[currentsum] #optional speedup
else: result[rindex]=sums[currentsum]
currentsum=(currentsum+1)%nsums
rindex+=1
return result
For regularly-spaced dates only
Here are two methods, first a pandas way and second a numpy function.
>>> n=5 # trailing periods for rolling sum
>>> k=3 # frequency of rolling sum calc
>>> df.rolling(n).sum()[-1::-k][::-1]
A
2013-01-01 NaN
2013-01-04 10.0
2013-01-07 25.0
2013-01-10 40.0
And here's a numpy function (adapted from Jaime's numpy moving_average):
def rolling_sum(a, n=5, k=3):
ret = np.cumsum(a.values)
ret[n:] = ret[n:] - ret[:-n]
return pd.DataFrame( ret[n-1:][-1::-k][::-1],
index=a[n-1:][-1::-k][::-1].index )
rolling_sum(df,n=6,k=4) # default n=5, k=3
For irregularly-spaced dates (or regularly-spaced)
Simply precede with:
df.resample('D').sum().fillna(0)
For example, the above methods become:
df.resample('D').sum().fillna(0).rolling(n).sum()[-1::-k][::-1]
and
rolling_sum( df.resample('D').sum().fillna(0) )
Note that dealing with irregularly-spaced dates can be done simply and elegantly in pandas as this is a strength of pandas over almost anything else out there. But you can likely find a numpy (or numba or cython) approach that will trade off some simplicity for an increase in speed. Whether this is a good tradeoff will depend on your data size and performance requirements, of course.
For the irregularly spaced dates, I tested on the following example data and it seemed to work correctly. This will produce a mix of missing, single, and multiple entries per date:
np.random.seed(12345)
per = 11
tidx = np.random.choice( pd.date_range('2012-12-31', periods=per, freq='D'), per )
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx).sort_index()
this isn't quite perfect yet, but I've gotta go make fake blood for a haloween party tonight... you should be able to see what I was getting at through the comments. One of the biggest speedups is finding the window edges with np.searchsorted. it doesn't quite work yet, but I'd bet it's just some index offsets that need tweaking
import pandas as pd
import numpy as np
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
sample_freq = 3 #days
sample_width = 5 #days
sample_freq *= 86400 #seconds per day
sample_width *= 86400 #seconds per day
times = df.index.astype(np.int64)//10**9 #array of timestamps (unix time)
cumsum = np.cumsum(df.A).as_matrix() #array of cumulative sums (could eliminate extra summation with large overlap)
mat = np.array([times, cumsum]) #could eliminate temporary times and cumsum vars
def yieldstep(mat, freq):
normtime = ((mat[0] - mat[0,0]) / freq).astype(int) #integer numbers indicating sample number
for i in range(max(normtime)+1):
yield np.searchsorted(normtime, i) #yield beginning of window index
def sumwindow(mat,i , width): #i is the start of the window returned by yieldstep
normtime = ((mat[0,i:] - mat[0,i])/ width).astype(int) #same as before, but we norm to window width
j = np.searchsorted(normtime, i, side='right')-1 #find the right side of the window
#return rightmost timestamp of window in seconds from unix epoch and sum of window
return mat[0,j], mat[1,j] - mat[1,i] #sum of window is just end - start because we did a cumsum earlier
windowed_sums = np.array([sumwindow(mat, i, sample_width) for i in yieldstep(mat, sample_freq)])
Looks like a rolling centered window where you pick up data every n days:
def rolleach(df, ndays, window):
return df.rolling(window, center=True).sum()[ndays-1::ndays]
rolleach(df, 3, 5)
Out[95]:
A
2013-01-02 10.0
2013-01-05 25.0
2013-01-08 40.0

Add two Pandas Series or DataFrame objects in-place?

I have a dataset where we record the electrical power demand from each individual appliance in the home. The dataset is quite large (2 years or data; 1 sample every 6 seconds; 50 appliances). The data is in a compressed HDF file.
We need to add the power demand for every appliance to get the total aggregate power demand over time. Each individual meter might have a different start and end time.
The naive approach (using a simple model of our data) is to do something like this:
LENGHT = 2**25
N = 30
cumulator = pd.Series()
for i in range(N):
# change the index for each new_entry to mimick the fact
# that out appliance meters have different start and end time.
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator = cumulator.add(new_entry, fill_value=0)
This works fine for small amounts of data. It also works OK with large amounts of data as long as every new_entry has exactly the same index.
But, with large amounts of data, where each new_entry has a different start and end index, Python quickly gobbles up all the available RAM. I suspect this is a memory fragmentation issue. If I use multiprocessing to fire up a new process for each meter (to load the meter's data from disk, load the cumulator from disk, do the addition in memory, then save the cumulator back to disk, and exit the process) then we have fine memory behaviour but, of course, all that disk IO slows us down a lot.
So, I think what I want is an in-place Pandas add function. The plan would be to initialise cumulator to have an index which is the union of all the meters' indicies. Then allocate memory once for that cumulator. Hence no more fragmentation issues.
I have tried two approaches but neither is satisfactory.
I tried using numpy.add to allow me to set the out argument:
# Allocate enough space for the cumulator
cumulator = pd.Series(0, index=np.arange(0, LENGTH+N))
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator, aligned_new_entry = cumulator.align(new_entry, copy=False, fill_value=0)
del new_entry
np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values)
del aligned_new_entry
But this gobbles up all my RAM too and doesn't seem to do the addition. If I change the penaultiate line to cumulator.values = np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values) then I get an error about not being able to assign to cumulator.values.
This second approach appears to have the correct memory behaviour but is far too slow to run:
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
for index in cumulator.index:
try:
cumulator[index] += new_entry[index]
except KeyError:
pass
I suppose I could write this function in Cython. But I'd rather not have to do that.
So: is there any way to do an 'inplace add' in Pandas?
Update
In response to comments below, here is a toy example of our meter data and the sum we want. All values are watts.
time meter1 meter2 meter3 sum
09:00:00 10 10
09:00:06 10 20 30
09:00:12 10 20 30
09:00:18 10 20 30 50
09:00:24 10 20 30 50
09:00:30 10 30 40
If you want to see more details then here's the file format description of our data logger, and here's the 4TByte archive of our entire dataset.
After messing around a lot with multiprocessing, I think I've found a fairly simple and efficient way to do an in-place add without using multiprocessing:
import numpy as np
import pandas as pd
LENGTH = 2**26
N = 10
DTYPE = np.int
# Allocate memory *once* for a Series which will hold our cumulator
cumulator = pd.Series(0, index=np.arange(0, N+LENGTH), dtype=DTYPE)
# Get a numpy array from the Series' buffer
cumulator_arr = np.frombuffer(cumulator.data, dtype=DTYPE)
# Create lots of dummy data. Each new_entry has a different start
# and end index.
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i), dtype=DTYPE)
aligned_new_entry = np.pad(new_entry.values, pad_width=((i, N-i)),
mode='constant', constant_values=((0, 0)))
# np.pad could be replaced by new_entry.reindex(index, fill_value=0)
# but np.pad is faster and more memory efficient than reindex
del new_entry
np.add(cumulator_arr, aligned_new_entry, out=cumulator_arr)
del aligned_new_entry
del cumulator_arr
print cumulator.head(N*2)
which prints:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 10
11 10
12 10
13 10
14 10
15 10
16 10
17 10
18 10
19 10
assuming that your dataframe looks something like:
df.index.names == ['time']
df.columns == ['meter1', 'meter2', ..., 'meterN']
then all you need to do is:
df['total'] = df.fillna(0, inplace=True).sum(1)

Categories

Resources