Data analysis optimazation

Data analysis optimazation - python

I have a dataset with cars being spotted on two different cameras. I need to calculate the average time it takes to travel from camera 1 to 2. Database looking like so:
"ID","PLATE", "CSPOTID", "INOUT_FLAG","EVENT_TIME"
"33173","xys8","1","0","2020-08-27 08:24:53"
"33174","asd4","1","0","2020-08-27 08:24:58"
"33175","------","2","1","2020-08-27 08:25:03"
"33176","asd4","1","0","2020-08-27 08:25:04"
"33177","ghj1","1","0","2020-08-27 08:25:08"
...
Currently my code works as intended and calculates the average time between different rows. But working with big data and having a flow of incoming data, it takes too much time.
import numpy as np, matplotlib.pyplot as plt, pandas as pd,collections, sys, operator, datetime
df = pd.read_csv('tmetrics_base2.csv', quotechar='"', skipinitialspace=True, delimiter=',', dtype={"ID": int, "PLATE": "string", "CSPOTID": int, "INOUT_FLAG": int,"EVENT_TIME": "string"})
data = df.as_matrix()
#Sort values by PLATE
dfSortedByPlate = df.sort_values(['PLATE', 'EVENT_TIME'])
#List for already tested PLATEs
TestedPlate = []
resultList = []
#Iterate through all rows in db
for i,j in dfSortedByPlate.iterrows():
# If PLATE IS "------" = skip it
if j[1] == "-------":
continue
if j[1] in TestedPlate:
continue
TestedPlate.append(j[1])
for ii,jj in dfSortedByPlate.iterrows():
if j[1] != jj[1]:
continue
if j[1] == jj[1]:
dt1 = datetime.datetime.strptime(jj[4],'%Y-%m-%d %H:%M:%S')
dt2 = datetime.datetime.strptime(j[4],'%Y-%m-%d %H:%M:%S')
Travel_time = []
Travel_time.append((dt1 - dt2).total_seconds())
# Discard if greater than 1 hour and less than 3min
if (dt1 - dt2).total_seconds() < 3000 and (dt1 - dt2).total_seconds() > 180:
resultList.append((dt1 - dt2).total_seconds())
#print((dt1 - dt2).total_seconds())
print(sum(resultList) / len(resultList))
placeholdertime = jj[4]
I have sorted the database by plate number so that the comparison should be fairly quick. Any advice or pointers to where I could increase run speed would be greatly appreciated.
Also I am unsure of how long time I should expect calculations like these to take? I don't have experience with data in this scale.

You can speed up your code by removing unnecessary for loops. You can use in-built pandas functions that are typically faster than iterating through rows in the df. For instance, you can replace the two for loops by:
#get only relevant plates
df_relevant = dfSortedByPlate[dfSortedByPlate['PLATE'] != "-------"]
#test relevant plates
for i,j in df_relevant.iterrows():
df_same_plate_j = df_relevant[df_relevant['PLATE'] == j[1]]
for ii, jj in df_same_plate_j.iterrows():
dt1 = datetime.datetime.strptime(jj[4],'%Y-%m-%d %H:%M:%S')
dt2 = datetime.datetime.strptime(j[4],'%Y-%m-%d %H:%M:%S')
Travel_time = []
Travel_time.append((dt1 - dt2).total_seconds())
# Discard if greater than 1 hour and less than 3min
if (dt1 - dt2).total_seconds() < 3000 and (dt1 - dt2).total_seconds() > 180:
resultList.append((dt1 - dt2).total_seconds())
#print((dt1 - dt2).total_seconds())
print(sum(resultList) / len(resultList))
placeholdertime = jj[4]
df_relevant now contains all plates that you want to test. Then, df_same_plate_j gets the rows in df_relevant that have the same plate as row j. Then you do the rest. This way, the number of items you are iterating over is much less.

Just a few suggestions:
Read only what you need:
df = pd.read_csv('data_raw.csv',
quotechar='"',
skipinitialspace=True,
delimiter=',',
usecols=['PLATE', 'EVENT_TIME'],
index_col=['PLATE'])
Convert the EVENT_TIME column to datetime (you don't have to do that row by row):
df['EVENT_TIME'] = pd.to_datetime(df['EVENT_TIME'])
Sort (you already did that):
df.sort_index(inplace=True)
df.sort_values(by='PLATE', inplace=True)
Fetch the plates, excluding the one that isn't needed):
plates = set(df.index).difference({"------"})
Process the plate-chunks:
for plate in plates:
print(df.loc[plate])

Related

Combining multiple functions through loops and parameters

UPDATE My question has been fully answered, I have applied it to my program using jarmod's answer, and although the code looks neater, it has not effected the speed of (when my graph appears( i plot this data using matplotlib) I am a a little confused on why my program runs slowly and how I can increase the speed ( takes about 30 seconds and I know this portion of the code is slowing it down) I have shown my real code in the second block of code. Also, the speed is strongly determined by the Range I set, with a short range it is quiet fast
I have this sample code here that shows my calculation needed to conduct forecasting and extracting values. I use the for loops to run through a specific range of CSV files that I labeled 1-100. I return numbers for each month (1-12) to get the forecasting average for a forecast for a given amount of month.
My full code includes 12 functions for a full year forecast but I feel the code is inefficient because the functions are very similar except for one number and reading the csv file so many times slows the program.
Is there a way I can combine these functions and perhaps add another parameter to make it run so. The biggest concern I had was that it would be hard to return separate numbers and categorize them. In other words, I would like to ideally only have one function for all 12 month accuracy predictions and the way I can possibly see how to do that would to add another parameter and another loop series, but have no idea how to go about that or if it is possible. Essentially, I would like to store all the values of onemonthaccuracy ( which goes into the file before the current file and compares the predicted value for the date associated with the currentfile) and then store all the values of twomonthaccurary and so on... so I can later use these variables for graphing and other purposes
import csv
import pandas as pd
def onemonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
onemonthread = pd.read_csv(str(basefilenumber-1)+'.csv', encoding='latin-1')
onemonthvalue = onemonthread.loc[onemonthread['Customer'].str.contains('Customer A', na=False),'Jun-16\nQty']
onetotal = int(onemonthvalue)/int(basefilevalue)
return onetotal
def twomonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twomonthread = pd.read_csv(str(basefilenumber-2)+'.csv', encoding = 'Latin-1')
twomonthvalue = twomonthread.loc[twomonthread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twototal = int(twomonthvalue)/int(basefilevalue)
return twototal
onetotal = 0
twototal = 0
onetotallist = []
twototallist = []
for basefilenumber in range(24,36):
onetotal += onemonthaccuracy(basefilenumber)
twototal +=twomonthaccuracy(basefilenumber)
onetotallist.append(onemonthaccuracy(i))
twototallist.append(twomonthaccuracy(i))
onetotalpermonth = onetotal/12
twototalpermonth = twototal/12
x = [1,2]
y = [onetotalpermonth, twototalpermonth]
z = [1,2]
w = [(onetotallist),(twototallist)]
for ze, we in zip(z, w):
plt.scatter([ze] * len(we), we, marker='D', s=5)
plt.scatter(x,y)
plt.show()
This is the real block of code I am using in my program, perhaps something is slowing it down that I am unaware of?
#other parts of code
#StartRange = yearvalue+Value
#EndRange = endValue + endyearvalue
#Range = EndRange - StartRange
# Department
#more code....
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
baseheader = getfileheader(basefilenumber)
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains(Department, na=False), baseheader]
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains(Department, na=False), baseheader]
return (1-(int(basefilevalue)/int(nmonthvalue))+1) if int(nmonthvalue) > int(basefilevalue) else int(nmonthvalue)/int(basefilevalue)
N = 13
total = [0] * N
total_by_month_list = [[] for _ in range(N)]
for basefilenumber in range(int(StartRange),int(EndRange)):
for n in range(N):
total[n] += nmonthaccuracy(basefilenumber, n)
total_by_month_list[n].append(nmonthaccuracy(basefilenumber,n))
onetotal=total[1]/ Range
twototal=total[2]/ Range
threetotal=total[3]/ Range
fourtotal=total[4]/ Range
fivetotal=total[5]/ Range #... all the way to 12
onetotallist=total_by_month_list[1]
twototallist=total_by_month_list[2]
threetotallist=total_by_month_list[3]
fourtotallist=total_by_month_list[4]
fivetotallist=total_by_month_list[5] #... all the way to 12
# alot more code after this

Something like this:
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
return int(nmonthvalue)/int(basefilevalue)
N = 2
total_by_month = [0] * N
total_aggregate = 0
for basefilenumber in range(20,30):
for n in range(N):
a = nmonthaccuracy(basefilenumber, n)
total_by_month[n] += a
total_aggregate += a
In case you are wondering what the following code does:
N = 2
total_by_month = [0] * N
It sets N to the number of months desired (2, but you could make it 12 or another value) and it then creates a total_by_month array that can store N results, one per month. It then initializes total_by_month to all zeroes (N zeroes) so that each of the N monthly totals starts at zero.

Improving performance of Python for loop?

I am trying to write a code to construct dataFrame which consists of cointegrating pairs of portfolios (stock price is cointegrating). In this case, stocks in a portfolio are selected from S&P500 and they have the equal weights.
Also, for some economical issue, the portfolios must include the same sectors.
For example:
if stocks in one portfolio are from [IT] and [Financial] sectors, the second portoflio must select stocks from [IT] and [Financial] sectors.
There are no correct number of stocks in a portfolio, so I'm considering about 10 to 20 stocks for each of them. However, when it comes to think about the combination, this is (500 choose 10), so I have an issue of computation time.
The followings are my code:
def adf(x, y, xName, yName, pvalue=0.01, beta_lower=0.5, beta_upper=1):
res=pd.DataFrame()
regress1, regress2 = pd.ols(x=x, y=y), pd.ols(x=y, y=x)
error1, error2 = regress1.resid, regress2.resid
test1, test2 = ts.adfuller(error1, 1), ts.adfuller(error2, 1)
if test1[1] < pvalue and test1[1] < test2[1] and\
regress1.beta["x"] > beta_lower and regress1.beta["x"] < beta_upper:
res[(tuple(xName), tuple(yName))] = pd.Series([regress1.beta["x"], test1[1]])
res = res.T
res.columns=["beta","pvalue"]
return res
elif test2[1] < pvalue and regress2.beta["x"] > beta_lower and\
regress2.beta["x"] < beta_upper:
res[(tuple(yName), tuple(xName))] = pd.Series([regress2.beta["x"], test2[1]])
res = res.T
res.columns=["beta","pvalue"]
return res
else:
pass
def coint(dataFrame, nstocks = 2, pvalue=0.01, beta_lower=0.5, beta_upper=1):
# dataFrame = pandas_dataFrame, in this case, data['Adj Close'], row=time, col = tickers
# pvalue = level of significance of adf test
# nstocks = number of stocks considered for adf test (equal weight)
# if nstocks > 2, coint return cointegration between portfolios
# beta_lower = lower bound for slope of linear regression
# beta_upper = upper bound for slope of linear regression
a=time.time()
tickers = dataFrame.columns
tcomb = itertools.combinations(dataFrame.columns, nstocks)
res = pd.DataFrame()
sec = pd.DataFrame()
for pair in tcomb:
xName, yName = list(pair[:int(nstocks/2)]), list(pair[int(nstocks/2):])
xind, yind = tickers.searchsorted(xName), tickers.searchsorted(yName)
xSector = list(SNP.ix[xind]["Sector"])
ySector = list(SNP.ix[yind]["Sector"])
if set(xSector) == set(ySector):
sector = [[(xSector, ySector)]]
x, y = dataFrame[list(xName)].sum(axis=1), dataFrame[list(yName)].sum(axis=1)
res1 = adf(x,y,xName,yName)
if res1 is None:
continue
elif res.size==0:
res=res1
sec = pd.DataFrame(sector, index = res.index, columns = ["sector"])
print("added : ", pair)
else:
res=res.append(res1)
sec = sec.append(pd.DataFrame(sector, index = [res.index[-1]], columns = ["sector"]))
print("added : ", pair)
res = pd.concat([res,sec],axis=1)
res=res.sort_values(by=["pvalue"],ascending=True)
b=time.time()
print("time taken : ", b-a, "sec")
return res
when nstocks=2, this takes about 263 seconds, but as nstocks increases, the loop takes alot of time (more than a day)
I collected 'Adj Close' data from yahoo finance using pandas_datareader.data
and the index is time and columns are different tickers
Any suggestions or help will be appreciated

I dont know what computer you have, but i would advise you to use some kind of multiprocessing for the loop. I haven't looked really hard into your code, but as far as i see res and sec can be moved into shared memory objects, and the individual loops paralleled with multiprocessing.
If you have a decent CPU it can improve the performance 4-6 times. In case you have access to some kind of HPC it can do wonders.

I'd recommend using a profiler to narrow down the most time consuming calls, and the number of loops (does your loop make the expected number of passes?). Python 3 has a profiler in the standard library:
https://docs.python.org/3.6/library/profile.html
You can either invoke it in your code:
import cProfile
cProfile.run('your_function(inputs)')
Or if a script is an easier entrypoint:
python -m cProfile [-o output_file] [-s sort_order] your-script.py

daily data, resample every 3 days, calculate over trailing 5 days efficiently

consider the df
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df
I want to calculate the sum over a trailing 5 days, every 3 days.
I expect something that looks like this
this was edited
what I had was incorrect. #ivan_pozdeev and #boud noticed this was a centered window and that was not my intention. Appologies for the confusion.
everyone's solutions capture much of what I was after.
criteria
I'm looking for smart efficient solutions that can be scaled to large data sets.
I'll be timing solutions and also considering elegance.
Solutions should also be generalizable for a variety of sample and look back frequencies.
from comments
I want a solution that generalizes to handle a look back of a specified frequency and grab anything that falls within that look back.
for the sample above, the look back is 5D and there may be 4 or 50 observations that fall within that look back.
I want the timestamp to be the last observed timestamp within the look back period.

the df you gave us is :
A
2012-12-31 0
2013-01-01 1
2013-01-02 2
2013-01-03 3
2013-01-04 4
2013-01-05 5
2013-01-06 6
2013-01-07 7
2013-01-08 8
2013-01-09 9
2013-01-10 10
you could create your rolling 5-day sum series and then resample it. I can't think of a more efficient way than this. overall this should be relatively time efficient.
df.rolling(5,min_periods=5).sum().dropna().resample('3D').first()
Out[36]:
A
2013-01-04 10.0000
2013-01-07 25.0000
2013-01-10 40.0000

Listed here are two three few NumPy based solutions using bin based summing covering basically three scenarios.
Scenario #1 : Multiple entries per date, but no missing dates
Approach #1 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
# Extract the index names and values
vals = df.A.values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
# Perform bin based summing and subtract the repeated ones
IDsums = np.bincount(id_arr,vals)
allsums = IDsums[:-1] + IDsums[1:]
allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #2 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
# Extract the index names and values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
# Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern
# Pad with 0 and length of array at either ends as we use diff later on
shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
search_id = np.hstack((0,shiftIDs,date_id[-1]+1))
# Find the start of those shifting indices
# Generate ID based on shifts and do bin based summing of dataframe
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
IDsums = np.bincount(id_arr,df.A.values)
# Sum each group of 3 elems with a stride of 2, make dataframe if needed
allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #3 :
def vectorized_app3(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
We could replace the convolution part with direct sliced summation for a modified version of it -
def vectorized_app3_v2(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
f = c.size+S-W
out = c[:f:S].copy()
for i in range(1,W):
out += c[i:f+i:S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #2 : Multiple entries per date and missing dates
Approach #4 :
def vectorized_app4(df, S=3, W=5):
dt = df.index.values
indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
WL = ((indx[-1]+1)//S)
c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
freq_str = str(S)+'D'
grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #3 : Consecutive dates and exactly one entry per date
Approach #5 :
def vectorized_app5(df, S=3, W=5):
vals = df.A.values
N = (df.shape[0]-W+2*S-1)//S
n = vals.strides[0]
out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
strides=(S*n,n)).sum(1)
index_idx = (W-1)+S*np.arange(N)
out_index = df.index[index_idx]
return pd.DataFrame(out,index=out_index,columns=['A'])
Suggestions for creating test-data
Scenario #1 :
# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])
Scenario #2 :
To create setup for multiple dates and with missing dates, we could just edit the df_data creation step, like so -
df_data = np.random.randint(0,9,(len(idx0)))
Scenario #3 :
# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)

If the dataframe is sorted by date, what we actually have is iterating over an array while calculating something.
Here's the algorithm that calculates sums all in one iteration over the array. To understand it, see a scan of my notes below. This is the base, unoptimized version intended to showcase the algorithm (optimized ones for Python and Cython follow), and list(<call>) takes ~500 ms for an array of 100k on my system (P4). Since Python ints and ranges are relatively slow, this should benefit tremendously from being transferred to C level.
from __future__ import division
import numpy as np
#The date column is unimportant for calculations.
# I leave extracting the numbers' column from the dataframe
# and adding a corresponding element from data column to each result
# as an exercise for the reader
data = np.random.randint(100,size=100000)
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
index=0
currentsum=0
while index<lim_index:
for _ in range(Mp):
#np.take is awkward, requiring a full list of indices to take
for i in range(currentsum,currentsum+nsums-1):
sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Note that it produces the first sum at kth element, not nth (this can be changed but by sacrificing elegance - a number of dummy iterations before the main loop - and is more elegantly done by prepending data with extra zeros and discarding a number of first sums)
It can easily be generalized to any operation by replacing sums[slice]+=data[index] with operation(sums[slice],data[index]) where operation is a parameter and should be a mutating operation (like ndarray.__iadd__).
parallelizing between any number or workers by splitting the data is as easy (if n>k, chunks after the first one should be fed extra elements at the start)
To deduce the algorithm, I wrote a sample for a case where a decent number of sums are calculated simultaneously in order to see patterns (click the image to see it full-size).
Optimized: pure Python
Caching range objects brings the time down to ~300ms. Surprisingly, numpy functionality is of no help: np.take is unusable, and replacing currentsum logic with static slices and np.roll is a regression. Even more surprisingly, the benefit of saving output to an np.empty as opposed to yield is nonexistent.
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
RM=range(M) #cache for efficiency
RMp=range(Mp) #cache for efficiency
index=0
currentsum=0
currentsum_ranges=[range(currentsum,currentsum+nsums-1)
for currentsum in range(nsums)] #cache for efficiency
while index<lim_index:
for _ in RMp:
#np.take is unusable as it allocates another array rather than view
for i in currentsum_ranges[currentsum]:
sums[i%nsums]+=data[index]
index+=1
for _ in RM:
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Optimized: Cython
Statically typing everything in Cython instantly speeds things up to 150ms. And (optionally) assuming np.int as dtype to be able to work with data at C level brings the time down to as little as ~11ms. At this point, saving to an np.empty does make a difference, saving an unbelievable ~6.5ms, totalling ~5.5ms.
def calc_trailing_data_with_interval(np.ndarray data,int n,int k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: 1-d ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
if not data.ndim==1: raise TypeError("One-dimensional array required")
cdef int lim_index=data.size-k+1
cdef np.ndarray result = np.empty(data.size//k,dtype=data.dtype)
cdef int rindex = 0
cdef int nsums = int(np.ceil(float(n)/k))
cdef np.ndarray sums = np.zeros(nsums,dtype=data.dtype)
#optional speedup for dtype=np.int
cdef bint use_int_buffer = data.dtype==np.int and data.flags.c_contiguous
cdef int[:] cdata = data
cdef int[:] csums = sums
cdef int[:] cresult = result
cdef int M=n%k
cdef int Mp=k-M
cdef int index=0
cdef int currentsum=0
cdef int _,i
while index<lim_index:
for _ in range(Mp):
#np.take is unusable as it allocates another array rather than view
for i in range(currentsum,currentsum+nsums-1):
if use_int_buffer: csums[i%nsums]+=cdata[index] #optional speedup
else: sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
if use_int_buffer:
for i in range(nsums): csums[i]+=cdata[index] #optional speedup
else: sums+=data[index]
index+=1
if use_int_buffer: cresult[rindex]=csums[currentsum] #optional speedup
else: result[rindex]=sums[currentsum]
currentsum=(currentsum+1)%nsums
rindex+=1
return result

For regularly-spaced dates only
Here are two methods, first a pandas way and second a numpy function.
>>> n=5 # trailing periods for rolling sum
>>> k=3 # frequency of rolling sum calc
>>> df.rolling(n).sum()[-1::-k][::-1]
A
2013-01-01 NaN
2013-01-04 10.0
2013-01-07 25.0
2013-01-10 40.0
And here's a numpy function (adapted from Jaime's numpy moving_average):
def rolling_sum(a, n=5, k=3):
ret = np.cumsum(a.values)
ret[n:] = ret[n:] - ret[:-n]
return pd.DataFrame( ret[n-1:][-1::-k][::-1],
index=a[n-1:][-1::-k][::-1].index )
rolling_sum(df,n=6,k=4) # default n=5, k=3
For irregularly-spaced dates (or regularly-spaced)
Simply precede with:
df.resample('D').sum().fillna(0)
For example, the above methods become:
df.resample('D').sum().fillna(0).rolling(n).sum()[-1::-k][::-1]
and
rolling_sum( df.resample('D').sum().fillna(0) )
Note that dealing with irregularly-spaced dates can be done simply and elegantly in pandas as this is a strength of pandas over almost anything else out there. But you can likely find a numpy (or numba or cython) approach that will trade off some simplicity for an increase in speed. Whether this is a good tradeoff will depend on your data size and performance requirements, of course.
For the irregularly spaced dates, I tested on the following example data and it seemed to work correctly. This will produce a mix of missing, single, and multiple entries per date:
np.random.seed(12345)
per = 11
tidx = np.random.choice( pd.date_range('2012-12-31', periods=per, freq='D'), per )
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx).sort_index()

this isn't quite perfect yet, but I've gotta go make fake blood for a haloween party tonight... you should be able to see what I was getting at through the comments. One of the biggest speedups is finding the window edges with np.searchsorted. it doesn't quite work yet, but I'd bet it's just some index offsets that need tweaking
import pandas as pd
import numpy as np
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
sample_freq = 3 #days
sample_width = 5 #days
sample_freq *= 86400 #seconds per day
sample_width *= 86400 #seconds per day
times = df.index.astype(np.int64)//10**9 #array of timestamps (unix time)
cumsum = np.cumsum(df.A).as_matrix() #array of cumulative sums (could eliminate extra summation with large overlap)
mat = np.array([times, cumsum]) #could eliminate temporary times and cumsum vars
def yieldstep(mat, freq):
normtime = ((mat[0] - mat[0,0]) / freq).astype(int) #integer numbers indicating sample number
for i in range(max(normtime)+1):
yield np.searchsorted(normtime, i) #yield beginning of window index
def sumwindow(mat,i , width): #i is the start of the window returned by yieldstep
normtime = ((mat[0,i:] - mat[0,i])/ width).astype(int) #same as before, but we norm to window width
j = np.searchsorted(normtime, i, side='right')-1 #find the right side of the window
#return rightmost timestamp of window in seconds from unix epoch and sum of window
return mat[0,j], mat[1,j] - mat[1,i] #sum of window is just end - start because we did a cumsum earlier
windowed_sums = np.array([sumwindow(mat, i, sample_width) for i in yieldstep(mat, sample_freq)])

Looks like a rolling centered window where you pick up data every n days:
def rolleach(df, ndays, window):
return df.rolling(window, center=True).sum()[ndays-1::ndays]
rolleach(df, 3, 5)
Out[95]:
A
2013-01-02 10.0
2013-01-05 25.0
2013-01-08 40.0

How to combine records based on date using python connected components?

I have a list of records (person_id, start_date, end_date) as follows:
person_records = [['1', '08/01/2011', '08/31/2011'],
['1', '09/01/2011', '09/30/2011'],
['1', '11/01/2011', '11/30/2011'],
['1', '12/01/2011', '12/31/2011'],
['1', '01/01/2012', '01/31/2012'],
['1', '03/01/2012', '03/31/2012']]
The records for each person are sorted in an ascending order of start_date. The periods are consolidated by combining the records based on the dates and recording the start_date of the first period as the start date and the end_date of the last period as the end date. BUT, If the time between the end of one period and the start of the next is 32 days or less, we should treat this as continuous period. Otherwise, we treat this as two periods:
consolidated_person_records = [['1', '08/01/2011', '09/30/2011'],
['1', '11/01/2011', '03/31/2012']]
Is there any way to do this using the python connected components?

I thought about your question, and I originally wrote a routine that would map the date intervals into a 1D binary array, where each entry in the array is a day, and consecutive days are consecutive entries. With this data structure, you can perform dilation and erosion to fill in small gaps, thus merging the intervals, and then map the consolidated intervals back into date ranges. Thus we use standard raster connected components logic to solve your problem, as per your idea (a graph-based connected components could work as well...)
This works fine, and I can post the code if you are really interested, but then I wondered what the advantages are of the former apporach over the simple routine of just iterating through the (pre-sorted) date ranges and merging the next into the current if the gap is small.
Here is the code for the simple routine, and it takes about 120 micro seconds to run using the sample data. If you expand the sample data by repeating it 10,000 times, this routine takes about 1 sec on my computer.
When I timed the morphology based solution, it was about 2x slower. It might work better under certain circumstances, but I would suggest we try simple first, and see if there's a real problem that requires a different algorithmic approach.
from datetime import datetime
from datetime import timedelta
import numpy as np
The sample data provided in the question:
SAMPLE_DATA = [['1', '08/01/2011', '08/31/2011'],
['1', '09/01/2011', '09/30/2011'],
['1', '11/01/2011', '11/30/2011'],
['1', '12/01/2011', '12/31/2011'],
['1', '01/01/2012', '01/31/2012'],
['1', '03/01/2012', '03/31/2012'],
['2', '11/11/2011', '11/30/2011'],
['2', '12/11/2011', '12/31/2011'],
['2', '01/11/2014', '01/31/2014'],
['2', '03/11/2014', '03/31/2014']]
The simple approach:
def simple_method(in_data=SAMPLE_DATA, person='1', fill_gap_days=31, printit=False):
date_format_str = "%m/%d/%Y"
dat = np.array(in_data)
dat = dat[dat[:, 0] == person, 1:] # just this person's data
# assume date intervals are already sorted by start date
new_intervals = []
cur_start = None
cur_end = None
gap_days = timedelta(days=fill_gap_days)
for (s_str, e_str) in dat:
dt_start = datetime.strptime(s_str, date_format_str)
dt_end = datetime.strptime(e_str, date_format_str)
if cur_end is None:
cur_start = dt_start
cur_end = dt_end
continue
else:
if cur_end + gap_days >= dt_start:
# merge, keep existing cur_start, extend cur_end
cur_end = dt_end
else:
# new interval, save previous and reset current to this
new_intervals.append((cur_start, cur_end))
cur_start = dt_start
cur_end = dt_end
# make sure final interval is saved
new_intervals.append((cur_start, cur_end))
if printit:
print_it(person, new_intervals, date_format_str)
return new_intervals
And here's the simple pretty printing function to print the ranges.
def print_it(person, consolidated_ranges, fmt):
for (s, e) in consolidated_ranges:
print(person, s.strftime(fmt), e.strftime(fmt))
Running in ipython as follows. Note that printing the result can be turned off for timing the computation.
In [10]: _ = simple_method(printit=True)
1 08/01/2011 09/30/2011
1 11/01/2011 03/31/2012
Running in ipython with %timeit macro:
In [8]: %timeit simple_method(in_data=SAMPLE_DATA)
10000 loops, best of 3: 118 µs per loop
In [9]: %timeit simple_method(in_data=SAMPLE_DATA*10000)
1 loops, best of 3: 1.06 s per loop
[EDIT 8 Feb 2016: To make a long answer longer...]
As I prefaced in my response, I did create a morphological / 1D connected components version and in my timing it was about 2x slower. But for the sake of completeness, I'll show the morphological method, and maybe others will have insight on if there's a big area for speed-up left somewhere in it.
#using same imports as previous code with one more
import calendar as cal
def make_occupancy_array(start_year, end_year):
"""
Represents the time between the start and end years, inclusively, as a 1-D array
of 'pixels', where each pixel corresponds to a day. Consecutive days are thus
mapped to consecutive pixels. We can perform morphology on this 1D array to
close small gaps between date ranges.
"""
years_days = [(yr, 366 if cal.isleap(yr) else 365) for yr in range(start_year, end_year+1)]
YD = np.array(years_days) # like [ (2011, 365), (2012, 366), ... ] in ndarray form
total_num_days = YD[:, 1].sum()
occupancy = np.zeros((total_num_days,), dtype='int')
return YD, occupancy
With the occupancy array to represent the time intervals, we need two functions to map from dates to positions in the array and the inverse.
def map_date_to_position(dt, YD):
"""
Maps the datetime value to a position in the occupancy array
"""
# the start position is the offset to day 1 in the dt1,year,
# plus the day of year - 1 for dt1 (day of year is 1-based indexed)
yr = dt.year
assert yr in YD[:, 0] # guard...YD should include all years for this person's dates
position = YD[YD[:, 0] < yr, 1].sum() # the sum of the days in year before this year
position += dt.timetuple().tm_yday - 1
return position
def map_position_to_date(pos, YD):
"""
Inverse of map_date_to_position, this maps a position in the
occupancy array back to a datetime value
"""
yr_offsets = np.cumsum(YD[:, 1])
day_offsets = yr_offsets - pos
idx = np.flatnonzero(day_offsets > 0)[0]
year = YD[idx, 0]
day_of_year = pos if idx == 0 else pos - yr_offsets[idx-1]
# construct datetime as first of year plus day offset in year
dt = datetime.strptime(str(year), "%Y")
dt += timedelta(days=int(day_of_year)+1)
return dt
The following function fills the relevant part of the occupancy array given start and end dates (inclusive) and optionally extends the end of the range by a gap-filling margin (like 1-sided dilation).
def set_occupancy(dt1, dt2, YD, occupancy, fill_gap_days=0):
"""
For a date range starting dt1 and ending, inclusively, dt2,
sets the corresponding 'pixels' in occupancy vector to 1.
If fill_gap_days > 0, then the end 'pixel' is extended
(dilated) by this many positions, so that we can fill
the gaps between intervals that are close to each other.
"""
pos1 = map_date_to_position(dt1, YD)
pos2 = map_date_to_position(dt2, YD) + fill_gap_days
occupancy[pos1:pos2] = 1
Once we have the consolidated intervals in the occupancy array, we need to read them back out into date intervals, optionally performing 1-sided erosion if we'd previously done gap filling.
def get_occupancy_intervals(OCC, fill_gap_days=0):
"""
Find the runs in the OCC array corresponding
to the 'dilated' consecutive positions, and then
'erode' back to the correct end dates by subtracting
the fill_gap_days.
"""
starts = np.flatnonzero(np.diff(OCC) > 0) # where runs of nonzeros start
ends = np.flatnonzero(np.diff(OCC) < 0) # where runs of nonzeros end
ends -= fill_gap_days # erode back to original length prior to dilation
return [(s, e) for (s, e) in zip(starts, ends)]
Putting it all together...
def morphology_method(in_data=SAMPLE_DATA, person='1', fill_gap_days=31, printit=False):
date_format_str = "%m/%d/%Y"
dat = np.array(in_data)
dat = dat[dat[:, 0] == person, 1:] # just this person's data
# for the intervals of this person, get starting and ending years
# we assume the data is already sorted
#start_year = datetime.strptime(dat[0, 0], date_format_str)
#end_year = datetime.strptime(dat[-1, 1], date_format_str)
start_times = [datetime.strptime(d, date_format_str) for d in dat[:, 0]]
end_times = [datetime.strptime(d, date_format_str) for d in dat[:, 1]]
start_year = start_times[0].year
end_year = end_times[-1].year
# create the occupancy array, dilated so that each interval
# is extended by fill_gap_days to 'fill in' the small gaps
# between intervals
YD, OCC = make_occupancy_array(start_year, end_year)
for (s, e) in zip(start_times, end_times):
set_occupancy(s, e, YD, OCC, fill_gap_days)
# return the intervals from OCC after having filled gaps,
# and trim end dates back to original position.
consolidated_pos = get_occupancy_intervals(OCC, fill_gap_days)
# map positions back to date-times
consolidated_ranges = [(map_position_to_date(s, YD), map_position_to_date(e, YD)) for
(s, e) in consolidated_pos]
if printit:
print_it(person, consolidated_ranges, date_format_str)
return consolidated_ranges

09/30/2011 + 32 days = 11/01/2011, so your example doesn't work. You probably meant 31 days or less.
When working with dates in python, you can use datetime and timedelta from the datetime module. Use strptime and strftime to convert from/to strings like '09/01/2011'.
I prefer to convert everything to datetime's at the beginning, do all the date related processing, then convert back to date strings at the end if needed.
from datetime import datetime, timedelta
PERSON_ID = 0
START_DATE = 1
END_DATE = 2
def consolidate(records, maxgap=timedelta(days=31)):
consolidated = []
consolidated_start = records[0][START_DATE]
consolidated_end = records[0][END_DATE]
for person_id, start_date, end_date in records:
if start_date <= consolidated_end + maxgap:
consolidated_end = end_date
else:
consolidated.append([person_id, consolidated_start, consolidated_end])
consolidated_start = start_date
consolidated_end = end_date
else:
consolidated.append([person_id, consolidated_start, consolidated_end])
return consolidated
fmt = "%m/%d/%Y"
records = [[id, datetime.strptime(start, fmt), datetime.strptime(end, fmt)] for id, start, end in person_records]
records = consolidate(records)
records = [[id, start.strftime(fmt), end.strftime(fmt)] for id, start, end in records]
EDIT: Here's a version of consolidate() using connected_components:
import numpy as np
from scipy.sparse.csgraph import connected_components
def consolidate(records, maxgap=32):
person_id = records[0][0]
dates = np.array([[rec[1].date(), rec[2].date()] for rec in records], dtype='datetime64')
start_dates, end_dates = dates.T
gaps = start_dates[1:] - end_dates[:-1]
conns = np.diagflat(gaps < np.timedelta64(maxgap, 'D'), 1)
num_comps, comps = connected_components(conns)
return [[person_id,
min(start_dates[comps==i]).astype(datetime),
max(end_dates[comps==i]).astype(datetime)
] for i in range(num_comps)
]

Reducing numpy array for drawing chart

I want to draw chart in my python application, but source numpy array is too large for doing this (about 1'000'000+). I want to take mean value for neighboring elements. The first idea was to do it in C++-style:
step = 19000 # every 19 seconds (for example) make new point with neam value
dt = <ordered array with time stamps>
value = <some random data that we want to draw>
index = dt - dt % step
cur = 0
res = []
while cur < len(index):
next = cur
while next < len(index) and index[next] == index[cur]:
next += 1
res.append(np.mean(value[cur:next]))
cur = next
but this solution works very slow. I tried to do like this:
step = 19000 # every 19 seconds (for example) make new point with neam value
dt = <ordered array with time stamps>
value = <some random data that we want to draw>
index = dt - dt % step
data = np.arange(index[0], index[-1] + 1, step)
res = [value[index == i].mean() for i in data]
pass
This solution is slower than the first one. What is the best solution for this problem?

np.histogram can provide sums over arbitrary bins. If you have time series, e.g.:
import numpy as np
data = np.random.rand(1000) # Random numbers between 0 and 1
t = np.cumsum(np.random.rand(1000)) # Random time series, from about 1 to 500
then you can calculate the binned sums across 5 second intervals using np.histogram:
t_bins = np.arange(0., 500., 5.) # Or whatever range you want
sums = np.histogram(t, t_bins, weights=data)[0]
If you want the mean rather than the sum, remove the weights and use the bin tallys:
means = sums / np.histogram(t, t_bins)][0]
This method is similar to the one in this answer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data analysis optimazation - python

Related

Combining multiple functions through loops and parameters

Improving performance of Python for loop?

daily data, resample every 3 days, calculate over trailing 5 days efficiently

How to combine records based on date using python connected components?

Reducing numpy array for drawing chart

Categories

Resources