Reducing numpy array for drawing chart - python

I want to draw chart in my python application, but source numpy array is too large for doing this (about 1'000'000+). I want to take mean value for neighboring elements. The first idea was to do it in C++-style:
step = 19000 # every 19 seconds (for example) make new point with neam value
dt = <ordered array with time stamps>
value = <some random data that we want to draw>
index = dt - dt % step
cur = 0
res = []
while cur < len(index):
next = cur
while next < len(index) and index[next] == index[cur]:
next += 1
res.append(np.mean(value[cur:next]))
cur = next
but this solution works very slow. I tried to do like this:
step = 19000 # every 19 seconds (for example) make new point with neam value
dt = <ordered array with time stamps>
value = <some random data that we want to draw>
index = dt - dt % step
data = np.arange(index[0], index[-1] + 1, step)
res = [value[index == i].mean() for i in data]
pass
This solution is slower than the first one. What is the best solution for this problem?

np.histogram can provide sums over arbitrary bins. If you have time series, e.g.:
import numpy as np
data = np.random.rand(1000) # Random numbers between 0 and 1
t = np.cumsum(np.random.rand(1000)) # Random time series, from about 1 to 500
then you can calculate the binned sums across 5 second intervals using np.histogram:
t_bins = np.arange(0., 500., 5.) # Or whatever range you want
sums = np.histogram(t, t_bins, weights=data)[0]
If you want the mean rather than the sum, remove the weights and use the bin tallys:
means = sums / np.histogram(t, t_bins)][0]
This method is similar to the one in this answer.

Related

Efficient time series sliding window function

I am trying to create a sliding window for a time series. So far I have a function that I managed to get working that lets you take a given series, set a window size in seconds and then create a rolling sample. My issue is that it is taking very long to run and seems like an inefficient approach.
# ========== create dataset =========================== #
import pandas as pd
from datetime import timedelta, datetime
timestamp_list = ["2022-02-07 11:38:08.625",
"2022-02-07 11:38:09.676",
"2022-02-07 11:38:10.084",
"2022-02-07 11:38:10.10000",
"2022-02-07 11:38:11.2320"]
bid_price_list = [1.14338,
1.14341,
1.14340,
1.1434334,
1.1534334]
df = pd.DataFrame.from_dict(zip(timestamp_list, bid_price_list))
df.columns = ['timestamp','value']
# make date time object
df.timestamp = [datetime.strptime(time_i, "%Y-%m-%d %H:%M:%S.%f") for time_i in df.timestamp]
df.head(3)
timestamp value timestamp_to_sec
0 2022-02-07 11:38:08.625 1.14338 2022-02-07 11:38:08
1 2022-02-07 11:38:09.676 1.14341 2022-02-07 11:38:09
2 2022-02-07 11:38:10.084 1.14340 2022-02-07 11:38:10
# ========== create rolling time-series function ====== #
# get the floor of time (second value)
df["timestamp_to_sec"] = df["timestamp"].dt.floor('s')
# set rollling window length in seconds
window_dt = pd.Timedelta(seconds=2)
# containers for rolling sample statistics
n_list = []
mean_list = []
std_list =[]
# add dt (window) seconds to the original time which was floored to the second
df["timestamp_to_sec_dt"] = df["timestamp_to_sec"] + window_dt
# get unique end times
time_unique_endlist = np.unique(df.timestamp_to_sec_dt)
# remove end times that are greater than the last actual time, i.e. max(df["timestamp_to_sec"])
time_unique_endlist = time_unique_endlist[time_unique_endlist <= max(df["timestamp_to_sec"])]
# loop running the sliding window (time_i is the end time of each window)
for time_i in time_unique_endlist:
# start time of each rolling window
start_time = time_i - window_dt
# sample for each time period of sliding window
rolling_sample = df[(df.timestamp >= start_time) & (df.timestamp <= time_i)]
# calculate the sample statistics
n_list.append(len(rolling_sample)) # store n observation count
mean_list.append(rolling_sample.mean()) # store rolling sample mean
std_list.append(rolling_sample.std()) # store rolling sample standard deviation
# plot histogram for each sample of the rolling sample
#plt.hist(rolling_sample.value, bins=10)
# tested and n_list brought back the correct values
>>> n_list
[2,3]
Is there a more efficient way of doing this, a way I could improve my interpretation or an open-source package that allows me to run a rolling window like this? I know that there is the .rolling() in pandas but that rolls on the values. I want something that I can use on unevenly-spaced data, using the time to define the fixed rolling window.
It seems like this is the best performance, hope it helps anyone else.
# set rollling window length in seconds
window_dt = pd.Timedelta(seconds=2)
# add dt seconds to the original timestep
df["timestamp_to_sec_dt"] = df["timestamp_to_sec"] + window_dt
# unique end time
time_unique_endlist = np.unique(df.timestamp_to_sec_dt)
# remove end values that are greater than the last actual value, i.e. max(df["timestamp_to_sec"])
time_unique_endlist = time_unique_endlist[time_unique_endlist <= max(df["timestamp_to_sec"])]
# containers for rolling sample statistics
mydic = {}
counter = 0
# loop running the rolling window
for time_i in time_unique_endlist:
start_time = time_i - window_dt
# sample for each time period of sliding window
rolling_sample = df[(df.timestamp >= start_time) & (df.timestamp <= time_i)]
# calculate the sample statistics
mydic[counter] = {
"sample_size":len(rolling_sample),
"sample_mean":rolling_sample["value"].mean(),
"sample_std":rolling_sample["value"].std()
}
counter = counter + 1
# results in a DataFrame
results = pd.DataFrame.from_dict(mydic).T

Vectorize operation on dataframe where I need to subset another dataframe (pearson correlation)

What's the best way to do an operation on a dataframe that, for every row, I need to do a selection on another dataframe?
For example:
My first dataframe has the similarity between every to pairs of items. For starters, I'll assume every similarity as zero and calculate the correct similarity later.
import pandas as pd
import numpy as np
import scipy as sp
from scipy.spatial import distance
items = [1,2,3,4]
item_item_idx = pd.MultiIndex.from_product([items, items], names = ['from_item', 'to_item'])
item_item_df = pd.DataFrame({'similarity': np.zeros(len(item_item_idx))},
index = item_item_idx
)
My next dataframe has the rating every user gave for every item. For sake of simplification, let's assume every user rated every item and generate random ratings between 1 and 5.
users = [1,2,3,4,5]
ratings_idx = pd.MultiIndex.from_product([items, users], names = ['item', 'user'])
rating_df = pd.DataFrame(
{'rating': np.random.randint(low = 1, high = 6, size = len(users)*len(items))},
columns = ['rating'],
index = ratings_idx
)
Now that I have the ratings, I want to update the cosine similarity between the items. What I need to do is, for every row in item_item_df, select to from rating_df the vector of ratings for each item, and calculate the cosine distance between those two.
I want to know the least dumb way to do this. Here's what I tried so far:
==== FIRST TRY - Iterating over rows
def similarity(ii, iu):
for index, row in ii.iterrows():
v = iu.loc[index[0]]
u = iu.loc[index[1]]
row['similarity'] = distance.cosine(v, u)
return(ii)
import time
start_time = time.time()
item_item_df = similarity(item_item_df, rating_df)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.01002s to run this. In problem with 10k items, I estimate it would take in th ballpark of 20 hours to run. Not good.
The thing is, I'm iterating over rows, my hope is that I can vectorize this to make it faster. I played around with df.apply() and df.map(). This is the best I did so far:
==== SECOND TRY - index.map()
def similarity_map(idx):
v = rating_df.loc[idx[0]]
u = rating_df.loc[idx[1]]
return distance.cosine(v, u)
start_time = time.time()
item_item_df['similarity'] = item_item_df.index.map(similarity_map)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.034961s to execute. Slower than just iterating over rows.
So this was a naive attempt to vectorize. Is it even possible to do? What other options I have to improve the runtime?
Thanks for the attention.
For your given example I'd just pivot it into an array and move on with my life.
from sklearn.metrics.pairwise import cosine_similarity
rating_df = rating_df.reset_index().pivot(index='item', columns='user')
cs_df = pd.DataFrame(cosine_similarity(rating_df),
index=rating_df.index, columns=rating_df.index)
>>> cs_df
item 1 2 3 4
item
1 1.000000 0.877346 0.660529 0.837611
2 0.877346 1.000000 0.608781 0.852029
3 0.660529 0.608781 1.000000 0.758098
4 0.837611 0.852029 0.758098 1.000000
This would be more difficult with a giant, highly-sparse array. Sklearn cosine_similarity takes sparse arrays though so as long as your number of items is reasonable (since the output matrix will be dense) this should be solvable.
Same thing but different. Work with numpy arrays. Fine for small arrays but with 10k rows you'll have some large arrays.
import numpy as np
data = rating_df.unstack().values # shape (4,5)
udotv = np.dot(data,data.T) # shape (4,4)
mag_data = np.linalg.norm(data,axis=1)
mag = mag_data * mag_data[:,None]
cos_sim = 1 - (udotv / mag)
df['sim2'] = cos_sim.flatten()
4k users and 14k items pretty much blows up my poor computer. I'm going to have to look how sklearn.metrics.pairwise.cosine_similarity handles that large data.

daily data, resample every 3 days, calculate over trailing 5 days efficiently

consider the df
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df
I want to calculate the sum over a trailing 5 days, every 3 days.
I expect something that looks like this
this was edited
what I had was incorrect. #ivan_pozdeev and #boud noticed this was a centered window and that was not my intention. Appologies for the confusion.
everyone's solutions capture much of what I was after.
criteria
I'm looking for smart efficient solutions that can be scaled to large data sets.
I'll be timing solutions and also considering elegance.
Solutions should also be generalizable for a variety of sample and look back frequencies.
from comments
I want a solution that generalizes to handle a look back of a specified frequency and grab anything that falls within that look back.
for the sample above, the look back is 5D and there may be 4 or 50 observations that fall within that look back.
I want the timestamp to be the last observed timestamp within the look back period.
the df you gave us is :
A
2012-12-31 0
2013-01-01 1
2013-01-02 2
2013-01-03 3
2013-01-04 4
2013-01-05 5
2013-01-06 6
2013-01-07 7
2013-01-08 8
2013-01-09 9
2013-01-10 10
you could create your rolling 5-day sum series and then resample it. I can't think of a more efficient way than this. overall this should be relatively time efficient.
df.rolling(5,min_periods=5).sum().dropna().resample('3D').first()
Out[36]:
A
2013-01-04 10.0000
2013-01-07 25.0000
2013-01-10 40.0000
Listed here are two three few NumPy based solutions using bin based summing covering basically three scenarios.
Scenario #1 : Multiple entries per date, but no missing dates
Approach #1 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
# Extract the index names and values
vals = df.A.values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
# Perform bin based summing and subtract the repeated ones
IDsums = np.bincount(id_arr,vals)
allsums = IDsums[:-1] + IDsums[1:]
allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #2 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
# Extract the index names and values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
# Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern
# Pad with 0 and length of array at either ends as we use diff later on
shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
search_id = np.hstack((0,shiftIDs,date_id[-1]+1))
# Find the start of those shifting indices
# Generate ID based on shifts and do bin based summing of dataframe
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
IDsums = np.bincount(id_arr,df.A.values)
# Sum each group of 3 elems with a stride of 2, make dataframe if needed
allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #3 :
def vectorized_app3(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
We could replace the convolution part with direct sliced summation for a modified version of it -
def vectorized_app3_v2(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
f = c.size+S-W
out = c[:f:S].copy()
for i in range(1,W):
out += c[i:f+i:S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #2 : Multiple entries per date and missing dates
Approach #4 :
def vectorized_app4(df, S=3, W=5):
dt = df.index.values
indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
WL = ((indx[-1]+1)//S)
c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
freq_str = str(S)+'D'
grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #3 : Consecutive dates and exactly one entry per date
Approach #5 :
def vectorized_app5(df, S=3, W=5):
vals = df.A.values
N = (df.shape[0]-W+2*S-1)//S
n = vals.strides[0]
out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
strides=(S*n,n)).sum(1)
index_idx = (W-1)+S*np.arange(N)
out_index = df.index[index_idx]
return pd.DataFrame(out,index=out_index,columns=['A'])
Suggestions for creating test-data
Scenario #1 :
# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])
Scenario #2 :
To create setup for multiple dates and with missing dates, we could just edit the df_data creation step, like so -
df_data = np.random.randint(0,9,(len(idx0)))
Scenario #3 :
# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
If the dataframe is sorted by date, what we actually have is iterating over an array while calculating something.
Here's the algorithm that calculates sums all in one iteration over the array. To understand it, see a scan of my notes below. This is the base, unoptimized version intended to showcase the algorithm (optimized ones for Python and Cython follow), and list(<call>) takes ~500 ms for an array of 100k on my system (P4). Since Python ints and ranges are relatively slow, this should benefit tremendously from being transferred to C level.
from __future__ import division
import numpy as np
#The date column is unimportant for calculations.
# I leave extracting the numbers' column from the dataframe
# and adding a corresponding element from data column to each result
# as an exercise for the reader
data = np.random.randint(100,size=100000)
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
index=0
currentsum=0
while index<lim_index:
for _ in range(Mp):
#np.take is awkward, requiring a full list of indices to take
for i in range(currentsum,currentsum+nsums-1):
sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Note that it produces the first sum at kth element, not nth (this can be changed but by sacrificing elegance - a number of dummy iterations before the main loop - and is more elegantly done by prepending data with extra zeros and discarding a number of first sums)
It can easily be generalized to any operation by replacing sums[slice]+=data[index] with operation(sums[slice],data[index]) where operation is a parameter and should be a mutating operation (like ndarray.__iadd__).
parallelizing between any number or workers by splitting the data is as easy (if n>k, chunks after the first one should be fed extra elements at the start)
To deduce the algorithm, I wrote a sample for a case where a decent number of sums are calculated simultaneously in order to see patterns (click the image to see it full-size).
Optimized: pure Python
Caching range objects brings the time down to ~300ms. Surprisingly, numpy functionality is of no help: np.take is unusable, and replacing currentsum logic with static slices and np.roll is a regression. Even more surprisingly, the benefit of saving output to an np.empty as opposed to yield is nonexistent.
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
RM=range(M) #cache for efficiency
RMp=range(Mp) #cache for efficiency
index=0
currentsum=0
currentsum_ranges=[range(currentsum,currentsum+nsums-1)
for currentsum in range(nsums)] #cache for efficiency
while index<lim_index:
for _ in RMp:
#np.take is unusable as it allocates another array rather than view
for i in currentsum_ranges[currentsum]:
sums[i%nsums]+=data[index]
index+=1
for _ in RM:
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Optimized: Cython
Statically typing everything in Cython instantly speeds things up to 150ms. And (optionally) assuming np.int as dtype to be able to work with data at C level brings the time down to as little as ~11ms. At this point, saving to an np.empty does make a difference, saving an unbelievable ~6.5ms, totalling ~5.5ms.
def calc_trailing_data_with_interval(np.ndarray data,int n,int k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: 1-d ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
if not data.ndim==1: raise TypeError("One-dimensional array required")
cdef int lim_index=data.size-k+1
cdef np.ndarray result = np.empty(data.size//k,dtype=data.dtype)
cdef int rindex = 0
cdef int nsums = int(np.ceil(float(n)/k))
cdef np.ndarray sums = np.zeros(nsums,dtype=data.dtype)
#optional speedup for dtype=np.int
cdef bint use_int_buffer = data.dtype==np.int and data.flags.c_contiguous
cdef int[:] cdata = data
cdef int[:] csums = sums
cdef int[:] cresult = result
cdef int M=n%k
cdef int Mp=k-M
cdef int index=0
cdef int currentsum=0
cdef int _,i
while index<lim_index:
for _ in range(Mp):
#np.take is unusable as it allocates another array rather than view
for i in range(currentsum,currentsum+nsums-1):
if use_int_buffer: csums[i%nsums]+=cdata[index] #optional speedup
else: sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
if use_int_buffer:
for i in range(nsums): csums[i]+=cdata[index] #optional speedup
else: sums+=data[index]
index+=1
if use_int_buffer: cresult[rindex]=csums[currentsum] #optional speedup
else: result[rindex]=sums[currentsum]
currentsum=(currentsum+1)%nsums
rindex+=1
return result
For regularly-spaced dates only
Here are two methods, first a pandas way and second a numpy function.
>>> n=5 # trailing periods for rolling sum
>>> k=3 # frequency of rolling sum calc
>>> df.rolling(n).sum()[-1::-k][::-1]
A
2013-01-01 NaN
2013-01-04 10.0
2013-01-07 25.0
2013-01-10 40.0
And here's a numpy function (adapted from Jaime's numpy moving_average):
def rolling_sum(a, n=5, k=3):
ret = np.cumsum(a.values)
ret[n:] = ret[n:] - ret[:-n]
return pd.DataFrame( ret[n-1:][-1::-k][::-1],
index=a[n-1:][-1::-k][::-1].index )
rolling_sum(df,n=6,k=4) # default n=5, k=3
For irregularly-spaced dates (or regularly-spaced)
Simply precede with:
df.resample('D').sum().fillna(0)
For example, the above methods become:
df.resample('D').sum().fillna(0).rolling(n).sum()[-1::-k][::-1]
and
rolling_sum( df.resample('D').sum().fillna(0) )
Note that dealing with irregularly-spaced dates can be done simply and elegantly in pandas as this is a strength of pandas over almost anything else out there. But you can likely find a numpy (or numba or cython) approach that will trade off some simplicity for an increase in speed. Whether this is a good tradeoff will depend on your data size and performance requirements, of course.
For the irregularly spaced dates, I tested on the following example data and it seemed to work correctly. This will produce a mix of missing, single, and multiple entries per date:
np.random.seed(12345)
per = 11
tidx = np.random.choice( pd.date_range('2012-12-31', periods=per, freq='D'), per )
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx).sort_index()
this isn't quite perfect yet, but I've gotta go make fake blood for a haloween party tonight... you should be able to see what I was getting at through the comments. One of the biggest speedups is finding the window edges with np.searchsorted. it doesn't quite work yet, but I'd bet it's just some index offsets that need tweaking
import pandas as pd
import numpy as np
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
sample_freq = 3 #days
sample_width = 5 #days
sample_freq *= 86400 #seconds per day
sample_width *= 86400 #seconds per day
times = df.index.astype(np.int64)//10**9 #array of timestamps (unix time)
cumsum = np.cumsum(df.A).as_matrix() #array of cumulative sums (could eliminate extra summation with large overlap)
mat = np.array([times, cumsum]) #could eliminate temporary times and cumsum vars
def yieldstep(mat, freq):
normtime = ((mat[0] - mat[0,0]) / freq).astype(int) #integer numbers indicating sample number
for i in range(max(normtime)+1):
yield np.searchsorted(normtime, i) #yield beginning of window index
def sumwindow(mat,i , width): #i is the start of the window returned by yieldstep
normtime = ((mat[0,i:] - mat[0,i])/ width).astype(int) #same as before, but we norm to window width
j = np.searchsorted(normtime, i, side='right')-1 #find the right side of the window
#return rightmost timestamp of window in seconds from unix epoch and sum of window
return mat[0,j], mat[1,j] - mat[1,i] #sum of window is just end - start because we did a cumsum earlier
windowed_sums = np.array([sumwindow(mat, i, sample_width) for i in yieldstep(mat, sample_freq)])
Looks like a rolling centered window where you pick up data every n days:
def rolleach(df, ndays, window):
return df.rolling(window, center=True).sum()[ndays-1::ndays]
rolleach(df, 3, 5)
Out[95]:
A
2013-01-02 10.0
2013-01-05 25.0
2013-01-08 40.0

Student - np.random.choice: How to isolate and tally hit frequency within a np.random.choice range

Currently learning Python and very new to Numpy & Panda
I have pieced together a random generator with a range. It uses Numpy and I am unable to isolate each individual result to count the iterations within a range within my random's range.
Goal: Count the iterations of "Random >= 1000" and then add 1 to the appropriate cell that correlates to the tally of iterations. Example in very basic sense:
#Random generator begins... these are first four random generations
Randomiteration0 = 175994 (Random >= 1000)
Randomiteration1 = 1199 (Random >= 1000)
Randomiteration2 = 873399 (Random >= 1000)
Randomiteration3 = 322 (Random < 1000)
#used to +1 to the fourth row of column A in CSV
finalIterationTally = 4
#total times random < 1000 throughout entire session. Placed in cell B1
hits = 1
#Rinse and repeat to custom set generations quantity...
(The logic would then be to +1 to A4 in the spreadsheet. If the iteration tally would have been 7, then +1 to the A7, etc. So essentially, I am measuring the distance and frequency of that distance between each "Hit")
My current code includes a CSV export portion. I do not need to export each individual random result any longer. I only need to export the frequency of each iteration distance between each hit. This is where I am stumped.
Cheers
import pandas as pd
import numpy as np
#set random generation quantity
generations=int(input("How many generations?\n###:"))
#random range and generator
choices = range(1, 100000)
samples = np.random.choice(choices, size=generations)
#create new column in excel
my_break = 1000000
if generations > my_break:
n_empty = my_break - generations % my_break
samples = np.append(samples, [np.nan] * n_empty).reshape((-1, my_break)).T
#export results to CSV
(pd.DataFrame(samples)
.to_csv('eval_test.csv', index=False, header=False))
#left uncommented if wanting to test 10 generations or so
print (samples)
I believe you are mixing up iterations and generations. It sounds like you want 4 iterations for N numbers of generations, but your bottom piece of code does not express the "4" anywhere. If you pull all your variables out to the top of your script it can help you organize better. Panda is great for parsing complicated csvs, but for this case you don't really need it. You probably don't even need numpy.
import numpy as np
THRESHOLD = 1000
CHOICES = 10000
ITERATIONS = 4
GENERATIONS = 100
choices = range(1, CHOICES)
output = np.zeros(ITERATIONS+1)
for _ in range(GENERATIONS):
samples = np.random.choice(choices, size=ITERATIONS)
count = sum([1 for x in samples if x > THRESHOLD])
output[count] += 1
output = map(str, map(int, output.tolist()))
with open('eval_test.csv', 'w') as f:
f.write(",".join(output)+'\n')

Generalizing a method in Python

I'm trying to create a method that an array I have of wind and demand averages through out the day in 15 minute intervals for 1 year. I want to turn this into a daily average for each of the 365 days. Here's what I have so far:
dailyAvg = [] # this creates the initial empty array for method below
def createDailyAvg(p): # The method
i = 0
while i < 35140: # I have 35140 data points in my array
dailyAvg.append(np.mean(p[i:i+95])) #Creates the avg of the 96, 15 minute avg
i += 95
return dailyAvg
dailyAvgWind = createDailyAvg(Wind) # Wind is the array of 15 minute avg's.
dailyAvgDemand = createDailyAvg(M) # M is the array of demand avg's
So far I can get this done if I write this twice, but that's not good programming. I want to figure out how I can use this one method on both data sets. Thanks.
You just need to make dailyAvg local to the function. This way, it will be initialized to an empty list every time the function executes (I bet the problem was that the function's result grew and grew and grew, appending the new averages, but not deleting the previous ones)
def createDailyAvg(p): # The method
dailyAvg = [] # this creates the initial empty array for this method below
i = 0
while i < 35140: # I have 35140 data points in my array
dailyAvg.append(np.mean(p[i:i+96])) #Creates the avg of the 96, 15 minute avg
i += 96
return dailyAvg
dailyAvgWind = createDailyAvg(Wind) # Wind is the array of 15 minute avg's.
dailyAvgDemand = createDailyAvg(M) # M is the array of demand avg's
Also, I replaced 95 with 96 in two places, as the end of a slice excludes the specified end.
def createDailyAvg(w,m): # The method
dailyAvg = [[],[]] # this creates the initial empty array for method below
i = 0
while i < 35140: # I have 35140 data points in my array
dailyAvg[0].append(np.mean(w[i:i+95])) #Creates the avg of the 96, 15 minute avg
dailyAvg[1].append(np.mean(m[i:i+95]))
i += 95
return dailyAvg
dailyAvg = createDailyAvg(Wind,M)
dailyAvgWind = dailyAvg[0] # Wind is the array of 15 minute avg's.
dailyAvgDemand = dailyAvg[1] # M is the array of demand avg's

Categories

Resources