pyfinance module. rolling OLS - min window needed - python

i'm tring to do a simple linear regression using pyfinance package and using PandasRollingOLS to have rolling regression beta (rolling with min_window option).
it works but i would like to have a min_window in the function.
i would like to have min_window in the rollingOLS function, because if we have a window of 90 it does not perform OLS on first 90 values. i would like to perform a OLS expanding until 90 observations starting when there is at least 12 observation (min_window), then rolling of 90 (window)
i tried to understand the code of the package but i'm not able to include min_window in the code.
i would like this kind of function (this is init of PandasRollingOLS class):
def __init__(self, y, x=None, window=None, **min_window=None**, has_const=False, use_const=True):
i think i should update the code on utils.rolling_windows posted below, can someone help me please?
def rolling_windows(a, window):
"""Creates rolling-window 'blocks' of length `window` from `a`.
Note that the orientation of rows/columns follows that of pandas.
Example
-------
import numpy as np
onedim = np.arange(20)
twodim = onedim.reshape((5,4))
print(twodim)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
print(rwindows(onedim, 3)[:5])
[[0 1 2]
[1 2 3]
[2 3 4]
[3 4 5]
[4 5 6]]
print(rwindows(twodim, 3)[:5])
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]]
"""
if window > a.shape[0]:
raise ValueError('Specified `window` length of {0} exceeds length of'
' `a`, {1}.'.format(window, a.shape[0]))
if isinstance(a, (Series, DataFrame)):
a = a.values
if a.ndim == 1:
a = a.reshape(-1, 1)
shape = (a.shape[0] - window + 1, window) + a.shape[1:]
strides = (a.strides[0],) + a.strides
windows = np.squeeze(np.lib.stride_tricks.as_strided(a, shape=shape,
strides=strides))
# In cases where window == len(a), we actually want to "unsqueeze" to 2d.
# I.e., we still want a "windowed" structure with 1 window.
if windows.ndim == 1:
windows = np.atleast_2d(windows)
return windows
thank you all!
Alessandro

I am struggling with this myself at the moment using PandasRollingOLS. I came to the temporary conclusion to simply take care of it before the regression, i.e. delete every column with below min_window value before running regressions.
min_window = 3
df.loc[:,~(df.rolling(min_window).count() < min_window).all()]
Note that it requires that your dataframe has NaNs (which is why I guess you want to have a min_window):
NaN NaN
0.5 NaN
0.8 NaN
0.7 0.5
0.6 0.4
This might be a temporary (ugly) solution until a Python guru stumbles upon your post.

Related

How to efficiently subtract values from each column with numpy

I have a 2D array of shape (50,50). I need to subtract a value from each column of this array skipping the first), which is calculated based on the index of the column. For example, using a for loop it would look something like this:
for idx in range(1, A[0, :].shape[0]):
A[0, idx] -= idx * (...) # simple calculations with idx
Now, of course this works fine, but it's very slow and performance is critical for my application. I've tried computing the values to be subtracted using np.fromfunction() and then subtracting it from the original array, but results are different than those obtained by the for loop iteractive subtraction:
func = lambda i, j: j * (...) #some simple calculations
subtraction_matrix = np.fromfunction(np.vectorize(func), (1,50))
A[0, 1:] -= subtraction_matrix
What am I doing wrong? Or is there some other method that would be better? Any help is appreciated!
All your code snippets indicate that you require the subtraction to happen only in the first row of A (though you've not explicitly mentioned that). So, I'm proceeding with that understanding.
Referring to your use of from_function(), you can use the subtraction_matrix as below:
A[0,1:] -= subtraction_matrix[1:]
Testing it out (assuming shape (5,5) instead of (50,50)):
import numpy as np
A = np.arange(25).reshape(5,5)
print (A)
func = lambda j: j * 10 #some simple calculations
subtraction_matrix = np.fromfunction(np.vectorize(func), (5,), dtype=A.dtype)
A[0,1:] -= subtraction_matrix[1:]
print (A)
Output:
[[ 0 1 2 3 4] # print(A), before subtraction
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
[[ 0 -9 -18 -27 -36] # print(A), after subtraction
[ 5 6 7 8 9]
[ 10 11 12 13 14]
[ 15 16 17 18 19]
[ 20 21 22 23 24]]
If you want the subtraction to happen in all the rows of A, you just need to use the line A[:,1:] -= subtraction_matrix[1:], instead of the line A[0,1:] -= subtraction_matrix[1:]

Using generator items selectively

Let's say I have some arrays/lists that contains a lot of values, which means that loading several of these into memory would ultimately result in a memory error due to lack of memory. One way to circumvent this is to load these arrays/lists into a generator, and then use them when needed. However, with generators you don't have so much control as with arrays/lists - and that is my problem.
Let me explain.
As an example I have the following code, which produces a generator with some small lists. So yeah, this is not memory intensive at all, just an example:
import numpy as np
np.random.seed(10)
number_of_lists = range(0, 5)
generator_list = (np.random.randint(0, 10, 10) for i in number_of_lists)
If I iterate over this list I get the following:
for i in generator_list:
print(i)
>> [9 4 0 1 9 0 1 8 9 0]
>> [8 6 4 3 0 4 6 8 1 8]
>> [4 1 3 6 5 3 9 6 9 1]
>> [9 4 2 6 7 8 8 9 2 0]
>> [6 7 8 1 7 1 4 0 8 5]
What I would like to do is sum element wise for all the lists (axis = 0). So the above should in turn result in:
[36, 22, 17, 17, 28, 16, 28, 31, 29, 14]
To do this I could use the following:
sum = [0]*10
for i in generator_list:
sum += i
where 10 is the length of one of the lists.
So far so good. I am not sure if there is a better/more optimized way of doing it, but it works.
My problem is that I would like to determine which lists in the generator_list I want to use. For example, what if I wanted to sum two of the first [0] list, one of the third, and 2 of the last, i.e.:
[9 4 0 1 9 0 1 8 9 0]
[9 4 0 1 9 0 1 8 9 0]
[4 1 3 6 5 3 9 6 9 1]
[6 7 8 1 7 1 4 0 8 5]
[6 7 8 1 7 1 4 0 8 5]
>> [34, 23, 19, 10, 35, 5, 19, 22, 43, 11]
How would I go about doing that ?
And before any questions arise why I want to do it this way, the reason is that in my real case, getting the arrays into the generator takes some time. I could then in principle just generate a new generator where I put in the order of lists as seen in the new list, but again, that would mean I would have to wait to get them in a new generator. And if this is to happen thousands of times (as seen with bootstrapping), well, it would take some time. With the first generator I have ALL lists that are available. Now I just wish to use them selectively so I don't have to create a new generator every time I want to mix it up, and sum a new set of arrays/lists.
import numpy as np
np.random.seed(10)
number_of_lists = range(5)
generator_list = (np.random.randint(0, 10, 10) for i in number_of_lists)
indices = [0, 0, 2, 4, 4]
assert sorted(indices) == indices, "only works for sorted list"
# sum_ = [0] * 10
# I prefer this:
sum_ = np.zeros((10,), dtype=int)
generator_index = -1
for index in indices:
while generator_index < index:
vector = next(generator_list)
generator_index += 1
sum_ += vector
print(sum_)
outputs
[34 23 19 10 37 5 19 22 43 11]

Downsample array in Python

I have basic 2-D numpy arrays and I'd like to "downsample" them to a more coarse resolution. Is there a simple numpy or scipy module that can easily do this? I should also note that this array is being displayed geographically via Basemap modules.
SAMPLE:
scikit-image has implemented a working version of downsampling here, although they shy away from calling it downsampling for it not being a downsampling in terms of DSP, if I understand correctly:
http://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.block_reduce
but it works very well, and it is the only downsampler that I found in Python that can deal with np.nan in the image. I have downsampled gigantic images with this very quickly.
When downsampling, interpolation is the wrong thing to do. Always use an aggregated approach.
I use block means to do this, using a "factor" to reduce the resolution.
import numpy as np
from scipy import ndimage
def block_mean(ar, fact):
assert isinstance(fact, int), type(fact)
sx, sy = ar.shape
X, Y = np.ogrid[0:sx, 0:sy]
regions = sy//fact * (X//fact) + Y//fact
res = ndimage.mean(ar, labels=regions, index=np.arange(regions.max() + 1))
res.shape = (sx//fact, sy//fact)
return res
E.g., a (100, 200) shape array using a factor of 5 (5x5 blocks) results in a (20, 40) array result:
ar = np.random.rand(20000).reshape((100, 200))
block_mean(ar, 5).shape # (20, 40)
imresize and ndimage.interpolation.zoom look like they do what you want
I haven't tried imresize before but here is how I have used ndimage.interpolation.zoom
a = np.array(64).reshape(8,8)
a = ndimage.interpolation.zoom(a,.5) #decimate resolution
a is then a 4x4 matrix with interpolated values in it
Easiest way:
You can use the array[0::2] notation, which only considers every second index.
E.g.
array= np.array([[i+j for i in range(0,10)] for j in range(0,10)])
down_sampled=array[0::2,0::2]
print("array \n", array)
print("array2 \n",down_sampled)
has the output:
array
[[ 0 1 2 3 4 5 6 7 8 9]
[ 1 2 3 4 5 6 7 8 9 10]
[ 2 3 4 5 6 7 8 9 10 11]
[ 3 4 5 6 7 8 9 10 11 12]
[ 4 5 6 7 8 9 10 11 12 13]
[ 5 6 7 8 9 10 11 12 13 14]
[ 6 7 8 9 10 11 12 13 14 15]
[ 7 8 9 10 11 12 13 14 15 16]
[ 8 9 10 11 12 13 14 15 16 17]
[ 9 10 11 12 13 14 15 16 17 18]]
array2
[[ 0 2 4 6 8]
[ 2 4 6 8 10]
[ 4 6 8 10 12]
[ 6 8 10 12 14]
[ 8 10 12 14 16]]
Because the OP just wants a courser resolution, I thought I would share my way for reducing number of pixels by half in each dimension. I takes the mean of 2x2 blocks. This can be applied multiple times to reduce by factors of 2.
from scipy.ndimage import convolve
array_downsampled = convolve(array,
np.array([[0.25,0.25],[0.25,0.25]]))[:array.shape[0]:2,:array.shape[1]:2]
xarray's "coarsen" method can downsample a xarray.Dataset or xarray.DataArray
http://xarray.pydata.org/en/stable/generated/xarray.DataArray.coarsen.html
http://xarray.pydata.org/en/stable/computation.html#coarsen-large-arrays
For example:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15,5))
# Create a 10x10 array of random numbers
a = xr.DataArray(np.random.rand(10,10)*100, dims=['x', 'y'])
# "Downscale" the array, mean of blocks of size (2x2)
b = a.coarsen(x=2, y=2).mean()
# "Downscale" the array, mean of blocks of size (5x5)
c = a.coarsen(x=5, y=5).mean()
# Plot and cosmetics
a.plot(ax=ax1)
ax1.set_title("Full Data")
b.plot(ax=ax2)
ax2.set_title("mean of (2x2) boxes")
c.plot(ax=ax3)
ax3.set_title("mean of (5x5) boxes")
This might not be what you're looking for, but I thought I'd mention it for completeness.
You could try installing scikits.samplerate (docs), which is a Python wrapper for libsamplerate. It provides nice, high-quality resampling algorithms -- BUT as far as I can tell, it only works in 1D. You might be able to resample your 2D signal first along one axis and then along another, but I'd think that might counteract the benefits of high-quality resampling to begin with.
This will take an image of any resolution and return only a quarter of its size by taking the 4th index of the image array.
import cv2
import numpy as np
def quarter_res_drop(im):
resized_image = im[0::4, 0::4]
cv2.imwrite('resize_result_image.png', resized_image)
return resized_image
im = cv2.imread('Your_test_image.png', 1)
quarter_res_drop(im)

Python: Shrink/Extend 2D arrays in fractions

There are 2D arrays of numbers as outputs of some numerical processes in the form of 1x1, 3x3, 5x5, ... shaped, that correspond to different resolutions.
In a stage an average i.e., 2D array value in the shape nxn needs to be produced.
If the outputs were in consistency of shape i.e., say all in 11x11 the solution was obvious, so:
element_wise_mean_of_all_arrays.
For the problem of this post however the arrays are in different shapes so the obvious way does not work!
I thought it might be some help by using kron function however it didn't. For example, if array is in shape of 17x17 how to make it 21x21. So for all others from 1x1,3x3,..., to build a constant-shaped array, say 21x21.
Also it can be the case that the arrays are smaller and bigger in shape compared to the target shape. That is an array of 31x31 to be shruk into 21x21.
You could imagine the problem as a very common task for images, being shrunk or extended.
What are possible efficient approaches to do the same jobs on 2D arrays, in Python, using numpy, scipy, etc?
Updates:
Here is a bit optimized version of the accepted answer bellow:
def resize(X,shape=None):
if shape==None:
return X
m,n = shape
Y = np.zeros((m,n),dtype=type(X[0,0]))
k = len(X)
p,q = k/m,k/n
for i in xrange(m):
Y[i,:] = X[i*p,np.int_(np.arange(n)*q)]
return Y
It works perfectly, however do you all agree it is the best choice in terms of the efficiency? If not any improvement?
# Expanding ---------------------------------
>>> X = np.array([[1,2,3],[4,5,6],[7,8,9]])
[[1 2 3]
[4 5 6]
[7 8 9]]
>>> resize(X,[7,11])
[[1 1 1 1 2 2 2 2 3 3 3]
[1 1 1 1 2 2 2 2 3 3 3]
[1 1 1 1 2 2 2 2 3 3 3]
[4 4 4 4 5 5 5 5 6 6 6]
[4 4 4 4 5 5 5 5 6 6 6]
[7 7 7 7 8 8 8 8 9 9 9]
[7 7 7 7 8 8 8 8 9 9 9]]
# Shrinking ---------------------------------
>>> X = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]
[13 14 15 16]]
>>> resize(X,(2,2))
[[ 1 3]
[ 9 11]]
Final note: that the code above easily could be translated to Fortran for the highest performance possible.
I'm not sure I understand exactly what you are trying but if what I think the simplest way would be:
wanted_size = 21
a = numpy.array([[1,2,3],[4,5,6],[7,8,9]])
b = numpy.zeros((wanted_size, wanted_size))
for i in range(wanted_size):
for j in range(wanted_size):
idx1 = i * len(a) / wanted_size
idx2 = j * len(a) / wanted_size
b[i][j] = a[idx1][idx2]
You could maybe replace the b[i][j] = a[idx1][idx2] with some custom function like the average of a 3x3 matrix centered in a[idx1][idx2] or some interpolation function.

Python: select function

With this code:
import scipy
from scipy import *
x = r_[1:15]
print x
a = select([x > 7, x >= 4],[x,x+10])
print a
I get this answer:
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
[ 0 0 0 14 15 16 17 8 9 10 11 12 13 14]
But why do I have zeros in the beginning and not in the end? Thanks in advance.
You seem to be using numpy.
From the documentation for numpy.select():
numpy.select(condlist, choicelist, default=0)
...
default: The element inserted in output when all conditions evaluate to False.
Since your conditions are x > 7 and x >=4, the output array will have elements from x+10 when x >= 4 and from x when x > 7. When both the conditions are false, i.e., when x < 4, you will get default, which is 0. So you get 3 zeros in the beginning.
You don't get any zeros in the end because at least one of the conditions is true (both are true, in fact).

Categories

Resources