How to calculate weighted average on a traingular similarity matrix - python

I have a triangular similarity matrix like this.
[[3, 1, 2, 0],
[1, 3, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 0]]
How do I calculate a weighted average for each row while discarding the zero elemets?

You could add along the second axis, and divide by the sum over the amount of non-zero values per row. Then with where in np.divide you can divide where a condition is satisfied, which by setting it to a mask specifying where non-zero values are, you can prevent getting a division by zero error:
a = np.array([[3, 1, 2, 0],
[1, 3, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 0]])
m = (a!=0).sum(1)
np.divide(a.sum(1), m, where=m!=0)
# array([2., 2., 1., 0.])

Loop over each row, then loop over each element. When looping over the elements, don't include zeros. If you find only elements which are zero, just add zero (or whatever you want the default value to be) to your list.
weighted_averages = []
for row in matrix:
total_weight = 0
number_of_weights = 0
for element in row:
if element != 0:
total_weight += element
number_of_weights += 1
if number_of_weights == 0:
weighted_averages.append(0)
else:
weighted_averages.append(total_weight/number_of_weights)
weighted_averages in your case comes back as:
[2.0, 2.0, 1.0, 0]

You can use numpy to calculate weighted average.
import numpy as np
a = np.array([
[3, 1, 2, 0],
[1, 3, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 0]
])
weights = np.array([1,2,3,4])
#create an mask where element is 0
ma = np.ma.masked_equal(a,0)
#take masked weighted average
ans = np.ma.average(ma, weights=weights,axis = 1)
#fill masked points as 0
ans.filled(0)
Output:
array([1.83333333, 2.33333333, 1. , 0. ])
Just Python:
ar = [[3, 1, 2, 0],
[1, 3, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 0]]
weight = [1,2,3,4]
ans=[]
for li in ar:
wa = 0 #weighted average
we = 0 #weights
for index,ele in enumerate(li):
if ele !=0:
wa+=weight[index]*ele
we+=weight[index]
if we!=0:
ans.append(wa/we)
else:
ans.append(0)
ans

Related

Vectorized computation of returns from positions and bid-ask series

I have two series (bids and asks), and a list of positions. All have shape (T, S).
In my example below, I have T=5 timesteps and S=3 symbols.
The positions represent a portfolio allocation for each symbol at each timestep. For example, if I have 5% of asset 1, 10% of asset 2 and 85% of asset 3 at timestep 4, then positions[4] is [0.05, 0.1, 0.85].
When buying (an asset from t to t+1 increases), the ask prices should be used. When selling, the bid prices should be used. This is because I am assuming a strategy where I only buy/sell with market orders, so I need to "cross the spread" each time.
How can I compute my returns given a list of positions over time, bid prices and ask prices?
For simplicity, all prices are quoted by the first asset (which price is always 1), and the starting position is [1, 0, 0].
What I would like is a vectorized implementation of the compute_returns_loop() function (getting rid of the for-loop over timesteps).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
T,S=5,3
asks = np.array([[1,1,1,1,1], [3, 3, 3.5, 4, 4], [13, 13, 15, 17, 21]]).T # (T, S)
bids = np.array([[1,1,1,1,1], [1, 1, 2.5, 2, 3], [11, 11, 13, 15, 19]]).T # (T, S)
positions_test_1 = np.array([[1, 0, 0], [1, 0, 0], [1, 0, 0 ], [1, 0, 0 ], [1, 0, 0]]) # (T, S)
positions_test_2 = np.array([[1, 0, 0], [1, 0, 0], [0, 0, 1 ], [0, 0, 1 ], [1, 0, 0]]) # (T, S)
positions_test_3 = np.array([[1, 0, 0], [1, 0, 0], [0.5, 0, 0.5], [0.5, 0, 0.5], [1, 0, 0]]) # (T, S)
positions_test_4 = np.array([[1, 0, 0], [1, 0, 0], [0, 0, 1 ], [1, 0, 0 ], [1, 0, 0]]) # (T, S)
# Quick visualization
plt.plot(asks[:, 2])
plt.plot(bids[:, 2])
# Here absolute means in the currency of the respective asset
# positions are expressed as ratio of the total portfolio allocation
def compute_returns_loop(positions, asks, bids):
mids = (asks + bids) / 2
current_absolute_position = positions[0].astype(np.float)
for t in range(1, asks.shape[0]):
unrealized_worth = (mids[t] * current_absolute_position).sum()
target_absolute_position = positions[t] / mids[t] * unrealized_worth
absolute_transactions = target_absolute_position - current_absolute_position
current_absolute_position[1:] += absolute_transactions[1:]
cost = np.where(absolute_transactions[1:] > 0, asks[t, 1:], bids[t, 1:])
current_absolute_position[0] -= (cost * absolute_transactions[1:]).sum()
return current_absolute_position[0]
def compute_returns_vectorized(positions, asks, bids):
mids = (asks + bids) / 2
# TODO
return None
print(compute_returns_loop(positions_test_1, asks, bids)) # should be 1
print(compute_returns_loop(positions_test_2, asks, bids)) # should be ~1.2
print(compute_returns_loop(positions_test_3, asks, bids)) # should be ~1.1
print(compute_returns_loop(positions_test_4, asks, bids)) # should be 1
# replacing compute_returns_loop with compute_returns_vectorized should give approximately the same results

Python: distance from index to 1s in binary mask

I have a binary mask like this:
X = [[0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1]]
I have a certain index in this array and want to compute the distance from that index to the closest 1 in the mask. If there's already a 1 at that index, the distance should be zero.
Examples (assuming Manhattan distance):
distance(X, idx=(0, 5)) == 0 # already is a 1 -> distance is zero
distance(X, idx=(1, 2)) == 2 # second row, third column
distance(X, idx=(0, 0)) == 5 # upper left corner
Is there already existing functionality like this in Python/NumPy/SciPy? Both Euclidian and Manhattan distance would be fine.
I'd prefer to avoid computing distances for the entire matrix (as that is pretty big in my case), and only get the distance for my one index.
Here's one for manhattan distance metric for one entry -
def bwdist_manhattan_single_entry(X, idx):
nz = np.argwhere(X==1)
return np.abs((idx-nz).sum(1)).min()
Sample run -
In [143]: bwdist_manhattan_single_entry(X, idx=(0,5))
Out[143]: 0
In [144]: bwdist_manhattan_single_entry(X, idx=(1,2))
Out[144]: 2
In [145]: bwdist_manhattan_single_entry(X, idx=(0,0))
Out[145]: 5
Optimize further on performance by extracting the boudary elements only off the blobs of 1s -
from scipy.ndimage.morphology import binary_erosion
def bwdist_manhattan_single_entry_v2(X, idx):
k = np.ones((3,3),dtype=int)
nz = np.argwhere((X==1) & (~binary_erosion(X,k,border_value=1)))
return np.abs((idx-nz).sum(1)).min()
Number of elements in nz with this method would be smaller number than the earlier one, hence it improves.
You can use scipy.ndimage.morphology.distance_transform_cdt to compute the "taxicab" (Manhattan) distance transform:
import numpy as np
import scipy.ndimage.morphology
x = np.array([[0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1]])
d = scipy.ndimage.morphology.distance_transform_cdt(1 - x, 'taxicab')
print(d[0, 5])
# 0
print(d[1, 2])
# 2
print(d[0, 0])
# 5
You can do it like this:
def Manhattan_distance(X, idx):
dist = min([ abs(i-idx[0]) + abs(j-idx[1]) for i, row in enumerate(X) for j, val in enumerate(X[i]) if val == 1])
return dist
Thanks.

Searching for vectors within a numpy matrix

Given the below matrix ixs with indices, I am looking for a vector in the ixs that is equivalent to ix (also a row/vector of ixs), except for dimension1 (which could assume any value) and dimension3 which needs to be set to 1.
ixs = np.asarray([
[0, 0, 3, 0, 1], # 0. current value of `ix`
[0, 0, 3, 1, 1], # 1.
[0, 1, 3, 0, 0], # 2.
[0, 1, 3, 0, 1], # 3.
[0, 1, 3, 1, 1], # 4.
[0, 2, 3, 0, 1], # 5.
[0, 2, 3, 1, 1] # 6.
])
ix = np.asarray([0, 0, 3, 0, 1])
So with ix of [0, 0, 3, 0, 1], I'd be looking at all rows that are below that one (row 1..6), and look for the pattern [0, *, 3, 1, 1] i.e. 1. [0, 0, 3, 1, 1], 4. [0, 1, 3, 1, 1], 6. [0, 2, 3, 1, 1].
What's the best (concise) way to get those vectors?
Here is an easy to understand approach using cdist:
We use a weighted hamming distance between ix and every row of ixs. This distance is 0 if the rows are identical (we use that to doublecheck that ix is in ixs) and adds a penalty for every difference. We chose the weights such that a difference in position 0,2 or 4 adds 3/11 and in position 1 or 3 adds 1/11. Later, we keep only vectors with distance < 1/4, this allows vectors which deviate from ix at 1 or 3 or both through and blocks all others. We then checck separately for a 1 in position 3.
from scipy.spatial.distance import cdist
# compute distance note that weights are automatically normalized to sum 1
d = cdist([ix],ixs,"hamming",w=[3,1,3,1,3])[0]
# find ix
ixloc = d.argmin()
# make sure its exactly ix
assert d[ixloc] == 0
# filter out all rows that are different in col 0,2 or 4
hits, = ((d < 1/4) & (ixs[:,3] == 1)).nonzero()
# only keep hits below the row of ix:
hits = hits[hits.searchsorted(ixloc):]
hits
# array([1, 4, 6])
This solution only uses numpy (very fast) with several logical operations.
At the end, it gives the right columns.
ixs = np.matrix([
[0, 0, 3, 0, 1], # 0. current value of `ix`
[0, 0, 3, 1, 1], # 1.
[0, 1, 3, 0, 0], # 2.
[0, 1, 3, 0, 1], # 3.
[0, 1, 3, 1, 1], # 4.
[0, 2, 3, 0, 1], # 5.
[0, 2, 3, 1, 1] # 6.
])
newixs = ixs
#since the second column does not matter, we just assign it 0 in the new matrix.
newixs[:,1] = 0
#here it compares the each row against the 0 indexed row
#then, it multiplies the True and False values with 1
#and the result is 0,1 values in an array.
#then it takes the averages at the row level
#if the average is 1, then it means that all values match
mask = ((newixs == newixs[0])*1).mean(axis=1) == 1
#it then converts the matrix to array for masking
mask = np.squeeze(np.asarray(mask))
#using the mask value, we select the matched columns
ixs[mask,:]
matrix([[0, 0, 3, 0, 1],
[0, 1, 3, 0, 1],
[0, 2, 3, 0, 1]])

Changing items in numpy array

I want to change all items in array A (in axis=1) into 0, according to the following criteria (toy code):
import numpy as np
A = np.array([[1,3], [2,5], [6,2]] )
B = np.array([[1,1,0,0,0],[1,0,0,2,0],[0,0,2,2,2],[0,0,0,2,0],[6,6,0,0,0]])
for i in A:
if i[1]<=2:
B[B==i[0]]=0
# result
>>> B
array([[1, 1, 0, 0, 0],
[1, 0, 0, 2, 0],
[0, 0, 2, 2, 2],
[0, 0, 0, 2, 0],
[0, 0, 0, 0, 0]])
But, in numpy way, that is NO 'for' loops :) Thanks!
You can use a conditional list comprehension to create a list of the first value in a tuple pair where the second value is less than or equal to two (in the example for A, it is the last item which gives a value of 6).
Then use slicing with np.isin to find the elements in B what are contained within the values from the previous condition, and then set those values to zero.
target_val = 2
B[np.isin(B, [a[0] for a in A if a[1] <= target_val])] = 0
>>> B
array([[1, 1, 0, 0, 0],
[1, 0, 0, 2, 0],
[0, 0, 2, 2, 2],
[0, 0, 0, 2, 0],
[0, 0, 0, 0, 0]])
Alternatively, you could also use np.where instead of slicing.
np.where(np.isin(B, [a[0] for a in A if a[1] <= target_val]), 0, B)
In one line: B[np.isin(B, A[A[:, 1] <= 2][:, 0])] = 0
Explanation:
c = A[:, 1] <= 2 # broadcast the original `if i[1]<=2:` check along axis=1
# i.e., mask A according to where the second values of the pairs are <= 2
d = c[:, 0] # index with the mask, and select the old `i[0]` values, here just `6`
e = np.isin(B, d) # mask B according to where the values are in the above
B[e] = 0 # and zero out those positions, i.e. where the old B value is 6

2-D Matrix: Finding and deleting columns that are subsets of other columns

I have a problem where I want to identify and remove columns in a logic matrix that are subsets of other columns. i.e. [1, 0, 1] is a subset of [1, 1, 1]; but neither of [1, 1, 0] and [0, 1, 1] are subsets of each other. I wrote out a quick piece of code that identifies the columns that are subsets, which does (n^2-n)/2 checks using a couple nested for loops.
import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
columns = [True]*cols
for i in range(cols):
for j in range(i+1,cols):
diff = A[:,i]-A[:,j]
if all(diff >= 0):
print "%d is a subset of %d" % (j, i)
columns[j] = False
elif all(diff <= 0):
print "%d is a subset of %d" % (i, j)
columns[i] = False
B = A[:,columns]
The solution should be
>>> print B
[[1 0 0]
[0 1 1]
[1 1 0]
[1 0 1]
[1 0 1]
[1 0 0]
[0 1 1]
[0 1 0]]
For massive matrices though, I'm sure there's a way that I could do this faster. One thought is to eliminate subset columns as I go so I'm not checking columns already known to be a subset. Another thought is to vectorize this so don't have O(n^2) operations. Thank you.
Since the A matrices I'm actually dealing with are 5000x5000 and sparse with about 4% density, I decided to try a sparse matrix approach combined with Python's "set" objects. Overall it's much faster than my original solution, but I feel like my process of going from matrix A to list of sets D is not as fast it could be. Any ideas on how to do this better are appreciated.
Solution
import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
drops = np.zeros(cols).astype(bool)
# sparse nonzero elements
C = np.nonzero(A)
# create a list of sets containing the indices of non-zero elements of each column
D = [set() for j in range(cols)]
for i in range(len(C[0])):
D[C[1][i]].add(C[0][i])
# find subsets, ignoring columns that are known to already be subsets
for i in range(cols):
if drops[i]==True:
continue
col1 = D[i]
for j in range(i+1,cols):
col2 = D[j]
if col2.issubset(col1):
# I tried `if drops[j]==True: continue` here, but that was slower
print "%d is a subset of %d" % (j, i)
drops[j] = True
elif col1.issubset(col2):
print "%d is a subset of %d" % (i, j)
drops[i] = True
break
B = A[:, ~drops]
print B
Here's another approach using NumPy broadcasting -
A[:,~((np.triu(((A[:,:,None] - A[:,None,:])>=0).all(0),1)).any(0))]
A detailed commented explanation is listed below -
# Perform elementwise subtractions keeping the alignment along the columns
sub = A[:,:,None] - A[:,None,:]
# Look for >=0 subtractions as they indicate non-subset criteria
mask3D = sub>=0
# Check if all elements along each column satisfy that criteria giving us a 2D
# mask which represent the relationship between all columns against each other
# for the non subset criteria
mask2D = mask3D.all(0)
# Finally get the valid column mask by checking for all columns in the 2D mas
# that have at least one element in a column san the diagonal elements.
# Index into input array with it for the final output.
colmask = ~(np.triu(mask2D,1).any(0))
out = A[:,colmask]
Define subset as col1.dot(col1) == col1.dot(col2) if and only if col1 is a subset of col2
Define col1 and col2 are the same if and only if col1 is subset of col2 and vice versa.
I split the work into two. First get rid of all but one equivalent columns. Then remove subsets.
Solution
import numpy as np
def drop_duplicates(A):
N = A.T.dot(A)
D = np.diag(N)[:, None]
drops = np.tril((N == D) & (N == D.T), -1).any(axis=1)
return A[:, ~drops], drops
def drop_subsets(A):
N = A.T.dot(A)
drops = ((N == np.diag(N)).sum(axis=0) > 1)
return A[:, ~drops], drops
def drop_strict(A):
A1, d1 = drop_duplicates(A)
A2, d2 = drop_subsets(A1)
d1[~d1] = d2
return A2, d1
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
B, drops = drop_strict(A)
Demonstration
print B
print
print drops
[[1 0 0]
[0 1 1]
[1 1 0]
[1 0 1]
[1 0 1]
[1 0 0]
[0 1 1]
[0 1 0]]
[False True False False True True]
Explanation
N = A.T.dot(A) is a matrix of every combination of dot product. Per the definition of subset at the top, this will come in handy.
def drop_duplicates(A):
N = A.T.dot(A)
D = np.diag(N)[:, None]
# (N == D)[i, j] being True identifies A[:, i] as a subset
# of A[:, j] if i < j. The relationship is reversed if j < i.
# If A[:, j] is subset of A[:, i] and vice versa, then we have
# equivalent columns. Taking the lower triangle ensures we
# leave one.
drops = np.tril((N == D) & (N == D.T), -1).any(axis=1)
return A[:, ~drops], drops
def drop_subsets(A):
N = A.T.dot(A)
# without concern for removing equivalent columns, this
# removes any column that has an off diagonal equal to the diagonal
drops = ((N == np.diag(N)).sum(axis=0) > 1)
return A[:, ~drops], drops

Categories

Resources