I have a matrix where each column is a different brand.
Each row is a different category.
I have a separate matrix which is the desired outcome.
I need to mix the first martix to find the most optimal outcome of % to = the second matrix.
First matrix: C | 105 130 120
P | 1 3 5
F | 2 4 2
goal is to design a mix that has these attributes:
Optimal: C | 245
P | 6
F | 7
What formula is this?
If I understand you correctly, you are actually searching for a solution to a set of linear equations. Assuming you want to "mix" each of the columns of the matrix to get to the final target, you're actually looking for a vector x such that target = M # x. The solution is to multiply by the inverse, x = inv(M) # target. Using numpy, this translates to
import numpy
M = numpy.array([[105, 130, 120],
[1, 3, 5],
[2, 4, 2]])
target = numpy.array([[245], [6], [7]])
x = numpy.linalg.inv(M)#target
x is
array([[0.11940299],
[1.57462687],
[0.23134328]])
Related
I am comparing the Jaccard distance matrix I get when I process a dataset using pdist and a DIY Jaccard distance matrix function. I'm getting different results in my output distance matrices and I'm not sure why.
I think one of the following is the cause:
My implementation of jaccard distance calculation is wrong
scipy.spatial.distance.pdist(metric = 'jaccard') and scipy.spatial.distance.jaccard calculate jaccard distance in different ways (seems unlikely as their both in scipy.spatial.distance)
squareform is doing something to my data, potentially a normalisation
The docs for squareform go a bit over my head so some form of normalisation might be what's happening. However, the squareform-ed distance matrix does not have the same relative distance magnitudes between cells which is confusing (e.g. row 0 in my DIY distance matrix is 0, 0.571429, 1, and with pdist is 0, 1, 1 - the middle value is twice as high with pdist).
Can anyone explain the why I'm getting a different distance matrix when it's being analysed with the same metric?
My code:
import numpy as np
from scipy.spatial.distance import jaccard, squareform, pdist
def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
#I don't care about every value in the array for my use case, so dont want to include them in my comparison
all_features = set([i for i in feature_list1 if i != filler_val])
all_features.update(set([i for i in feature_list2 if i != filler_val]))
counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
return jaccard(counts_1, counts_2)
data_array = np.array([[1, 2, 3, 4, 5],
[3, 4, 5, 6, 7],
[8, 9, 10, 11, 12]])
# =============================================================================
# DIY distance matrix
# =============================================================================
#set filler val to None, so the arrays being compared are equivalent to pdist
dist_diy = np.array([[jaccard_dissimilarity(a,b, None) for a in data_array] for b in data_array])
# =============================================================================
# pdist distance matrix
# =============================================================================
dist_pdist = squareform(pdist(data_array, metric = 'jaccard'))
Input array:
1 2 3 4 5
3 4 5 6 7
8 9 10 11 12
dist_diy:
0 0.571429 1
0.571429 0 1
1 1 0
dist_pdist:
0 1 1
1 0 1
1 1 0
Looks like pdist considers objects at a given index when comparing arrays, rather than just what objects are present in the array itself - if I change data_array[1] to 3, 4, 5, 4, 5 then the distance matrix changes to reflect the fact that data_array[0][3:5] == data_array[1][3:5]:
0 0.6 1
0.6 0 1
1 1 0
The behaviour is discussed here, but the arrays don't have to be boolean based on the above tests (if the arrays were treated as boolean then the distance matrix would not change as all numbers are > 1 and are therefore == True).
The DIY function considered the objects present rather than the index at which those objects were found, hence the discrepancy!
I would like to know if there exists a similar way of doing this (Mathematica) in Python:
Mathematica
I have tried it in Python and it does not work. I have also tried it with numpy.put() or with simple 2 for loops. This 2 ways work properly but I find them very time consuming with larger matrices (3000×3000 elements for example).
Described problem in Python,
import numpy as np
a = np.arange(0, 25, 1).reshape(5, 5)
b = np.arange(100, 500, 100).reshape(2, 2)
p = np.array([0, 3])
a[p][:, p] = b
which outputs non-changed matrix a: Python
Perhaps you are looking for this:
a[p[...,None], p] = b
Array a after the above assignment looks like this:
[[100 1 2 200 4]
[ 5 6 7 8 9]
[ 10 11 12 13 14]
[300 16 17 400 19]
[ 20 21 22 23 24]]
As documented in Integer Array Indexing, the two integer index arrays will be broadcasted together, and iterated together, which effectively indexes the locations a[0,0], a[0,3], a[3,0], and a[3,3]. The assignment statement would then perform an element-wise assignment at these locations of a, using the respective element-values from RHS.
I'm looking for an efficient function to automatically produce betas for every possible multiple regression model given a dependent variable and set of predictors as a DataFrame in python.
For example, given this set of data:
https://i.stack.imgur.com/YuPuv.jpg
The dependent variable is 'Cases per Capita' and the columns following are the predictor variables.
In a simpler example:
Student Grade Hours Slept Hours Studied ...
--------- -------- ------------- --------------- -----
A 90 9 1 ...
B 85 7 2 ...
C 100 4 5 ...
... ... ... ... ...
where the beta matrix output would look as such:
Regression Hours Slept Hours Studied
------------ ------------- ---------------
1 # N/A
2 N/A #
3 # #
The table size would be [2^n - 1] where n is the number of variables, so in the case with 5 predictors and 1 dependent, there would be 31 regressions, each with a different possible combination of beta calculations.
The process is described in greater detail here and an actual solution that is written in R is posted here.
I am not aware of any package that already does this. But you can create all those combinations (2^n-1), where n is the number of columns in X (independent variables), and fit a linear regression model for each combination and then get coefficients/betas for each model.
Here is how I would do it, hope this helps
from sklearn import datasets, linear_model
import numpy as np
from itertools import combinations
#test dataset
X, y = datasets.load_boston(return_X_y=True)
X = X[:,:3] # Orginal X has 13 columns, only taking n=3 instead of 13 columns
#create all 2^n-1 (here 7 because n=3) combinations of columns, where n is the number of features/indepdent variables
all_combs = []
for i in range(X.shape[1]):
all_combs.extend(combinations(range(X.shape[1]),i+1))
# print 2^n-1 combinations
print('2^n-1 combinations are:')
print(all_combs)
## Create a betas/coefficients as zero matrix with rows (2^n-1) and columns equal to X
betas = np.zeros([len(all_combs), X.shape[1]])+np.NaN
## Fit a model for each combination of columns and add the coefficients into betas matrix
lr = linear_model.LinearRegression()
for regression_no, comb in enumerate(all_combs):
lr.fit(X[:,comb], y)
betas[regression_no, comb] = lr.coef_
## Print Coefficients of each model
print('Regression No'.center(15)+" ".join(['column {}'.format(i).center(10) for i in range(X.shape[1])]))
print('_'*50)
for index, beta in enumerate(betas):
print('{}'.format(index + 1).center(15), " ".join(['{:.4f}'.format(beta[i]).center(10) for i in range(X.shape[1])]))
results in
2^n-1 combinations are:
[(0,), (1,), (2,), (0, 1), (0, 2), (1, 2), (0, 1, 2)]
Regression No column 0 column 1 column 2
__________________________________________________
1 -0.4152 nan nan
2 nan 0.1421 nan
3 nan nan -0.6485
4 -0.3521 0.1161 nan
5 -0.2455 nan -0.5234
6 nan 0.0564 -0.5462
7 -0.2486 0.0585 -0.4156
First of all thank you for any support. This is my first question published as usually my doubts are solved reading through other user's questions.
Here is my question: I have a number (n) of sets with common elements. These elements are usually added sequentially creating new sets although I do not have the sequence and this is what I am trying to find. The sequence is not always perfect and at some points I have to find the closest one with some uncertainty when the sequence is not 'perfect'.
I coded it using theory of Sets searching sequentially the set that contains all the other sets and when I do not reach the last set then I start from the smallest to the bigger.
I gave some thoughts to the topic and I found, in theory, a more robust and generic approach. The idea is to build a square matrix with the n sets as row index (i) and the n sets as column index (j). The element i,j will be equal to 1 when set j is contained in i.
Here I have an example with sets A to G:
A={a, b, c, d1, d2, e, f};
B={b, c, d1, d2, e, f};
C={c, d1, d2, e, f};
D={d1, f, g};
E={d2, f, g};
F={f, g};
G={g};
If I create the matrix assuming sequence B, E, C, F, D, A, G, I would have:
B E C F D A G
B 1 1 1 1 1 0 1
E 0 1 0 1 0 0 1
C 0 1 1 1 1 0 1
F 0 0 0 1 0 0 1
D 0 0 0 1 1 0 1
A 1 1 1 1 1 1 1
G 0 0 0 0 0 0 1
I should get this matrix transformed into following matrix:
A B C D E F G
A 1 1 1 1 1 1 1
B 0 1 1 1 1 1 1
C 0 0 1 1 1 1 1
D 0 0 0 1 0 1 1
E 0 0 0 0 1 1 1
F 0 0 0 0 0 1 1
G 0 0 0 0 0 0 1
Which shows one of the two possible sequence: A, B, C, D, E, F, G
Here I add a picture as I am not sure matrix are shown clearly.
My first question is how you recommend to handle this matrix (which kind of data type should I use with typical functions to swap rows and columns).
And my second question is if there is already a matrix transformation function for this topic.
From my (small) experience, most used types for matrices are lists and numpy.ndarrays.
For columns swaps in particular, I would recommend numpy. There are many array creation routines in numpy. You either give the list with data explicitly or you create an array based on the shape you want. Example
>>> import numpy as np
>>> np.array([1, 2, 3])
array([1, 2, 3])
>>> np.array([[1, 2, 3], [1, 2, 3]])
array([[1, 2, 3],
[1, 2, 3]])
>>> np.zeros((2, 2))
array([[0., 0.],
[0., 0.]])
np.zeros accepts a shape as an argument (number of rows and columns for matrices). Of course, you can create arrays with how many dimensions you want.
numpy is quite complex regarding indexing its arrays. For a matrix you have:
>>> a = np.arange(6).reshape(2, 3)
>>> a
array([[0, 1, 2],
[3, 4, 5]])
>>> a[0] # row indexing
array([0, 1, 2])
>>> a[1, 1] # element indexing
4
>>> a[:, 2] # column indexing
array([2, 5])
Hopefully the examples are self-explanatory. Regarding the column index, : means "over all the values". So you specify a column index and the fact that you want all the values on that column.
For swapping rows and columns it's pretty short:
>>> a = np.arange(6).reshape(2, 3)
>>> a
array([[0, 1, 2],
[3, 4, 5]])
>>> a[[0, 1]] = a[[1, 0]] # row swapping
>>> a
array([[3, 4, 5],
[0, 1, 2]])
>>> a[:, [0, 2]] = a[:, [2, 0]] # column swapping
>>> a
array([[5, 4, 3],
[2, 1, 0]])
Here advance indexing is used. Each dimension (called axis by numpy) can accept a list of indices. So you can get 2 or more rows/columns at the same time from a matrix.
You don't have to ask for them in a certain order. numpy gives you the values in the order you ask for them.
Swapping rows is done by asking numpy for the two rows in reversed order and saving them in their original positions. It actually respects the pythonic way of swapping values between 2 variables (although surrounded by a complex frame):
a, b = b, a
Regarding matrix transformation, it depends on what you are looking for.
Using the swapping ideas from I made my own functions to find all the swapping to do to get the triangular matrix.
Here I write the code:
`def simple_sort_matrix(matrix):
orden=np.array([i for i in range(len(matrix[0]))])
change=True
while change:
rows_index=row_index_ones(matrix)
change=False
#for i in range(len(rows_index)-1):
i=0
while i
def swap_row_and_column(matrix,i,j):
matrix[[i, j]] = matrix[[j, i]] # row swapping
matrix[:, [i, j]] = matrix[:, [j, i]] # column swapping
return matrix
def row_index_ones(matrix):
return(np.sum(matrix,axis=1))`
Best regards,
Pablo
I have an array A, say :
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
And I wish to create a new array B by replacing each element in A by the median of its four nearest neighbors, without taking into account the value at the given position... for example :
B[2] = np.median([A[0], A[1], A[3], A[4]]) (=3)
The thing is that I need to perform this on a gigantic A and I want to optimize times, so I want to avoid for loops or similar. And... I don't care about the result at the edges.
I already tried scipy.ndimage.filters.median_filter but it is not producing the desired output :
import scipy.ndimage
B = scipy.ndimage.filters.median_filter(A,footprint=[1,1,0,1,1],mode='wrap')
which produces B=[7,4,4,5,6,7,6,6], which is clearly not the correct answer.
Any idea is welcome.
On way could be using np.roll to shift the number in your array such as:
A_1 = np.roll(A,1)
# output: array([8, 1, 2, 3, 4, 5, 6, 7])
And then the same thing with rolling by -2, -1 and 2:
A_2 = np.roll(A,2)
A_m1 = np.roll(A,-1)
A_m2 = np.roll(A,-2)
Now you just need to sum your 4 arrays, as for each index you have the 4 neighbors in one of them:
B = (A_1 + A_2 + A_m1 + A_m2)/4.
And as you said you don't care about the edges, I think it works for you!
EDIT: I guess I was focus on the rolling idea that I mixed up mean and median, the median can be calculated by B = np.median([A_1,A_2,A_m1,A_m2],axis=0)
I'd make a rolling, central window of length 5 in pandas, and apply the median function to the values of the window, the middle one masked away:
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
mask = np.array(np.ones(5), bool)
mask[5//2] = False
import pandas as pd
df = pd.DataFrame(A)
r5 = df.rolling(5, center=True)
result = r5.apply(lambda x: np.median(x[mask]))
result
0
0 NaN
1 NaN
2 3.0
3 4.0
4 5.0
5 6.0
6 NaN
7 NaN