I noticed previous versions of my question suggested the use of queries, but I have unique data frames that do not have the same column names. I want to code this formula without for loops and only with apply function:
Here is the variables initialized. mu=μ and the other variables are as follows:
mu=pd.DataFrame(0, index=['A','B','C'], columns=['x','y'])
pij=pd.DataFrame(np.random.randn(500,3),columns=['A','B','C'])
X=pd.DataFrame(np.random.randn(500,2),columns=['x','y'])
Next, I am able to use nested for loops to solve this
for j in range(len(mu)):
for i in range(len(X)):
mu.ix[j,:]+=pij.ix[i,j]*X.ix[i,['x','y']]
mu.ix[j,:]=(mu.ix[j,:])/(pij.ix[:,j].sum())
mu
x y
A 0.147804 0.169263
B -0.299590 -0.828494
C -0.199637 0.363423
My question is if it is possible to not use the nested for loops or even remove one for loop to solve this. I have made feeble attempts to no avail.
Even my initial attempts result in multiple NaN's.
The code you pasted suggests you meant the index on mu on the left hand side of the formula to be j, so I'll assume that's the case.
Also since you generated random matrices for your example, my results will turn out different than yours, but I checked that your pasted code gives the same results as my code on the matrices I generated.
The numerator of the RHS of the formula can be computed with the appropriate transpose and matrix multiplication:
>>> num = pij.transpose().dot(X)
>>> num
x y
A -30.352924 -22.405490
B 14.889298 -16.768464
C -24.671337 9.092102
The denominator is simply summing over columns:
>>> denom = pij.sum()
>>> denom
A 23.460325
B 20.106702
C -46.519167
dtype: float64
Then the "division" is element-wise division by column:
>>> num.divide(denom, axis='index')
x y
A -1.293798 -0.955037
B 0.740514 -0.833974
C 0.530348 -0.195449
I'd normalize pij first then take inner product with X. The formula looks like:
mu = (pij / pij.sum()).T.dot(X)
Related
Python and coding beginner here. I started learning python a couple of days ago, no prior coding experience, and I've started learning about functions. Since python is really useful for mathematical operations, I'm trying to tie this with what I'm learning in my linear algebra class. So here's the question. Beware a lot of reading ahead!
I'm trying to multiply two random matrices using python, without numpy (otherwise I can use numpy.dot and numpy.matrix). If we have 2 matrices X and Y, of dimensions axb and bxc respectively, matrix multiplication only works if the columns of X and the rows of Y are equal. In order to write a program that can do the matrix multiplication, here's what I tried.
First I defined my function as def mat.mul(A,B), with dimensions axb and bxc respectively. I then have the matrix Z as the product of the matrix A and B, which will be empty, Z = []. Here's where my thought process is a bit wobbly. I think that first I need a for loop that iterates through the columns of A, for a in range(0, len(A)):, followed by another for loop that iterates through the rows of A, for b in range(len(0, X[0])) , followed by another for loop to iterate through the columns of B, for c in range(0, len(Y)) and finally a last for loop that iterates through the rows of Y, for d in range(0, len(Y[0])). Now I should have the product matrix Z but I'm not how I should write it. Would it be Z += X[i] * Y[d]?
Sorry for the long explanation, I just thought I'd share my thought process as well.
I just don't know how to explain what I need. I'm not looking for any codes, but just tutorial and direction to get to where I need to be.
Example: I have numbers in a csv file, a and b are in different columns:
header1,header2
a,b
a1,b1
a2,b2
a3,b3
a4,b4
a5,b5
a6,b6
so how would i create something like
[a(b)+a1(b1)+a2(b2)...a6(b6)] /(divided by) [sum of (all b values)]
ok so I know how to code the denominator by using pandas, but how would I code the numerator?
What is this process called, and where can I find a tutorial for it?
I don't know if this is the best method but should work. You can create a new column in pandas which is product of a*b
df['product'] = df['a']*df['b']
You can then simply user sum() to get sum of column b and column product and then divide the product by b:
ans = df['product'].sum() / df['b'].sum()
Not sure if this is the best method to use, but you could use list comprehensions along with the zip() function. With these two, you can get the nominator like this:
[a*b for a, b in zip(df['header1'], df['header2'])]
Chapter 3 of Dive into Python 3 has more on list comprehensions. Here is the documentation on zip() and here a few examples of its usage.
For alpha and k fixed integers with i < k also fixed, I am trying to encode a sum of the form
where all the x and y variables are known beforehand. (this is essentially the alpha coordinate of a big iterated matrix-vector multiplication)
For a normal sum varying over one index I usually create a 1d array A and set A[i] equal to the i indexed entry of the sum then use sum(A), but in the above instance the entries of the innermost sum depend on the indices in the previous sum, which in turn depend on the indices in the sum before that, all the way back out to the first sum which prevents me using this tact in a straightforward manner.
I tried making a 2D array B of appropriate length and width and setting the 0 row to be the entries in the innermost sum, then the 1 row as the entries in the next sum times sum(np.transpose(B),0) and so on, but the value of the first sum (of row 0) needs to vary with each entry in row 1 since that sum still has indices dependent on our position in row 1, so on and so forth all the way up to sum k-i.
A sum which allows for a 'variable' filled in by each position of the array it's summing through would thusly do the trick, but I can't find anything along these lines in numpy and my attempts to hack one together have thus far failed -- my intuition says there is a solution that involves summing along the axes of a k-i dimensional array, but I haven't been able to make this precise yet. Any assistance is greatly appreciated.
One simple attempt to hard-code something like this would be:
for j0 in range(0,n0):
for j1 in range(0,n1):
....
Edit: (a vectorized version)
You could do something like this: (I didn't test it)
temp = np.ones(n[k-i])
for j in range(0,k-i):
temp = x[:n[k-i-1-j],:n[k-i-j]].T#(y[:n[k-i-j]]*temp)
result = x[alpha,:n[0]]#(y[:n[0]]*temp)
The basic idea is that you try to press it into a matrix-vector form. (note that this is python3 syntax)
Edit: You should note that you need to change the "k-1" to where the innermost sum is (I just did it for all sums up to index k-i)
This is 95% identical to #sehigle's answer, but includes a generic N vector:
def nested_sum(XX, Y, N, alpha):
intermediate = np.ones(N[-1], dtype=XX.dtype)
for n1, n2 in zip(N[-2::-1], N[:0:-1]):
intermediate = np.sum(XX[:n1, :n2] * Y[:n2] * intermediate, axis=1)
return np.sum(XX[alpha, :N[0]] * Y[:N[0]] * intermediate)
Similarly, I have no knowledge of the expression, so I'm not sure how to build appropriate tests. But it runs :\
I have a problem, I have 2 lists with x and y values, and I would like to create a function based on these. But the problem is that I would like to build a function like this one:
f(x) = a * (x-b)**c
I already know scipy.interpolate but I couldn't find anything to return a function like this one.
is there a quite easy way to try to create the best function I can by searching which values of a,b and c match the most?
thanks for your help!
Edit:
here is what my current values of x and y look like:
I created this function :
def problem(values):
s = sum((y - values[0]*(x-values[1])**values[2])**2 for x,y in zip(X,Y))
return(s)
and I tried to find the best values of a,b and c with scipy.optimize.minimize but I don't know with which values of a,b and c I should start...
values = minimize(problem,(a,b,c))
(Edited to account for the OP's added code and sub-question.)
The general idea is to use a least-squares minimization to find the "best" values of a, b, and c. First define a function whose parameters are a, b, c that returns the sum of the squares of the differences between the given y values and the calculated values of a * (x-b)**c. (That function can be done as a one-liner.) Then use an optimization routine, such as one found in scipy, to minimize the value of that function value. Those values of a, b, c are what you want--use them to define your desired function.
There are a few details to examine, such as restrictions on the allowed values of a, b, c, but those depend somewhat on your lists of x and y values.
Now that you have shown a graph of your x and y values, I see that your values are all positive and the function is generally increasing. For that common situation I would use the initial values
a = 1.0
b = 0.0
c = 1.0
That gives a straight line through the origin, in fact the line y = x, which is often a decent first guess. In your case the x and y values have a very different scale, with y about a hundred times larger than x, so you would probably get better results with changing the value of a:
a = 100.0
b = 0.0
c = 1.0
I can see even better values and some restrictions on the end values but I would prefer to keep this answer more general and useful for other similar problems.
Your function problem() looks correct to me, though I would have written it a little differently for better clarity. Be sure to test it.
def problem (a , b, c, d):
return a * (x[d]-b)**c
I guess is what you are after. With D being what value of the X array. Not sure where Y comes into it.
Is there an easier way to get the sum of all values (assuming they are all numbers) in an ndarray :
import numpy as np
m = np.array([[1,2],[3,4]])
result = 0
(dim0,dim1) = m.shape
for i in range(dim0):
for j in range(dim1):
result += m[i,j]
print result
The above code seems somewhat verbose for a straightforward mathematical operation.
Thanks!
Just use numpy.sum():
result = np.sum(matrix)
or equivalently, the .sum() method of the array:
result = matrix.sum()
By default this sums over all elements in the array - if you want to sum over a particular axis, you should pass the axis argument as well, e.g. matrix.sum(0) to sum over the first axis.
As a side note your "matrix" is actually a numpy.ndarray, not a numpy.matrix - they are different classes that behave slightly differently, so it's best to avoid confusing the two.
Yes, just use the sum method:
result = m.sum()
For example,
In [17]: m = np.array([[1,2],[3,4]])
In [18]: m.sum()
Out[18]: 10
By the way, NumPy has a matrix class which is different than "regular" numpy arrays. So calling a regular ndarray matrix causes some cognitive dissonance. To help others understand your code, you may want to change the name matrix to something else.