Pandas force matrix multiplication - python

I would like to force matrix multiplication "orientation" using Python Pandas, both between DataFrames against DataFrames, Dataframes against Series and Series against Series.
As an example, I tried the following code:
t = pandas.Series([1, 2])
print(t.T.dot(t))
Which outputs: 5
But I expect this:
[1 2
2 4]
Pandas is great, but this inability to do matrix multiplications the way I want is what is the most frustrating, so any help would be greatly appreciated.
PS: I know Pandas tries to implicitly use index to find the right way to compute the matrix product, but it seems this behavior can't be switched off!

Here:
In [1]: import pandas
In [2]: t = pandas.Series([1, 2])
In [3]: np.outer(t, t)
Out[3]:
array([[1, 2],
[2, 4]])

Anyone coming to this now may want to consider: pandas.Series.to_frame(). It's kind of clunky.
Here's the original question's example:
import pandas as pd
t = pd.Series([1, 2])
t.to_frame() # t.to_frame().T
# or equivalently:
t.to_frame().dot(t.to_frame().T)
Which yields:
In [3]: t.to_frame().dot(t.to_frame().T)
Out[3]:
0 1
0 1 2
1 2 4

Solution found by y-p:
https://github.com/pydata/pandas/issues/3344#issuecomment-16533461
from pandas.util.testing import makeCustomDataframe as mkdf
a=mkdf(3,5,data_gen_f=lambda r,c: randint(1,100))
b=mkdf(5,3,data_gen_f=lambda r,c: randint(1,100))
c=DataFrame(a.values.dot(b.values),index=a.index,columns=b.columns)
print a
print b
print c
assert (a.iloc[0,:].values*b.iloc[:,0].values.T).sum() == c.iloc[0,0]
C0 C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3 C_l0_g4
R0
R_l0_g0 39 87 88 2 65
R_l0_g1 59 14 76 10 65
R_l0_g2 93 69 4 29 58
C0 C_l0_g0 C_l0_g1 C_l0_g2
R0
R_l0_g0 76 88 11
R_l0_g1 66 73 47
R_l0_g2 78 69 15
R_l0_g3 47 3 40
R_l0_g4 54 31 31
C0 C_l0_g0 C_l0_g1 C_l0_g2
R0
R_l0_g0 19174 17876 7933
R_l0_g1 15316 13503 4862
R_l0_g2 16429 15382 7284
The assert here is useless, it just does a check that it's indeed a correct matrix multiplication.
The key here seems to be line 4:
c=DataFrame(a.values.dot(b.values),index=a.index,columns=b.columns)
What this does is that it computes the dot product of a and b, but force that the resulting DataFrame c has a's indexes and b's columns, indeed converting the dot product into a matrix multiplication, and in pandas's style since you keep the indexes and columns (you lose the columns of a and indexes of b, but this is semantically correct since in a matrix multiplication you are summing over those rows, so it would be meaningless to keep them).
This is a bit awkward but seems simple enough if it is consistent with the rest of the API (I still have to test what will be the result with Series x Dataframe and Series x Series, I will post here my findings).

Related

Assign values from small matrix to specified places in larger matrix

I would like to know if there exists a similar way of doing this (Mathematica) in Python:
Mathematica
I have tried it in Python and it does not work. I have also tried it with numpy.put() or with simple 2 for loops. This 2 ways work properly but I find them very time consuming with larger matrices (3000×3000 elements for example).
Described problem in Python,
import numpy as np
a = np.arange(0, 25, 1).reshape(5, 5)
b = np.arange(100, 500, 100).reshape(2, 2)
p = np.array([0, 3])
a[p][:, p] = b
which outputs non-changed matrix a: Python
Perhaps you are looking for this:
a[p[...,None], p] = b
Array a after the above assignment looks like this:
[[100 1 2 200 4]
[ 5 6 7 8 9]
[ 10 11 12 13 14]
[300 16 17 400 19]
[ 20 21 22 23 24]]
As documented in Integer Array Indexing, the two integer index arrays will be broadcasted together, and iterated together, which effectively indexes the locations a[0,0], a[0,3], a[3,0], and a[3,3]. The assignment statement would then perform an element-wise assignment at these locations of a, using the respective element-values from RHS.

Select columns based on a range and single integer using iloc()

Given the data frame below I want to select columns A and D to F.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(5, 6)), columns=list('ABCDEF'))
In R I could say
df[ , c(1, 4:6)]
I want something similar for pandas. With
df.iloc[:, slice(3, 6, 1)]
I get columns D to F. But how can I add column A?
You use np.r_ to pass combinations of slices and indices. Since you seem to know the labels, you could can use get_loc to obtain the iloc indices.
import numpy as np
idx = np.r_[df.columns.get_loc('A'),
df.columns.get_loc('D'):df.columns.get_loc('F')+1]
# idx = np.r_[0, 3:6]
df.iloc[:, idx]
A D E F
0 38 71 62 63
1 60 93 72 94
2 57 33 30 51
3 88 54 21 39
4 0 53 41 20
Another option is np.split
df_split=pd.concat([np.split(df, [1],axis=1)[0],np.split(df,[3],axis=1)[1]],axis=1)
There you don't need to know the column names, just their positions.

How to implement fast numpy array computation with multiple occuring slice indices?

I was recently wondering how I could by-pass the following numpy behavior.
Starting with an simple example:
import numpy as np
a = np.array([[1,2,3,4,5,6,7,8,9,0], [11, 12, 13, 14, 15, 16, 17, 18, 19, 10]])
then:
b = a.copy()
b[:, [0,1,4,8]] = b[:, [0,1,4,8]] + 50
print(b)
...results in printing:
[[51 52 3 4 55 6 7 8 59 0]
[61 62 13 14 65 16 17 18 69 10]]
but also taking one index double into the slice then:
c = a.copy()
c[:, [0,1,4,4,8]] = c[:, [0,1,4,4,8]] + 50
print(c)
giving:
[[51 52 3 4 55 6 7 8 59 0]
[61 62 13 14 65 16 17 18 69 10]]
(in short; they do the same thing)
Could I also have that for index 4 it is executed 2 times?
Or more practically; Let the slice element i be given r times: Can we let the above expression be applied r times, instead of numpy just taking it once into account? Also if we replace "50" by something that differs for every occurance of i?
For my current code, I used:
w[p1] = w[p1] + D[pix]
where I define "pix", "p1" as some numpy arrays with dtype int, same length and some integers may appear multiple times.
(So one may have pix = [..., 1,1,1,2,2,3,...] at the same time as p1 = [..., 21,32,13,23,11,78,...], however, thus resulting on its own into taking for index 1 only the first 1 and the corresponding 21 and scraping the rest of the ones.)
Of course using a for loop would solve the problem easily. The point is that both the integers and the sizes of the arrays are huge, so it would cost a lot of computational resources to use for-loops instead of efficient numpy-array routines. Any ideas, links to existing documentation etc.?

How to check correlation between matching columns of two data sets?

If we have the data set:
import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})
How does one create a correlation matrix, in which the y-axis represents "a" and the x-axis represents "b"?
The aim is to see correlations between the matching columns of the two datasets like this:
If you won't mind a NumPy based vectorized solution, based on this solution post to Computing the correlation coefficient between two multi-dimensional arrays -
corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.
Sample run -
In [621]: a
Out[621]:
A B C D E
0 34 54 56 0 78
1 12 87 78 23 12
2 78 35 0 72 31
3 84 25 14 56 0
4 26 82 13 14 34
In [622]: b
Out[622]:
A B C D E
0 45 45 98 0 24
1 24 87 52 23 12
2 65 65 32 1 65
3 65 52 32 365 3
4 65 12 12 53 65
In [623]: corr2_coeff(a.values.T,b.values.T).T
Out[623]:
array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011],
[ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737],
[-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413],
[ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866],
[ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])
This achieves exactly what you want:
from scipy.stats import pearsonr
# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)
# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient
Edit: Well, it achieved exactly what you wanted, until the question was modified. Although this can easily be changed:
c = pd.DataFrame(columns = a.columns, index = a.columns)
for col in c.columns:
for idx in c.index:
correl_signif = pearsonr(a[col], b[idx])
correl = correl_signif[0]
c.loc[idx, col] = correl
c is now this:
Out[16]:
A B C D E
A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101
B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427
C -0.266264 -0.0198347 0.661107 -0.50872 0.683504
D 0.580956 -0.552312 -0.320539 0.384165 -0.624039
E 0.0165272 0.140005 -0.582389 0.12936 0.286023
I use this function that breaks it down with numpy
def corr_ab(a, b):
a_ = a.values
b_ = b.values
ab = a_.T.dot(b_)
n = len(a)
sums_squared = np.outer(a_.sum(0), b_.sum(0))
stds_squared = np.outer(a_.std(0), b_.std(0))
return pd.DataFrame((ab - sums_squared / n) / stds_squared / n,
a.columns, b.columns)
demo
corr_ab(a, b)
Do you have to use Pandas? This seem can be done via numpy rather easily. Did i understand the task incorrectly?
import numpy
X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}
Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]}
for key,value in X.items():
print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))

Python subsetting and slicing with [index,:] format

I've seen python DataFrames sometimes subsetted using the [index,:] notation when sometimes using [index] would suffice.
Using a simple toy example:
df = pd.DataFrame({'a':[1,5,10,15,20,50,88]})
idx = [2,4,6]
We can call the iloc method using either of these:
df.iloc[idx,:]
df.iloc[idx]
To get results:
a
2 10
4 20
6 88
Are there any differences between the call methods? Should I prefer the use of one over the other?
In df.iloc[idx,:] the colon is slicing over the columns. In python when you use [:] you slice over all the options. As example:
df = pd.DataFrame({'a':[1,5,10,15,20,50,88], 'b':[1,5,10,15,20,50,88]})
idx = [2,4,6]
Without columns slicing:
df.iloc[idx]
output:
a b
2 10 10
4 20 20
6 88 88
With columns slicing:
df.iloc[idx,:1]
output:
a
2 10
4 20
6 88
In this case the question is if you want to explicitly slice over all the columns. In my modest opinion I think it will be clear as the standar form df.iloc[idx].
http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position
Mainly they're same.
Axes left out of the specification are assumed to be :. (e.g.
p.loc['a'] is equiv to p.loc['a', :, :])
Different Choices for Indexing

Categories

Resources