Pandas force matrix multiplication

Pandas force matrix multiplication - python

I would like to force matrix multiplication "orientation" using Python Pandas, both between DataFrames against DataFrames, Dataframes against Series and Series against Series.
As an example, I tried the following code:
t = pandas.Series([1, 2])
print(t.T.dot(t))
Which outputs: 5
But I expect this:
[1 2
2 4]
Pandas is great, but this inability to do matrix multiplications the way I want is what is the most frustrating, so any help would be greatly appreciated.
PS: I know Pandas tries to implicitly use index to find the right way to compute the matrix product, but it seems this behavior can't be switched off!

Here:
In [1]: import pandas
In [2]: t = pandas.Series([1, 2])
In [3]: np.outer(t, t)
Out[3]:
array([[1, 2],
[2, 4]])

Anyone coming to this now may want to consider: pandas.Series.to_frame(). It's kind of clunky.
Here's the original question's example:
import pandas as pd
t = pd.Series([1, 2])
t.to_frame() # t.to_frame().T
# or equivalently:
t.to_frame().dot(t.to_frame().T)
Which yields:
In [3]: t.to_frame().dot(t.to_frame().T)
Out[3]:
0 1
0 1 2
1 2 4

Solution found by y-p:
https://github.com/pydata/pandas/issues/3344#issuecomment-16533461
from pandas.util.testing import makeCustomDataframe as mkdf
a=mkdf(3,5,data_gen_f=lambda r,c: randint(1,100))
b=mkdf(5,3,data_gen_f=lambda r,c: randint(1,100))
c=DataFrame(a.values.dot(b.values),index=a.index,columns=b.columns)
print a
print b
print c
assert (a.iloc[0,:].values*b.iloc[:,0].values.T).sum() == c.iloc[0,0]
C0 C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3 C_l0_g4
R0
R_l0_g0 39 87 88 2 65
R_l0_g1 59 14 76 10 65
R_l0_g2 93 69 4 29 58
C0 C_l0_g0 C_l0_g1 C_l0_g2
R0
R_l0_g0 76 88 11
R_l0_g1 66 73 47
R_l0_g2 78 69 15
R_l0_g3 47 3 40
R_l0_g4 54 31 31
C0 C_l0_g0 C_l0_g1 C_l0_g2
R0
R_l0_g0 19174 17876 7933
R_l0_g1 15316 13503 4862
R_l0_g2 16429 15382 7284
The assert here is useless, it just does a check that it's indeed a correct matrix multiplication.
The key here seems to be line 4:
c=DataFrame(a.values.dot(b.values),index=a.index,columns=b.columns)
What this does is that it computes the dot product of a and b, but force that the resulting DataFrame c has a's indexes and b's columns, indeed converting the dot product into a matrix multiplication, and in pandas's style since you keep the indexes and columns (you lose the columns of a and indexes of b, but this is semantically correct since in a matrix multiplication you are summing over those rows, so it would be meaningless to keep them).
This is a bit awkward but seems simple enough if it is consistent with the rest of the API (I still have to test what will be the result with Series x Dataframe and Series x Series, I will post here my findings).

Related

Assign values from small matrix to specified places in larger matrix

I would like to know if there exists a similar way of doing this (Mathematica) in Python:
Mathematica
I have tried it in Python and it does not work. I have also tried it with numpy.put() or with simple 2 for loops. This 2 ways work properly but I find them very time consuming with larger matrices (3000×3000 elements for example).
Described problem in Python,
import numpy as np
a = np.arange(0, 25, 1).reshape(5, 5)
b = np.arange(100, 500, 100).reshape(2, 2)
p = np.array([0, 3])
a[p][:, p] = b
which outputs non-changed matrix a: Python

Perhaps you are looking for this:
a[p[...,None], p] = b
Array a after the above assignment looks like this:
[[100 1 2 200 4]
[ 5 6 7 8 9]
[ 10 11 12 13 14]
[300 16 17 400 19]
[ 20 21 22 23 24]]
As documented in Integer Array Indexing, the two integer index arrays will be broadcasted together, and iterated together, which effectively indexes the locations a[0,0], a[0,3], a[3,0], and a[3,3]. The assignment statement would then perform an element-wise assignment at these locations of a, using the respective element-values from RHS.

Select columns based on a range and single integer using iloc()

Given the data frame below I want to select columns A and D to F.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(5, 6)), columns=list('ABCDEF'))
In R I could say
df[ , c(1, 4:6)]
I want something similar for pandas. With
df.iloc[:, slice(3, 6, 1)]
I get columns D to F. But how can I add column A?

You use np.r_ to pass combinations of slices and indices. Since you seem to know the labels, you could can use get_loc to obtain the iloc indices.
import numpy as np
idx = np.r_[df.columns.get_loc('A'),
df.columns.get_loc('D'):df.columns.get_loc('F')+1]
# idx = np.r_[0, 3:6]
df.iloc[:, idx]
A D E F
0 38 71 62 63
1 60 93 72 94
2 57 33 30 51
3 88 54 21 39
4 0 53 41 20

Another option is np.split
df_split=pd.concat([np.split(df, [1],axis=1)[0],np.split(df,[3],axis=1)[1]],axis=1)
There you don't need to know the column names, just their positions.

How to implement fast numpy array computation with multiple occuring slice indices?

I was recently wondering how I could by-pass the following numpy behavior.
Starting with an simple example:
import numpy as np
a = np.array([[1,2,3,4,5,6,7,8,9,0], [11, 12, 13, 14, 15, 16, 17, 18, 19, 10]])
then:
b = a.copy()
b[:, [0,1,4,8]] = b[:, [0,1,4,8]] + 50
print(b)
...results in printing:
[[51 52 3 4 55 6 7 8 59 0]
[61 62 13 14 65 16 17 18 69 10]]
but also taking one index double into the slice then:
c = a.copy()
c[:, [0,1,4,4,8]] = c[:, [0,1,4,4,8]] + 50
print(c)
giving:
[[51 52 3 4 55 6 7 8 59 0]
[61 62 13 14 65 16 17 18 69 10]]
(in short; they do the same thing)
Could I also have that for index 4 it is executed 2 times?
Or more practically; Let the slice element i be given r times: Can we let the above expression be applied r times, instead of numpy just taking it once into account? Also if we replace "50" by something that differs for every occurance of i?
For my current code, I used:
w[p1] = w[p1] + D[pix]
where I define "pix", "p1" as some numpy arrays with dtype int, same length and some integers may appear multiple times.
(So one may have pix = [..., 1,1,1,2,2,3,...] at the same time as p1 = [..., 21,32,13,23,11,78,...], however, thus resulting on its own into taking for index 1 only the first 1 and the corresponding 21 and scraping the rest of the ones.)
Of course using a for loop would solve the problem easily. The point is that both the integers and the sizes of the arrays are huge, so it would cost a lot of computational resources to use for-loops instead of efficient numpy-array routines. Any ideas, links to existing documentation etc.?

How to check correlation between matching columns of two data sets?

If we have the data set:
import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})
How does one create a correlation matrix, in which the y-axis represents "a" and the x-axis represents "b"?
The aim is to see correlations between the matching columns of the two datasets like this:

If you won't mind a NumPy based vectorized solution, based on this solution post to Computing the correlation coefficient between two multi-dimensional arrays -
corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.
Sample run -
In [621]: a
Out[621]:
A B C D E
0 34 54 56 0 78
1 12 87 78 23 12
2 78 35 0 72 31
3 84 25 14 56 0
4 26 82 13 14 34
In [622]: b
Out[622]:
A B C D E
0 45 45 98 0 24
1 24 87 52 23 12
2 65 65 32 1 65
3 65 52 32 365 3
4 65 12 12 53 65
In [623]: corr2_coeff(a.values.T,b.values.T).T
Out[623]:
array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011],
[ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737],
[-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413],
[ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866],
[ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])

This achieves exactly what you want:
from scipy.stats import pearsonr
# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)
# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient
Edit: Well, it achieved exactly what you wanted, until the question was modified. Although this can easily be changed:
c = pd.DataFrame(columns = a.columns, index = a.columns)
for col in c.columns:
for idx in c.index:
correl_signif = pearsonr(a[col], b[idx])
correl = correl_signif[0]
c.loc[idx, col] = correl
c is now this:
Out[16]:
A B C D E
A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101
B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427
C -0.266264 -0.0198347 0.661107 -0.50872 0.683504
D 0.580956 -0.552312 -0.320539 0.384165 -0.624039
E 0.0165272 0.140005 -0.582389 0.12936 0.286023

I use this function that breaks it down with numpy
def corr_ab(a, b):
a_ = a.values
b_ = b.values
ab = a_.T.dot(b_)
n = len(a)
sums_squared = np.outer(a_.sum(0), b_.sum(0))
stds_squared = np.outer(a_.std(0), b_.std(0))
return pd.DataFrame((ab - sums_squared / n) / stds_squared / n,
a.columns, b.columns)
demo
corr_ab(a, b)

Do you have to use Pandas? This seem can be done via numpy rather easily. Did i understand the task incorrectly?
import numpy
X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}
Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]}
for key,value in X.items():
print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))

Python subsetting and slicing with [index,:] format

I've seen python DataFrames sometimes subsetted using the [index,:] notation when sometimes using [index] would suffice.
Using a simple toy example:
df = pd.DataFrame({'a':[1,5,10,15,20,50,88]})
idx = [2,4,6]
We can call the iloc method using either of these:
df.iloc[idx,:]
df.iloc[idx]
To get results:
a
2 10
4 20
6 88
Are there any differences between the call methods? Should I prefer the use of one over the other?

In df.iloc[idx,:] the colon is slicing over the columns. In python when you use [:] you slice over all the options. As example:
df = pd.DataFrame({'a':[1,5,10,15,20,50,88], 'b':[1,5,10,15,20,50,88]})
idx = [2,4,6]
Without columns slicing:
df.iloc[idx]
output:
a b
2 10 10
4 20 20
6 88 88
With columns slicing:
df.iloc[idx,:1]
output:
a
2 10
4 20
6 88
In this case the question is if you want to explicitly slice over all the columns. In my modest opinion I think it will be clear as the standar form df.iloc[idx].
http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position

Mainly they're same.
Axes left out of the specification are assumed to be :. (e.g.
p.loc['a'] is equiv to p.loc['a', :, :])
Different Choices for Indexing

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas force matrix multiplication - python

Here: In [1]: import pandas In [2]: t = pandas.Series([1, 2]) In [3]: np.outer(t, t) Out[3]: array([[1, 2], [2, 4]])

Related

Assign values from small matrix to specified places in larger matrix

Select columns based on a range and single integer using iloc()

How to implement fast numpy array computation with multiple occuring slice indices?

How to check correlation between matching columns of two data sets?

Python subsetting and slicing with [index,:] format

Categories

Resources