Matrix multiplication in pandas - python

I have numeric data stored in two DataFrames x and y. The inner product from numpy works but the dot product from pandas does not.
In [63]: x.shape
Out[63]: (1062, 36)
In [64]: y.shape
Out[64]: (36, 36)
In [65]: np.inner(x, y).shape
Out[65]: (1062L, 36L)
In [66]: x.dot(y)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-66-76c015be254b> in <module>()
----> 1 x.dot(y)
C:\Programs\WinPython-64bit-2.7.3.3\python-2.7.3.amd64\lib\site-packages\pandas\core\frame.pyc in dot(self, other)
888 if (len(common) > len(self.columns) or
889 len(common) > len(other.index)):
--> 890 raise ValueError('matrices are not aligned')
891
892 left = self.reindex(columns=common, copy=False)
ValueError: matrices are not aligned
Is this a bug or am I using pandas wrong?

Not only must the shapes of x and y be correct, but also
the column names of x must match the index names of y. Otherwise
this code in pandas/core/frame.py will raise a ValueError:
if isinstance(other, (Series, DataFrame)):
common = self.columns.union(other.index)
if (len(common) > len(self.columns) or
len(common) > len(other.index)):
raise ValueError('matrices are not aligned')
If you just want to compute the matrix product without making the column names of x match the index names of y, then use the NumPy dot function:
np.dot(x, y)
The reason why the column names of x must match the index names of y is because the pandas dot method will reindex x and y so that if the column order of x and the index order of y do not naturally match, they will be made to match before the matrix product is performed:
left = self.reindex(columns=common, copy=False)
right = other.reindex(index=common, copy=False)
The NumPy dot function does no such thing. It will just compute the matrix product based on the values in the underlying arrays.
Here is an example which reproduces the error:
import pandas as pd
import numpy as np
columns = ['col{}'.format(i) for i in range(36)]
x = pd.DataFrame(np.random.random((1062, 36)), columns=columns)
y = pd.DataFrame(np.random.random((36, 36)))
print(np.dot(x, y).shape)
# (1062, 36)
print(x.dot(y).shape)
# ValueError: matrices are not aligned

Related

pandas copy numerical values from column based on categorical condition and put in new column

The purpose of this code is to:
Create a dummy data set that contains 2 columns with 25 rows filled
with values between 0 and 100.
Calculate the peaks and troughs of the data and put in a new column
called value.
In order to plot the data and visualize the result I need numerical
values, so I want to create 2 additional columns one called peak and
trough containing the numerical values and to be located at the same
row as peak or trough values in the value column.
Here is the code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import argrelmax, argrelmin
data = np.random.randint(0, 100,(25, 2))
df = pd.DataFrame(data , columns=['a', 'b'])
peak = argrelmax(data = np.array(df.b), axis=0, order=2, mode='clip')[0]
for x in peak:
if x == df.index[x]:
df.loc[x, 'value'] = 'peak'
df.loc[x, 'peak'] = df.b.iloc[[x]]
trough = argrelmin(data = np.array(df.b), axis=0, order=2, mode='clip')[0]
for x in trough:
if x == df.index[x]:
df.loc[x, 'value'] = 'trough'
df.loc[x, 'trough'] = df.b.iloc[[x]]
df
Here is the error:
ValueError Traceback (most recent call
last) in ()
3 if x == df.index[x]:
4 df.loc[x, 'value'] = 'peak'
----> 5 df.loc[x, 'peak'] = df.b.iloc[[x]]
3 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in
_align_series(self, indexer, ser, multiindex_indexer)
1950 return ser.reindex(ax)._values 1951
-> 1952 raise ValueError("Incompatible indexer with Series") 1953 1954 def _align_frame(self, indexer, df: ABCDataFrame):
ValueError: Incompatible indexer with Series
I haven't looked much into the logic but I think the problem is the indexing here should be only 1 square bracket:
df.loc[x, 'peak'] = df.b.iloc[x]
same here
df.loc[x, 'trough'] = df.b.iloc[x]
More optimized and easy way is to use np.where
peak = argrelmax(data = np.array(df.b), axis=0, order=2, mode='clip')[0]
df['peak'] = np.where(df.index.isin(peak), df.b, np.NaN)
trough = argrelmin(data = np.array(df.b), axis=0, order=2, mode='clip')[0]
df['trough'] = np.where(df.index.isin(trough), df.b, np.NaN)

Matrices multiplication

takes two 2D NumPy matrices as arguments
returns 1 2D NumPy array: the product of the two input matrices
This function will perform the operation Z=X×Y, where X and Y are the function arguments. Remember how to perform matrix-matrix multiplication:
First, you need to make sure the matrix dimensions line up. For computing X×Y, this means the number of columns of X (first matrix) should match the number of rows of Y (second matrix). These are referred to as the "inner dimensions"--matrix dimensions are usually cited as "rows by columns", so the second dimension of the first operand X is on the "inside" of the operation; same with the first dimension of the second operand, Y. If the operation were instead Y×X, you would need to make sure that the number of columns of Y matches the number of rows of X. If these numbers don't match, you should return None.
Second, you'll need to create your output matrix, Z. The dimensions of this matrix will be the "outer dimensions" of the two operands: if we're computing X×Y, then Z's dimensions will have the same number of rows as X (the first matrix), and the same number of columns as Y (the second matrix).
Third, you'll need to compute pairwise dot products. If the operation is X×Y, then these dot products will be between the ith row of X with the jth column of Y. This resulting dot product will then go in Z[i][j]. So first, you'll find the dot product of row 0 of X with column 0 of Y, and put that in Z[0][0]. Then you'll find the dot product of row 0 of X with column 1 of Y, and put that in Z[0][1]. And so on, until all rows and columns of X and Y have been dot-product-ed with each other.
You can use numpy.array, but no functions associated with computing matrix products (and definitely not the # operator).
You CAN use numpy.dot, but ONLY to multiply vectors, since it can also be used to multiply full matrices.
import numpy as np
def mm_multiply(X, Y):
X = [[1,2], [2,1]]
Y = [[2,3], [3,3]]
X = np.array(X)
Y = np.array(Y)
[I,J] = X.shape
[K,H] = Y.shape
Z = np.zeros((I,H))
for i in range(I):
for j in range(H):
for k in range(K):
Z[i,j] += X[i,k]*Y[k,j]
print(Z)
________________
My submission code is mv_multiply is not defined, but that is the previous problem - but it might go away with the correct coding. and the other error im getting is:
Any help will be much appreciated!! Thanks in advance!
TypeError Traceback (most recent call last)
<ipython-input-38-1b8bf5d47d82> in <module>
4 A = np.random.random((48, 683))
5 B = np.random.random((683, 58))
----> 6 np.testing.assert_allclose(mm_multiply(A, B), A # B)
/opt/conda/lib/python3.7/site-packages/numpy/testing/_private/utils.py in assert_allclose(actual, desired, rtol, atol, equal_nan, err_msg, verbose)
1491 header = 'Not equal to tolerance rtol=%g, atol=%g' % (rtol, atol)
1492 assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
-> 1493 verbose=verbose, header=header, equal_nan=equal_nan)
1494
1495
/opt/conda/lib/python3.7/site-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
779 return
780
--> 781 val = comparison(x, y)
782
783 if isinstance(val, bool):
/opt/conda/lib/python3.7/site-packages/numpy/testing/_private/utils.py in compare(x, y)
1486 def compare(x, y):
1487 return np.core.numeric.isclose(x, y, rtol=rtol, atol=atol,
-> 1488 equal_nan=equal_nan)
1489
1490 actual, desired = np.asanyarray(actual), np.asanyarray(desired)
/opt/conda/lib/python3.7/site-packages/numpy/core/numeric.py in isclose(a, b, rtol, atol, equal_nan)
2519 y = array(y, dtype=dt, copy=False, subok=True)
2520
-> 2521 xfin = isfinite(x)
2522 yfin = isfinite(y)
2523 if all(xfin) and all(yfin):
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Can I parallelize `numpy.bincount` using `xarray.apply_ufunc`?

I want to parallelize the numpy.bincount function using the apply_ufunc API of xarray and the following code is what I've tried:
import numpy as np
import xarray as xr
da = xr.DataArray(np.random.rand(2,16,32),
dims=['time', 'y', 'x'],
coords={'time': np.array(['2019-04-18', '2019-04-19'],
dtype='datetime64'),
'y': np.arange(16), 'x': np.arange(32)})
f = xr.DataArray(da.data.reshape((2,512)),dims=['time','idx'])
x = da.x.values
y = da.y.values
r = np.sqrt(x[np.newaxis,:]**2 + y[:,np.newaxis]**2)
nbins = 4
if x.max() > y.max():
ri = np.linspace(0., y.max(), nbins)
else:
ri = np.linspace(0., x.max(), nbins)
ridx = np.digitize(np.ravel(r), ri)
func = lambda a, b: np.bincount(a, weights=b)
xr.apply_ufunc(func, xr.DataArray(ridx,dims=['idx']), f)
but I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-203-974a8f0a89e8> in <module>()
12
13 func = lambda a, b: np.bincount(a, weights=b)
---> 14 xr.apply_ufunc(func, xr.DataArray(ridx,dims=['idx']), f)
~/anaconda/envs/uptodate/lib/python3.6/site-packages/xarray/core/computation.py in apply_ufunc(func, *args, **kwargs)
979 signature=signature,
980 join=join,
--> 981 exclude_dims=exclude_dims)
982 elif any(isinstance(a, Variable) for a in args):
983 return variables_ufunc(*args)
~/anaconda/envs/uptodate/lib/python3.6/site-packages/xarray/core/computation.py in apply_dataarray_ufunc(func, *args, **kwargs)
208
209 data_vars = [getattr(a, 'variable', a) for a in args]
--> 210 result_var = func(*data_vars)
211
212 if signature.num_outputs > 1:
~/anaconda/envs/uptodate/lib/python3.6/site-packages/xarray/core/computation.py in apply_variable_ufunc(func, *args, **kwargs)
558 raise ValueError('unknown setting for dask array handling in '
559 'apply_ufunc: {}'.format(dask))
--> 560 result_data = func(*input_data)
561
562 if signature.num_outputs == 1:
<ipython-input-203-974a8f0a89e8> in <lambda>(a, b)
11 ridx = np.digitize(np.ravel(r), ri)
12
---> 13 func = lambda a, b: np.bincount(a, weights=b)
14 xr.apply_ufunc(func, xr.DataArray(ridx,dims=['idx']), f)
ValueError: object too deep for desired array
I am kind of lost where the error is stemming from and help would be greatly appreciated...
The issue is that apply_along_axis iterates over 1D slices of the first argument to the applied function and not any of the others. If I understand your use-case correctly, you actually want to iterate over 1D slices of the weights (weights in the np.bincount signature), not the integer array (x in the np.bincount signature).
One way to work around this is to write a thin wrapper function around np.bincount that simply switches the order of the arguments:
def wrapped_bincount(weights, x):
return np.bincount(x, weights=weights)
We can then use np.apply_along_axis with this function for your use-case:
def apply_bincount_along_axis(x, weights, axis=-1):
return np.apply_along_axis(wrapped_bincount, axis, weights, x)
Finally, we can wrap this new function for use with xarray using apply_ufunc, noting that it can be automatically parallelized with dask (also note that that we do not need to provide an axis argument, because xarray will automatically move the input core dimension dim to the last position in the weights array before applying the function):
def xbincount(x, weights):
if len(x.dims) != 1:
raise ValueError('x must be one-dimensional')
dim, = x.dims
nbins = x.max() + 1
return xr.apply_ufunc(apply_bincount_along_axis, x, weights,
input_core_dims=[[dim], [dim]],
output_core_dims=[['bin']], dask='parallelized',
output_dtypes=[np.float], output_sizes={'bin': nbins})
Applying this function to your example then looks like:
xbincount(ridx, f)
<xarray.DataArray (time: 2, bin: 5)>
array([[ 0. , 7.934821, 34.066872, 51.118065, 152.769169],
[ 0. , 11.692989, 33.262936, 44.993856, 157.642972]])
Dimensions without coordinates: time, bin
As desired it also works with dask arrays:
xbincount(ridx, f.chunk({'time': 1}))
<xarray.DataArray (time: 2, bin: 5)>
dask.array<shape=(2, 5), dtype=float64, chunksize=(1, 5)>
Dimensions without coordinates: time, bin
I know this is a bit late, but here is an alternative for computing bincount with multiple sets of weights. Please refer to #spencerkclark's answer for information about parallelizing the function.
There is a WARNING before using this: the function bincount_2d_SLOW is used to demonstrate the idea! Do not use this function in your code directly, it is very slow!
I will explain at the end why the idea of the function can greatly speed up your code relative to the solution posted by #spencerkclark, but only if you are computing the bincount multiple times using the same set of groups.
The idea of the code is that while we can't use np.bincount with 2d weights, we can convert 2d weights into 1d data that is directly usable by np.bincount.
The way we do this is:
we repeat our grouping column by the 2nd dimension of the weights, so the grouping and weights have the same shape.
we adjust the grouping values along the 2nd dimension, so that each set of weights has its own unique grouping values. This way, we are grouping along each set of weights separately.
We flatten the data, so it's 1d. Now we can run np.bincount.
Finally, reshape the data.
def bincount_2d_SLOW(x, weights=None):
if weights is None:
return np.bincount(x)
if len(weights.shape) == 1:
return np.bincount(x, weights=weights)
n_groups = x.max() + 1
n_dims = weights.shape[1]
# Expand x to the same number of dimensions as weights
repeated_x = np.tile(x, (n_dims, 1)).T
# Take Kronecker product, so bincount works separately along each dimension
repeated_x = repeated_x + n_groups * np.arange(n_dims)
# Flatten
repeated_x = repeated_x.flatten()
# Compute bincount
return np.bincount(repeated_x, weights=weights.flatten()).reshape((n_dims, n_groups)).T
Here is why the idea of this function can speed up your code: if you are computing the bincount many times using the same set of groups, you can pre-compute the tiled-and-flattened groupings, and suddenly the code is incredibly fast. Here is an alternative function (I also added an option to specify n_groups, which can speed up the code even more):
def bincount_2d(x, weights=None, n_groups=None):
if weights is None:
return np.bincount(x)
if len(weights.shape) == 1:
return np.bincount(x, weights=weights)
n_dims = weights.shape[1]
if n_groups is None:
n_groups = (x.max() + 1) // n_dims
return np.bincount(x, weights=weights.flatten()).reshape((n_dims, n_groups)).T
In some testing I did, bincount_2d_SLOW is about 1/3 slower than apply_bincount_along_axis. But bincount_2d was about 2x faster than apply_bincount_along_axis when I didn't specify n_groups, and when I did specify n_groups, it was about 3x faster.

__ValueError: setting an array element with a sequence

I was trying to calculate the trends of temperature
ntimes, ny, nx = tempF.shape
print tempF.shape
trend = MA.zeros((ny,nx),dtype=float)
print trend.shape
for y in range (ny):
for x in range(nx):
trends[y,x] = numpy.polyfit(tdum, tempF[:,y,x],1)
print trend()
the result is
(24, 241, 480)
(241, 480)
ValueErrorTraceback (most recent call last)
<ipython-input-31-4ac068601e48> in <module>()
12 for y in range (0,ny):
13 for x in range (0,nx):
---> 14 trend[y,x] = numpy.polyfit(tdum, tempF[:,y,x],1)
15
16
/home/charcoalp/anaconda2/envs/pyn_test/lib/python2.7/site-packages/numpy/ma/core.pyc in __setitem__(self, indx, value)
3272 if _mask is nomask:
3273 # Set the data, then the mask
-> 3274 _data[indx] = dval
3275 if mval is not nomask:
3276 _mask = self._mask = make_mask_none(self.shape, _dtype)
ValueError: setting an array element with a sequence.
I've just used python for few days, can any one help me, thank you
When you create ny by nx zeros ndarray, you can specify, which type you want to store in it's elements. If you want to store 1x2 array of float values (polyfit with degree=1 returns 1x2 array of floats) in each cell of your zeros array, you can choose a following type instead of float:
trend = numpy.zeros((ny,nx), dtype='2f')
After that, you can easily store your arrays, as elements of trend ndarray

Convert pandas to numpy.ndarray for sparse.hstack

I try to solve next problem
import numpy as np
import pandas as pd
from scipy import sparse
X1 = sparse.rand(10, 10000)
df = pd.DataFrame({ 'a': range(10)})
In fact, I get X1 from TfidfVectorizer but let go of the code for the sake of brevity
I want to apply sparse.hstack to use both variables in a regression.
I convert pandas to numpy.ndarray as below
X2 = df['a'].as_matrix()
type(X2)
numpy.ndarray
X = sparse.hstack((X1,X2))
ValueError Traceback (most recent call last)
<ipython-input-38-9493e3833c5d> in <module>()
----> 1 X = sparse.hstack((X1,X2))
C:\Program Files\Anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
462
463 """
--> 464 return bmat([blocks], format=format, dtype=dtype)
465
466
C:\Program Files\Anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
579 elif brow_lengths[i] != A.shape[0]:
580 raise ValueError('blocks[%d,:] has incompatible '
--> 581 'row dimensions' % i)
582
583 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions
What's wrong?
I've done as below. It works
import numpy as np
import pandas as pd
from scipy import sparse
X1 = sparse.rand(10, 10000)
df = pd.DataFrame({ 'a': range(10)})
X2 = df['a'].reset_index()
X2 = X2.iloc[:,[1]].values
X = sparse.hstack((X1,X2))
your arrays must have the same first dimension size and must contain at least 1 row each.
you can check that by X1.shape() and X2.shape()

Categories

Resources