Apply Box Cox transformation to two columns simultaneously - python

I want to apply a Box-Cox transformation to two different columns. The twist is that I'm being asked to choose the lambda that's optimal for both columns simultaneously.
scipy.stats.boxcox only accepts one-dimensional arrays.
How can I apply a Box-Cox transformation to two columns subject to lambda_1 = lambda_2?
Here's my data.
I would like to transform the columns SPEED and CAP.
import pandas as pd
from scipy import stats
df = pd.read_csv('https://raw.githubusercontent.com/BenjaminKay/berndt-econometrics/master/data/floppy_ver/CHAP4.DAT/COLE',
sep='\t')
stats.boxcox(df[['SPEED','CAP']].values)
ValueError: Data must be 1-dimensional.

It sounds like you want boxcox to treat the two columns as a single data set. You could merge them into a single 1-d array, apply boxcox, and then restore the shape afterwards, as in the following.
Get the values as a 2-d array:
In [63]: data = df[['SPEED','CAP']].values
Pass the data to boxcox; use the .ravel() method to flatten data into a 1-d array before passing in the data:
In [64]: result1d, lam = stats.boxcox(data.ravel())
In [65]: lam
Out[65]: -0.02063317824310837
Reshape result1d back to the original 2-d shape:
In [66]: result = result1d.reshape(data.shape)
In [67]: result.shape
Out[67]: (91, 2)
In [68]: result[:8]
Out[68]:
array([[-1.82384013, 7.23194418],
[-4.09393704, 3.25939313],
[-3.80017243, 4.39314839],
[-3.80017243, 4.39314839],
[-3.80017243, 4.39314839],
[-3.80017243, 4.39314839],
[-3.11153324, 5.01897958],
[-3.11153324, 5.01897958]])

Related

How can I put two NumPy arrays into a matrix with two columns?

I am trying to put two NumPy arrays into a matrix or horizontally stack them. Each array is 76 elements long, and I want the ending matrix to have 76 rows and 2 columns. I basically have a velocity/frequency model and want to have two columns with corresponding frequency/velocity values in each row.
Here is my code ('f' is frequency and 'v' the velocity values, previously already defined):
print(f.shape)
print(v.shape)
print(type(f))
print(type(v))
x = np.concatenate((f, v), axis = 1)
This returns
(76,)
(76,)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
And an error about the concatenate line that says:
AxisError: axis 1 is out of bounds for array of dimension 1
I've also tried hstack except for concatenate, as well as vstack and transposing .T, and have the same error. I've also tried using Pandas, but I need to use NumPy, because when I save it into a txt/dat file, Pandas gives me an extra column with numbering that I do not need to have.
Your problem is that your vectors are one-dimensional, like in this example:
f_1d = np.array([1,2,3,4])
print(f_1d.shape)
> (4,)
As you can see, only the first dimension is given. So instead you could create your vectors like this:
f = np.expand_dims(np.array([1,2,3,4]), axis=1)
v = np.expand_dims(np.array([5,6,7,8]), axis=1)
print(f.shape)
print(v.shape)
>(4,1)
>(4,1)
As you may notice, the second dimension is equal to one, but now your vector is represented in matrix form.
It is now possible to transpose the matrix-vectors:
f_t = f.T
v_t = v.T
print(f_t)
> (1,4)
Instead of using concatenate, you could use vstack or hstack to create cleaner code:
x = np.hstack((f,v))
x_t = np.vstack((f_t,v_t))
print(x.shape)
print(x_t.shape)
>(4,2)
>(2,4)

Correlation of columns of two arrays in python

I have two arrays: 900x421 and 900x147. I need to correlate all columns from these arrays so that the output is 421x147. In Matlab function corr() does it, but I can't find a function that does the same in python.
the numpy.corrcoef function is the way to go. You need both arguments x and y to be of the same shape. You can do so by concatenate the two arrays. Let's say arr1 is of shape 900x421 and arr2 is of shape 900x147. You can do the following
import numpy as np
two_arrays = np.concatenate((arr1, arr2), axis=1) # 900x568
corr = np.corrcoef(two_arrays.T) # 568x568 array
desired_output = corr[0:421, 421:]
The np.corrcoef treats each row as a variable and each column as observation. That is why we need to transpose the array.

Replicating a matrix in pandas or numpy to a certain size

I have a matrix A which is (41, 41) which is a dataframe.
B is a matrix of size (7154, 8240), ndarray.
I want replicate A (keeping the whole 41x41 matrix intact) to the size of B. It will not fit exactly, but then it should just clip the rows that does not fit.
This is to be able to multiply A*B.
I tried this code, but I cannot multiply with a float.
repeat = pd.concat([A]*(B.shape[0]/A.shape[0]), axis=0, ignore_index=True)
filter_large = pd.concat([repeat]*(B.shape[1]/A.shape[1]), axis=1, ignore_index=True)
filter_l = filter_large.values # change from a dataframe to a numpy array
AB = A*filter_l
I should mention that I've tried numpy.resize but it does not keep the matrix intact, mixing up all rows which is not what I want.
This code will do what you ask for:
shapeMultiples = (np.ceil(B.shape[0]/A.shape[0]).astype(int), np.ceil(B.shape[1]/A.shape[1]).astype(int))
res = np.tile(A, shapeMultiples)[:B.shape[0], :B.shape[1]]
Explanation:
np.tile(A, reps) repeats the matrix A multiple times along each axis. How often it is repeated is specified for each axis in reps.
For your example it should be repeated b.shape[0]/a.shape[0] times along axis 0 and b.shape[1]/a.shape[1] times along axis 1. However you have to round these values up, to make sure it extends the size of matrix B, which is what np.ceil does. Since reps is expected to be a shape of integers but ceil returns floats, we have to cast the type to int.
In the final step we cut of the result to make it fit the size of B with [:B.shape[0], :B.shape[1]].

How to flatten an array to a matrix in Numpy?

I am looking for an elegant way to flatten an array of arbitrary shape to a matrix based on a single parameter that specifies the dimension to retain. For illustration, I would like
def my_func(input, dim):
# code to compute output
return output
Given for example an input array of shape 2x3x4, output should be for dim=0 an array of shape 12x2; for dim=1 an array of shape 8x3; for dim=2 an array of shape 6x8. If I want to flatten the last dimension only, then this is easily accomplished by
input.reshape(-1, input.shape[-1])
But I would like to add the functionality of adding dim (elegantly, without going through all possible cases + checking with if conditions, etc.). It might be possible by first swapping dimensions, so that the dimension of interest is trailing and then applying the operation above.
Any help?
We can permute axes and reshape -
# a is input array; axis is input axis/dim
np.moveaxis(a,axis,-1).reshape(-1,a.shape[axis])
Functionally, it's basically pushing the specified axis to the back and then reshaping keeping that axis length to form the second axis and merging rest of the axes to form the first axis.
Sample runs -
In [32]: a = np.random.rand(2,3,4)
In [33]: axis = 0
In [34]: np.moveaxis(a,axis,-1).reshape(-1,a.shape[axis]).shape
Out[34]: (12, 2)
In [35]: axis = 1
In [36]: np.moveaxis(a,axis,-1).reshape(-1,a.shape[axis]).shape
Out[36]: (8, 3)
In [37]: axis = 2
In [38]: np.moveaxis(a,axis,-1).reshape(-1,a.shape[axis]).shape
Out[38]: (6, 4)

Convert two numpy array to dataframe

I want to convert two numpy array to one DataFrame containing two columns.
The first numpy array 'images' is of shape 102, 1024.
The second numpy array 'label' is of shape (1020, )
My core code is:
images=np.array(images)
label=np.array(label)
l=np.array([images,label])
dataset=pd.DataFrame(l)
But it turns out to be an error saying that:
ValueError: could not broadcast input array from shape (1020,1024) into shape (1020)
What should I do to convert these two numpy array into two columns in one dataframe?
You can't stack them easily, especially if you want them as different columns, because you can't insert a 2D array in one column of a DataFrame, so you need to convert it to something else, for example a list.
So something like this would work:
import pandas as pd
import numpy as np
images = np.array(images)
label = np.array(label)
dataset = pd.DataFrame({'label': label, 'images': list(images)}, columns=['label', 'images'])
This will create a DataFrame with 1020 rows and 2 columns, where each item in the second column contains 1D arrays of length 1024.
Coming from engineering, I like the visual side of creating matrices.
matrix_aux = np.vstack([label,images])
matrix = np.transpose(matrix_aux)
df_lab_img = pd.DataFrame(matrix)
Takes a little bit more of code but leaves you with the Numpy array too.
You can also use hstack
import pandas as pd
import numpy as np
dataset = pd.DataFrame(np.hstack((images, label.reshape(-1, 1))))

Categories

Resources