converting pd.Series of strings into ndarray - python

I extract an array of words from pandas column:
X = np.array(tab1['word'])
example of X : array(['dog', 'cat'], dtype=object)
X is a pandas Series of 665 strings.
And then I transform each word into an ndarray of (1,270)
for i in range(len(X)):
tmp = X[i]
z = func(tmp) #function that returns ndarray of (1,270)
X[i] = z
My end goal is to get an Ndarray of shape: (665, 270)
but instead I get this shape: (665,)
And I also can't reshape it, when I try to: X.reshape(665,270)
I get this error:
ValueError: cannot reshape array of size 665 into shape (665,270)
The func(word) function could be any function, for example:
def func(word):
a = np.arange(0,270)
a = a.reshape(1,270)
return a
Any ideas on why is it so?

The problem is about converting a Pandas Series of strings into a NumPy array by a transformative function that, given a string input, returns a (1, n) array.
Here is the solution:
import pandas as pd
import numpy as np
# You have a series of strings
X = pd.Series(['aaa'] * 665)
# You have a transformative func that returns a (1, n) np.array
def func(word, n=270):
return np.zeros((1, n))
# You apply the function to the series and vertically stack the results
Xs = np.vstack(X.apply(func))
# You check for the desidered shape
print(Xs.shape)

The key lines below are:
z = list(func(tmp)) # converting returned value from func to a list
and
result = np.array([x for x in X.values])
Here is my full test code:
import numpy as np
import pandas as pd
def func(tmp):
return np.array([t for t in tmp])
X = pd.Series({'a': 'abc', 'x': 'xyz', 'j': 'jkl', 'z': 'zzz'})
for i in range(len(X)):
tmp = X[i]
z = list(func(tmp)) # converting returned value from func to a list
X[i] = z
result = np.array([x for x in X.values])
Then type result on console, you'll see it is an (4, 3) ndarray.
In[3] result
Out[3]:
array([['a', 'b', 'c'],
['x', 'y', 'z'],
['j', 'k', 'l'],
['z', 'z', 'z']], dtype='<U1')

Related

How to sum a single column array with another array (going column by column)?

The code below allows me to add a vector to each row of a given matrix using Numpy:
import numpy as np
m = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 1, 0])
print("Original vector:")
print(v)
print("Original matrix:")
print(m)
result = np.empty_like(m)
for i in range(4):
result[i, :] = m[i, :] + v
print("\nAfter adding the vector v to each row of the matrix m:")
print(result)
How do I perform a similar addition operation, but going column by column?
I have tried the following:
import numpy as np
array1 = np.array([[5,5,3],[2,2,3]])
print(array1)
addition = np.array([[1],[1]])
print(addition)
for i in range(3):
array1[:,i] = array1[:,i] + addition
print(array1)
However, I get the following broadcasting error:
ValueError: could not broadcast input array from shape (2,2) into shape (2)
Just match the number of dimensions, numpy will broadcast the arrays as needed. In the first example, it should be:
result = m + v.reshape((1, -1))
In the second example, the addition is already 2D so it will be just:
array1 + addition
You can alternatively, add a dimension via Numpy None syntax and then do the addition:
array1 += addition[:,None]

Split a numpy array using masking in python

I have a numpy array my_array of size 100x20. I want to create a function that receives as an input a 2d numpy array my_arr and an index x and will return two arrays one with size 1x20 test_arr and one with 99x20 train_arr. The vector test_arr will correspond to the row of the matrix my_arr with the index x and the train_arr will contain the rest rows. I tried to follow a solution using masking:
def split_train_test(my_arr, x):
a = np.ma.array(my_arr, mask=False)
a.mask[x, :] = True
a = np.array(a.compressed())
return a
Apparently this is not working as i wanted. How can i return a numpy array as a result and the train and test arrays properly?
You can use simple index and numpy.delete for this:
def split_train_test(my_arr, x):
return np.delete(my_arr, x, 0), my_arr[x:x+1]
my_arr = np.arange(10).reshape(5,2)
train, test = split_train_test(my_arr, 2)
train
#array([[0, 1],
# [2, 3],
# [6, 7],
# [8, 9]])
test
#array([[4, 5]])
You can also use a boolean index as the mask:
def split_train_test(my_arr, x):
# define mask
mask=np.zeros(my_arr.shape[0], dtype=bool)
mask[x] = True # True only at index x, False elsewhere
return my_arr[mask, :], my_arr[~mask, :]
Sample run:
test_arr, train_arr = split_train_test(np.random.rand(100, 20), x=10)
print(test_arr.shape, train_arr.shape)
((1L, 20L), (99L, 20L))
EDIT:
If someone is looking for the general case where more than one element needs to be allocated to the test array (say 80%-20% split), x can also accept an array:
my_arr = np.random.rand(100, 20)
x = np.random.choice(np.arange(my_arr.shape[0]), int(my_arr .shape[0]*0.8), replace=False)
test_arr, train_arr = split_train_test(my_arr, x)
print(test_arr.shape, train_arr.shape)
((80L, 20L), (20L, 20L))

NumPy: Improperly Creating 2-D Array

I'm reading in data and trying to create a NumPy array of shape (194, 1). So it should look like: [[4], [0], [9], ...]
I'm doing this:
def parse_data(file_name):
data = []
target = []
with open(file_name) as f:
for line in f:
temp = line.split()
x = [float(x) for x in temp[:2]]
y = float(temp[2])
data.append(np.array(x))
target.append(np.array(y))
return np.array(data), np.array(target)
x, y = parse_data("data.txt")
when I inspect y.shape, it's (194,), not (194,1) as I expected.
The x has shape (194,2) as I'd expect, however.
Any idea what I'm doing incorrectly?
Thanks!
You seem to have expected np.array(y) to automatically turn your scalar y into a 1-element row. That's not how NumPy works.
np.array(y) is 0-dimensional. Putting a bunch of those in a list and calling array on the list produces a 1-dimensional result, not a 2-dimensional one.
When np.array() is called on a list of numpy arrays built from scalars it concatenates them and then creates a numpy array, giving you your (194,) shape.
You can always reshape y to your desired shape:
def parse_data(file_name):
data = []
target = []
with open(file_name) as f:
for line in f:
temp = line.split()
x = [float(x) for x in temp[:2]]
y = float(temp[2])
data.append(np.array(x))
target.append(y)
return np.array(data), np.array(target).reshape(-1, 1)
x, y = parse_data("data.txt")
Of course you can also fix your problem with:
target.append(np.array([y]))
An example of the behavior I stated above:
import numpy as np
a = np.array(5)
b = np.array(4)
v = [a, b]
v
>>>[array(5), array(4)]
np.array(v)
>>>array(5, 4) #arrays are concatenated
I'd skip the np.array in the iteration.
def parse_data(file_name):
data = []
target = []
with open(file_name) as f:
for line in f:
temp = line.split()
x = [float(x) for x in temp[:2]]
y = float(temp[2])
data.append(x)
target.append(y)
return np.array(data), np.array(target)
This would create data like:
[[1.0, 2.0],[3.0, 4.0], ....]
and target like
[1.2, 3.2, 3.1, ...]
np.array(data) then turns the list of lists into a 2d array, and the list of numbers into a 1d array.
It is then easy to reshape or add a dimension to the 1d, making it (1,n) or (n,1) or what ever you need.
Remember the basic array construction methods are:
np.array([1,2,3]) # 1d
np.array([[1,2],[3,4]]) # 2d

Multi-dimensional gather in Tensorflow

The general solution to this question is being worked on in this github issue, but I was wondering if there are workarounds using tf.gather (or something else) to achieve array indexing using a multi-index. One solution I came up with was to broadcast multiply each index in the multi-idx with the cumulative product of the tensor shape, which produces indices suitable for indexing the flattened tensor:
import tensorflow as tf
import numpy as np
def __cumprod(l):
# Get the length and make a copy
ll = len(l)
l = [v for v in l]
# Reverse cumulative product
for i in range(ll-1):
l[ll-i-2] *= l[ll-i-1]
return l
def ravel_multi_index(tensor, multi_idx):
"""
Returns a tensor suitable for use as the index
on a gather operation on argument tensor.
"""
if not isinstance(tensor, (tf.Variable, tf.Tensor)):
raise TypeError('tensor should be a tf.Variable')
if not isinstance(multi_idx, list):
multi_idx = [multi_idx]
# Shape of the tensor in ints
shape = [i.value for i in tensor.get_shape()]
if len(shape) != len(multi_idx):
raise ValueError("Tensor rank is different "
"from the multi_idx length.")
# Work out the shape of each tensor in the multi_idx
idx_shape = [tuple(j.value for j in i.get_shape()) for i in multi_idx]
# Ensure that each multi_idx tensor is length 1
assert all(len(i) == 1 for i in idx_shape)
# Create a list of reshaped indices. New shape will be
# [1, 1, dim[0], 1] for the 3rd index in multi_idx
# for example.
reshaped_idx = [tf.reshape(idx, [1 if i !=j else dim[0]
for j in range(len(shape))])
for i, (idx, dim)
in enumerate(zip(multi_idx, idx_shape))]
# Figure out the base indices for each dimension
base = __cumprod(shape)
# Now multiply base indices by each reshaped index
# to produce the flat index
return (sum(b*s for b, s in zip(base[1:], reshaped_idx[:-1]))
+ reshaped_idx[-1])
# Shape and slice starts and sizes
shape = (Z, Y, X) = 4, 5, 6
Z0, Y0, X0 = 1, 1, 1
ZS, YS, XS = 3, 3, 4
# Numpy matrix and index
M = np.random.random(size=shape)
idx = [
np.arange(Z0, Z0+ZS).reshape(ZS,1,1),
np.arange(Y0, Y0+YS).reshape(1,YS,1),
np.arange(X0, X0+XS).reshape(1,1,XS),
]
# Tensorflow matrix and indices
TM = tf.Variable(M)
TF_flat_idx = ravel_multi_index(TM, [
tf.range(Z0, Z0+ZS),
tf.range(Y0, Y0+YS),
tf.range(X0, X0+XS)])
TF_data = tf.gather(tf.reshape(TM,[-1]), TF_flat_idx)
with tf.Session() as S:
S.run(tf.initialize_all_variables())
# Obtain data via flat indexing
data = S.run(TF_data)
# Check that it agrees with data obtained
# by numpy smart indexing
assert np.all(data == M[idx])
However, this only works on tensors of rank 3 due to this (current) limitation limiting broadcasts to tensors of rank 3.
At the moment I can only think of doing a chained gather, transpose, gather, transpose, gather, but this is unlikely to be efficient. e.g.
shape = (8, 9, 10)
A = tf.random_normal(shape)
data = tf.gather(tf.transpose(tf.gather(A, [1, 3]), [1,0,2]), ...)
Any ideas?
It sounds like you want gather_nd.

NumPy Vectorize a function, unknown shape

I have function that take numpy array as parameter, for example:
def f(arr):
return arr.sum()
and I want to create numpy array from each vec in A, so if A.shape = (14,12,7), my function myfunc(A).shape = (14,12)
i.e.
myfunc(A)[x, y] = f(A[x, y])
Note that len(A.shape) is not specified.
You can apply sum along the last axis:
A.sum(axis=-1)
For example:
In [1]: np.ones((14,12,7)).sum(axis=-1).shape
Out[1]: (14, 12)
If you have a generic function you can use apply_along_axis:
np.apply_along_axis(sum, -1, A)

Categories

Resources