numpy.column_stack with numeric and string arrays

numpy.column_stack with numeric and string arrays - python

I have several arrays, some of them have float numbers and others have string characters, all the arrays have the same length. When I try to use numpy.column_stack in these arrays, this function convert to string the float numbers, for example:
a = np.array([3.4,3.4,6.4])
b = np.array(['holi','xlo','xlo'])
B = np.column_stack((a,b))
print B
>>> [['3.4' 'holi']
['3.4' 'xlo']
['3.4' 'xlo']
type(B[0,0])
>>> numpy.string
Why? It's possible to avoid it?
Thanks a lot for your time.

The easiest structured array approach is with the rec.fromarrays function:
In [1411]: a=np.array([3.4,3.4,6.4]); b=np.array(['holi','xlo','xlo'])
In [1412]: B = np.rec.fromarrays([a,b],names=['a','b'])
In [1413]: B
Out[1413]:
rec.array([(3.4, 'holi'), (3.4, 'xlo'), (6.4, 'xlo')],
dtype=[('a', '<f8'), ('b', '<U4')])
In [1414]: B['a']
Out[1414]: array([ 3.4, 3.4, 6.4])
In [1415]: B['b']
Out[1415]:
array(['holi', 'xlo', 'xlo'],
dtype='<U4')
Check its docs for more parameters. But it basically constructs an empty array of the correct compound dtype, and copies your arrays to the respective fields.

To store such mixed type data, most probably you would be required to store them as Object dtype arrays or use structured arrays. Going with the Object dtype arrays, we could convert either of the input arrays to an Object dtype upfront and then stack it alongside the rest of the arrays to be stacked. The rest of the arrays would be converted automatically to Object dtype to give us a stacked array of that type. Thus, we would have an implementation like so-
np.column_stack((a.astype(np.object),b))
Sample run to show how to construct a stacked array and retrieve the individual arrays back -
In [88]: a
Out[88]: array([ 3.4, 3.4, 6.4])
In [89]: b
Out[89]:
array(['holi', 'xlo', 'xlo'],
dtype='|S4')
In [90]: out = np.column_stack((a.astype(np.object),b))
In [91]: out
Out[91]:
array([[3.4, 'holi'],
[3.4, 'xlo'],
[6.4, 'xlo']], dtype=object)
In [92]: out[:,0].astype(float)
Out[92]: array([ 3.4, 3.4, 6.4])
In [93]: out[:,1].astype(str)
Out[93]:
array(['holi', 'xlo', 'xlo'],
dtype='|S4')

Related

Inconsistent numpy complex multiplication results

Consider the following Python code that multiplies two complex numbers:
import numpy as np
a = np.matrix('28534314.10478439+28534314.10478436j').astype(np.complex128)
b = np.matrix('-1.39818115e+09+1.39818115e+09j').astype(np.complex128)
#Verify values
print(a)
print(b)
c=np.dot(a.getT(),b)
#Verify product
print(c)
Now the product should be -7.979228021897728000e+16 + 48j which is correct when I run on Spyder. However, if I receive the values a and b from a sender to a receiver via MPI on an MPI4py program (I verify that they have been received correctly) the product is wrong and specifically -7.97922801e+16+28534416.j. In both cases I am using numpy 1.14.3 and Python 2.7.14. The only difference in the latter case is that prior to receiving the values I initialize the matrices with:
a = np.empty_like(np.matrix([[0]*(1) for i in range(1)])).astype(np.complex128)
b = np.empty_like(np.matrix([[0]*(1) for i in range(1)])).astype(np.complex128)
and then the function MPI::Comm::Irecv() gives them the correct values.
What is going wrong in the latter case if the a and b are correct but c is wrong? Is numpy arbitrarily setting the imaginary part since it's quite smaller than the real part of the product?

First, this doesn't address the mp stuff, but since it was raised in comments:
np.matrix can takes a string argument, and produce a numeric matrix from it. Also notice that the shape is (1,1)
In [145]: a = np.matrix('28534314.10478439+28534314.10478436j')
In [146]: a
Out[146]: matrix([[28534314.10478439+28534314.10478436j]])
In [147]: a.dtype
Out[147]: dtype('complex128')
String input to np.array produces a string:
In [148]: a = np.array('28534314.10478439+28534314.10478436j')
In [149]: a
Out[149]: array('28534314.10478439+28534314.10478436j', dtype='<U36')
But omit the quotes and we get the complex array, with shape () (0d):
In [151]: a = np.array(28534314.10478439+28534314.10478436j)
In [152]: a
Out[152]: array(28534314.10478439+28534314.10478436j)
In [153]: a.dtype
Out[153]: dtype('complex128')
And the product of these values:
In [154]: b = np.array(-1.39818115e+09+1.39818115e+09j)
In [155]: a*b # a.dot(b) same thing
Out[155]: (-7.979228021897728e+16+48j)
Without using mp, I assume the initialization and setting is something like this:
In [179]: x=np.empty_like(np.matrix([[0]*(1) for i in range(1)])).astype(np.complex128)
In [180]: x[:]=a
In [181]: x
Out[181]: matrix([[28534314.10478439+28534314.10478436j]])
In [182]: y=np.empty_like(np.matrix([[0]*(1) for i in range(1)])).astype(np.complex128)
In [183]: y[:]=b
In [184]: y
Out[184]: matrix([[-1.39818115e+09+1.39818115e+09j]])
In [185]: x*y
Out[185]: matrix([[-7.97922802e+16+48.j]])
It may be worth trying np.zeros_like instead of np.empty_like. That will ensure the imaginary part is 0, instead of something random. Then if the mp process is just setting the real part, you should get something different.

How do I use Python's map function with sklearn's preprocessing.scale?

I am trying to use a function (preprocessing.scale) on a list of data. I am new to mapreduce/parallelism in Python - I would like to process this on a large list of data to improve performance.
Example:
X = [1,2,3,4]
Using the syntax:
list(map(preprocessing.scale, X))
I get this error:
TypeError: Singleton array array(1.0) cannot be considered a valid collection.
I think that is because of the return type of the function, but I am not sure how to fix this. Any help would be greatly appreciated!

You don't need/want to use map function as it does for loop under the hood.
Almost all sklearn methods are vectorized and they accept list-alike objects (lists, numpy arrays, etc.) and this would work much-much faster compared to map(...) approach
Demo:
In [121]: from sklearn.preprocessing import scale
In [122]: X = [1,2,3,4]
In [123]: scale(X)
Out[123]: array([-1.34164079, -0.4472136 , 0.4472136 , 1.34164079])
the same demo using numpy array:
In [39]: x = np.array(X)
In [40]: x
Out[40]: array([1, 2, 3, 4])
In [41]: scale(x)
DataConversionWarning: Data with input dtype int32 was converted to float64 by the scale function.
warnings.warn(msg, _DataConversionWarning)
Out[41]: array([-1.34164079, -0.4472136 , 0.4472136 , 1.34164079])
it expects float dtype, so we can easily convert our numpy array to float dtype on the fly:
In [42]: scale(x.astype('float64'))
Out[42]: array([-1.34164079, -0.4472136 , 0.4472136 , 1.34164079])

Executing list(map(preprocessing.scale, X)) is equivalent to executing [preprocessing.scale(a) for a in X].
Given this, what you are currently doing is scaling an singleton (one observation). You cannot scale a single item, and that is where the function breaks. Try doing preprocessing.scale(X[0]) and you will get the same error.
What is the purpose for you trying to run it like that and not just pass the array X preprocessing.scale(X)?

Converting numpy array values into integers

My values are currently showing as 1.00+e09 in an array (type float64). I would like them to show 1000000000 instead. Is this possible?

Make a sample array
In [206]: x=np.array([1e9, 2e10, 1e6])
In [207]: x
Out[207]: array([ 1.00000000e+09, 2.00000000e+10, 1.00000000e+06])
We can convert to ints - except notice that the largest one is too large the default int32
In [208]: x.astype(int)
Out[208]: array([ 1000000000, -2147483648, 1000000])
In [212]: x.astype(np.int64)
Out[212]: array([ 1000000000, 20000000000, 1000000], dtype=int64)
Writing a csv with the default format (float) (this is the default format regardless of the array dtype):
In [213]: np.savetxt('text.txt',x)
In [214]: cat text.txt
1.000000000000000000e+09
2.000000000000000000e+10
1.000000000000000000e+06
We can specify a format:
In [215]: np.savetxt('text.txt',x, fmt='%d')
In [216]: cat text.txt
1000000000
20000000000
1000000
Potentially there are 3 issues:
integer v float in the array itself, it's dtype
display or print of the array
writing the array to a csv file

It is a printing option, see the documentation: printing options. Briefly stated: you need to use the suppress option when printing:
np.set_printoptions(suppress=True) # for small floating point.
np.set_printoptions(suppress=True, formatter={'all':lambda x: str(x)})

Matlab cell2mat function in Python Numpy?

Does numpy have the cell2mat function? Here is the link to matlab. I found an implementation of something similar but it only works when we can split it evenly. Here is the link.

In a sense Python has had 'cells' at lot longer than MATLAB - list. a python list is a direct substitute for a 1d cell (or rather, cell with size 1 dimension). A 2d cell could be represented as a nested list. numpy arrays with dtype object also work. I believe that is what scipy.io.loadmat uses to render cells in .mat files.
np.array() converts a list, or lists of lists, etc, to a ndarray. Sometimes it needs help specifying the dtype. It also tries to render the input to as high a dimensional array as possible.
np.array([1,2,3])
np.array(['1',2,'abc'],dtype=object)
np.array([[1,2,3],[1,2],[3]])
np.array([[1,2],[3,4]])
And MATLAB structures map onto Python dictionaries or objects.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html
loadmat can also represent structures as numpy structured (record) arrays.
There is np.concatenate that takes a list of arrays, and its convenience derivatives vstack, hstack, dstack. Mostly they tweak the dimensions of the arrays, and then concatenate on one axis.
Here's a rough approximation to the MATLAB cell2mat example:
C = {[1], [2 3 4];
[5; 9], [6 7 8; 10 11 12]}
construct ndarrays with same shapes
In [61]: c11=np.array([[1]])
In [62]: c12=np.array([[2,3,4]])
In [63]: c21=np.array([[5],[9]])
In [64]: c22=np.array([[6,7,8],[10,11,12]])
Join them with a combination of hstack and vstack - i.e. concatenate along the matching axes.
In [65]: A=np.vstack([np.hstack([c11,c12]),np.hstack([c21,c22])])
# or A=np.hstack([np.vstack([c11,c21]),np.vstack([c12,c22])])
producing:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
Or more generally (and compactly)
In [75]: C=[[c11,c12],[c21,c22]]
In [76]: np.vstack([np.hstack(c) for c in C])

I usually use object arrays as a replacement for Matlab's cell arrays. For example:
cell_array = np.array([[np.arange(10)],
[np.arange(30,40)] ],
dtype='object')
Is a 2x1 object array containing length 10 numpy array vectors. I can perform the cell2mat functionality by:
arr = np.concatenate(cell_array).astype('int')
This returns a 2x10 int array. You can change .astype('int') to be whatever data type you need, or you could grab it from one of the objects in your cell_array,
arr = np.concatenate(cell_array).astype(cell_array[0].dtype)
Good luck!

Converting numpy string array to float: Bizarre?

So, this should be a really straightforward thing but for whatever reason, nothing I'm doing to convert an array of strings to an array of floats is working.
I have a two column array, like so:
Name Value
Bob 4.56
Sam 5.22
Amy 1.22
I try this:
for row in myarray[1:,]:
row[1]=float(row[1])
And this:
for row in myarray[1:,]:
row[1]=row[1].astype(1)
And this:
myarray[1:,1] = map(float, myarray[1:,1])
And they all seem to do something, but when I double check:
type(myarray[9,1])
I get
<type> 'numpy.string_'>

Numpy arrays must have one dtype unless it is structured. Since you have some strings in the array, they must all be strings.
If you wish to have a complex dtype, you may do so:
import numpy as np
a = np.array([('Bob','4.56'), ('Sam','5.22'),('Amy', '1.22')], dtype = [('name','S3'),('val',float)])
Note that a is now a 1d structured array, where each element is a tuple of type dtype.
You can access the values using their field name:
In [21]: a = np.array([('Bob','4.56'), ('Sam','5.22'),('Amy', '1.22')],
...: dtype = [('name','S3'),('val',float)])
In [22]: a
Out[22]:
array([('Bob', 4.56), ('Sam', 5.22), ('Amy', 1.22)],
dtype=[('name', 'S3'), ('val', '<f8')])
In [23]: a['val']
Out[23]: array([ 4.56, 5.22, 1.22])
In [24]: a['name']
Out[24]:
array(['Bob', 'Sam', 'Amy'],
dtype='|S3')

The type of the objects in a numpy array is determined at the initialsation of that array. If you want to change that later, you must cast the array, not the objects within that array.
myNewArray = myArray.asType(float)
Note: Upcasting is possible, for downcasting you need the astype method.
For further information see:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.chararray.astype.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy.column_stack with numeric and string arrays - python

Related

Inconsistent numpy complex multiplication results

How do I use Python's map function with sklearn's preprocessing.scale?

Converting numpy array values into integers

Matlab cell2mat function in Python Numpy?

Converting numpy string array to float: Bizarre?

Categories

Resources