Get numpy array from an array of tuples - python

I was using numpy, MySQLbb and scipy and ended up with an array of tuples from a MySQL cursor execution. Then I used np.fromiter. Now I have an array of tuples that looks like this:
>>> A
array([('bob', 0.43), ('dan', 0.24), ('bill', 0.14)
('sharen', 0.28), ..., ('zena', 0.24), ('zoe', 0.39)],
dtype = [('f0', 'S10'), ('f1', '<f4')])
How do I make a numpy array for the first part of each tuple? I tried:
>>> Names = A[:][0]
I also tried:
>>> Names = np.array(A[:][0])
But that didn't work; only gave me the first tuple. I couldn't find any documentation for that specific example.
I want an numpy array like this:
>>> Names
array[('bob', 'bill', all the other names...
>>> Numbers
array[(0.43, 0.24, etc...
Thanks in advance.

What you got there is a recarray.
The first field in your array is named 'f0'. You can tell that from the dtype part (A.dtype).
You access it as A['f0'] or A.f0.
Names = A.f0
Numbers = A.f1

Related

Numpy memmap in-place sort of a large matrix by column

I'd like to sort a matrix of shape (N, 2) on the first column where N >> system memory.
With in-memory numpy you can do:
x = np.array([[2, 10],[1, 20]])
sortix = x[:,0].argsort()
x = x[sortix]
But that appears to require that x[:,0].argsort() fit in memory, which won't work for memmap where N >> system memory (please correct me if this assumption is wrong).
Can I achieve this sort in-place with numpy memmap?
(assume heapsort is used for sorting and simple numeric data types are used)
The solution may be simple, using the order argument to an in place sort. Of course, order requires fieldnames, so those have to be added first.
d = x.dtype
x = x.view(dtype=[(str(i), d) for i in range(x.shape[-1])])
array([[(2, 10)],
[(1, 20)]], dtype=[('0', '<i8'), ('1', '<i8')])
The field names are strings, corresponding to the column indices. Sorting can be done in place with
x.sort(order='0', axis=0)
Then convert back to a regular array with the original datatype
x.view(d)
array([[ 1, 20],
[ 2, 10]])
That should work, although you may need to change how the view is taken depending on how the data is stored on disk, see the docs
For a.view(some_dtype), if some_dtype has a different number of bytes per entry than the previous dtype (for example, converting a regular array to a structured array), then the behavior of the view cannot be predicted just from the superficial appearance of a (shown by print(a)). It also depends on exactly how a is stored in memory. Therefore if a is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.
#user2699 answered the question beautifully. I'm adding this solution as a simplified example in case you don't mind keeping your data as a structured array, which does away with the view.
import numpy as np
filename = '/tmp/test'
x = np.memmap(filename, dtype=[('index', '<f2'),('other1', '<f2'),('other2', '<f2')], mode='w+', shape=(2,))
x[0] = (2, 10, 30)
x[1] = (1, 20, 20)
print(x.shape)
print(x)
x.sort(order='index', axis=0, kind='heapsort')
print(x)
(2,)
[(2., 10., 30.) (1., 20., 20.)]
[(1., 20., 20.) (2., 10., 30.)]
Also the dtype formats are documented here.

python ndarray : how do you combine the list picking only certain data

Updated:
CSV File Contents:
iterations,fitness,time,fevals
0,498.0,0.003076461,11
10,500.0,0.004095651,21
I read a csv file using np.genfromtxt with this syntax
a = np.genfromtxt (fname1, delimiter=",",dtype=float,names=True)
b = np.genfromtxt (fname2, delimiter=",",dtype=float,names=True)
When I checked the datatype, it reported that it is as a
'numpy.ndarray'>
'numpy.ndarray'>
The shape that is reported was :
(2,)
(2,)
Since the resulting contents is suppose to be a 2D array, how can I convert the files a, b so that I can join the two using the first column as a join key and then apply some mean function on the 1, 2 column only
Because you used the option names of genfromtxt, python created a special kind of array : a and b are of the followind kind:
array([(0.0, 498.0, 0.003076461, 11.0), (10.0, 500.0, 0.004095651, 21.0)],
dtype=[('iterations', '<f8'), ('fitness', '<f8'), ('time', '<f8'), ('fevals', '<f8')])
So to mean the value just do :
np.mean([a[1][2],b[1][2]])
If you want the ordinary array type get ride of names option and use skiprows instead (or skip_header starting from numpy-1.10).
a=np.genfromtxt('a.txt',delimiter=',',dtype=float,skiprows=1)
b=np.genfromtxt('b.txt',delimiter=',',dtype=float,skiprows=1)
print np.mean([a[1,2],b[1,2]])
a is now an ordinary numpy.array object:
array([[ 0.00000000e+00, 4.98000000e+02, 3.07646100e-03,
1.10000000e+01],
[ 1.00000000e+01, 5.00000000e+02, 4.09565100e-03,
2.10000000e+01]])

Converting astropy.table.columns to a numpy array

I'd like to plot points:
points = np.random.multivariate_normal(mean=(0,0), cov=[[0.4,9],[9,10]],size=int(1e4))
print(points)
[[-2.50584156 2.77190372]
[ 2.68192136 -3.83203819]
...,
[-1.10738221 -1.72058301]
[ 3.75168017 5.6905342 ]]
print(type(points))
<class 'numpy.ndarray'>
data = ascii.read(datafile)
type(data['ra'])
astropy.table.column.Column
type(data['dec'])
astropy.table.column.Column
and then I try:
points = np.array([data['ra']], [data['dec']])
and get a
TypeError: data type not understood
Thoughts?
An astropy Table Column object can be converted to a numpy array using the data attribute:
In [7]: c = Column([1, 2, 3])
In [8]: c.data
Out[8]: array([1, 2, 3])
You can also convert an entire table to a numpy structured array with the as_array() Table method (e.g. data.as_array() in your example).
BTW I think the actual problem is not about astropy Column but your numpy array creation statement. It should probably be:
arr = np.array([data['ra'], data['dec']])
This works with Column objects.
The signature of numpy.array is numpy.array(object, dtype=None,)
Hence, when calling np.array([data['ra']], [data['dec']]), [data['ra']] is the object to convert to a numpy array, and [data['dec']] is the data type, which is not understood (as the error says).
It's not actually clear from the question what you are trying to achieve instead - possibly something like
points = np.array([data['ra'], data['dec']])
Keep in mind, though, that if you actiually want is to plot points you don't need to convert to arrays. The following will work just fine:
from matplotlib import pyplot as plt
plt.scatter(data['ra'], data['dec'])
With no need to do any conversion to arrays.

3D Pandas DataFrame from csv

I am fairly new to pandas, and need to import a 3D array of tuples from a data file. In the file, the data is formatted as so:
[[(1.1, 1.2), (1.3, 1.4)], [(1.5, 1.6), (1.7, 1.8)], [(1.9, 1.10), (1.11, 1.12)], [(1.13, 1.14), (1.15, 1.16)]]
[[(2.1, 2.2), (2.3, 2.4)], [(2.5, 2.6), (2.7, 2.8)], [(2.9, 2.10), (2.11, 2.12)], [(2.13, 2.14), (2.15, 2.16)]]
[[(3.1, 3.2), (3.3, 3.4)], [(3.5, 3.6), (3.7, 3.8)], [(3.9, 3.10), (3.11, 3.12)], [(3.13, 3.14), (3.15, 3.16)]]
I would like to be able to import this into a data frame such that (for this example) the dimensionality would be 3x4x2 (with another x2, if you want to count the dimensions of the tuples, though those don't necessarily need their own dimension, so long as I can access them as tuples).
In actuality, my data set is much larger than this (with dimensions of roughly 13000x2000x2), so I would like to keep any manual editing that might be needed to a minimum, though I should be able to change how the data is formatted in the file with some simple scripts, if a different format would help.
Even 'eval' is a dangerous tool it gives here a one-liner to collect the data :
with open('data.csv') as f: a=np.array([eval(x) for x in f.readlines()])
check :
In [59]: a.shape
Out[59]: (3, 4, 2, 2)
There is no such thing as a multidimensional dataframe with pandas.
You could think of several dataframes and have them relate to each other with an extra column as an id.
Or you could also flatten your 3D array to dataframe with several columns:
your rows would be the observation, in this case 3
your columns would be the flatten output 4 x 2 = 8
You could use numpy to reshape:
new_array = numpy.reshape(array, (3,8))

Converting numpy string array to float: Bizarre?

So, this should be a really straightforward thing but for whatever reason, nothing I'm doing to convert an array of strings to an array of floats is working.
I have a two column array, like so:
Name Value
Bob 4.56
Sam 5.22
Amy 1.22
I try this:
for row in myarray[1:,]:
row[1]=float(row[1])
And this:
for row in myarray[1:,]:
row[1]=row[1].astype(1)
And this:
myarray[1:,1] = map(float, myarray[1:,1])
And they all seem to do something, but when I double check:
type(myarray[9,1])
I get
<type> 'numpy.string_'>
Numpy arrays must have one dtype unless it is structured. Since you have some strings in the array, they must all be strings.
If you wish to have a complex dtype, you may do so:
import numpy as np
a = np.array([('Bob','4.56'), ('Sam','5.22'),('Amy', '1.22')], dtype = [('name','S3'),('val',float)])
Note that a is now a 1d structured array, where each element is a tuple of type dtype.
You can access the values using their field name:
In [21]: a = np.array([('Bob','4.56'), ('Sam','5.22'),('Amy', '1.22')],
...: dtype = [('name','S3'),('val',float)])
In [22]: a
Out[22]:
array([('Bob', 4.56), ('Sam', 5.22), ('Amy', 1.22)],
dtype=[('name', 'S3'), ('val', '<f8')])
In [23]: a['val']
Out[23]: array([ 4.56, 5.22, 1.22])
In [24]: a['name']
Out[24]:
array(['Bob', 'Sam', 'Amy'],
dtype='|S3')
The type of the objects in a numpy array is determined at the initialsation of that array. If you want to change that later, you must cast the array, not the objects within that array.
myNewArray = myArray.asType(float)
Note: Upcasting is possible, for downcasting you need the astype method.
For further information see:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.chararray.astype.html

Categories

Resources