Convert list of tuples to structured numpy array - python

I have a list of Num_tuples tuples that all have the same length Dim_tuple
xlist = [tuple_1, tuple_2, ..., tuple_Num_tuples]
For definiteness, let's say Num_tuples=3 and Dim_tuple=2
xlist = [(1, 1.1), (2, 1.2), (3, 1.3)]
I want to convert xlist into a structured numpy array xarr using a user-provided list of column names user_names and a user-provided list of variable types user_types
user_names = [name_1, name_2, ..., name_Dim_tuple]
user_types = [type_1, type_2, ..., type_Dim_tuple]
So in the creation of the numpy array,
dtype = [(name_1,type_1), (name_2,type_2), ..., (name_Dim_tuple, type_Dim_tuple)]
In the case of my toy example desired end product would look something like:
xarr['name1']=np.array([1,2,3])
xarr['name2']=np.array([1.1,1.2,1.3])
How can I slice xlist to create xarr without any loops?

A list of tuples is the correct way of providing data to a structured array:
In [273]: xlist = [(1, 1.1), (2, 1.2), (3, 1.3)]
In [274]: dt=np.dtype('int,float')
In [275]: np.array(xlist,dtype=dt)
Out[275]:
array([(1, 1.1), (2, 1.2), (3, 1.3)],
dtype=[('f0', '<i4'), ('f1', '<f8')])
In [276]: xarr = np.array(xlist,dtype=dt)
In [277]: xarr['f0']
Out[277]: array([1, 2, 3])
In [278]: xarr['f1']
Out[278]: array([ 1.1, 1.2, 1.3])
or if the names are important:
In [280]: xarr.dtype.names=['name1','name2']
In [281]: xarr
Out[281]:
array([(1, 1.1), (2, 1.2), (3, 1.3)],
dtype=[('name1', '<i4'), ('name2', '<f8')])
http://docs.scipy.org/doc/numpy/user/basics.rec.html#filling-structured-arrays

hpaulj's answer is interesting but horrifying :)
The modern Pythonic way to have named columns is to use pandas, a highly popular package built on top of numpy:
import pandas as pd
xlist = [(1, 1.1), (2, 1.2), (3, 1.3)]
# Cast name1 to int because pandas' default is float
df = pd.DataFrame(xlist, columns=['name1', 'name2']).astype({'name1':int})
print(df)
This gives you a DataFrame, df, which is the structure you want:
name1 name2
0 1 1.1
1 2 1.2
2 3 1.3
You can do all kinds of wonderful things with this, like slicing and various operations.

Related

Getting unexpected shape while slicing a numpy array

I have a 4D numpy array. While slicing for multiple indices in a single dimension, my axis get interchanged. Am I missing something trivial here.
import numpy as np
from smartprint import smartprint as prints
a = np.random.rand(50, 60, 70, 80)
b = a[:, :, :, [2,3,4]]
prints (b.shape) # this works as expected
c = a[1, :, :, [2,3,4]]
prints (c.shape) # here, I see the axes are interchanged
Output:
b.shape : (50, 60, 70, 3)
c.shape : (3, 60, 70)
Here are some observations that may help explain the problem.
Start with a 3d array, with the expect strides:
In [158]: x=np.arange(24).reshape(2,3,4)
In [159]: x.shape,x.strides
Out[159]: ((2, 3, 4), (48, 16, 4))
Advanced indexing on the last axis:
In [160]: y=x[:,:,[0,1,2,3]]
In [161]: y.shape, y.strides
Out[161]: ((2, 3, 4), (12, 4, 24))
Notice that the strides are not in the normal C-contiguous order. For a 2d array we'd describe this a F-contiguous. It's an obscure indexing detail that usually doesn't matter.
Apparently when doing this indexing it first makes an array with the last, the indexed dimension, first:
In [162]: y.base.shape
Out[162]: (4, 2, 3)
In [163]: y.base.strides
Out[163]: (24, 12, 4)
y is this base with swapped axes, a view of its base.
The case with a slice in the middle is
In [164]: z=x[1,:,[0,1,2,3]]
In [165]: z.shape, z.strides
Out[165]: ((4, 3), (12, 4))
In [166]: z.base # its own base, not a view
Transposing z to the expected (3,4) shape would switch the strides to (4,12), F-contiguous.
With the two step indexing, we get an array with the expect shape, but the F strides. And its base looks a lot like z.
In [167]: w=x[1][:,[0,1,2,3]]
In [168]: w.shape, w.strides
Out[168]: ((3, 4), (4, 12))
In [169]: w.base.shape, w.base.strides
Out[169]: ((4, 3), (12, 4))
The docs justify the switch in axes by saying that there's an ambiguity when performing advanced indexing with a slice in the middle. It's perhaps clearest when using a (2,1) and (4,) indices:
In [171]: w=x[[[0],[1]],:,[0,1,2,3]]
In [172]: w.shape, w.strides
Out[172]: ((2, 4, 3), (48, 12, 4))
The middle, size 3 dimension, is "tacked on last". With x[1,:,[0,1,2,3]] that ambibuity argument isn't as good, but apparently it's using the same indexing method. When this was raised in github issues, the claim was that reworking the indexing to correct this was too difficult. Individual cases might be corrected, but a comprehensive change was too complicated.
This dimension switch seems to come up on SO a couple of times a year, an annoyance, but not a critical issue.

change the dtype of a numpy structured array in place without changing other dtypes

I have an input structured array with unknown number of columns and rows. Among those fields, some are timestamps. However, those timestamps are all np.datetime64[ns] data type. I would like to change them all into np.datetime64[m] data type without changing other columns.
I have tried
array["time"] = array["time"].astype("datetime64[m]")
but it will not change the original structured array. Any suggestions? Thanks
Define two dtypes:
In [220]: dt1 = [('x','int'),('time','datetime64[ns]')]
In [221]: dt2 = [('x','int'),('time','datetime64[m]')]
an array with the first:
In [222]: arr = np.array([(1, '2022-01-23T00:34:59.1233'),(2, '2022-01-23T03:00:01.234')],dt1)
In [223]: arr
Out[223]:
array([(1, '2022-01-23T00:34:59.123300000'),
(2, '2022-01-23T03:00:01.234000000')],
dtype=[('x', '<i8'), ('time', '<M8[ns]')])
Looks like astype properly converts the times:
In [224]: arr.astype(dt2)
Out[224]:
array([(1, '2022-01-23T00:34'), (2, '2022-01-23T03:00')],
dtype=[('x', '<i8'), ('time', '<M8[m]')])
Or as I commented:
In [225]: res = np.zeros(2, dt2)
In [226]: res
Out[226]:
array([(0, '1970-01-01T00:00'), (0, '1970-01-01T00:00')],
dtype=[('x', '<i8'), ('time', '<M8[m]')])
In [227]: for field in arr.dtype.names: res[field]= arr[field]
In [228]: res
Out[228]:
array([(1, '2022-01-23T00:34'), (2, '2022-01-23T03:00')],
dtype=[('x', '<i8'), ('time', '<M8[m]')])
The recfunctions usually do something along this line - define a new dtype, and copy fields as needed.

Numpy/Pandas: Merge two numpy arrays based on one array efficiently

I have two numpy arrays comprised of two-set tuples:
a = [(1, "alpha"), (2, 3), ...]
b = [(1, "zylo"), (1, "xen"), (2, "potato", ...]
The first element in the tuple is the identifier and shared between both arrays, so I want to create a new numpy array which looks like this:
[(1, "alpha", "zylo", "xen"), (2, 3, "potato"), etc...]
My current solution works, but it's way too inefficient for me. Looks like this:
aggregate_collection = []
for tuple_set in a:
for tuple_set2 in b:
if tuple_set[0] == tuple_set2[0] and other_condition:
temp_tup = (tuple_set[0], other tuple values)
aggregate_collection.append(temp_tup)
How can I do this efficiently?
I'd concatenate these into a data frame and just groupby+agg
(pd.concat([pd.DataFrame(a), pd.DataFrame(b)])
.groupby(0)
.agg(lambda s: [s.name, *s])[1])
where 0 and 1 are the default column names given by creating a dataframe via pd.DataFrame. Change it to your column names.
In [278]: a = [(1, "alpha"), (2, 3)]
...: b = [(1, "zylo"), (1, "xen"), (2, "potato")]
In [279]: a
Out[279]: [(1, 'alpha'), (2, 3)]
In [280]: b
Out[280]: [(1, 'zylo'), (1, 'xen'), (2, 'potato')]
Note that if I try to make an array from a I get something quite different.
In [281]: np.array(a)
Out[281]:
array([['1', 'alpha'],
['2', '3']], dtype='<U21')
In [282]: _.shape
Out[282]: (2, 2)
defaultdict is a handy tool for collecting like-keyed values
In [283]: from collections import defaultdict
In [284]: dd = defaultdict(list)
In [285]: for tup in a+b:
...: k,v = tup
...: dd[k].append(v)
...:
In [286]: dd
Out[286]: defaultdict(list, {1: ['alpha', 'zylo', 'xen'], 2: [3, 'potato']})
which can be cast as a list of tuples with:
In [288]: [(k,*v) for k,v in dd.items()]
Out[288]: [(1, 'alpha', 'zylo', 'xen'), (2, 3, 'potato')]
I'm using a+b to join the lists, since it apparently doesn't matter where the tuples occur.
Out[288] is even a poor numpy fit, since the tuples differ in size, and items (other than the first) might be strings or numbers.

Extract array (column name, data) from Pandas DataFrame

This is my first question at Stack Overflow.
I have a DataFrame of Pandas like this.
a b c d
one 0 1 2 3
two 4 5 6 7
three 8 9 0 1
four 2 1 1 5
five 1 1 8 9
I want to extract the pairs of column name and data whose data is 1 and each index is separate at array.
[ [(b,1.0)], [(d,1.0)], [(b,1.0),(c,1.0)], [(a,1.0),(b,1.0)] ]
I want to use gensim of python library which requires corpus as this form.
Is there any smart way to do this or to apply gensim from pandas data?
Many gensim functions accept numpy arrays, so there may be a better way...
In [11]: is_one = np.where(df == 1)
In [12]: is_one
Out[12]: (array([0, 2, 3, 3, 4, 4]), array([1, 3, 1, 2, 0, 1]))
In [13]: df.index[is_one[0]], df.columns[is_one[1]]
Out[13]:
(Index([u'one', u'three', u'four', u'four', u'five', u'five'], dtype='object'),
Index([u'b', u'd', u'b', u'c', u'a', u'b'], dtype='object'))
To groupby each row, you could use iterrows:
from itertools import repeat
In [21]: [list(zip(df.columns[np.where(row == 1)], repeat(1.0)))
for label, row in df.iterrows()
if 1 in row.values] # if you don't want empty [] for rows without 1
Out[21]:
[[('b', 1.0)],
[('d', 1.0)],
[('b', 1.0), ('c', 1.0)],
[('a', 1.0), ('b', 1.0)]]
In python 2 the list is not required since zip returns a list.
Another way would be
In [1652]: [[(c, 1) for c in x[x].index] for _, x in df.eq(1).iterrows() if x.any()]
Out[1652]: [[('b', 1)], [('d', 1)], [('b', 1), ('c', 1)], [('a', 1), ('b', 1)]]

merging indexed array in Python

Suppose that I have two numpy arrays of the form
x = [[1,2]
[2,4]
[3,6]
[4,NaN]
[5,10]]
y = [[0,-5]
[1,0]
[2,5]
[5,20]
[6,25]]
is there an efficient way to merge them such that I have
xmy = [[0, NaN, -5 ]
[1, 2, 0 ]
[2, 4, 5 ]
[3, 6, NaN]
[4, NaN, NaN]
[5, 10, 20 ]
[6, NaN, 25 ]
I can implement a simple function using search to find the index but this is not elegant and potentially inefficient for a lot of arrays and large dimensions. Any pointer is appreciated.
See numpy.lib.recfunctions.join_by
It only works on structured arrays or recarrays, so there are a couple of kinks.
First you need to be at least somewhat familiar with structured arrays. See here if you're not.
import numpy as np
import numpy.lib.recfunctions
# Define the starting arrays as structured arrays with two fields ('key' and 'field')
dtype = [('key', np.int), ('field', np.float)]
x = np.array([(1, 2),
(2, 4),
(3, 6),
(4, np.NaN),
(5, 10)],
dtype=dtype)
y = np.array([(0, -5),
(1, 0),
(2, 5),
(5, 20),
(6, 25)],
dtype=dtype)
# You want an outer join, rather than the default inner join
# (all values are returned, not just ones with a common key)
join = np.lib.recfunctions.join_by('key', x, y, jointype='outer')
# Now we have a structured array with three fields: 'key', 'field1', and 'field2'
# (since 'field' was in both arrays, it renamed x['field'] to 'field1', and
# y['field'] to 'field2')
# This returns a masked array, if you want it filled with
# NaN's, do the following...
join.fill_value = np.NaN
join = join.filled()
# Just displaying it... Keep in mind that as a structured array,
# it has one dimension, where each row contains the 3 fields
for row in join:
print row
This outputs:
(0, nan, -5.0)
(1, 2.0, 0.0)
(2, 4.0, 5.0)
(3, 6.0, nan)
(4, nan, nan)
(5, 10.0, 20.0)
(6, nan, 25.0)
Hope that helps!
Edit1: Added example
Edit2: Really shouldn't join with floats... Changed 'key' field to an int.

Categories

Resources