Suppose that I have two numpy arrays of the form
x = [[1,2]
[2,4]
[3,6]
[4,NaN]
[5,10]]
y = [[0,-5]
[1,0]
[2,5]
[5,20]
[6,25]]
is there an efficient way to merge them such that I have
xmy = [[0, NaN, -5 ]
[1, 2, 0 ]
[2, 4, 5 ]
[3, 6, NaN]
[4, NaN, NaN]
[5, 10, 20 ]
[6, NaN, 25 ]
I can implement a simple function using search to find the index but this is not elegant and potentially inefficient for a lot of arrays and large dimensions. Any pointer is appreciated.
See numpy.lib.recfunctions.join_by
It only works on structured arrays or recarrays, so there are a couple of kinks.
First you need to be at least somewhat familiar with structured arrays. See here if you're not.
import numpy as np
import numpy.lib.recfunctions
# Define the starting arrays as structured arrays with two fields ('key' and 'field')
dtype = [('key', np.int), ('field', np.float)]
x = np.array([(1, 2),
(2, 4),
(3, 6),
(4, np.NaN),
(5, 10)],
dtype=dtype)
y = np.array([(0, -5),
(1, 0),
(2, 5),
(5, 20),
(6, 25)],
dtype=dtype)
# You want an outer join, rather than the default inner join
# (all values are returned, not just ones with a common key)
join = np.lib.recfunctions.join_by('key', x, y, jointype='outer')
# Now we have a structured array with three fields: 'key', 'field1', and 'field2'
# (since 'field' was in both arrays, it renamed x['field'] to 'field1', and
# y['field'] to 'field2')
# This returns a masked array, if you want it filled with
# NaN's, do the following...
join.fill_value = np.NaN
join = join.filled()
# Just displaying it... Keep in mind that as a structured array,
# it has one dimension, where each row contains the 3 fields
for row in join:
print row
This outputs:
(0, nan, -5.0)
(1, 2.0, 0.0)
(2, 4.0, 5.0)
(3, 6.0, nan)
(4, nan, nan)
(5, 10.0, 20.0)
(6, nan, 25.0)
Hope that helps!
Edit1: Added example
Edit2: Really shouldn't join with floats... Changed 'key' field to an int.
Related
I have a 4D numpy array. While slicing for multiple indices in a single dimension, my axis get interchanged. Am I missing something trivial here.
import numpy as np
from smartprint import smartprint as prints
a = np.random.rand(50, 60, 70, 80)
b = a[:, :, :, [2,3,4]]
prints (b.shape) # this works as expected
c = a[1, :, :, [2,3,4]]
prints (c.shape) # here, I see the axes are interchanged
Output:
b.shape : (50, 60, 70, 3)
c.shape : (3, 60, 70)
Here are some observations that may help explain the problem.
Start with a 3d array, with the expect strides:
In [158]: x=np.arange(24).reshape(2,3,4)
In [159]: x.shape,x.strides
Out[159]: ((2, 3, 4), (48, 16, 4))
Advanced indexing on the last axis:
In [160]: y=x[:,:,[0,1,2,3]]
In [161]: y.shape, y.strides
Out[161]: ((2, 3, 4), (12, 4, 24))
Notice that the strides are not in the normal C-contiguous order. For a 2d array we'd describe this a F-contiguous. It's an obscure indexing detail that usually doesn't matter.
Apparently when doing this indexing it first makes an array with the last, the indexed dimension, first:
In [162]: y.base.shape
Out[162]: (4, 2, 3)
In [163]: y.base.strides
Out[163]: (24, 12, 4)
y is this base with swapped axes, a view of its base.
The case with a slice in the middle is
In [164]: z=x[1,:,[0,1,2,3]]
In [165]: z.shape, z.strides
Out[165]: ((4, 3), (12, 4))
In [166]: z.base # its own base, not a view
Transposing z to the expected (3,4) shape would switch the strides to (4,12), F-contiguous.
With the two step indexing, we get an array with the expect shape, but the F strides. And its base looks a lot like z.
In [167]: w=x[1][:,[0,1,2,3]]
In [168]: w.shape, w.strides
Out[168]: ((3, 4), (4, 12))
In [169]: w.base.shape, w.base.strides
Out[169]: ((4, 3), (12, 4))
The docs justify the switch in axes by saying that there's an ambiguity when performing advanced indexing with a slice in the middle. It's perhaps clearest when using a (2,1) and (4,) indices:
In [171]: w=x[[[0],[1]],:,[0,1,2,3]]
In [172]: w.shape, w.strides
Out[172]: ((2, 4, 3), (48, 12, 4))
The middle, size 3 dimension, is "tacked on last". With x[1,:,[0,1,2,3]] that ambibuity argument isn't as good, but apparently it's using the same indexing method. When this was raised in github issues, the claim was that reworking the indexing to correct this was too difficult. Individual cases might be corrected, but a comprehensive change was too complicated.
This dimension switch seems to come up on SO a couple of times a year, an annoyance, but not a critical issue.
I have two numpy arrays comprised of two-set tuples:
a = [(1, "alpha"), (2, 3), ...]
b = [(1, "zylo"), (1, "xen"), (2, "potato", ...]
The first element in the tuple is the identifier and shared between both arrays, so I want to create a new numpy array which looks like this:
[(1, "alpha", "zylo", "xen"), (2, 3, "potato"), etc...]
My current solution works, but it's way too inefficient for me. Looks like this:
aggregate_collection = []
for tuple_set in a:
for tuple_set2 in b:
if tuple_set[0] == tuple_set2[0] and other_condition:
temp_tup = (tuple_set[0], other tuple values)
aggregate_collection.append(temp_tup)
How can I do this efficiently?
I'd concatenate these into a data frame and just groupby+agg
(pd.concat([pd.DataFrame(a), pd.DataFrame(b)])
.groupby(0)
.agg(lambda s: [s.name, *s])[1])
where 0 and 1 are the default column names given by creating a dataframe via pd.DataFrame. Change it to your column names.
In [278]: a = [(1, "alpha"), (2, 3)]
...: b = [(1, "zylo"), (1, "xen"), (2, "potato")]
In [279]: a
Out[279]: [(1, 'alpha'), (2, 3)]
In [280]: b
Out[280]: [(1, 'zylo'), (1, 'xen'), (2, 'potato')]
Note that if I try to make an array from a I get something quite different.
In [281]: np.array(a)
Out[281]:
array([['1', 'alpha'],
['2', '3']], dtype='<U21')
In [282]: _.shape
Out[282]: (2, 2)
defaultdict is a handy tool for collecting like-keyed values
In [283]: from collections import defaultdict
In [284]: dd = defaultdict(list)
In [285]: for tup in a+b:
...: k,v = tup
...: dd[k].append(v)
...:
In [286]: dd
Out[286]: defaultdict(list, {1: ['alpha', 'zylo', 'xen'], 2: [3, 'potato']})
which can be cast as a list of tuples with:
In [288]: [(k,*v) for k,v in dd.items()]
Out[288]: [(1, 'alpha', 'zylo', 'xen'), (2, 3, 'potato')]
I'm using a+b to join the lists, since it apparently doesn't matter where the tuples occur.
Out[288] is even a poor numpy fit, since the tuples differ in size, and items (other than the first) might be strings or numbers.
I have a scenario where I have a dataframe and vocabulary file which I am trying to fit to the dataframe string columns. I am using scikit learn countVectorizer which produces a sparse matrix. I need to take the output of the sparse matrix and merge it with the dataframe for corresponding row in dataframe.
code:-
from sklearn.feature_extraction.text import CountVectorizer
docs = ["You can catch more flies with honey than you can with vinegar.",
"You can lead a horse to water, but you can't make him drink.",
"search not cleaning up on hard delete",
"updating firmware version failed",
"increase not service topology s memory",
"Nothing Matching Here"
]
vocabulary = ["catch more","lead a horse", "increase service", "updating" , "search", "vinegar", "drink", "failed", "not"]
vectorizer = CountVectorizer(analyzer=u'word', vocabulary=vocabulary,lowercase=True,ngram_range=(0,19))
SpraseMatrix = vectorizer.fit_transform(docs)
Below is sparse matrix output -
(0, 0) 1
(0, 5) 1
(1, 6) 1
(2, 4) 1
(2, 8) 1
(3, 3) 1
(3, 7) 1
(4, 8) 1
Now, What I am looking to do is build a string for each row from sparse matrix and add it to the corresponding document.
Ex:- for doc 3 ("Updating firmware version failed") , I am looking to get "3:1 7:1 " from sparse matrix (i.e updating & failed column index and their frequency) and add this to doc's data frame's row 3.
I tried below , and it produces flatten output where as I am looking to get the submatrix based on the row index, loop through it and build a concated string for each row such as "3:1 7:1" , and finally then add this string as a new column to data frame for each corresponding row.
cx = SpraseMatrix .tocoo()
for i,j,v in zip(cx.row, cx.col, cx.data):
print((i,j,v))
(0, 0, 1)
(0, 5, 1)
(1, 6, 1)
(2, 4, 1)
(2, 8, 1)
(3, 3, 1)
(3, 7, 1)
(4, 8, 1)
I'm not entirely following what you want, but maybe the lil format will be easier to work with:
In [1122]: M = sparse.coo_matrix(([1,1,1,1,1,1,1,1],([0,0,1,2,2,3,3,4],[0,5,6,4,
...: 8,3,7,8])))
In [1123]: M
Out[1123]:
<5x9 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in COOrdinate format>
In [1124]: print(M)
(0, 0) 1
(0, 5) 1
(1, 6) 1
(2, 4) 1
(2, 8) 1
(3, 3) 1
(3, 7) 1
(4, 8) 1
In [1125]: Ml = M.tolil()
In [1126]: Ml.data
Out[1126]: array([list([1, 1]), list([1]), list([1, 1]), list([1, 1]), list([1])], dtype=object)
In [1127]: Ml.rows
Out[1127]: array([list([0, 5]), list([6]), list([4, 8]), list([3, 7]), list([8])], dtype=object)
It's attributes are organized by row, which appears to be how you want it.
In [1130]: Ml.rows[3]
Out[1130]: [3, 7]
In [1135]: for i,(rd) in enumerate(zip(Ml.rows, Ml.data)):
...: print(' '.join(['%s:%s'%ij for ij in zip(*rd)]))
...:
0:1 5:1
6:1
4:1 8:1
3:1 7:1
8:1
You can also iterate through the rows of the csr format, but that requires a bit more math with the .indptr attribute.
I have a list of Num_tuples tuples that all have the same length Dim_tuple
xlist = [tuple_1, tuple_2, ..., tuple_Num_tuples]
For definiteness, let's say Num_tuples=3 and Dim_tuple=2
xlist = [(1, 1.1), (2, 1.2), (3, 1.3)]
I want to convert xlist into a structured numpy array xarr using a user-provided list of column names user_names and a user-provided list of variable types user_types
user_names = [name_1, name_2, ..., name_Dim_tuple]
user_types = [type_1, type_2, ..., type_Dim_tuple]
So in the creation of the numpy array,
dtype = [(name_1,type_1), (name_2,type_2), ..., (name_Dim_tuple, type_Dim_tuple)]
In the case of my toy example desired end product would look something like:
xarr['name1']=np.array([1,2,3])
xarr['name2']=np.array([1.1,1.2,1.3])
How can I slice xlist to create xarr without any loops?
A list of tuples is the correct way of providing data to a structured array:
In [273]: xlist = [(1, 1.1), (2, 1.2), (3, 1.3)]
In [274]: dt=np.dtype('int,float')
In [275]: np.array(xlist,dtype=dt)
Out[275]:
array([(1, 1.1), (2, 1.2), (3, 1.3)],
dtype=[('f0', '<i4'), ('f1', '<f8')])
In [276]: xarr = np.array(xlist,dtype=dt)
In [277]: xarr['f0']
Out[277]: array([1, 2, 3])
In [278]: xarr['f1']
Out[278]: array([ 1.1, 1.2, 1.3])
or if the names are important:
In [280]: xarr.dtype.names=['name1','name2']
In [281]: xarr
Out[281]:
array([(1, 1.1), (2, 1.2), (3, 1.3)],
dtype=[('name1', '<i4'), ('name2', '<f8')])
http://docs.scipy.org/doc/numpy/user/basics.rec.html#filling-structured-arrays
hpaulj's answer is interesting but horrifying :)
The modern Pythonic way to have named columns is to use pandas, a highly popular package built on top of numpy:
import pandas as pd
xlist = [(1, 1.1), (2, 1.2), (3, 1.3)]
# Cast name1 to int because pandas' default is float
df = pd.DataFrame(xlist, columns=['name1', 'name2']).astype({'name1':int})
print(df)
This gives you a DataFrame, df, which is the structure you want:
name1 name2
0 1 1.1
1 2 1.2
2 3 1.3
You can do all kinds of wonderful things with this, like slicing and various operations.
I want to iterate over a numpy array starting at the index of the highest value working through to the lowest value
import numpy as np #imports numpy package
elevation_array = np.random.rand(5,5) #creates a random array 5 by 5
print elevation_array # prints the array out
ravel_array = np.ravel(elevation_array)
sorted_array_x = np.argsort(ravel_array)
sorted_array_y = np.argsort(sorted_array_x)
sorted_array = sorted_array_y.reshape(elevation_array.shape)
for index, rank in np.ndenumerate(sorted_array):
print index, rank
I want it to print out:
index of the highest value
index of the next highest value
index of the next highest value etc
If you want numpy doing the heavy lifting, you can do something like this:
>>> a = np.random.rand(100, 100)
>>> sort_idx = np.argsort(a, axis=None)
>>> np.column_stack(np.unravel_index(sort_idx[::-1], a.shape))
array([[13, 62],
[26, 77],
[81, 4],
...,
[83, 40],
[17, 34],
[54, 91]], dtype=int64)
You first get an index that sorts the whole array, and then convert that flat index into pairs of indices with np.unravel_index. The call to np.column_stack simply joins the two arrays of coordinates into a single one, and could be replaced by the Python zip(*np.unravel_index(sort_idx[::-1], a.shape)) to get a list of tuples instead of an array.
Try this:
from operator import itemgetter
>>> a = np.array([[2, 7], [1, 4]])
array([[2, 7],
[1, 4]])
>>> sorted(np.ndenumerate(a), key=itemgetter(1), reverse=True)
[((0, 1), 7),
((1, 1), 4),
((0, 0), 2),
((1, 0), 1)]
you can iterate this list if you so wish. Essentially I am telling the function sorted to order the elements of np.ndenumerate(a) according to the key itemgetter(1). This function itemgetter gets the second (index 1) element from the tuples ((0, 1), 7), ((1, 1), 4), ... (i.e the values) generated by np.ndenumerate(a).