Numpy/Pandas: Merge two numpy arrays based on one array efficiently - python

I have two numpy arrays comprised of two-set tuples:
a = [(1, "alpha"), (2, 3), ...]
b = [(1, "zylo"), (1, "xen"), (2, "potato", ...]
The first element in the tuple is the identifier and shared between both arrays, so I want to create a new numpy array which looks like this:
[(1, "alpha", "zylo", "xen"), (2, 3, "potato"), etc...]
My current solution works, but it's way too inefficient for me. Looks like this:
aggregate_collection = []
for tuple_set in a:
for tuple_set2 in b:
if tuple_set[0] == tuple_set2[0] and other_condition:
temp_tup = (tuple_set[0], other tuple values)
aggregate_collection.append(temp_tup)
How can I do this efficiently?

I'd concatenate these into a data frame and just groupby+agg
(pd.concat([pd.DataFrame(a), pd.DataFrame(b)])
.groupby(0)
.agg(lambda s: [s.name, *s])[1])
where 0 and 1 are the default column names given by creating a dataframe via pd.DataFrame. Change it to your column names.

In [278]: a = [(1, "alpha"), (2, 3)]
...: b = [(1, "zylo"), (1, "xen"), (2, "potato")]
In [279]: a
Out[279]: [(1, 'alpha'), (2, 3)]
In [280]: b
Out[280]: [(1, 'zylo'), (1, 'xen'), (2, 'potato')]
Note that if I try to make an array from a I get something quite different.
In [281]: np.array(a)
Out[281]:
array([['1', 'alpha'],
['2', '3']], dtype='<U21')
In [282]: _.shape
Out[282]: (2, 2)
defaultdict is a handy tool for collecting like-keyed values
In [283]: from collections import defaultdict
In [284]: dd = defaultdict(list)
In [285]: for tup in a+b:
...: k,v = tup
...: dd[k].append(v)
...:
In [286]: dd
Out[286]: defaultdict(list, {1: ['alpha', 'zylo', 'xen'], 2: [3, 'potato']})
which can be cast as a list of tuples with:
In [288]: [(k,*v) for k,v in dd.items()]
Out[288]: [(1, 'alpha', 'zylo', 'xen'), (2, 3, 'potato')]
I'm using a+b to join the lists, since it apparently doesn't matter where the tuples occur.
Out[288] is even a poor numpy fit, since the tuples differ in size, and items (other than the first) might be strings or numbers.

Related

List of tuples for each pandas dataframe slice

I need to do something very similar to this question: Pandas convert dataframe to array of tuples
The difference is I need to get not only a single list of tuples for the entire DataFrame, but a list of lists of tuples, sliced based on some column value.
Supposing this is my data set:
t_id A B
----- ---- -----
0 AAAA 1 2.0
1 AAAA 3 4.0
2 AAAA 5 6.0
3 BBBB 7 8.0
4 BBBB 9 10.0
...
I want to produce as output:
[[(1,2.0), (3,4.0), (5,6.0)],[(7,8.0), (9,10.0)]]
That is, one list for 'AAAA', another for 'BBBB' and so on.
I've tried with two nested for loops. It seems to work, but it is taking too long (actual data set has ~1M rows):
result = []
for t in df['t_id'].unique():
tuple_list= []
for x in df[df['t_id' == t]].iterrows():
row = x[1][['A', 'B']]
tuple_list.append(tuple(x))
result.append(tuple_list)
Is there a faster way to do it?
You can groupby column t_id, iterate through groups and convert each sub dataframe into a list of tuples:
[g[['A', 'B']].to_records(index=False).tolist() for _, g in df.groupby('t_id')]
# [[(1, 2.0), (3, 4.0), (5, 6.0)], [(7, 8.0), (9, 10.0)]]
I think this should work too:
import pandas as pd
import itertools
df = pd.DataFrame({"A": [1, 2, 3, 1], "B": [2, 2, 2, 2], "C": ["A", "B", "C", "B"]})
tuples_in_df = sorted(tuple(df.to_records(index=False)), key=lambda x: x[0])
output = [[tuple(x)[1:] for x in group] for _, group in itertools.groupby(tuples_in_df, lambda x: x[0])]
print(output)
Out:
[[(2, 'A'), (2, 'B')], [(2, 'B')], [(2, 'C')]]

Select specific index, column pairs from pandas dataframe

I have a dataframe x:
x = pd.DataFrame(np.random.randn(3,3), index=[1,2,3], columns=['A', 'B', 'C'])
x
A B C
1 0.256668 -0.338741 0.733561
2 0.200978 0.145738 -0.409657
3 -0.891879 0.039337 0.400449
and I would like to select a bunch of index column pairs to populate a new Series. For example, I could select [(1, 'A'), (1, 'B'), (1, 'A'), (3, 'C')] which would generate a list or array or series with 4 elements:
[0.256668, -0.338741, 0.256668, 0.400449]
Any idea of how I should do that?
I think get_value() and lookup() is faster:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.random.randn(3,3), index=[1,2,3], columns=['A', 'B', 'C'])
locations = [(1, "A"), (1, "B"), (1, "A"), (3, "C")]
print x.get_value(1, "A")
row_labels, col_labels = zip(*locations)
print x.lookup(row_labels, col_labels)
If your pairs are positions instead of index/column names,
row_position = [0,0,0,2]
col_position = [0,1,0,2]
x.values[row_position, col_position]
Or get the position from np.searchsorted
row_position = np.searchsorted(x.index,row_labels,sorter = np.argsort(x.index))
Use ix should be able to locate the elements in the data frame, like this:
import pandas as pd
# using your data sample
df = pd.read_clipboard()
df
Out[170]:
A B C
1 0.256668 -0.338741 0.733561
2 0.200978 0.145738 -0.409657
3 -0.891879 0.039337 0.400449
# however you cannot store A, B, C... as they are undefined names
l = [(1, 'A'), (1, 'B'), (1, 'A'), (3, 'C')]
# you can also use a for/loop, simply iterate the list and LOCATE the element
map(lambda x: df.ix[x[0], x[1]], l)
Out[172]: [0.25666800000000001, -0.33874099999999996, 0.25666800000000001, 0.400449]

Convert list of tuples to structured numpy array

I have a list of Num_tuples tuples that all have the same length Dim_tuple
xlist = [tuple_1, tuple_2, ..., tuple_Num_tuples]
For definiteness, let's say Num_tuples=3 and Dim_tuple=2
xlist = [(1, 1.1), (2, 1.2), (3, 1.3)]
I want to convert xlist into a structured numpy array xarr using a user-provided list of column names user_names and a user-provided list of variable types user_types
user_names = [name_1, name_2, ..., name_Dim_tuple]
user_types = [type_1, type_2, ..., type_Dim_tuple]
So in the creation of the numpy array,
dtype = [(name_1,type_1), (name_2,type_2), ..., (name_Dim_tuple, type_Dim_tuple)]
In the case of my toy example desired end product would look something like:
xarr['name1']=np.array([1,2,3])
xarr['name2']=np.array([1.1,1.2,1.3])
How can I slice xlist to create xarr without any loops?
A list of tuples is the correct way of providing data to a structured array:
In [273]: xlist = [(1, 1.1), (2, 1.2), (3, 1.3)]
In [274]: dt=np.dtype('int,float')
In [275]: np.array(xlist,dtype=dt)
Out[275]:
array([(1, 1.1), (2, 1.2), (3, 1.3)],
dtype=[('f0', '<i4'), ('f1', '<f8')])
In [276]: xarr = np.array(xlist,dtype=dt)
In [277]: xarr['f0']
Out[277]: array([1, 2, 3])
In [278]: xarr['f1']
Out[278]: array([ 1.1, 1.2, 1.3])
or if the names are important:
In [280]: xarr.dtype.names=['name1','name2']
In [281]: xarr
Out[281]:
array([(1, 1.1), (2, 1.2), (3, 1.3)],
dtype=[('name1', '<i4'), ('name2', '<f8')])
http://docs.scipy.org/doc/numpy/user/basics.rec.html#filling-structured-arrays
hpaulj's answer is interesting but horrifying :)
The modern Pythonic way to have named columns is to use pandas, a highly popular package built on top of numpy:
import pandas as pd
xlist = [(1, 1.1), (2, 1.2), (3, 1.3)]
# Cast name1 to int because pandas' default is float
df = pd.DataFrame(xlist, columns=['name1', 'name2']).astype({'name1':int})
print(df)
This gives you a DataFrame, df, which is the structure you want:
name1 name2
0 1 1.1
1 2 1.2
2 3 1.3
You can do all kinds of wonderful things with this, like slicing and various operations.

How to turn pandas dataframe row into ordereddict fast

Looking for a fast way to get a row in a pandas dataframe into a ordered dict with out using list. List are fine but with large data sets will take to long. I am using fiona GIS reader and the rows are ordereddicts with the schema giving the data type. I use pandas to join data. I many cases the rows will have different types so I was thinking turning into a numpy array with type string might do the trick.
This is implemented in pandas 0.21.0+ in function to_dict with parameter into:
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
print (df)
a b
0 1 2
1 3 4
d = df.to_dict(into=OrderedDict, orient='index')
print (d)
OrderedDict([(0, OrderedDict([('a', 1), ('b', 2)])), (1, OrderedDict([('a', 3), ('b', 4)]))])
Unfortunately you can't just do an apply (since it fits it back to a DataFrame):
In [1]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
In [2]: df
Out[2]:
a b
0 1 2
1 3 4
In [3]: from collections import OrderedDict
In [4]: df.apply(OrderedDict)
Out[4]:
a b
0 1 2
1 3 4
But you can use a list comprehension with iterrows:
In [5]: [OrderedDict(row) for i, row in df.iterrows()]
Out[5]: [OrderedDict([('a', 1), ('b', 2)]), OrderedDict([('a', 3), ('b', 4)])]
If it was possible to use a generator, rather than a list, to whatever you were working with this will usually be more efficient:
In [6]: (OrderedDict(row) for i, row in df.iterrows())
Out[6]: <generator object <genexpr> at 0x10466da50>

Iterate over numpy array in a specific order based on values

I want to iterate over a numpy array starting at the index of the highest value working through to the lowest value
import numpy as np #imports numpy package
elevation_array = np.random.rand(5,5) #creates a random array 5 by 5
print elevation_array # prints the array out
ravel_array = np.ravel(elevation_array)
sorted_array_x = np.argsort(ravel_array)
sorted_array_y = np.argsort(sorted_array_x)
sorted_array = sorted_array_y.reshape(elevation_array.shape)
for index, rank in np.ndenumerate(sorted_array):
print index, rank
I want it to print out:
index of the highest value
index of the next highest value
index of the next highest value etc
If you want numpy doing the heavy lifting, you can do something like this:
>>> a = np.random.rand(100, 100)
>>> sort_idx = np.argsort(a, axis=None)
>>> np.column_stack(np.unravel_index(sort_idx[::-1], a.shape))
array([[13, 62],
[26, 77],
[81, 4],
...,
[83, 40],
[17, 34],
[54, 91]], dtype=int64)
You first get an index that sorts the whole array, and then convert that flat index into pairs of indices with np.unravel_index. The call to np.column_stack simply joins the two arrays of coordinates into a single one, and could be replaced by the Python zip(*np.unravel_index(sort_idx[::-1], a.shape)) to get a list of tuples instead of an array.
Try this:
from operator import itemgetter
>>> a = np.array([[2, 7], [1, 4]])
array([[2, 7],
[1, 4]])
>>> sorted(np.ndenumerate(a), key=itemgetter(1), reverse=True)
[((0, 1), 7),
((1, 1), 4),
((0, 0), 2),
((1, 0), 1)]
you can iterate this list if you so wish. Essentially I am telling the function sorted to order the elements of np.ndenumerate(a) according to the key itemgetter(1). This function itemgetter gets the second (index 1) element from the tuples ((0, 1), 7), ((1, 1), 4), ... (i.e the values) generated by np.ndenumerate(a).

Categories

Resources