I've got a numpy array. What is the fastest way to compute all the permutations of orderings.
What I mean is, given the first element in my array, I want a list of all the elements that sequentially follow it. Then given the second element, a list of all the elements that follow it.
So given my list: b, c, & d follow a. c & d follow b, and d follows c.
x = np.array(["a", "b", "c", "d"])
So a potential output looks like:
[
["a","b"],
["a","c"],
["a","d"],
["b","c"],
["b","d"],
["c","d"],
]
I will need to do this several million times so I am looking for an efficient solution.
I tried something like:
im = np.vstack([x]*len(x))
a = np.vstack(([im], [im.T])).T
results = a[np.triu_indices(len(x),1)]
but its actually slower than looping...
You can use itertools's functions like chain.from_iterable and combinations with np.fromiter for this. This involves no loop in Python, but still not a pure NumPy solution:
>>> from itertools import combinations, chain
>>> arr = np.fromiter(chain.from_iterable(combinations(x, 2)), dtype=x.dtype)
>>> arr.reshape(arr.size/2, 2)
array([['a', 'b'],
['a', 'c'],
['a', 'd'],
...,
['b', 'c'],
['b', 'd'],
['c', 'd']],
dtype='|S1')
Timing comparisons:
>>> x = np.array(["a", "b", "c", "d"]*100)
>>> %%timeit
im = np.vstack([x]*len(x))
a = np.vstack(([im], [im.T])).T
results = a[np.triu_indices(len(x),1)]
...
10 loops, best of 3: 29.2 ms per loop
>>> %%timeit
arr = np.fromiter(chain.from_iterable(combinations(x, 2)), dtype=x.dtype)
arr.reshape(arr.size/2, 2)
...
100 loops, best of 3: 6.63 ms per loop
I've been browsing the source and it seems the tri functions have had some very substantial improvements relatively recently. The file is all Python so you can just copy it into your directory if that helps.
I seem to have completely different timings to Ashwini Chaudhary, even after taking this into account.
It is very important to know the size of the arrays you want to do this on; if it is small you should cache intermediates like triu_indices.
The fastest code for me was:
def triangalize_1(x):
xs, ys = numpy.triu_indices(len(x), 1)
return numpy.array([x[xs], x[ys]]).T
unless the x array is small.
If x is small, caching was best:
triu_cache = {}
def triangalize_1(x):
if len(x) in triu_cache:
xs, ys = triu_cache[len(x)]
else:
xs, ys = numpy.triu_indices(len(x), 1)
triu_cache[len(x)] = xs, ys
return numpy.array([x[xs], x[ys]]).T
I wouldn't do this for large x because of memory requirements.
Related
I have a numpy array and a dictionary similar to below:
arr1 = np.array([['a1','x'],['a2','x'],['a3','y'],['a4','y'],['a5','z']])
d = {'x':2,'z':1,'y':1,'w':2}
For each key-value pair (k,v) in d, k should appear exactly v times in arr1 in its second column. Clearly that doesn't happen here.
So what I want to do is, from arr1, I want to create another array where every element in its second column appears exactly the number of times it's supposed to according to d. In other words, my desired outcome is:
np.array([['a1','x'],['a2','x'],['a5','z']])
I can get my desired outcome using list comprehension:
ans = [[x1,x2] for x1,x2 in arr1 if np.count_nonzero(arr1==x2)==d[x2]]
but I was wondering if it was possible to do it only using numpy.
After a bit of playing around with np.argsort(), I found a pure numpy solution. Just have to sort the second row of arr1 according to how the same elements are positioned in an array version of d.values().
arr1 = np.array([['a1','x'],['a2','x'],['a3','y'],['a4','y'],['a5','z']])
d = {'x':2,'z':1,'y':1,'w':2}
# create arrays from d
keys, vals = np.array(list(d.keys())), np.array(list(d.values()))
# count the unique elements in arr1[:,1]
unqs, cts = np.unique(arr1[:,1], return_counts=True)
# only keep track of elements that appear in arr1
mask = np.isin(keys,unqs)
keys, vals = keys[mask], vals[mask]
# sort the unique values and corresponding counts according to keys
idx1 = np.argsort(np.argsort(keys))
idx2 = np.argsort(unqs)
unqs, cts = unqs[idx2][idx1], cts[idx2][idx1]
# filter values by whether the counts match
correct = unqs[vals==cts]
# keep subarray where the counts match
ans = arr1[np.isin(arr1[:,1],correct)]
print(ans)
# [['a1' 'x']
# ['a2' 'x']
# ['a5' 'z']]
This does what you want:
import numpy as np
arr1 = np.array([['a1', 'x'], ['a2', 'x'], ['a3', 'y'], ['a4', 'y'], ['a5', 'z']])
d = {'x': 2, 'z': 1, 'y': 1, 'w': 2}
# get the actual counts of values in arr1
counts = dict(zip(*np.unique(arr1[:, 1], return_counts=True)))
# determine what values to keep, as their count matches the desired count
keep = [x for x in d if x in counts and d[x] == counts[x]]
# filter down the array
result = arr1[list(map(lambda x: x[1] in keep, arr1))]
Quite possibly there's a more optimal way to do this in numpy, but I don't know how big the set you're applying to is, or how often you need to do this, to say whether looking for it is worth it.
Edit: Note that you need to scale things up to decide what is a good solution. Your original solution is great for toy examples, it outperforms both answers. But the numpy solution provided by #NewbieAF beats the rest handily if you scale up to what may be more realistic workloads:
from random import randint
from timeit import timeit
import numpy as np
def original(arr1, d):
return [[x1, x2] for x1, x2 in arr1 if np.count_nonzero(arr1 == x2) == d[x2]]
def f1(arr1, d):
# get the actual counts of values in arr1
counts = dict(zip(*np.unique(arr1[:, 1], return_counts=True)))
# determine what values to keep, as their count matches the desired count
keep = [x for x in d if x in counts and d[x] == counts[x]]
# filter down the array
return arr1[list(map(lambda x: x[1] in keep, arr1))]
def f2(arr1, d):
# create arrays from d
keys, vals = np.array(list(d.keys())), np.array(list(d.values()))
# count the unique elements in arr1[:,1]
unqs, cts = np.unique(arr1[:,1], return_counts=True)
# only keep track of elements that appear in arr1
mask = np.isin(keys,unqs)
keys, vals = keys[mask], vals[mask]
# sort the unique values and corresponding counts according to keys
idx1 = np.argsort(np.argsort(keys))
idx2 = np.argsort(unqs)
unqs, cts = unqs[idx2][idx1], cts[idx2][idx1]
# filter values by whether the counts match
correct = unqs[vals==cts]
return arr1[np.isin(arr1[:,1],correct)]
def main():
arr1 = np.array([['a1', 'x'], ['a2', 'x'], ['a3', 'y'], ['a4', 'y'], ['a5', 'z']])
d = {'x': 2, 'z': 1, 'y': 1, 'w': 2}
print(timeit(lambda: original(arr1, d), number=10000))
print(timeit(lambda: f1(arr1, d), number=10000))
print(timeit(lambda: f2(arr1, d), number=10000))
counts = [randint(1, 3) for _ in range(10000)]
arr1 = np.array([['x', f'{n}'] for n in range(10000) for _ in range(counts[n])])
d = {f'{n}': randint(1, 3) for n in range(10000)}
print(timeit(lambda: original(arr1, d), number=10))
print(timeit(lambda: f1(arr1, d), number=10))
print(timeit(lambda: f2(arr1, d), number=10))
main()
Result:
0.14045359999999998
0.2402685
0.5027185999999999
46.7569239
5.893172499999999
0.08729539999999503
The numpy solution is slow on a toy example, but orders of magnitude faster on a large input. Your solution seems pretty good, but loses out to the non-numpy solution avoiding the extra calls when scaled up.
Consider the size of your problem. If the problem is small, you should pick your own solution, for readability. If the problem is medium-sized, you might pick mine for the bump in performance. If the problem is large (either in size or frequency used), you should opt for the all numpy solution, sacrificing readability for speed.
I have a problem about list to array conversion. I have a list from a csv file, like
a=[['1','a'],['2','b']]
Now I only want the first column, the number '1' and '2', and convert them to a numpy array. How do I accomplish this? Using b = np.array(a) put all items as string into the array.
You can use numpy.fromiter with operator.itemgetter. Note a standard NumPy array is not a good choice for mixed types (dtype object), as this will cause all data to be stored in pointers.
a = [['1', 'a'], ['2', 'b']]
from operator import itemgetter
res = np.fromiter(map(itemgetter(0), a), dtype=int)
print(res)
array([1, 2])
Some performance benchmarking:
a = [['1', 'a'], ['2', 'b']] * 10000
%timeit np.fromiter(map(itemgetter(0), a), dtype=int) # 4.31 ms per loop
%timeit np.array(a)[:, 0].astype(int) # 15.1 ms per loop
%timeit np.array([i[0] for i in a]).astype(int) # 8.3 ms per loop
If you need a structured array of mixed types:
x = np.array([(int(i[0]), i[1]) for i in a],
dtype=[('val', 'i4'), ('text', 'S10')])
print(x)
array([(1, b'a'), (2, b'b')],
dtype=[('val', '<i4'), ('text', 'S10')])
You'd first need to create a new list`, that only contains the first values of the lists in a. For example
c = []
for row in a:
c.append(row[0])
b = np.array(c)
More Pythonic would probably be a list comprehension:
c = [x[0] for x in a]
b = np.array(c)
import numpy as np
a = [['1', 'a'], ['2', 'b']]
print(np.array(a)[:, 0].astype(int))
try this:
a=array([int(i[0]) for i in a])
I got a huge numpy array where elements are strings. I like to replace the strings with the first alphabet of the string. For example if
C[0] = 'A90CD'
I want to replace it with
C[0] = 'A'
IN nutshell, I was thinking of applying regex in a loop where I have a dictionary of regex expression like
'^A.+$' => 'A'
'^B.+$' => 'B'
etc
How can I apply this regex over the numpy arrays ? Or is there any better method to achieve the same ?
There's no need for regex here. Just convert your array to a 1 byte string, using astype -
v = np.array(['abc', 'def', 'ghi'])
>>> v.astype('<U1')
array(['a', 'd', 'g'],
dtype='<U1')
Alternatively, you change its view and stride. Here's a slightly optimised version for equal sized strings. -
>>> v.view('<U1')[::len(v[0])]
array(['a', 'd', 'g'],
dtype='<U1')
And here's the more generalised version of .view method, but this works for arrays of strings with differing length. Thanks to Paul Panzer for the suggestion -
>>> v.view('<U1').reshape(v.shape + (-1,))[:, 0]
array(['a', 'd', 'g'],
dtype='<U1')
Performance
y = np.array([x * 20 for x in v]).repeat(100000)
y.shape
(300000,)
len(y[0]) # they're all the same length - `abcabcabc...`
60
Now, the timings -
# `astype` conversion
%timeit y.astype('<U1')
100 loops, best of 3: 5.03 ms per loop
# `view` for equal sized string arrays
%timeit y.view('<U1')[::len(y[0])]
100000 loops, best of 3: 2.43 µs per loop
# Paul Panzer's version for differing length strings
%timeit y.view('<U1').reshape(y.shape + (-1,))[:, 0]
100000 loops, best of 3: 3.1 µs per loop
The view method is faster by a huge margin.
However, use with caution, as the memory is shared.
If you're interested in a more general solution that finds you the first letter (regardless of where it may be), I'd say the fastest/easiest way would be using the re module, compiling a pattern and searching inside a list comprehension.
>>> p = re.compile('[a-zA-Z]')
>>> [p.search(x).group() for x in v]
['a', 'd', 'g']
And, its performance on the same setup above -
%timeit [p.search(x).group() for x in y]
1 loop, best of 3: 320 ms per loop
I have to find in a nested list which list have a word and return a boolear numpy array.
nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words=c
result=[1,0,1,1]
I'm using this list comprehension to do it and it works
np.array([word in x for x in nested_list])
But I'm working with a nested list with 700k lists inside so it takes a lot of time. Also, I have to do it a lot of times, lists are static but words can change.
1 loop with list comprehension takes 0.36s, I need a way to do it faster, is there a way to do it?
We could flatten out the elements in all sub-lists to give us a 1D array. Then, we simply look for any occurrence of 'c' within the limits of each sub-list in the flattened 1D array. Thus, with that philosophy, we could use two approaches, based on how we count the occurrence of any c.
Approach #1 : One approach with np.bincount -
lens = np.array([len(i) for i in nested_list])
arr = np.concatenate(nested_list)
ids = np.repeat(np.arange(lens.size),lens)
out = np.bincount(ids, arr=='c')!=0
Since, as stated in the question, nested_list won't change across iterations, we can re-use everything and just loop for the final step.
Approach #2 : Another approach with np.add.reduceat reusing arr and lens from previous one -
grp_idx = np.append(0,lens[:-1].cumsum())
out = np.add.reduceat(arr=='c', grp_idx)!=0
When looping through a list of words, we can keep this approach vectorized for the final step by using np.add.reduceat along an axis and using broadcasting to give us a 2D array boolean, like so -
np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0
Sample run -
In [344]: nested_list
Out[344]: [['a', 'b', 'c'], ['a', 'b'], ['b', 'c'], ['c']]
In [345]: words
Out[345]: ['c', 'b']
In [346]: lens = np.array([len(i) for i in nested_list])
...: arr = np.concatenate(nested_list)
...: grp_idx = np.append(0,lens[:-1].cumsum())
...:
In [347]: np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0
Out[347]:
array([[ True, False, True, True], # matches for 'c'
[ True, True, True, False]]) # matches for 'b'
A generator expression would be preferable when iterating once(in terms of performance).The solution using numpy.fromiter function:
nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words = 'c'
arr = np.fromiter((words in l for l in nested_list), int)
print(arr)
The output:
[1 0 1 1]
https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html
How much time is it taking your to finish your loop? In my test case it only takes a few hundred milliseconds.
import random
# generate the nested lists
a = list('abcdefghijklmnop')
nested_list = [ [random.choice(a) for x in range(random.randint(1,30))]
for n in range(700000)]
%%timeit -n 10
word = 'c'
b = [word in x for x in nested_list]
# 10 loops, best of 3: 191 ms per loop
Reducing each internal list to a set gives some time savings...
nested_sets = [set(x) for x in nested_list]
%%timeit -n 10
word = 'c'
b = [word in s for s in nested_sets]
# 10 loops, best of 3: 132 ms per loop
And once you have turned it into a list of sets, you can build a list of boolean tuples. No real time savings though.
%%timeit -n 10
words = list('abcde')
b = [(word in s for word in words) for s in nested_sets]
# 10 loops, best of 3: 749 ms per loop
Say I have the following:
my_list = np.array(["abc", "def", "ghi"])
and I'd like to get:
np.array(["ef", "hi"])
I tried:
my_list[1:,1:]
But then I get:
IndexError: too many indices for array
Does Numpy support slicing strings?
No, you cannot do that. For numpy np.array(["abc", "def", "ghi"]) is a 1D array of strings, therefore you cannot use 2D slicing.
You could either define your array as a 2D array or characters, or simply use list comprehension for slicing,
In [4]: np.asarray([el[1:] for el in my_list[1:]])
Out[4]:
array(['ef', 'hi'], dtype='|S2')
Your array of strings stores the data as a contiguous block of characters, using the 'S3' dtype to divide it into strings of length 3.
In [116]: my_list
Out[116]:
array(['abc', 'def', 'ghi'],
dtype='|S3')
A S1,S2 dtype views each element as 2 strings, with 1 and 2 char each:
In [115]: my_list.view('S1,S2')
Out[115]:
array([('a', 'bc'), ('d', 'ef'), ('g', 'hi')],
dtype=[('f0', 'S1'), ('f1', 'S2')])
select the 2nd field to get an array with the desired characters:
In [114]: my_list.view('S1,S2')[1:]['f1']
Out[114]:
array(['ef', 'hi'],
dtype='|S2')
My first attempt with view was to split the array into single byte strings, and play with the resulting 2d array:
In [48]: my_2dstrings = my_list.view(dtype='|S1').reshape(3,-1)
In [49]: my_2dstrings
Out[49]:
array([['a', 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i']],
dtype='|S1')
This array can then be sliced in both dimensions. I used flatten to remove a dimension, and to force a copy (to get a new contiguous buffer).
In [50]: my_2dstrings[1:,1:].flatten().view(dtype='|S2')
Out[50]:
array(['ef', 'hi'],
dtype='|S2')
If the strings are already in an array (as opposed to a list) then this approach is much faster than the list comprehension approaches.
Some timings with the 1000 x 64 list that wflynny tests
In [98]: timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 173 us per loop # mine's slower computer
In [99]: timeit np.array(my_list_64).view('S1').reshape(64,-1)[1:,1:].flatten().view('S63')
1000 loops, best of 3: 213 us per loop
In [100]: %%timeit arr =np.array(my_list_64)
.....: arr.view('S1').reshape(64,-1)[1:,1:].flatten().view('S63') .....:
10000 loops, best of 3: 23.2 us per loop
Creating the array from the list is slow, but once created the view approach is much faster.
See my edit history for my earlier notes on np.char.
As per Joe Kington here, python is very good at string manipulations and generator/list comprehensions are fast and flexible for string operations. Unless you need to use numpy later in your pipeline, I would urge against it.
[s[1:] for s in my_list[1:]]
is fast:
In [1]: from string import ascii_lowercase
In [2]: from random import randint, choice
In [3]: my_list_rand = [''.join([choice(ascii_lowercase)
for _ in range(randint(2, 64))])
for i in range(1000)]
In [4]: my_list_64 = [''.join([choice(ascii_lowercase) for _ in range(64)])
for i in range(1000)]
In [5]: %timeit [s[1:] for s in my_list_rand[1:]]
10000 loops, best of 3: 47.6 µs per loop
In [6]: %timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 45.3 µs per loop
Using numpy just adds overhead.
Starting with numpy 1.23.0, I added a mechanism to change the dtype of views of non-contiguous arrays. That means you can view your array as individual characters, slice it how you like, and then build it back together. Before this would require a copy, as #hpaulj's answer clearly shows.
>>> my_list = np.array(["abc", "def", "ghi"])
>>> my_list[:, None].view('U1')[1:, 1:].view('U2').squeeze()
array(['ef', 'hi'])
I'm working on another layer of abstraction, specifically for string arrays called np.slice_ (currently work-in-progress in PR #20694, but the code is functional). If that should get accepted, you will be able to do
>>> np.char.slice_(my_list[1:], 1)
array(['ef', 'hi'])
Your slicing is incorrectly syntaxed. You only need to do my_list[1:] to get what you need. If you want to copy the elements twice onto a list, You can do something = mylist[1:].extend(mylist[1:])