Extract the first letter from each string in a numpy array

Extract the first letter from each string in a numpy array - python

I got a huge numpy array where elements are strings. I like to replace the strings with the first alphabet of the string. For example if
C[0] = 'A90CD'
I want to replace it with
C[0] = 'A'
IN nutshell, I was thinking of applying regex in a loop where I have a dictionary of regex expression like
'^A.+$' => 'A'
'^B.+$' => 'B'
etc
How can I apply this regex over the numpy arrays ? Or is there any better method to achieve the same ?

There's no need for regex here. Just convert your array to a 1 byte string, using astype -
v = np.array(['abc', 'def', 'ghi'])
>>> v.astype('<U1')
array(['a', 'd', 'g'],
dtype='<U1')
Alternatively, you change its view and stride. Here's a slightly optimised version for equal sized strings. -
>>> v.view('<U1')[::len(v[0])]
array(['a', 'd', 'g'],
dtype='<U1')
And here's the more generalised version of .view method, but this works for arrays of strings with differing length. Thanks to Paul Panzer for the suggestion -
>>> v.view('<U1').reshape(v.shape + (-1,))[:, 0]
array(['a', 'd', 'g'],
dtype='<U1')
Performance
y = np.array([x * 20 for x in v]).repeat(100000)
y.shape
(300000,)
len(y[0]) # they're all the same length - `abcabcabc...`
60
Now, the timings -
# `astype` conversion
%timeit y.astype('<U1')
100 loops, best of 3: 5.03 ms per loop
# `view` for equal sized string arrays
%timeit y.view('<U1')[::len(y[0])]
100000 loops, best of 3: 2.43 µs per loop
# Paul Panzer's version for differing length strings
%timeit y.view('<U1').reshape(y.shape + (-1,))[:, 0]
100000 loops, best of 3: 3.1 µs per loop
The view method is faster by a huge margin.
However, use with caution, as the memory is shared.
If you're interested in a more general solution that finds you the first letter (regardless of where it may be), I'd say the fastest/easiest way would be using the re module, compiling a pattern and searching inside a list comprehension.
>>> p = re.compile('[a-zA-Z]')
>>> [p.search(x).group() for x in v]
['a', 'd', 'g']
And, its performance on the same setup above -
%timeit [p.search(x).group() for x in y]
1 loop, best of 3: 320 ms per loop

Related

Numpy.dot nests vector when multiplying [duplicate]

I am using numpy. I have a matrix with 1 column and N rows and I want to get an array from with N elements.
For example, if i have M = matrix([[1], [2], [3], [4]]), I want to get A = array([1,2,3,4]).
To achieve it, I use A = np.array(M.T)[0]. Does anyone know a more elegant way to get the same result?
Thanks!

If you'd like something a bit more readable, you can do this:
A = np.squeeze(np.asarray(M))
Equivalently, you could also do: A = np.asarray(M).reshape(-1), but that's a bit less easy to read.

result = M.A1
https://numpy.org/doc/stable/reference/generated/numpy.matrix.A1.html
matrix.A1
1-d base array

A, = np.array(M.T)
depends what you mean by elegance i suppose but thats what i would do

You can try the following variant:
result=np.array(M).flatten()

np.array(M).ravel()
If you care for speed; But if you care for memory:
np.asarray(M).ravel()

Or you could try to avoid some temps with
A = M.view(np.ndarray)
A.shape = -1

First, Mv = numpy.asarray(M.T), which gives you a 4x1 but 2D array.
Then, perform A = Mv[0,:], which gives you what you want. You could put them together, as numpy.asarray(M.T)[0,:].

This will convert the matrix into array
A = np.ravel(M).T

ravel() and flatten() functions from numpy are two techniques that I would try here. I will like to add to the posts made by Joe, Siraj, bubble and Kevad.
Ravel:
A = M.ravel()
print A, A.shape
>>> [1 2 3 4] (4,)
Flatten:
M = np.array([[1], [2], [3], [4]])
A = M.flatten()
print A, A.shape
>>> [1 2 3 4] (4,)
numpy.ravel() is faster, since it is a library level function which does not make any copy of the array. However, any change in array A will carry itself over to the original array M if you are using numpy.ravel().
numpy.flatten() is slower than numpy.ravel(). But if you are using numpy.flatten() to create A, then changes in A will not get carried over to the original array M.
numpy.squeeze() and M.reshape(-1) are slower than numpy.flatten() and numpy.ravel().
%timeit M.ravel()
>>> 1000000 loops, best of 3: 309 ns per loop
%timeit M.flatten()
>>> 1000000 loops, best of 3: 650 ns per loop
%timeit M.reshape(-1)
>>> 1000000 loops, best of 3: 755 ns per loop
%timeit np.squeeze(M)
>>> 1000000 loops, best of 3: 886 ns per loop

Came in a little late, hope this helps someone,
np.array(M.flat)

looking on nested list for lot of data python

I have to find in a nested list which list have a word and return a boolear numpy array.
nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words=c
result=[1,0,1,1]
I'm using this list comprehension to do it and it works
np.array([word in x for x in nested_list])
But I'm working with a nested list with 700k lists inside so it takes a lot of time. Also, I have to do it a lot of times, lists are static but words can change.
1 loop with list comprehension takes 0.36s, I need a way to do it faster, is there a way to do it?

We could flatten out the elements in all sub-lists to give us a 1D array. Then, we simply look for any occurrence of 'c' within the limits of each sub-list in the flattened 1D array. Thus, with that philosophy, we could use two approaches, based on how we count the occurrence of any c.
Approach #1 : One approach with np.bincount -
lens = np.array([len(i) for i in nested_list])
arr = np.concatenate(nested_list)
ids = np.repeat(np.arange(lens.size),lens)
out = np.bincount(ids, arr=='c')!=0
Since, as stated in the question, nested_list won't change across iterations, we can re-use everything and just loop for the final step.
Approach #2 : Another approach with np.add.reduceat reusing arr and lens from previous one -
grp_idx = np.append(0,lens[:-1].cumsum())
out = np.add.reduceat(arr=='c', grp_idx)!=0
When looping through a list of words, we can keep this approach vectorized for the final step by using np.add.reduceat along an axis and using broadcasting to give us a 2D array boolean, like so -
np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0
Sample run -
In [344]: nested_list
Out[344]: [['a', 'b', 'c'], ['a', 'b'], ['b', 'c'], ['c']]
In [345]: words
Out[345]: ['c', 'b']
In [346]: lens = np.array([len(i) for i in nested_list])
...: arr = np.concatenate(nested_list)
...: grp_idx = np.append(0,lens[:-1].cumsum())
...:
In [347]: np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0
Out[347]:
array([[ True, False, True, True], # matches for 'c'
[ True, True, True, False]]) # matches for 'b'

A generator expression would be preferable when iterating once(in terms of performance).The solution using numpy.fromiter function:
nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words = 'c'
arr = np.fromiter((words in l for l in nested_list), int)
print(arr)
The output:
[1 0 1 1]
https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html

How much time is it taking your to finish your loop? In my test case it only takes a few hundred milliseconds.
import random
# generate the nested lists
a = list('abcdefghijklmnop')
nested_list = [ [random.choice(a) for x in range(random.randint(1,30))]
for n in range(700000)]
%%timeit -n 10
word = 'c'
b = [word in x for x in nested_list]
# 10 loops, best of 3: 191 ms per loop
Reducing each internal list to a set gives some time savings...
nested_sets = [set(x) for x in nested_list]
%%timeit -n 10
word = 'c'
b = [word in s for s in nested_sets]
# 10 loops, best of 3: 132 ms per loop
And once you have turned it into a list of sets, you can build a list of boolean tuples. No real time savings though.
%%timeit -n 10
words = list('abcde')
b = [(word in s for word in words) for s in nested_sets]
# 10 loops, best of 3: 749 ms per loop

Multi-dimensional slicing a list of strings with numpy

Say I have the following:
my_list = np.array(["abc", "def", "ghi"])
and I'd like to get:
np.array(["ef", "hi"])
I tried:
my_list[1:,1:]
But then I get:
IndexError: too many indices for array
Does Numpy support slicing strings?

No, you cannot do that. For numpy np.array(["abc", "def", "ghi"]) is a 1D array of strings, therefore you cannot use 2D slicing.
You could either define your array as a 2D array or characters, or simply use list comprehension for slicing,
In [4]: np.asarray([el[1:] for el in my_list[1:]])
Out[4]:
array(['ef', 'hi'], dtype='|S2')

Your array of strings stores the data as a contiguous block of characters, using the 'S3' dtype to divide it into strings of length 3.
In [116]: my_list
Out[116]:
array(['abc', 'def', 'ghi'],
dtype='|S3')
A S1,S2 dtype views each element as 2 strings, with 1 and 2 char each:
In [115]: my_list.view('S1,S2')
Out[115]:
array([('a', 'bc'), ('d', 'ef'), ('g', 'hi')],
dtype=[('f0', 'S1'), ('f1', 'S2')])
select the 2nd field to get an array with the desired characters:
In [114]: my_list.view('S1,S2')[1:]['f1']
Out[114]:
array(['ef', 'hi'],
dtype='|S2')
My first attempt with view was to split the array into single byte strings, and play with the resulting 2d array:
In [48]: my_2dstrings = my_list.view(dtype='|S1').reshape(3,-1)
In [49]: my_2dstrings
Out[49]:
array([['a', 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i']],
dtype='|S1')
This array can then be sliced in both dimensions. I used flatten to remove a dimension, and to force a copy (to get a new contiguous buffer).
In [50]: my_2dstrings[1:,1:].flatten().view(dtype='|S2')
Out[50]:
array(['ef', 'hi'],
dtype='|S2')
If the strings are already in an array (as opposed to a list) then this approach is much faster than the list comprehension approaches.
Some timings with the 1000 x 64 list that wflynny tests
In [98]: timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 173 us per loop # mine's slower computer
In [99]: timeit np.array(my_list_64).view('S1').reshape(64,-1)[1:,1:].flatten().view('S63')
1000 loops, best of 3: 213 us per loop
In [100]: %%timeit arr =np.array(my_list_64)
.....: arr.view('S1').reshape(64,-1)[1:,1:].flatten().view('S63') .....:
10000 loops, best of 3: 23.2 us per loop
Creating the array from the list is slow, but once created the view approach is much faster.
See my edit history for my earlier notes on np.char.

As per Joe Kington here, python is very good at string manipulations and generator/list comprehensions are fast and flexible for string operations. Unless you need to use numpy later in your pipeline, I would urge against it.
[s[1:] for s in my_list[1:]]
is fast:
In [1]: from string import ascii_lowercase
In [2]: from random import randint, choice
In [3]: my_list_rand = [''.join([choice(ascii_lowercase)
for _ in range(randint(2, 64))])
for i in range(1000)]
In [4]: my_list_64 = [''.join([choice(ascii_lowercase) for _ in range(64)])
for i in range(1000)]
In [5]: %timeit [s[1:] for s in my_list_rand[1:]]
10000 loops, best of 3: 47.6 µs per loop
In [6]: %timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 45.3 µs per loop
Using numpy just adds overhead.

Starting with numpy 1.23.0, I added a mechanism to change the dtype of views of non-contiguous arrays. That means you can view your array as individual characters, slice it how you like, and then build it back together. Before this would require a copy, as #hpaulj's answer clearly shows.
>>> my_list = np.array(["abc", "def", "ghi"])
>>> my_list[:, None].view('U1')[1:, 1:].view('U2').squeeze()
array(['ef', 'hi'])
I'm working on another layer of abstraction, specifically for string arrays called np.slice_ (currently work-in-progress in PR #20694, but the code is functional). If that should get accepted, you will be able to do
>>> np.char.slice_(my_list[1:], 1)
array(['ef', 'hi'])

Your slicing is incorrectly syntaxed. You only need to do my_list[1:] to get what you need. If you want to copy the elements twice onto a list, You can do something = mylist[1:].extend(mylist[1:])

Processing time difference between tuple-list

Wondering why this tuple process;
x = tuple((t for t in range(100000)))
# 0.014001131057739258 seconds
Took longer than this list;
y = [z for z in range(100000)]
# 0.005000114440917969 seconds
I learned that tuple processes are faster than list since tuples are immutable.
Edit: After I changed the codes;
x = tuple(t for t in range(100000))
y = list(z for z in range(100000))
>>>
0.009999990463256836
0.0
>>>
These are the result: Still tuple is the slower one.

Tuple operations aren't necessarily faster. Being immutable at most opens the door to more optimisations, but that doesn't mean Python does them or that they apply in every case.
The difference here is very marginal, and - without profiling to confirm - it seems likely that it relates to the generator version having an extra name lookup and function call. As mentioned in the comments, rewriting the list comprehension as a call to list wrapped around a generator expression, the difference will likely shrink.

using comparative methods of testing the tuple is slightly faster:
In [12]: timeit tuple(t for t in range(100000))
100 loops, best of 3: 7.41 ms per loop
In [13]: timeit list(t for t in range(100000))
100 loops, best of 3: 7.53 ms per loop
calling list does actually create a list:
In [19]: x = list(t for t in range(10))
In [20]: x
Out[20]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
we can also see calling list on the generator does not allocate as much space as using a list comprehension:
In [28]: x = list(t for t in range(10))
In [29]: sys.getsizeof(x)
Out[29]: 168
In [30]: x = [t for t in range(10)]
In [31]: sys.getsizeof(x)
Out[31]: 200
So both operations are very similar.
A better comparison would be creating lists and tuples as subelements:
In [41]: timeit tuple((t,) for t in range(1000000))
10 loops, best of 3: 151 ms per loop
In [42]: timeit list([t] for t in range(1000000))
1 loops, best of 3: 247 ms per loop
Now we see a much larger difference.

efficiently compute ordering permutations in numpy array

I've got a numpy array. What is the fastest way to compute all the permutations of orderings.
What I mean is, given the first element in my array, I want a list of all the elements that sequentially follow it. Then given the second element, a list of all the elements that follow it.
So given my list: b, c, & d follow a. c & d follow b, and d follows c.
x = np.array(["a", "b", "c", "d"])
So a potential output looks like:
[
["a","b"],
["a","c"],
["a","d"],
["b","c"],
["b","d"],
["c","d"],
]
I will need to do this several million times so I am looking for an efficient solution.
I tried something like:
im = np.vstack([x]*len(x))
a = np.vstack(([im], [im.T])).T
results = a[np.triu_indices(len(x),1)]
but its actually slower than looping...

You can use itertools's functions like chain.from_iterable and combinations with np.fromiter for this. This involves no loop in Python, but still not a pure NumPy solution:
>>> from itertools import combinations, chain
>>> arr = np.fromiter(chain.from_iterable(combinations(x, 2)), dtype=x.dtype)
>>> arr.reshape(arr.size/2, 2)
array([['a', 'b'],
['a', 'c'],
['a', 'd'],
...,
['b', 'c'],
['b', 'd'],
['c', 'd']],
dtype='|S1')
Timing comparisons:
>>> x = np.array(["a", "b", "c", "d"]*100)
>>> %%timeit
im = np.vstack([x]*len(x))
a = np.vstack(([im], [im.T])).T
results = a[np.triu_indices(len(x),1)]
...
10 loops, best of 3: 29.2 ms per loop
>>> %%timeit
arr = np.fromiter(chain.from_iterable(combinations(x, 2)), dtype=x.dtype)
arr.reshape(arr.size/2, 2)
...
100 loops, best of 3: 6.63 ms per loop

I've been browsing the source and it seems the tri functions have had some very substantial improvements relatively recently. The file is all Python so you can just copy it into your directory if that helps.
I seem to have completely different timings to Ashwini Chaudhary, even after taking this into account.
It is very important to know the size of the arrays you want to do this on; if it is small you should cache intermediates like triu_indices.
The fastest code for me was:
def triangalize_1(x):
xs, ys = numpy.triu_indices(len(x), 1)
return numpy.array([x[xs], x[ys]]).T
unless the x array is small.
If x is small, caching was best:
triu_cache = {}
def triangalize_1(x):
if len(x) in triu_cache:
xs, ys = triu_cache[len(x)]
else:
xs, ys = numpy.triu_indices(len(x), 1)
triu_cache[len(x)] = xs, ys
return numpy.array([x[xs], x[ys]]).T
I wouldn't do this for large x because of memory requirements.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract the first letter from each string in a numpy array - python

Related

Numpy.dot nests vector when multiplying [duplicate]

looking on nested list for lot of data python

Multi-dimensional slicing a list of strings with numpy

Processing time difference between tuple-list

efficiently compute ordering permutations in numpy array

Categories

Resources