Multi-dimensional slicing a list of strings with numpy - python

Say I have the following:
my_list = np.array(["abc", "def", "ghi"])
and I'd like to get:
np.array(["ef", "hi"])
I tried:
my_list[1:,1:]
But then I get:
IndexError: too many indices for array
Does Numpy support slicing strings?

No, you cannot do that. For numpy np.array(["abc", "def", "ghi"]) is a 1D array of strings, therefore you cannot use 2D slicing.
You could either define your array as a 2D array or characters, or simply use list comprehension for slicing,
In [4]: np.asarray([el[1:] for el in my_list[1:]])
Out[4]:
array(['ef', 'hi'], dtype='|S2')

Your array of strings stores the data as a contiguous block of characters, using the 'S3' dtype to divide it into strings of length 3.
In [116]: my_list
Out[116]:
array(['abc', 'def', 'ghi'],
dtype='|S3')
A S1,S2 dtype views each element as 2 strings, with 1 and 2 char each:
In [115]: my_list.view('S1,S2')
Out[115]:
array([('a', 'bc'), ('d', 'ef'), ('g', 'hi')],
dtype=[('f0', 'S1'), ('f1', 'S2')])
select the 2nd field to get an array with the desired characters:
In [114]: my_list.view('S1,S2')[1:]['f1']
Out[114]:
array(['ef', 'hi'],
dtype='|S2')
My first attempt with view was to split the array into single byte strings, and play with the resulting 2d array:
In [48]: my_2dstrings = my_list.view(dtype='|S1').reshape(3,-1)
In [49]: my_2dstrings
Out[49]:
array([['a', 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i']],
dtype='|S1')
This array can then be sliced in both dimensions. I used flatten to remove a dimension, and to force a copy (to get a new contiguous buffer).
In [50]: my_2dstrings[1:,1:].flatten().view(dtype='|S2')
Out[50]:
array(['ef', 'hi'],
dtype='|S2')
If the strings are already in an array (as opposed to a list) then this approach is much faster than the list comprehension approaches.
Some timings with the 1000 x 64 list that wflynny tests
In [98]: timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 173 us per loop # mine's slower computer
In [99]: timeit np.array(my_list_64).view('S1').reshape(64,-1)[1:,1:].flatten().view('S63')
1000 loops, best of 3: 213 us per loop
In [100]: %%timeit arr =np.array(my_list_64)
.....: arr.view('S1').reshape(64,-1)[1:,1:].flatten().view('S63') .....:
10000 loops, best of 3: 23.2 us per loop
Creating the array from the list is slow, but once created the view approach is much faster.
See my edit history for my earlier notes on np.char.

As per Joe Kington here, python is very good at string manipulations and generator/list comprehensions are fast and flexible for string operations. Unless you need to use numpy later in your pipeline, I would urge against it.
[s[1:] for s in my_list[1:]]
is fast:
In [1]: from string import ascii_lowercase
In [2]: from random import randint, choice
In [3]: my_list_rand = [''.join([choice(ascii_lowercase)
for _ in range(randint(2, 64))])
for i in range(1000)]
In [4]: my_list_64 = [''.join([choice(ascii_lowercase) for _ in range(64)])
for i in range(1000)]
In [5]: %timeit [s[1:] for s in my_list_rand[1:]]
10000 loops, best of 3: 47.6 µs per loop
In [6]: %timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 45.3 µs per loop
Using numpy just adds overhead.

Starting with numpy 1.23.0, I added a mechanism to change the dtype of views of non-contiguous arrays. That means you can view your array as individual characters, slice it how you like, and then build it back together. Before this would require a copy, as #hpaulj's answer clearly shows.
>>> my_list = np.array(["abc", "def", "ghi"])
>>> my_list[:, None].view('U1')[1:, 1:].view('U2').squeeze()
array(['ef', 'hi'])
I'm working on another layer of abstraction, specifically for string arrays called np.slice_ (currently work-in-progress in PR #20694, but the code is functional). If that should get accepted, you will be able to do
>>> np.char.slice_(my_list[1:], 1)
array(['ef', 'hi'])

Your slicing is incorrectly syntaxed. You only need to do my_list[1:] to get what you need. If you want to copy the elements twice onto a list, You can do something = mylist[1:].extend(mylist[1:])

Related

Can't append 2 strings the way I want [duplicate]

I'm a complete rookie to Python, but it seems like a given string is able to be (effectively) arbitrary length. i.e. you can take a string str and keeping adding to it: str += "some stuff...". Is there a way to make an array of such strings?
When I try this, each element only stores a single character
strArr = numpy.empty(10, dtype='string')
for i in range(0,10)
strArr[i] = "test"
On the other hand, I know I can initialize an array of certain length strings, i.e.
strArr = numpy.empty(10, dtype='s256')
which can store 10 strings of up to 256 characters.
You can do so by creating an array of dtype=object. If you try to assign a long string to a normal numpy array, it truncates the string:
>>> a = numpy.array(['apples', 'foobar', 'cowboy'])
>>> a[2] = 'bananas'
>>> a
array(['apples', 'foobar', 'banana'],
dtype='|S6')
But when you use dtype=object, you get an array of python object references. So you can have all the behaviors of python strings:
>>> a = numpy.array(['apples', 'foobar', 'cowboy'], dtype=object)
>>> a
array([apples, foobar, cowboy], dtype=object)
>>> a[2] = 'bananas'
>>> a
array([apples, foobar, bananas], dtype=object)
Indeed, because it's an array of objects, you can assign any kind of python object to the array:
>>> a[2] = {1:2, 3:4}
>>> a
array([apples, foobar, {1: 2, 3: 4}], dtype=object)
However, this undoes a lot of the benefits of using numpy, which is so fast because it works on large contiguous blocks of raw memory. Working with python objects adds a lot of overhead. A simple example:
>>> a = numpy.array(['abba' for _ in range(10000)])
>>> b = numpy.array(['abba' for _ in range(10000)], dtype=object)
>>> %timeit a.copy()
100000 loops, best of 3: 2.51 us per loop
>>> %timeit b.copy()
10000 loops, best of 3: 48.4 us per loop
You could use the object data type:
>>> import numpy
>>> s = numpy.array(['a', 'b', 'dude'], dtype='object')
>>> s[0] += 'bcdef'
>>> s
array([abcdef, b, dude], dtype=object)

Numpy.dot nests vector when multiplying [duplicate]

I am using numpy. I have a matrix with 1 column and N rows and I want to get an array from with N elements.
For example, if i have M = matrix([[1], [2], [3], [4]]), I want to get A = array([1,2,3,4]).
To achieve it, I use A = np.array(M.T)[0]. Does anyone know a more elegant way to get the same result?
Thanks!
If you'd like something a bit more readable, you can do this:
A = np.squeeze(np.asarray(M))
Equivalently, you could also do: A = np.asarray(M).reshape(-1), but that's a bit less easy to read.
result = M.A1
https://numpy.org/doc/stable/reference/generated/numpy.matrix.A1.html
matrix.A1
1-d base array
A, = np.array(M.T)
depends what you mean by elegance i suppose but thats what i would do
You can try the following variant:
result=np.array(M).flatten()
np.array(M).ravel()
If you care for speed; But if you care for memory:
np.asarray(M).ravel()
Or you could try to avoid some temps with
A = M.view(np.ndarray)
A.shape = -1
First, Mv = numpy.asarray(M.T), which gives you a 4x1 but 2D array.
Then, perform A = Mv[0,:], which gives you what you want. You could put them together, as numpy.asarray(M.T)[0,:].
This will convert the matrix into array
A = np.ravel(M).T
ravel() and flatten() functions from numpy are two techniques that I would try here. I will like to add to the posts made by Joe, Siraj, bubble and Kevad.
Ravel:
A = M.ravel()
print A, A.shape
>>> [1 2 3 4] (4,)
Flatten:
M = np.array([[1], [2], [3], [4]])
A = M.flatten()
print A, A.shape
>>> [1 2 3 4] (4,)
numpy.ravel() is faster, since it is a library level function which does not make any copy of the array. However, any change in array A will carry itself over to the original array M if you are using numpy.ravel().
numpy.flatten() is slower than numpy.ravel(). But if you are using numpy.flatten() to create A, then changes in A will not get carried over to the original array M.
numpy.squeeze() and M.reshape(-1) are slower than numpy.flatten() and numpy.ravel().
%timeit M.ravel()
>>> 1000000 loops, best of 3: 309 ns per loop
%timeit M.flatten()
>>> 1000000 loops, best of 3: 650 ns per loop
%timeit M.reshape(-1)
>>> 1000000 loops, best of 3: 755 ns per loop
%timeit np.squeeze(M)
>>> 1000000 loops, best of 3: 886 ns per loop
Came in a little late, hope this helps someone,
np.array(M.flat)

Extract the first letter from each string in a numpy array

I got a huge numpy array where elements are strings. I like to replace the strings with the first alphabet of the string. For example if
C[0] = 'A90CD'
I want to replace it with
C[0] = 'A'
IN nutshell, I was thinking of applying regex in a loop where I have a dictionary of regex expression like
'^A.+$' => 'A'
'^B.+$' => 'B'
etc
How can I apply this regex over the numpy arrays ? Or is there any better method to achieve the same ?
There's no need for regex here. Just convert your array to a 1 byte string, using astype -
v = np.array(['abc', 'def', 'ghi'])
>>> v.astype('<U1')
array(['a', 'd', 'g'],
dtype='<U1')
Alternatively, you change its view and stride. Here's a slightly optimised version for equal sized strings. -
>>> v.view('<U1')[::len(v[0])]
array(['a', 'd', 'g'],
dtype='<U1')
And here's the more generalised version of .view method, but this works for arrays of strings with differing length. Thanks to Paul Panzer for the suggestion -
>>> v.view('<U1').reshape(v.shape + (-1,))[:, 0]
array(['a', 'd', 'g'],
dtype='<U1')
Performance
y = np.array([x * 20 for x in v]).repeat(100000)
y.shape
(300000,)
len(y[0]) # they're all the same length - `abcabcabc...`
60
Now, the timings -
# `astype` conversion
%timeit y.astype('<U1')
100 loops, best of 3: 5.03 ms per loop
# `view` for equal sized string arrays
%timeit y.view('<U1')[::len(y[0])]
100000 loops, best of 3: 2.43 µs per loop
# Paul Panzer's version for differing length strings
%timeit y.view('<U1').reshape(y.shape + (-1,))[:, 0]
100000 loops, best of 3: 3.1 µs per loop
The view method is faster by a huge margin.
However, use with caution, as the memory is shared.
If you're interested in a more general solution that finds you the first letter (regardless of where it may be), I'd say the fastest/easiest way would be using the re module, compiling a pattern and searching inside a list comprehension.
>>> p = re.compile('[a-zA-Z]')
>>> [p.search(x).group() for x in v]
['a', 'd', 'g']
And, its performance on the same setup above -
%timeit [p.search(x).group() for x in y]
1 loop, best of 3: 320 ms per loop

Preallocating ndarrays

How can I preallocate arrays of arrays so I can do appending a bit more efficiently. In Matlab there is a function called cell(required_length) which preallocates 'cells' which can store arrays.
I have an array that currently looks like:
a=np.array([[1,2],[1],[2,3,4]])
b=np.array([[20,2]])
However I wish to append 1000s of more array which are like the 'b' shown but varying in size.
This isn't just a quesiton of preallocating an array, such as np.empty((100,), dtype=int). It's as much a question about how to collect a large number of lists into one structure, whether it be a list or numpy array. The comparison with MATLAB cells is enough, in my opinion, to warrant further discussion.
I think you should be using Python lists. They can contain lists or other objects (incuding arrays) of varying sizes. You can easily append more items (or use extend to add multiple objects). Python has had them for ever; MATLAB added cells to approximate that flexibility.
np.arrays with dtype=object are similar - arrays of pointers to objects such as lists. For the most part they are just lists with an array wrapper. You can initial an array to some large size, and insert/set items.
A = np.empty((10,),dtype=object)
produces an array with 10 elements, each None.
A[0] = [1,2,3]
A[1] = [2,3]
...
You can also concatenate elements to an existing array, but the result is new one. There is a np.append function, but it is just a cover for concatenate; it should not be confused with the list append.
If it must be an array, you can easily construct it from the list at the end. That's what your np.array([[1,2],[1],[2,3,4]]) does.
How to add to numpy array entries of different size in a for loop (similar to Matlab's cell arrays)?
On the issue of speed, let's try simple time tests
def witharray(n):
result=np.empty((n,),dtype=object)
for i in range(n):
result[i]=list(range(i))
return result
def withlist(n):
result=[]
for i in range(n):
result.append(list(range(i)))
return result
which produce
In [111]: withlist(4)
Out[111]: [[], [0], [0, 1], [0, 1, 2]]
In [112]: witharray(4)
Out[112]: array([[], [0], [0, 1], [0, 1, 2]], dtype=object)
In [113]: np.array(withlist(4))
Out[113]: array([[], [0], [0, 1], [0, 1, 2]], dtype=object)
timetests
In [108]: timeit withlist(400)
1000 loops, best of 3: 1.87 ms per loop
In [109]: timeit witharray(400)
100 loops, best of 3: 2.13 ms per loop
In [110]: timeit np.array(withlist(400))
100 loops, best of 3: 8.95 ms per loop
Simply constructing a list of lists is fastest. But if the result must be an object type array, then assigning values to an empty array is faster.

Numpy matrix to array

I am using numpy. I have a matrix with 1 column and N rows and I want to get an array from with N elements.
For example, if i have M = matrix([[1], [2], [3], [4]]), I want to get A = array([1,2,3,4]).
To achieve it, I use A = np.array(M.T)[0]. Does anyone know a more elegant way to get the same result?
Thanks!
If you'd like something a bit more readable, you can do this:
A = np.squeeze(np.asarray(M))
Equivalently, you could also do: A = np.asarray(M).reshape(-1), but that's a bit less easy to read.
result = M.A1
https://numpy.org/doc/stable/reference/generated/numpy.matrix.A1.html
matrix.A1
1-d base array
A, = np.array(M.T)
depends what you mean by elegance i suppose but thats what i would do
You can try the following variant:
result=np.array(M).flatten()
np.array(M).ravel()
If you care for speed; But if you care for memory:
np.asarray(M).ravel()
Or you could try to avoid some temps with
A = M.view(np.ndarray)
A.shape = -1
First, Mv = numpy.asarray(M.T), which gives you a 4x1 but 2D array.
Then, perform A = Mv[0,:], which gives you what you want. You could put them together, as numpy.asarray(M.T)[0,:].
This will convert the matrix into array
A = np.ravel(M).T
ravel() and flatten() functions from numpy are two techniques that I would try here. I will like to add to the posts made by Joe, Siraj, bubble and Kevad.
Ravel:
A = M.ravel()
print A, A.shape
>>> [1 2 3 4] (4,)
Flatten:
M = np.array([[1], [2], [3], [4]])
A = M.flatten()
print A, A.shape
>>> [1 2 3 4] (4,)
numpy.ravel() is faster, since it is a library level function which does not make any copy of the array. However, any change in array A will carry itself over to the original array M if you are using numpy.ravel().
numpy.flatten() is slower than numpy.ravel(). But if you are using numpy.flatten() to create A, then changes in A will not get carried over to the original array M.
numpy.squeeze() and M.reshape(-1) are slower than numpy.flatten() and numpy.ravel().
%timeit M.ravel()
>>> 1000000 loops, best of 3: 309 ns per loop
%timeit M.flatten()
>>> 1000000 loops, best of 3: 650 ns per loop
%timeit M.reshape(-1)
>>> 1000000 loops, best of 3: 755 ns per loop
%timeit np.squeeze(M)
>>> 1000000 loops, best of 3: 886 ns per loop
Came in a little late, hope this helps someone,
np.array(M.flat)

Categories

Resources