How can this function be vectorized? - python

I have a NumPy array with the following properties:
shape: (9986080, 2)
dtype: np.float32
I have a method that loops over the range of the array, performs an operation and then inputs result to new array:
def foo(arr):
new_arr = np.empty(arr.size, dtype=np.uint64)
for i in range(arr.size):
x, y = arr[i]
e, n = ''
if x < 0:
e = '1'
else:
w = '2'
if y > 0:
n = '3'
else:
s = '4'
new_arr[i] = int(f'{abs(x)}{e}{abs(y){n}'.replace('.', ''))

I agree with Iguananaut's comment that this data structure seems a bit odd. My biggest problem with it is that it is really tricky to try and vectorize the putting together of integers in a string and then re-converting that to an integer. Still, this will certainly help speed up the function:
def foo(arr):
x_values = arr[:,0]
y_values = arr[:,1]
ones = np.ones(arr.shape[0], dtype=np.uint64)
e = np.char.array(np.where(x_values < 0, ones, ones * 2))
n = np.char.array(np.where(y_values < 0, ones * 3, ones * 4))
x_values = np.char.array(np.absolute(x_values))
y_values = np.char.array(np.absolute(y_values))
x_values = np.char.replace(x_values, '.', '')
y_values = np.char.replace(y_values, '.', '')
new_arr = np.char.add(np.char.add(x_values, e), np.char.add(y_values, n))
return new_arr.astype(np.uint64)
Here, the x and y values of the input array are first split up. Then we use a vectorized computation to determine where e and n should be 1 or 2, 3 or 4. The last line uses a standard list comprehension to do the string merging bit, which is still undesirably slow for super large arrays but faster than a regular for loop. Also vectorizing the previous computations should speed the function up hugely.
Edit:
I was mistaken before. Numpy does have a nice way of handling string concatenation using the np.char.add() method. This requires converting x_values and y_values to Numpy character arrays using np.char.array(). Also for some reason, the np.char.add() method only takes two arrays as inputs, so it is necessary to first concatenate x_values and e and y_values and n and then concatenate these results. Still, this vectorizes the computations and should be pretty fast. The code is still a bit clunky because of the rather odd operation you are after, but I think this will help you speed up the function greatly.

You may use np.apply_along_axis. When you feed this function with another function that takes row (or column) as an argument, it does what you want to do.
For you case, You may rewrite the function as below:
def foo(row):
x, y = row
e, n = ''
if x < 0:
e = '1'
else:
w = '2'
if y > 0:
n = '3'
else:
s = '4'
return int(f'{abs(x)}{e}{abs(y){n}'.replace('.', ''))
# Where you want to you use it.
new_arr = np.apply_along_axis(foo, 1, n)

Related

Trouble finding the first letter of a variable and searching for it in a list [duplicate]

What is the fastest way to check if a value exists in a very large list?
7 in a
Clearest and fastest way to do it.
You can also consider using a set, but constructing that set from your list may take more time than faster membership testing will save. The only way to be certain is to benchmark well. (this also depends on what operations you require)
As stated by others, in can be very slow for large lists. Here are some comparisons of the performances for in, set and bisect. Note the time (in second) is in log scale.
Code for testing:
import random
import bisect
import matplotlib.pyplot as plt
import math
import time
def method_in(a, b, c):
start_time = time.time()
for i, x in enumerate(a):
if x in b:
c[i] = 1
return time.time() - start_time
def method_set_in(a, b, c):
start_time = time.time()
s = set(b)
for i, x in enumerate(a):
if x in s:
c[i] = 1
return time.time() - start_time
def method_bisect(a, b, c):
start_time = time.time()
b.sort()
for i, x in enumerate(a):
index = bisect.bisect_left(b, x)
if index < len(a):
if x == b[index]:
c[i] = 1
return time.time() - start_time
def profile():
time_method_in = []
time_method_set_in = []
time_method_bisect = []
# adjust range down if runtime is too long or up if there are too many zero entries in any of the time_method lists
Nls = [x for x in range(10000, 30000, 1000)]
for N in Nls:
a = [x for x in range(0, N)]
random.shuffle(a)
b = [x for x in range(0, N)]
random.shuffle(b)
c = [0 for x in range(0, N)]
time_method_in.append(method_in(a, b, c))
time_method_set_in.append(method_set_in(a, b, c))
time_method_bisect.append(method_bisect(a, b, c))
plt.plot(Nls, time_method_in, marker='o', color='r', linestyle='-', label='in')
plt.plot(Nls, time_method_set_in, marker='o', color='b', linestyle='-', label='set')
plt.plot(Nls, time_method_bisect, marker='o', color='g', linestyle='-', label='bisect')
plt.xlabel('list size', fontsize=18)
plt.ylabel('log(time)', fontsize=18)
plt.legend(loc='upper left')
plt.yscale('log')
plt.show()
profile()
You could put your items into a set. Set lookups are very efficient.
Try:
s = set(a)
if 7 in s:
# do stuff
edit In a comment you say that you'd like to get the index of the element. Unfortunately, sets have no notion of element position. An alternative is to pre-sort your list and then use binary search every time you need to find an element.
The original question was:
What is the fastest way to know if a value exists in a list (a list
with millions of values in it) and what its index is?
Thus there are two things to find:
is an item in the list, and
what is the index (if in the list).
Towards this, I modified #xslittlegrass code to compute indexes in all cases, and added an additional method.
Results
Methods are:
in--basically if x in b: return b.index(x)
try--try/catch on b.index(x) (skips having to check if x in b)
set--basically if x in set(b): return b.index(x)
bisect--sort b with its index, binary search for x in sorted(b).
Note mod from #xslittlegrass who returns the index in the sorted b,
rather than the original b)
reverse--form a reverse lookup dictionary d for b; then
d[x] provides the index of x.
Results show that method 5 is the fastest.
Interestingly the try and the set methods are equivalent in time.
Test Code
import random
import bisect
import matplotlib.pyplot as plt
import math
import timeit
import itertools
def wrapper(func, *args, **kwargs):
" Use to produced 0 argument function for call it"
# Reference https://www.pythoncentral.io/time-a-python-function/
def wrapped():
return func(*args, **kwargs)
return wrapped
def method_in(a,b,c):
for i,x in enumerate(a):
if x in b:
c[i] = b.index(x)
else:
c[i] = -1
return c
def method_try(a,b,c):
for i, x in enumerate(a):
try:
c[i] = b.index(x)
except ValueError:
c[i] = -1
def method_set_in(a,b,c):
s = set(b)
for i,x in enumerate(a):
if x in s:
c[i] = b.index(x)
else:
c[i] = -1
return c
def method_bisect(a,b,c):
" Finds indexes using bisection "
# Create a sorted b with its index
bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])
for i,x in enumerate(a):
index = bisect.bisect_left(bsorted,(x, ))
c[i] = -1
if index < len(a):
if x == bsorted[index][0]:
c[i] = bsorted[index][1] # index in the b array
return c
def method_reverse_lookup(a, b, c):
reverse_lookup = {x:i for i, x in enumerate(b)}
for i, x in enumerate(a):
c[i] = reverse_lookup.get(x, -1)
return c
def profile():
Nls = [x for x in range(1000,20000,1000)]
number_iterations = 10
methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
time_methods = [[] for _ in range(len(methods))]
for N in Nls:
a = [x for x in range(0,N)]
random.shuffle(a)
b = [x for x in range(0,N)]
random.shuffle(b)
c = [0 for x in range(0,N)]
for i, func in enumerate(methods):
wrapped = wrapper(func, a, b, c)
time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))
markers = itertools.cycle(('o', '+', '.', '>', '2'))
colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))
for i in range(len(time_methods)):
plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))
plt.xlabel('list size', fontsize=18)
plt.ylabel('log(time)', fontsize=18)
plt.legend(loc = 'upper left')
plt.show()
profile()
def check_availability(element, collection: iter):
return element in collection
Usage
check_availability('a', [1,2,3,4,'a','b','c'])
I believe this is the fastest way to know if a chosen value is in an array.
a = [4,2,3,1,5,6]
index = dict((y,x) for x,y in enumerate(a))
try:
a_index = index[7]
except KeyError:
print "Not found"
else:
print "found"
This will only be a good idea if a doesn't change and thus we can do the dict() part once and then use it repeatedly. If a does change, please provide more detail on what you are doing.
Be aware that the in operator tests not only equality (==) but also identity (is), the in logic for lists is roughly equivalent to the following (it's actually written in C and not Python though, at least in CPython):
for element in s:
if element is target:
# fast check for identity implies equality
return True
if element == target:
# slower check for actual equality
return True
return False
In most circumstances this detail is irrelevant, but in some circumstances it might leave a Python novice surprised, for example, numpy.NAN has the unusual property of being not being equal to itself:
>>> import numpy
>>> numpy.NAN == numpy.NAN
False
>>> numpy.NAN is numpy.NAN
True
>>> numpy.NAN in [numpy.NAN]
True
To distinguish between these unusual cases you could use any() like:
>>> lst = [numpy.NAN, 1 , 2]
>>> any(element == numpy.NAN for element in lst)
False
>>> any(element is numpy.NAN for element in lst)
True
Note the in logic for lists with any() would be:
any(element is target or element == target for element in lst)
However, I should emphasize that this is an edge case, and for the vast majority of cases the in operator is highly optimised and exactly what you want of course (either with a list or with a set).
If you only want to check the existence of one element in a list,
7 in list_data
is the fastest solution. Note though that
7 in set_data
is a near-free operation, independently of the size of the set! Creating a set from a large list is 300 to 400 times slower than in, so if you need to check for many elements, creating a set first is faster.
Plot created with perfplot:
import perfplot
import numpy as np
def setup(n):
data = np.arange(n)
np.random.shuffle(data)
return data, set(data)
def list_in(data):
return 7 in data[0]
def create_set_from_list(data):
return set(data[0])
def set_in(data):
return 7 in data[1]
b = perfplot.bench(
setup=setup,
kernels=[list_in, set_in, create_set_from_list],
n_range=[2 ** k for k in range(24)],
xlabel="len(data)",
equality_check=None,
)
b.save("out.png")
b.show()
It sounds like your application might gain advantage from the use of a Bloom Filter data structure.
In short, a bloom filter look-up can tell you very quickly if a value is DEFINITELY NOT present in a set. Otherwise, you can do a slower look-up to get the index of a value that POSSIBLY MIGHT BE in the list. So if your application tends to get the "not found" result much more often then the "found" result, you might see a speed up by adding a Bloom Filter.
For details, Wikipedia provides a good overview of how Bloom Filters work, and a web search for "python bloom filter library" will provide at least a couple useful implementations.
This is not the code, but the algorithm for very fast searching.
If your list and the value you are looking for are all numbers, this is pretty straightforward. If strings: look at the bottom:
-Let "n" be the length of your list
-Optional step: if you need the index of the element: add a second column to the list with current index of elements (0 to n-1) - see later
Order your list or a copy of it (.sort())
Loop through:
Compare your number to the n/2th element of the list
If larger, loop again between indexes n/2-n
If smaller, loop again between indexes 0-n/2
If the same: you found it
Keep narrowing the list until you have found it or only have 2 numbers (below and above the one you are looking for)
This will find any element in at most 19 steps for a list of 1.000.000 (log(2)n to be precise)
If you also need the original position of your number, look for it in the second, index column.
If your list is not made of numbers, the method still works and will be fastest, but you may need to define a function which can compare/order strings.
Of course, this needs the investment of the sorted() method, but if you keep reusing the same list for checking, it may be worth it.
Edge case for spatial data
There are probably faster algorithms for handling spatial data (e.g. refactoring to use a k-d tree), but the special case of checking if a vector is in an array is useful:
If you have spatial data (i.e. cartesian coordinates)
If you have integer masks (i.e. array filtering)
In this case, I was interested in knowing if an (undirected) edge defined by two points was in a collection of (undirected) edges, such that
(pair in unique_pairs) | (pair[::-1] in unique_pairs) for pair in pairs
where pair constitutes two vectors of arbitrary length (i.e. shape (2,N)).
If the distance between these vectors is meaningful, then the test can be expressed by a floating point inequality like
test_result = Norm(v1 - v2) < Tol
and "Value exists in List" is simply any(test_result).
Example code and dummy test set generators for integer pairs and R3 vector pairs are below.
# 3rd party
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt
# optional
try:
from tqdm import tqdm
except ModuleNotFoundError:
def tqdm(X, *args, **kwargs):
return X
print('tqdm not found. tqdm is a handy progress bar module.')
def get_float_r3_pairs(size):
""" generate dummy vector pairs in R3 (i.e. case of spatial data) """
coordinates = np.random.random(size=(size, 3))
pairs = []
for b in coordinates:
for a in coordinates:
pairs.append((a,b))
pairs = np.asarray(pairs)
return pairs
def get_int_pairs(size):
""" generate dummy integer pairs (i.e. case of array masking) """
coordinates = np.random.randint(0, size, size)
pairs = []
for b in coordinates:
for a in coordinates:
pairs.append((a,b))
pairs = np.asarray(pairs)
return pairs
def float_tol_pair_in_pairs(pair:np.ndarray, pairs:np.ndarray) -> np.ndarray:
"""
True if abs(a0 - b0) <= tol & abs(a1 - b1) <= tol for (ai1, aj2), (bi1, bj2)
in [(a01, a02), ... (aik, ajl)]
NB this is expected to be called in iteration so no sanitization is performed.
Parameters
----------
pair : np.ndarray
pair of vectors with shape (2, M)
pairs : np.ndarray
collection of vector pairs with shape (N, 2, M)
Returns
-------
np.ndarray
(pair in pairs) | (pair[::-1] in pairs).
"""
m1 = np.sum( abs(LA.norm(pairs - pair, axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
m2 = np.sum( abs(LA.norm(pairs - pair[::-1], axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
return m1 | m2
def get_unique_pairs(pairs:np.ndarray) -> np.ndarray:
"""
apply float_tol_pair_in_pairs for pair in pairs
Parameters
----------
pairs : np.ndarray
collection of vector pairs with shape (N, 2, M)
Returns
-------
np.ndarray
pair if not ((pair in rv) | (pair[::-1] in rv)) for pair in pairs
"""
pairs = np.asarray(pairs).reshape((len(pairs), 2, -1))
rv = [pairs[0]]
for pair in tqdm(pairs[1:], desc='finding unique pairs...'):
if not any(float_tol_pair_in_pairs(pair, rv)):
rv.append(pair)
return np.array(rv)

Two number Sum program in python O(N^2)

I am used to write code in c++ but now I am trying to learn python. I came to know about the Python language and it is very popular among everyone. So I thought, let's give it a shot.
Currently I am preparing for companies interview questions and able to solve most of them in c++. Alongside which, I am trying to write the code for the same in Python. For the things which I am not familiar with, I do a google search or watch tutorials etc.
While I was writing code for my previously solved easy interview questions in python, I encountered a problem.
Code : Given an array of integers, return indices of the two numbers such that they add up to a specific target.
You may assume that each input would have exactly one solution, and you may not use the same element twice.
Given an array of integers, print the indices of the two numbers such that they add up to a specific target.
def twoNum(*arr, t):
cur = 0
x = 0
y = 0
for i in range (len(arr) - 1):
for j in range (len(arr) - 1):
if(i == j):
break
cur = arr[i] + arr[j]
if(t == cur):
x = arr[i]
y = arr[j]
break
if(t == cur):
break
print(f"{x} + {y} = {x+y} ")
arr = [3, 5, -4, 8, 11, 1, -1, 6]
target = 10
twoNum(arr, t=target)
So here is the problem: I have defined x, y in function and then used x = arr[i] and y = arr[j] and I m printing those values.
output coming is : is 0 + 0 = 10 (where target is 10)
This is I guess probably because I am using x = 0 and y = 0 initially in the function and it seems x and y values are not updating then I saw outline section in VSCode there I saw x and y are declared twice, once at the starting of the function and second in for loop.
Can anyone explain to me what is going on here?
For reference, here is an image of the code I wrote in C++
Change this:
def twoNum(*arr, t):
to this:
def twoNum(arr, t):
* is used to indicate that there will be a variable number of arguments, see this. It is not for pointers as in C++.
Basically what you are trying to do is to write C code in python.
I would instead try to focus first on how to write python code in a 'pythonic' way first. But for your question - sloving it your way using brute force in python:
In [173]: def two_num(arr, t):
...: for i in arr:
...: for j in arr[i + 1: ]:
...: if i + j == t:
...: print(f"{i} + {j} = {t}")
...: return
Here's a way to implement a brute force approach using a list comprehension:
arr = [1,3,5,7,9]
target = 6
i,j = next((i,j) for i,n in enumerate(arr[:-1]) for j,m in enumerate(arr[i+1:],i+1) if n+m==target)
output:
print(f"arr[{i}] + arr[{j}] = {arr[i]} + {arr[j]} = {target}")
# arr[0] + arr[2] = 1 + 5 = 6
Perhaps even more pythonic would be to use iterators:
from itertools import tee
iArr = enumerate(arr)
i,j = next((i,j) for i,n in iArr for j,m in tee(iArr,1)[0] if n+m==target)
When you get to implementing an O(n) solution, you should look into dictionaries:
d = { target-n:j for j,n in enumerate(arr) }
i,j = next( (i,d[m]) for i,m in enumerate(arr) if m in d and d[m] != i )

SymPy: Expression for Summation of Symbols in a List

I'm writing a program that evaluates the power series sum_{m=0}{oo} a[m]x^m, where a[m] is recursively defined: a[m]=f(a[m-1]). I am generating symbols as follows:
a = list(sympy.symbols(' '.join([('a%d' % i) for i in range(10)])))
for i in range(1, LIMIT):
a[i] = f_recur(a[i-1], i-1)
This lets me refer to the symbols a0,a1,...,a9 using a[0],a[1],...,a[9], and a[m] is a function of a[m-1] given by f_recur.
Now, I hope code up the summation as follows:
m, x, y = sympy.symbols('m x y')
y = sympy.Sum(a[m]*x**m, (m, 0, 10))
But, m is not an integer so a[m] throws an Exception.
In this situation, where symbols are stored in a list, how would you code the summation? Thanks for any help!
SymPy's Sum is designed as a sum with a symbolic index. You want a sum with a concrete index running through 0, ... 9. This could be Python's sum
y = sum([a[m]*x**m for m in range(10)])
or, which is preferable from the performance point of view (relevant issue)
y = sympy.Add(*[a[m]*x**m for m in range(10)])
In either case, m is not a symbol but an integer.
I have a work-around that does not use sympy.Sum:
x = sympy.symbols('x')
y = a[0]*x**0
for i in range(1, LIMIT):
y += a[i]*x**i
This does the job, but sympy.Sum is not used.
Use IndexedBase instead of Symbol:
>>> a = IndexedBase('a')
>>> Sum(x**m*a[m],(m,1,3))
Sum(a[m]*x**m, (m, 1, 3))
>>> _.doit()
a[1]*x + a[2]*x**2 + a[3]*x**3

Efficiently programming array elements to add up to a sum in python

I'm looking to implement in python a simple algorithm which takes as input an array and a sum, and finds a number X where if all elements in the array > X are converted to X, all the elements in the array should add up to the sum.
How do I do this efficiently?
Here is my code:
result = []
for _ in range(int(raw_input())):
input_array = map(int,raw_input().split())
sum_target = raw_input()
for e in input_array:
test_array = input_array
test_array[test_array > e] = e // supposed to replace all elements > e with e, but what's wrong here?
if sum(test_array) == sum_target:
result.append(e)
print result
Using the Numpy library (import numpy), you could replace the line
input_array = map(int,raw_input().split())
with
input_array = numpy.array(raw_input().split()).astype(int)
Then
test_array[test_array > e] = e
just works. Then, you could also do test_array.sum().
(That is, if you want to alter the array in-place, else you could replace
test_array = input_array
with
test_array = np.array(input_array)

Vectorized code for a function to generate vector values

Suppose we have a defined function as following, and we would like to iterate over n from 1 to L, I've suffered a lot for a vectorization code, since this code is rather slow due to for loop needed outside to call this function.
Details: L, K are large integers e.g. 1000 and H_n is float value.
def multifrac_Brownian_motion(n, L, K, list_hurst, ind_hurst):
t_ks = np.asarray(sorted(-np.array(range(1, K + 1))*(1./L)))
t_ns = np.linspace(0, 1, num=L+1)
t_n = t_ns[n]
chi_k = np.random.randn(K)
chi_lminus1 = np.random.randn(L)
H_n = get_hurst_value(t_n, list_hurst, ind_hurst)
part1 = 1./(np.random.gamma(0.5 + H_n))
sums1 = np.dot((t_n - t_ks)**(H_n - 0.5) - ((-t_ks)**(H_n - 0.5)), chi_k)
sums2 = np.dot((t_n - t_ns[:n])**(H_n - 0.5), chi_lminus1[:n])
return part1*(1./np.sqrt(L))*(sums1 + sums2)
for n in range(1, L + 1):
onelist.append(multifrac_Brownian_motion(n, L, K, list_hurst, ind_hurst=ind_hurst))
Update:
def list_hurst_funcs(M, seg_size=10):
"""Generate a list of Hurst function components
Args:
M: Int, number of hurst functions
seg_size: Int, number of segmentations of interval [0, 1]
Returns:
list_hurst: List, list of hurst function components
"""
list_hurst = []
for i in range(M):
seg_points = sorted(np.random.uniform(size=seg_size))
funclist = np.random.uniform(size=seg_size + 1)
list_hurst.append((seg_points, funclist))
return list_hurst
def get_hurst_value(x, list_hurst, ind):
if np.isscalar(x):
x = np.array(float(x), ndmin=1)
seg_points, funclist = list_hurst[ind]
condlist = [x < seg_points[0]] +\
[(x >= seg_points[s] and x < seg_points[s + 1])
for s in range(len(seg_points) - 1)] +\
[x >= seg_points[-1]]
return np.piecewise(x, condlist=condlist, funclist=funclist)
One way to tackle a problem like this is to (try) understand the big picture, and come with a different approach that treats everything as 2d or larger (LxK arrays). Another is to examine the multifrac_Brownian_motion, trying to speed it up, and where possible eliminate steps that depend on scalars or 1d arrays. In other words, work from the inside out. If we get enough of a speed improvement it may not matter that we have to call it in a loop. Even better the improvement suggests ways of operating in high dimensions.
As a start from inside out, I'd suggest replacing the t_ks calc with:
t_ks = -np.arange(K,0,-1)/L # 1./L if required by Py2 integer division
Since list_hurst, ind_hurst are the same for all n, I suspect you can move some time consuming parts of get_hurst_value outside the loop.
But I'd put most effort into improving that condlist construction. That's list comprehension buried deep inside your outer loop.
piecewise also loops over those seg_points.

Categories

Resources