Extract dependency information between the variables of a numpy expression

Extract dependency information between the variables of a numpy expression - python

I want to extract dependency information between the variables in a string that contains a Python 3 NumPy expression. For example:
import numpy as np
a = np.array([10, 20, 30])
b = np.array([10, 20, 30])
c = np.array([10, 20, 30])
d = 100
# The expression string will only contain numbers, arrays, and `NumPy` functions.
expression = 'b[1:3] = a[0:3:2] + np.sum(c[:]) + d'
deps = extract_dependencies(expression)
Then, the result should be:
deps: {
'b[0]': [],
'b[1]': ['a[0]', 'c[0]', 'c[1]', 'c[2]', 'd']
'b[2]': ['a[2]', 'c[0]', 'c[1]', 'c[2]', 'd']
}
My problem is that I can't figure out how to implement extract_dependencies().
This is easy to solve if all the symbols in the expression are either not arrays or arrays with single-element indexing (e.g., foo, bar[0], baz[2]). It can be done by using either regular expressions or some basic text parsing.
However, if the variables are arrays, things get more complicated. For basic operations, one could use regular expressions to find the variables in the expression string, and then extract and map array indices accordingly. For example it is easy to extract and match the indices of the expression a[0:2] = b[1:3]. Things become more tricky when functions are used as part of the expression string because they are essentially "black boxes". You can't account for all possible function signatures, behaviors, and return values unless you hard code every single one of them.
I was wondering if this could be solved using some clever use of Python's eval, exec, or ast trees.
Any ideas? :)
Thank you.
PS: The expression string is eventually evaluated using the asteval library. Hence, a solution that utilizes asteval will get extra points! :)

I (main author of asteval) think this is not possible, at least not in general.
As you say, even with your expression
expression = 'b[1:3] = a[0:3:2] + np.sum(c[:]) + d'
you need to know:
how strides work: That [x:y:n] slice means (x, x+n, x+2n, ...) up to y-1. OK, that's not too hard for 1D arrays.
what np.sum() does -- that it returns a value with arrays summed along an axis. What happens if c = np.array([[5, 4], [9,- 2]])? The expression works, but the element-by-element dependencies change. Which means that you have to know what np.sum does in detail.
what d is. What happens if it changes from a scalar to is a 2-element array?
Basically, you're asking to figure out element-by-element dependencies lexically. You have to know what a, c, d, and np.sum are to know what the resulting dependencies will be. You cannot tell from the words alone.
Python views slices and function calls as operations to be done at run time. So it parses this expression then expects that whatever is held by d can be added to the result of whatever np.sum returns with the function argument returned by evaluating an "all-null slice" on whatever c is. If those two things cannot be added together, it will fail at run time.

Related

Converting MatLab for loop with array code to python

I was given code in Matlab made by someone else and asked to convert to python. However, I do not know MatLab.This is the code:
for i = 1:nWind
[input(a:b,:), t(a:b,1)] = EulerMethod(A(:,:,:,i),S(:,:,i),B(:,i),n,scale(:,i),tf,options);
fprintf("%d\n",i);
for j = 1:b
vwa = generate_wind([input(j,9);input(j,10)],A(:,:,:,i),S(:,:,i),B(:,i),n,scale(:,i));
wxa(j) = vwa(1);
wya(j) = vwa(2);
end
% Pick random indexes for filtered inputs
rand_index = randi(tf/0.01-1,1,filter_size);
inputf(c:d,:) = input(a+rand_index,:);
wxf(c:d,1) = wxa(1,a+rand_index);
wyf(c:d,1) = wya(1,a+rand_index);
wzf(c:d,1) = 0;
I am confused on what [input(a:b,:), t(a:b,1)] mean and if wxf, wzf, wyf are part of the MatLab library or if it's made. Also, EulerMethod and generate_wind are seprate classes. Can someone help me convert this code to python?
The only thing I really changed so far is changing the for loop from:
for i = 1:nWind
to
for i in range(1,nWind):

There's several things to unpack here.
First, MATLAB indexing is 1-based, while Python indexing is 0-based. So, your for i = 1:nWind from MATLAB should translate to for i in range(0,nWind) in Python (with the zero optional). For nWind = 5, MATLAB would produce 1,2,3,4,5 while Python range would produce 0,1,2,3,4.
Second, wxf, wyf, and wzf are local variables. MATLAB is unique in that you can assign into specific indices at the same time variables are declared. These lines are assigning the first rows of wxa and wya (since their first index is 1) into the first columns of wxf and wyf (since their second index is 1). MATLAB will also expand an array if you assign past its end.
Without seeing the rest of the code, I don't really know what c and d are doing. If c is initialized to 1 before the loop and there's something like c = d+1; later, then it would be that your variables wxf, wyf, and wzf are being initialized on the first iteration of the loop and expanded on later iterations. This is a common (if frowned upon) pattern in MATLAB. If this is the case, you'd replicate it in Python by initializing to an empty array before the loop and using the array's extend() method inside the loop (though I bet it's frowned upon in Python, as well). But really, we need you to edit your question to include a, b, c, and d if you want to be sure this is really the case.
Third, EulerMethod and generate_wind are functions, not classes. EulerMethod returns two outputs, which you'd probably replicate in Python by returning a tuple.
[input(a:b,:), t(a:b,1)] = EulerMethod(...); is assigning the two outputs of EulerMethod into specific ranges of input and t. Similar concepts as in points 1 and 2 apply here.
Those are the answers to what you expressed confusion about. Without sitting down and doing it myself, I don't have enough experience in Python to give more Python-specific recommendations.

What is the simplest python equivalent to R `:` operator to create a sequence of numbers outside indexing

Is there a simple Python equivalent to R's : operator to create a vector of numbers? I only found range().
Example:
vector_example <- 1:4
vector_example
Output:
[1] 1 2 3 4

You mention range(). That's the standard answer for Python's equivalent. It returns a sequence. If you want the equivalent in a Python list, just create a list from the sequence returned by range():
range_list = list(range(1,5))
Result:
[1, 2, 3, 4]
I don't know 'go', but from your example, it appears that its : operator's second argument is inclusive...that is, that number is included in the resulting sequence. This is not true of Python's range() function. The second parameter passed to it is not included in the resulting sequence. So where you use 4 in your example, you want to use 5 with Python to get the same result.

I remember being frustrated by the lack of : to create sequences of consecutive numbers when I first switched from R to Python. In general, there is no direct equivalent to the : operator. Python sequences are more like R's seq() function.
While the base function range is alright, I personally prefer numpy.arange, as it is more flexible.
import numpy as np
# Create a simple array from 1 through 4
np.arange(1, 5)
# This is what I mean by "more flexible"
np.arange(1, 5).tolist()
Remember that Python lists and arrays are 0-indexed. As far as I'm concerned, all intervals are right-open too. So np.arange(a, b) will exclude b.
PS: There are other functions, such as numpy.linspace which may suit your needs.

How can i check that a list is in my array in python

for example if i have:
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
and i want to check if the following list is the same as one of the lists that the array consist of:
B = [2,3,4]
I tried
B in A #which returns True
But the following also returns True, which should be false:
B = [2,2,2]
B in A

Try this generator comprehension. The builtin any() short-circuits so that you don't have extra evaluations that you don't need.
any(np.array_equal(row, B) for row in A)
For now, np.array_equal doesn't implement internal short-circuiting. In a different question the performance impact of different ways of accomplishing this is discussed.
As #Dan mentions below, broadcasting is another valid way to solve this problem, and it's often (though not always) a better way. For some rough heuristics, here's how you might want to choose between the two approaches. As with any other micro-optimization, benchmark your results.
Generator Comprehension
Reduced memory footprint (not creating the array B==A)
Short-circuiting (if the first row of A is B, we don't have to look at the rest)
When rows are large (definition depends on your system, but could be ~100 - 100,000), broadcasting isn't noticeably faster.
Uses builtin language features. You have numpy installed anyway, but I'm partial to using the core language when there isn't a reason to do otherwise.
Broadcasting
Fastest way to solve an extremely broad range of problems using numpy. Using it here is good practice.
If we do have to search through every row in A (i.e. if more often than not we expect B to not be in A), broadcasting will almost always be faster (not always a lot faster necessarily, see next point)
When rows are smallish, the generator expression won't be able to vectorize the computations efficiently, so broadcasting will be substantially faster (unless of course you have enough rows that short-circuiting outweighs that concern).
In a broader context where you have more numpy code, the use of broadcasting here can help to have more consistent patterns in your code base. Coworkers and future you will appreciate not having a mix of coding styles and patterns.

You can do it by using broadcasting like this:
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
B = np.array([2,3,4]) # Or [2,3,4], a list will work fine here too
(B==A).all(axis=1).any()

Using the built-in any. As soon as an identical element is found, it stops iterating and returns true.
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
B = [3,2,4]
if any(np.array_equal(B, x) for x in A):
print(f'{B} inside {A}')
else:
print(f'{B} NOT inside {A}')

You need to use .all() for comparing all the elements of list.
A = np.array([[2,3,4],[5,6,7]])
B = [2,3,4]
for i in A:
if (i==B).all():
print ("Yes, B is present in A")
break
EDIT: I put break to break out of the loop as soon as the first occurence is found. This applies to example such as A = np.array([[2,3,4],[2,3,4]])
# print ("Yes, B is present in A")
Alternative solution using any:
any((i==B).all() for i in A)
# True

list((A[[i], :]==B).all() for i in range(A.shape[0]))
[True, False]
This will tell you what row of A is equal to B

Straight forward, you could use any() to go through a generator comparing the arrays with array_equal.
from numpy import array_equal
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
B = np.array([2,2,4])
in_A = lambda x, A : any((array_equal(a,x) for a in A))
print(in_A(B, A))
False
[Program finished]

Matlab repr function

In Matlab, one can evaluate an arbitrary string as code using the eval function. E.g.
s = '{1, 2, ''hello''}' % char
c = eval(s) % cell
Is there any way to do the inverse operation; getting the literal string representation of an arbitrary variable? That is, recover s from c?
Something like
s = repr(c)
Such a repr function is built into Python, but I've not come across anything like it in Matlab, nor do I see a clear way of how to implement it myself.
The closest thing I know of is something like disp(c) which prints out a representation of c, but in a "readable" format as opposed to a literal code format.

The closest there is in Matlab is mat2str, which works for numeric, character or logical 2D arrays (including vectors). (It doesn't work for ND arrays, cell arrays, struct arrays, or tables).
Examples:
>> a = [1 2; 3 4]; ar = mat2str(a), isequal(eval(ar), a)
ar =
'[1 2;3 4]'
ans =
logical
1
>> a = ['abc'; 'def']; ar = mat2str(a), isequal(eval(ar), a)
ar =
'['abc';'def']'
ans =
logical
1
In this related question and answers you can see:
A function I wrote for obtaining a string representation of 2D cell arrays with arbitrarily nested cell, numeric, char or logical arrays.
How to do what you want in Octave for arbitrary data types.

OK, I see your pain.
My advice would still be to provide a function of the sort of toString leveraging on fprintf, sprint, and friends, but I understand that it may be tedious if you do not know the type of the data and also requires several subcases.
For a quick fix you can use evalc with the disp function you mentioned.
Something like this should work:
function out = repr(x)
out = evalc('disp(x)');
end
Or succinctly
repr = #(x) evalc('disp(x)');

Depending on exactly why you want to do this, your use case may be resolved with matlab.io.saveVariablesToScript
Here is the doc for it.
Hope that helps!

Python eval function with numpy arrays via string input with dictionaries

I am implementing the code in python which has the variables stored in numpy vectors. I need to perform simple operation: something like (vec1+vec2^2)/vec3. Each element of each vector is summed and multiplied. (analog of MATLAB elementwise .* operation).
The problem is in my code that I have dictionary which stores all vectors:
var = {'a':np.array([1,2,2]),'b':np.array([2,1,3]),'c':np.array([3])}
The 3rd vector is just 1 number which means that I want to multiply this number by each element in other arrays like 3*[1,2,3]. And at the same time I have formula which is provided as a string:
formula = '2*a*(b/c)**2'
I am replacing the formula using Regexp:
formula_for_dict_variables = re.sub(r'([A-z][A-z0-9]*)', r'%(\1)s', formula)
which produces result:
2*%(a)s*(%(b)s/%(c)s)**2
and substitute the dictionary variables:
eval(formula%var)
In the case then I have just pure numbers (Not numpy arrays) everything is working, but when I place numpy.arrays in dict I receive an error.
Could you give an example how can I solve this problem or maybe suggest some different approach. Given that vectors are stored in dictionary and formula is a string input.
I also can store variables in any other container. The problem is that I don't know the name of variables and formula before the execution of code (they are provided by user).
Also I think iteration through each element in vectors probably will be slow given the python for loops are slow.

Using numexpr, then you could do this:
In [143]: import numexpr as ne
In [146]: ne.evaluate('2*a*(b/c)**2', local_dict=var)
Out[146]: array([ 0.88888889, 0.44444444, 4. ])

Pass the dictionary to python eval function:
>>> var = {'a':np.array([1,2,2]),'b':np.array([2,1,3]),'c':np.array([3])}
>>> formula = '2*a*(b/c)**2'
>>> eval(formula, var)
array([ 0.8889, 0.4444, 4. ])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.