I've been writing an lexer/parser/interpreter for my own language and so far all has been working. I've been following the examples over at Ruslan Spivak's blog (Github link to each article).
I wanted to extend my language grammar past what is written in the articles to include more operators like comparisons (<, >=, etc.) and also exponents (** or ^ in my language). I have this grammar:
expression : exponent ((ADD | SUB) exponent)*
exponent : term ((POWER) term)*
# this one is right-associative (powers **)
term : comparison ((MUL | DIV) comparison)*
comparison : factor ((EQUAl | L_EQUAL | LESS
N_EQUAL | G_EQUAL | GREATER) factor)*
# these are all binary operations
factor : NUM | STR | variable
| ADD factor | SUB factor
| LPAREN expr RPAREN
# different types of 'base' types like integers
# also contains parenthesised expressions which are evalutaed first
In terms of parsing tokens, I use the same method as used in Ruslan's blog. Here is one that will parse the exponent line, which handles addition and subtraction despite its name, as the grammar says that expressions are parsed as
exponent_expr (+ / -) exponent_expr
def exponent(self):
node = self.term()
while self.current_token.type in (ADD, SUB):
token = self.current_token
if token.type == ADD:
self.consume_token(ADD)
elif token.type == SUB:
self.consume_token(SUB)
node = BinaryOperation(left_node=node,
operator=token,
right_node=self.term())
return node
Now this parses left-associative tokens just fine (since the token stream comes left to right naturally), but I am stuck on how to parse right-associative exponents. Look at this expected in/out for reference:
>>> 2 ** 3 ** 2
# should be parsed as...
>>> 2 ** (3 ** 2)
# which is...
>>> 2 ** 9
# which returns...
512
# Mine, at the moment, parses it as...
>>> (2 ** 3) ** 2
# which is...
>>> 8 ** 2
# which returns...
64
To solve this, I tried switching the BinaryOperation() constructor's left and right nodes to make the current node the right and the new node the left, but this just makes 2**5 parse as 5**2 which gives me 25 instead of the expected 32.
Any approaches that I could try?
The fact that your exponent function actually parses expressions should have been a red flag. In fact, what you need is an expression function which parses expressions and an exponent function which parses exponentiations.
You've also mixed up the precedences of exponentiation and multiplication (and other operations), because 2 * x ** 4 does not mean (2 * x) ** 4 (which would be 16x⁴), but rather 2 * (x ** 4). By the same token, x * 3 < 17 does not mean x * (3 < 17), which is how your grammar will parse it.
Normally the precedences for arithmetics look something like this:
comparison <, <=, ==, ... ( lowest precedence)
additive +, -
multiplicative *, /, %
unary +, -
exponentiation **
atoms numbers, variables, parenthesized expressions, etc.
(If you had postfix operators like function calls, they would go in between exponentiation and atoms.)
Once you've reworked your grammar in this form, the exponent parser will look something like this:
def exponent(self):
node = self.term()
while self.current_token.type is POWER:
self.consume_token(ADD)
node = BinaryOperation(left_node=node,
operator=token,
right_node=self.exponent())
return node
The recursive call at the end produces right associativity. In this case recursion is acceptable because the left operand and the operator have already been consumed. Thus the recursive call cannot produce an infinite loop.
Related
What's the usage of the tilde operator in Python?
One thing I can think about is do something in both sides of a string or list, such as check if a string is palindromic or not:
def is_palindromic(s):
return all(s[i] == s[~i] for i in range(len(s) / 2))
Any other good usage?
It is a unary operator (taking a single argument) that is borrowed from C, where all data types are just different ways of interpreting bytes. It is the "invert" or "complement" operation, in which all the bits of the input data are reversed.
In Python, for integers, the bits of the twos-complement representation of the integer are reversed (as in b <- b XOR 1 for each individual bit), and the result interpreted again as a twos-complement integer. So for integers, ~x is equivalent to (-x) - 1.
The reified form of the ~ operator is provided as operator.invert. To support this operator in your own class, give it an __invert__(self) method.
>>> import operator
>>> class Foo:
... def __invert__(self):
... print 'invert'
...
>>> x = Foo()
>>> operator.invert(x)
invert
>>> ~x
invert
Any class in which it is meaningful to have a "complement" or "inverse" of an instance that is also an instance of the same class is a possible candidate for the invert operator. However, operator overloading can lead to confusion if misused, so be sure that it really makes sense to do so before supplying an __invert__ method to your class. (Note that byte-strings [ex: '\xff'] do not support this operator, even though it is meaningful to invert all the bits of a byte-string.)
~ is the bitwise complement operator in python which essentially calculates -x - 1
So a table would look like
i ~i
-----
0 -1
1 -2
2 -3
3 -4
4 -5
5 -6
So for i = 0 it would compare s[0] with s[len(s) - 1], for i = 1, s[1] with s[len(s) - 2].
As for your other question, this can be useful for a range of bitwise hacks.
One should note that in the case of array indexing, array[~i] amounts to reversed_array[i]. It can be seen as indexing starting from the end of the array:
[0, 1, 2, 3, 4, 5, 6, 7, 8]
^ ^
i ~i
Besides being a bitwise complement operator, ~ can also help revert a boolean value, though it is not the conventional bool type here, rather you should use numpy.bool_.
This is explained in,
import numpy as np
assert ~np.True_ == np.False_
Reversing logical value can be useful sometimes, e.g., below ~ operator is used to cleanse your dataset and return you a column without NaN.
from numpy import NaN
import pandas as pd
matrix = pd.DataFrame([1,2,3,4,NaN], columns=['Number'], dtype='float64')
# Remove NaN in column 'Number'
matrix['Number'][~matrix['Number'].isnull()]
The only time I've ever used this in practice is with numpy/pandas. For example, with the .isin() dataframe method.
In the docs they show this basic example
>>> df.isin([0, 2])
num_legs num_wings
falcon True True
dog False True
But what if instead you wanted all the rows not in [0, 2]?
>>> ~df.isin([0, 2])
num_legs num_wings
falcon False False
dog True False
I was solving this leetcode problem and I came across this beautiful solution by a user named Zitao Wang.
The problem goes like this for each element in the given array find the product of all the remaining numbers without making use of divison and in O(n) time
The standard solution is:
Pass 1: For all elements compute product of all the elements to the left of it
Pass 2: For all elements compute product of all the elements to the right of it
and then multiplying them for the final answer
His solution uses only one for loop by making use of. He computes the left product and right product on the fly using ~
def productExceptSelf(self, nums):
res = [1]*len(nums)
lprod = 1
rprod = 1
for i in range(len(nums)):
res[i] *= lprod
lprod *= nums[i]
res[~i] *= rprod
rprod *= nums[~i]
return res
Explaining why -x -1 is correct in general (for integers)
Sometimes (example), people are surprised by the mathematical behaviour of the ~ operator. They might reason, for example, that rather than evaluating to -19, the result of ~18 should be 13 (since bin(18) gives '0b10010', inverting the bits would give '0b01101' which represents 13 - right?). Or perhaps they might expect 237 (treating the input as signed 8-bit quantity), or some other positive value corresponding to larger integer sizes (such as the machine word size).
Note, here, that the signed interpretation of the bits 11101101 (which, treated as unsigned, give 237) is... -19. The same will happen for larger numbers of bits. In fact, as long as we use at least 6 bits, and treating the result as signed, we get the same answer: -19.
The mathematical rule - negate, and then subtract one - holds for all inputs, as long as we use enough bits, and treat the result as signed.
And, this being Python, conceptually numbers use an arbitrary number of bits. The implementation will allocate more space automatically, according to what is necessary to represent the number. (For example, if the value would "fit" in one machine word, then only one is used; the data type abstracts the process of sign-extending the number out to infinity.) It also does not have any separate unsigned-integer type; integers simply are signed in Python. (After all, since we aren't in control of the amount of memory used anyway, what's the point in denying access to negative values?)
This breaks intuition for a lot of people coming from a C environment, in which it's arguably best practice to use only unsigned types for bit manipulation and then apply 2s-complement interpretation later (and only if appropriate; if a value is being treated as a group of "flags", then a signed interpretation is unlikely to make sense). Python's implementation of ~, however, is consistent with its other design choices.
How to force unsigned behaviour
If we wanted to get 13, 237 or anything else like that from inverting the bits of 18, we would need some external mechanism to specify how many bits to invert. (Again, 18 conceptually has arbitrarily many leading 0s in its binary representation in an arbitrary number of bits; inverting them would result in something with leading 1s; and interpreting that in 2s complement would give a negative result.)
The simplest approach is to simply mask off those arbitrarily-many bits. To get 13 from inverting 18, we want 5 bits, so we mask with 0b11111, i.e., 31. More generally (and giving the same interface for the original behaviour):
def invert(value, bits=None):
result = ~value
return result if bits is None else (result & ((1 << bits) - 1))
Another way, per Andrew Jenkins' answer at the linked example question, is to XOR directly with the mask. Interestingly enough, we can use XOR to handle the default, arbitrary-precision case. We simply use an arbitrary-sized mask, i.e. an integer that conceptually has an arbitrary number of 1 bits in its binary representation - i.e., -1. Thus:
def invert(value, bits=None):
return value ^ (-1 if bits is None else ((1 << bits) - 1))
However, using XOR like this will give strange results for a negative value - because all those arbitrarily-many set bits "before" (in more-significant positions) the XOR mask weren't cleared:
>>> invert(-19, 5) # notice the result is equal to 18 - 32
-14
it's called Binary One’s Complement (~)
It returns the one’s complement of a number’s binary. It flips the bits. Binary for 2 is 00000010. Its one’s complement is 11111101.
This is binary for -3. So, this results in -3. Similarly, ~1 results in -2.
~-3
Output : 2
Again, one’s complement of -3 is 2.
This is minor usage is tilde...
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
the code above is from "Hands On Machine Learning"
you use tilde (~ sign) as alternative to - sign index marker
just like you use minus - is for integer index
ex)
array = [1,2,3,4,5,6]
print(array[-1])
is the samething as
print(array[~1])
What's the usage of the tilde operator in Python?
One thing I can think about is do something in both sides of a string or list, such as check if a string is palindromic or not:
def is_palindromic(s):
return all(s[i] == s[~i] for i in range(len(s) / 2))
Any other good usage?
It is a unary operator (taking a single argument) that is borrowed from C, where all data types are just different ways of interpreting bytes. It is the "invert" or "complement" operation, in which all the bits of the input data are reversed.
In Python, for integers, the bits of the twos-complement representation of the integer are reversed (as in b <- b XOR 1 for each individual bit), and the result interpreted again as a twos-complement integer. So for integers, ~x is equivalent to (-x) - 1.
The reified form of the ~ operator is provided as operator.invert. To support this operator in your own class, give it an __invert__(self) method.
>>> import operator
>>> class Foo:
... def __invert__(self):
... print 'invert'
...
>>> x = Foo()
>>> operator.invert(x)
invert
>>> ~x
invert
Any class in which it is meaningful to have a "complement" or "inverse" of an instance that is also an instance of the same class is a possible candidate for the invert operator. However, operator overloading can lead to confusion if misused, so be sure that it really makes sense to do so before supplying an __invert__ method to your class. (Note that byte-strings [ex: '\xff'] do not support this operator, even though it is meaningful to invert all the bits of a byte-string.)
~ is the bitwise complement operator in python which essentially calculates -x - 1
So a table would look like
i ~i
-----
0 -1
1 -2
2 -3
3 -4
4 -5
5 -6
So for i = 0 it would compare s[0] with s[len(s) - 1], for i = 1, s[1] with s[len(s) - 2].
As for your other question, this can be useful for a range of bitwise hacks.
One should note that in the case of array indexing, array[~i] amounts to reversed_array[i]. It can be seen as indexing starting from the end of the array:
[0, 1, 2, 3, 4, 5, 6, 7, 8]
^ ^
i ~i
Besides being a bitwise complement operator, ~ can also help revert a boolean value, though it is not the conventional bool type here, rather you should use numpy.bool_.
This is explained in,
import numpy as np
assert ~np.True_ == np.False_
Reversing logical value can be useful sometimes, e.g., below ~ operator is used to cleanse your dataset and return you a column without NaN.
from numpy import NaN
import pandas as pd
matrix = pd.DataFrame([1,2,3,4,NaN], columns=['Number'], dtype='float64')
# Remove NaN in column 'Number'
matrix['Number'][~matrix['Number'].isnull()]
The only time I've ever used this in practice is with numpy/pandas. For example, with the .isin() dataframe method.
In the docs they show this basic example
>>> df.isin([0, 2])
num_legs num_wings
falcon True True
dog False True
But what if instead you wanted all the rows not in [0, 2]?
>>> ~df.isin([0, 2])
num_legs num_wings
falcon False False
dog True False
I was solving this leetcode problem and I came across this beautiful solution by a user named Zitao Wang.
The problem goes like this for each element in the given array find the product of all the remaining numbers without making use of divison and in O(n) time
The standard solution is:
Pass 1: For all elements compute product of all the elements to the left of it
Pass 2: For all elements compute product of all the elements to the right of it
and then multiplying them for the final answer
His solution uses only one for loop by making use of. He computes the left product and right product on the fly using ~
def productExceptSelf(self, nums):
res = [1]*len(nums)
lprod = 1
rprod = 1
for i in range(len(nums)):
res[i] *= lprod
lprod *= nums[i]
res[~i] *= rprod
rprod *= nums[~i]
return res
Explaining why -x -1 is correct in general (for integers)
Sometimes (example), people are surprised by the mathematical behaviour of the ~ operator. They might reason, for example, that rather than evaluating to -19, the result of ~18 should be 13 (since bin(18) gives '0b10010', inverting the bits would give '0b01101' which represents 13 - right?). Or perhaps they might expect 237 (treating the input as signed 8-bit quantity), or some other positive value corresponding to larger integer sizes (such as the machine word size).
Note, here, that the signed interpretation of the bits 11101101 (which, treated as unsigned, give 237) is... -19. The same will happen for larger numbers of bits. In fact, as long as we use at least 6 bits, and treating the result as signed, we get the same answer: -19.
The mathematical rule - negate, and then subtract one - holds for all inputs, as long as we use enough bits, and treat the result as signed.
And, this being Python, conceptually numbers use an arbitrary number of bits. The implementation will allocate more space automatically, according to what is necessary to represent the number. (For example, if the value would "fit" in one machine word, then only one is used; the data type abstracts the process of sign-extending the number out to infinity.) It also does not have any separate unsigned-integer type; integers simply are signed in Python. (After all, since we aren't in control of the amount of memory used anyway, what's the point in denying access to negative values?)
This breaks intuition for a lot of people coming from a C environment, in which it's arguably best practice to use only unsigned types for bit manipulation and then apply 2s-complement interpretation later (and only if appropriate; if a value is being treated as a group of "flags", then a signed interpretation is unlikely to make sense). Python's implementation of ~, however, is consistent with its other design choices.
How to force unsigned behaviour
If we wanted to get 13, 237 or anything else like that from inverting the bits of 18, we would need some external mechanism to specify how many bits to invert. (Again, 18 conceptually has arbitrarily many leading 0s in its binary representation in an arbitrary number of bits; inverting them would result in something with leading 1s; and interpreting that in 2s complement would give a negative result.)
The simplest approach is to simply mask off those arbitrarily-many bits. To get 13 from inverting 18, we want 5 bits, so we mask with 0b11111, i.e., 31. More generally (and giving the same interface for the original behaviour):
def invert(value, bits=None):
result = ~value
return result if bits is None else (result & ((1 << bits) - 1))
Another way, per Andrew Jenkins' answer at the linked example question, is to XOR directly with the mask. Interestingly enough, we can use XOR to handle the default, arbitrary-precision case. We simply use an arbitrary-sized mask, i.e. an integer that conceptually has an arbitrary number of 1 bits in its binary representation - i.e., -1. Thus:
def invert(value, bits=None):
return value ^ (-1 if bits is None else ((1 << bits) - 1))
However, using XOR like this will give strange results for a negative value - because all those arbitrarily-many set bits "before" (in more-significant positions) the XOR mask weren't cleared:
>>> invert(-19, 5) # notice the result is equal to 18 - 32
-14
it's called Binary One’s Complement (~)
It returns the one’s complement of a number’s binary. It flips the bits. Binary for 2 is 00000010. Its one’s complement is 11111101.
This is binary for -3. So, this results in -3. Similarly, ~1 results in -2.
~-3
Output : 2
Again, one’s complement of -3 is 2.
This is minor usage is tilde...
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
the code above is from "Hands On Machine Learning"
you use tilde (~ sign) as alternative to - sign index marker
just like you use minus - is for integer index
ex)
array = [1,2,3,4,5,6]
print(array[-1])
is the samething as
print(array[~1])
I want to know why does the following happen.
The code below evaluates right side 1**3 first then 2**1
2**1**3 has the value of 2
However, for the below code left side 7//3 is evaluated first then 2*3. Finally 1+6-1=6.
1+7//3*3-1 has the value of 6
Take a look at the documentation of operator precedence. Although multiplication * and floor division // have the same precedence, you should take note of this part:
Operators in the same box group left to right (except for exponentiation, which groups from right to left).
For the convention of 213 being evaluated right-associative, see cross-site dupe on the math stackexchange site: What is the order when doing xyz and why?
The TL;DR is this: since the left-associative version (xy)z would just equal xy*z, it's not useful to have another (worse) notation for the same thing, so exponentiation should be right associative.
Almost all operators in Python (that share the same precedence) have left-to-right associativity. For example:
1 / 2 / 3 ≡ (1 / 2) / 3
One exception is the exponent operator which is right-to-left associativity:
2 ** 3 ** 4 ≡ 2 ** (3 ** 4)
That's just the way the language is defined, matching mathematical notation where abc ≡ a(bc).
If it were (ab)c, that would just be abc.
Per Operator Precedence, the operator is right associative: a**b**c**d == a**(b**(c**d)).
So, if you do this:
a,b,c,d = 2,3,5,7
a**b**c**d == a**(b**(c**d))
you should get true after a looooong time.
The Exponent operator in python has a Right to Left precedence. That is out of all the occurrences in an expression the calculation will be done from Rightmost to Leftmost. The exponent operator is an exception among all the other operators as most of them follow a Left to Right associativity rule.
2**1**3 = 2
The expression
1+7//3*3-1
It is a simple case of Left to Right associativity. As // and * operator share the same precedence, Associativity(one is the Left) is taken into account.
This is just how math typically works
213
This is the same as the first expression you used. To evaluate this with math, you'd work your way down, so 13=1 and then 21 which equals 2.
You can make sense of this just by thinking about the classic PEMDAS (or Please Excuse My Dear Aunt Sally) order of operations from mathematics. In your first one, 2**1**3 is equivalent to , which is really read as . Looking at it this way, you see that you do parenthesis (P) first (the 1**3).
In the second one, 1+7//3*3-1 == 6 you have to note that the MD and AS of PEMDAS are actually done in order of whichever comes first reading from left-to-right. It's simply a fault of language that we have to write one letter before another (that is, we could write this as PEDMAS and it still be correct if we treat the D and M appropriately).
All that to say, Python is treating the math exactly the same way as we should even if this were written with pen and paper.
I've been programming in Python for years but something extremely trivial has surprised me:
>>> -1 ** 2
-1
Of course, squaring any negative real number should produce a positive result. Probably Python's math is not completely broken. Let's look at how it parsed this expression:
>>> ast.dump(ast.parse('-1 ** 2').body[0])
Expr(
value=UnaryOp(
op=USub(),
operand=BinOp(
left=Num(n=1),
op=Pow(),
right=Num(n=2)
)
)
)
Ok, so it is treating it as if I had written -(1 ** 2). But why is the - prefix to 1 being treated as a separate unary subtraction operator, instead of the sign of the constant?
Note that the expression -1 is not parsed as the unary subtraction of the constant 1, but just the constant -1:
>>> ast.dump(ast.parse('-1').body[0])
Expr(
value=Num(n=-1)
)
The same goes for -1 * 2, even though it is syntactically nearly identical to the first expression.
>>> ast.dump(ast.parse('-1 * 2').body[0])
Expr(
value=BinOp(
left=Num(n=-1),
op=Mult(),
right=Num(n=2)
)
)
This behavior turns out to be common to many languages including perl, PHP, and Ruby.
It behaves just like the docs explain here:
2.4.4. Numeric literals
[...] Note that numeric literals do not include a sign; a phrase like -1 is actually an expression composed of the unary operator - and the literal 1.
and here:
6.5. The power operator
The power operator binds more tightly than unary operators on its
left; [...]
See also the precedence table from this part. Here is the relevant part from that table:
Operator | Description
-------------|---------------------------------
* | Multiplication, ...
+x, -x, ~x | Positive, negative, bitwise NOT
** | Exponentiation
This explains why the parse tree is different between the ** and * examples.
Suppose I would like to write a fairly simple programming language, and I want to implement operators such like 2 + 3 * 2 = 8
What is the general way to implement things like this?
I'm not sure how much detail you're interested in, but it sounds like you're looking to implement a parser. There's typically two steps:
The lexer reads over the text and converts it to tokens. For example, it might read "2 + 3 * 2" and convert it to INTEGER PLUS INTEGER STAR INTEGER
The parser reads in the tokens and tries to match them to rules. For example, you might have these rules:
Expr := Sum | Product | INTEGER;
Sum := Expr PLUS Expr;
Product := Expr STAR Expr;
It reads the tokens and tries to apply the rules such that the start rule maps to the tokens its read in. In this case, it might do:
Expr := Sum
Expr := Expr PLUS Expr
Expr := INTEGER(2) PLUS Expr
Expr := INTEGER(2) PLUS Product
Expr := INTEGER(2) PLUS Expr STAR Expr
Expr := INTEGER(2) PLUS Integer(3) STAR Expr
Expr := INTEGER(2) PLUS Integer(3) STAR Integer(2)
There are many types of parsers. In this example I read from left to right, and started from the initial expression, working down until I'd replaced everything with a token, so this would be an LL parser. As it does this replacement, it can generate an abstract syntax tree that represents the data. The tree for this might look something like:
You can see that the Product rule is a child of the Sum rule, so it will end up happening first: 2 + (3 * 2). If the expression had been parsed differently we might've ended up with this tree:
Now we're calculating (2 + 3) * 2. It all comes down to which way the parser generates the tree
If you actually want to parse expressions, odds are you don't want to write the parser by hand. There are parser generators that take a configuration (called a grammar) similar to the one I used above, and generate the actual parser code. Parser generators will let you specify which rule should take priority, so for example:
Expr := Sum | Product | INTEGER;
Sum := Expr PLUS Expr; [2]
Product := Expr STAR Expr; [1]
I labeled the Product rule as priority 1, and Sum as priority 2, so given the choice the generated parser will favor Product. You can also design the grammar itself such that the priority is built-in (this is the more common approach). For example:
Expr := Sum | INTEGER;
Sum := Expr PLUS Product;
Product := Term STAR INTEGER;
This forces the Products to be under the Sums in the AST. Naturally this grammar is very limited (for example, it wouldn't match 2 * 3 + 2), but a comprehensive grammar can be written that still embeds an order of operations automatically
You would need to write a parser for your fairly simple programming language. If you want to do this in Python, start by reading Ned Batchelder's blog post Python Parsing Tools.