I am learning Python, and need to format "From" fields received from IMAP. I tried it using str.find() and str.strip(), and also using regex. With find(), etc. my function runs quite a bit faster than with re (I timed it). So, when is it better to use re? Does anybody have any good links/articles related to that? Python documentation obviously doesn't mention that...
find only matches an exact sequence of characters, while a regular expression matches a pattern. Naturally only looking an for exact sequence is faster (even if your regex pattern is also an exact sequence, there is still some overhead involved).
As a consequence of the above, you should use find if you know the exact sequence, and a regular expression (or something else) when you don't. The exact approach you should use really depends on the complexity of the problem you face.
As a side note, the python re module provides a compile method that allows you to pre-compile a regex if you are going to be using it repeatedly. This can substantially improve speed if you are using the same pattern many times.
If you intend to do something complex you should use re . It is more scalable than using string methods.
String methods are good for doing something simple and not worth bothering with regular expressions.
So, it depends on what are you doing, but usually you should use regular expressions since they are more powerful.
Related
I wrote a lexical analyzer for cpp codes in python, but the problem is when I use input.split(" ") it won't recognize codes like x=2 or function() as three different tokens unless I add an space between them manually, like: x = 2 .
also it fails to recognize the tokens at the beginning of each line.
(if i add spaces between each two tokens and also at the beginning of each line, my code works correctly)
I tried splitting the code first by lines then by space but it got complicated and still I wasn't able to solve the first problem.
Also I thought about splitting it by operators, yet I couldn't actually implement it. plus I need the operators to be recognized as tokens as well, so this might not be a good idea.
I would appreciate it if anyone could give any solution or suggestion, Thank You.
f=open("code.txt")
input=f.read()
input=input.split(" ")
f=open("code.txt")
input=f.read()
input1=input.split("\n")
for var in input1:
var=var.split(" ")
Obviously, if you try to have success splitting such an expression like x=2 and also x = 2... it seems pretty obvious that isn't going to work.
What you are looking is for a solution that works with both right?
Basic solution is to use an and operator, and use the conditions that you need to parse. Note that this solution isn't scalable, neither fits into the category of good practices, but it can help you to figure out better but harder solutions.
if input.split(' ') and input.split('='):
An intermediate solution would be to use regex.
Regex isn't an easy topic, but you can checkout online documentation, and then you have wonderful online tools to check your regex codes.
Regex 101
The last one, would be to convert your input data into an AST, which stands for abstract syntax tree. This is the technique employed by C++ compilers like, for example, Clang.
This last one is a real hard topic, so for figure out a basic lexer, probably will be really time consuming, but maybe it could fit your needs.
The usual approach is to scan the incoming text from left to right. At each character position, the lexical analyser selects the longest string which fits some pattern for a "lexeme", which is either a token or ignored input (whitespace and comments, for example). Then the scan continues at the next character.
Lexical patterns are often described using regular expressions, but the standard regular expression module re is not as much help as it could be for this procedure, because it does not have the facility of checking multiple regular expressions in parallel. (And neither does the possible future replacement, the regex module.) Or, more precisely, the library can check multiple expressions in parallel (using alternation syntax, (...|...|...)), but it lacks an interface which can report which of the alternatives was matched. [Note 1]. So it would be necessary to try every possible pattern one at a time and select whichever one turns out to have the longest match.
Note that the matches are always anchored at the current input point; the lexical analyser does not search for a matching pattern. Every input character becomes part of some lexeme, even if that lexeme is ignored, and lexemes do not overlap.
You can write such an analyser by hand for a simple language, but C++ is hardly a simple language. Hand-built lexical analysers most certainly exist, but all the ones I've seen are thousands of lines of not very readable code. So it's usually easier to build an analyzer automatically using software designed for that purpose. These have been around for a long time -- Lex was written almost 50 years ago, for example -- and if you are planning on writing more than one lexical analyser, you would be well advised to investigate some of the available tools.
Notes
The PCRE2 and Oniguruma regex libraries provide a "callout" feature which I believe could be used for this purpose. I haven't actually seen it used in lexical analysis, but it's a fairly recent addition, particularly for Oniguruma, and as far as I can see, the Python bindings for those two libraries do not wrap the callout feature. (Although, as usual with Python bindings to C libraries, documentation is almost non-existent, so I can't say for certain.)
I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"
I was wondering if anyone knew of a good python library for evaluation text-based mathematical expressions. So for example,
>>> evaluate("Three plus nine")
12
>>> evaluate("Eight + two")
10
I've seen similar examples that people have done for numeric values and operators in a string. One method used eval to compute the literal value of the expression. And another method of doing this used regex to parse the text.
If there isn't an existing library that handles this well I will probably end up using a combination of the regex and eval techniques for this. I just want to confirm that there isn't already something like this already out there.
You could try pyparsing, which does general recursive descent parsing. In fact, here is something quite close to your second example.
About your other suggestions.
See here about the security issues of eval (ironically, using it for a calculator).
Fundamentally, regular languages are weaker than pushdown automata languages. You shouldn't try to fight a general parse tree problem with regexes.
For testing purposes on a project I'm working on, I have a need to, if given a regular expression, randomly generate a string that will FAIL to be matched by it. For instance, if I'm given this regex:
^[abcd]d+
Then I should be able to generate strings such as:
hnbbad
uduebbaef
9f8;djfew
skjcc98332f
...each of which does NOT match the regex, but NOT generate:
addr32
bdfd09usdj
cdddddd-9fdssee
...each of which DO. In other words, I want something like an anti-Xeger.
Does such a library exist, preferably in Python (if I can understand the theory, I can most likely convert it to Python if need be)? I gave some thought to how I could write this, but given the scope of regular expressions, it seemed that might be a much harder problem than what things like Xeger can tackle. I also looked around for a pre-made library to do this, but either I'm not using the right keywords to search or nobody's had this problem before.
My initial instinct is, no, such a library does not exist because it's not possible. You can't be sure that you can find a valid input for any arbitrary regular expression in a reasonable amount of time.
For example, proving whether a number is prime is believed to be a hard to solve mathematical problem. The following regular expression matches any string which is at least 10000 characters long and whose total length is a prime number:
(?!(..+)\1+$).{10000}
I doubt that any library exists that can find a valid input to this regular expression in reasonable time. And this is a very easy example with a simple solution, e.g. 'x' * 10007 will work. It would be possible to come up with other regular expressions that are much harder to find valid inputs for.
I think the only way you are going to solve this is if you limit yourself to some subset of all possible regular expressions.
But having said that if you have a magical library that generates text that matches for any arbitrary regular expression then all you need to do is generate a regular expression that matches all the strings that don't match your original expression.
Luckily this is possible using a negative lookahead:
^(?![\s\S]*(?:^[abcd]d+))
If you are willing to change the requirements to only allow a limited subset of regular expressions then you can negate the regular expression by using boolean logic. For example if ^[abcd]d+ becomes ^[^abcd]|^[abcd][^d]. It is then possible to find a valid input for this regular expression in reasonable time.
I would do a loop, generating random combinations of random length, and test if matches the regexp. Repeat the loop until a not-match situation is reached.
Obviously, this would be inefficient. Are you sure you cannot invert the regexp and generate a match on the inverted regexp?
No this is impossible. There are an infinite number of regexes that match every string in the known universe. For example:
/^/
/.*/
/[^"\\]*(\\.[^"\\]*)*$/
etc.
This is because all these regexes can match nothing at all (which is something all strings have!)
Can we reduce the infinite number of possibilities, by restricting to generate strings from a give character set.
For example, I can define the character set, [QWERTYUIOP!##$%^%^&*))_] and all the strings I generate randomly should be born from this set. That way we can reduce the infinite nature of this problem?
In fact even I am looking for a utility like this, preferably in Python.
From Perl's documentation:
study takes extra time to study SCALAR ($_ if unspecified) in anticipation of doing
many pattern matches on the string before it is next modified. This may or may not save
time, depending on the nature and number of patterns you are searching and the distribution
of character frequencies in the string to be searched;
I'm trying to speed up some regular expression-driven parsing that I'm doing in Python, and I remembered this trick from Perl. I realize I'll have to benchmark to determine if there is a speedup, but I can't find an equivalent method in Python.
Perl’s study doesn’t really do much anymore. The regex compiled has gotten a whole, whole lot smarter than it was when study was created.
For example, it compiles alternatives into a trie structure with Aho–Corasick prediction.
Run with perl -Mre=debug to see the sorts of cleverness the regex compiler and execution engine apply.
As far as I know there's nothing like this built into Python. But according to the perldoc:
The way study works is this: a linked list of every character in the
string to be searched is made, so we know, for example, where all the
'k' characters are. From each search string, the rarest character is
selected, based on some static frequency tables constructed from some
C programs and English text. Only those places that contain this
"rarest" character are examined.
This doesn't sound very sophisticated, and you could probably hack together something equivalent yourself.
esmre is kind of vaguely similar. And as #Frg noted, you'll want to use re.compile if you're reusing a single regex (to avoid re-parsing the regex itself over and over).
Or you could use suffix trees (here's one implementation, or here's a C extension with unicode support) or suffix arrays (implementation).