A simple regexp in python - python

My program is a simple calculator, so I need to parse te expression which the user types, to get the input more user-friendly. I know I can do it with regular expressions, but I'm not familar enough about this.
So I need transform a input like this:
import re
input_user = "23.40*1200*(12.00-0.01)*MM(H2O)/(8.314 *func(2*x+273.15,x))"
re.some_stuff( ,input_user) # ????
in this:
"23.40*1200*(12.00-0.01)*MM('H2O')/(8.314 *func('2*x+273.15',x))"
just adding these simple quotes inside the parentheses. How can I do that?
UPDATE:
To be more clear, I want add simple quotes after every sequence of characters "MM(" and before the ")" which comes after it, and after every sequence of characters "func(" and before the "," which comes after it.

This is the sort of thing where regexes can work, but they can potentially result in major problems unless you consider exactly what your input will be like. For example, can whatever is inside MM(...) contain parentheses of its own? Can the first expression in func( contain a comma? If the answers to both questions is no, then the following could work:
input_user2 = re.sub(r'MM\(([^\)]*)\)', r"MM('\1')", input_user)
output = re.sub(r'func\(([^,]*),', r"func('\1',", input_user)
However, this will not work if the answer to either question is yes, and even without that could cause problems depending upon what sort of inputs you expect to receive. Essentially, the first re.sub here looks for MM( ('MM('), followed by any number (including 0) of characters that aren't a close-parenthesis ('([^)]*)') that are then stored as a group (caused by the extra parentheses), and then a close-parenthesis. It replaces that section with the string in the second argument, where \1 is replaced by the first and only group from the pattern. The second re.sub works similarly, looking for any number of characters that aren't a comma.
If the answer to either question is yes, then regexps aren't appropriate for the parsing, as your language would not be regular. The answer to this question, while discussing a different application, may give more insight into that matter.

Related

Think you know Python RE? Here's a challenge

Here's the skinny: how do you make a character set match NOT a previously captured character?
r'(.)[^\1]' # doesn't work
Here's the uh... fat? It's part of a (simple) cryptography program. Suppose "hobo" got coded to "fxgx". The program only gets the encoded text and has to figure what it could be, so it generates the pattern:
r'(.)(.)(.)\2' # 1st and 3rd letters *should* be different!
Now it (correctly) matches "hobo", but also matches "hoho" (think about it!). I've tried stuff like:
r'(.)([^\1])([^\1\2])\2' # also doesn't work
and MANY variations but alas! Alack...
Please help!
P.S. The work-around (which I had to implement) is to just retrieve the "hobo"s as well the "hoho"s, and then just filter the results (discarding the "hoho"s), if you catch my drift ;)
P.P.S Now I want a hoho
VVVVV THE ANSWER VVVVV
Yes, I re-re-read the documentation and it does say:
Inside the '[' and ']' of a character class, all numeric escapes are
treated as characters.
As well as:
Special characters lose their special meaning inside sets.
Which pretty much means (I think) NO, you can't do anything like:
re.compile(r'(.)[\1]') # Well you can, but it kills the back-reference!
Thanks for the help!
1st and 3rd letters should be different!
This cannot be detected using a regular expression (not just python's implementation). More specifically, it can't be detected using automata without memory. You'll have to use a different kind of automata.
The kind of grammar you're trying to discover (‫‪reduplication‬‬) is not regular. Moreover, it is not context-free.
Automata is the mechanism which allows regular expression match to be so efficient.

Regular expression code is not working (Python)

Assume I have a word AB1234XZY or even 1AB1234XYZ.
I want to extract ONLY 'AB1234' or 1AB1234 (ie. everything up until the letters at the end).
I have used the following code to extract that but it's not working:
base= re.match(r"^(\D+)(\d+)", word).group(0)
When I print base, it's not working for the second case. Any ideas why?
Your regex doesn't work for the second case because it starts with a number; the \D at the beginning of your pattern matches anything that ISN'T a number.
You should be able to use something quite simple for this--simpler, in fact, than anything else I see here.
'.*\d'
That's it! This should match everything up to and including the last number in your string, and ignore everything after that.
Here's the pattern working online, so you can see for yourself.
(.+?\d+)\w+ would give you what you want.
Or even something like this
^(.+?)[a-zA-Z]+$
re.match starts at the beginning of the string, and re.search simply looks for it in the string. both return the first match. .group(0) is everything included in the match, if you had capturing groups, then .group(1) is the first group...etc etc... as opposed to normal convention where 0 is the first index, in this case, 0 is a special use case meaning everything.
in your case, depending on what you really need to capture, maybe using re.search is better. and instead of using 2 groups, you can use (\D+\d+) keep in mind, it will capture the first (non-digits,digits) group. it might be sufficient for you, but you might want to be more specific.
after reading your comment "everything before the letters at the end"
this regex is what you need:
regex = re.compile(r'(.+)[A-Za-z]')

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

Python: regular expressions in control structures [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to check if text is “empty” (spaces, tabs, newlines) in Python?
I am trying to write a short function to process lines of text in a file. When it encounters a line with significant content (meaning more than just whitespace), it is to do something with that line. The control structure I wanted was
if '\S' in line: do something
or
if r'\S' in line: do something
(I tried the same combinations with double quotes also, and yes I had imported re.) The if statement above, in all the forms I tried, always returns False. In the end, I had to resort to the test
if re.search('\S', line) is not None: do something
This works, but it feels a little clumsy in relation to a simple if statement. My question, then, is why isn't the if statement working, and is there a way to do something as (seemingly) elegant and simple?
I have another question unrelated to control structures, but since my suspicion is that it is also related to a possibly illegal use of regular expressions, I'll ask it here. If I have a string
s = " \t\tsome text \t \n\n"
The code
s.strip('\s')
returns the same string complete with spaces, tabs, and newlines (r'\s' is no different). The code
s.strip()
returns "some text". This, even though strip called with no character string supposedly defaults to stripping whitespace characters, which to my mind is exactly what the expression '\s' is doing. Why is the one stripping whitespace and the other not?
Thanks for any clarification.
Python string functions are not aware of regular expressions, so if you want to use them you have to use the re module.
However if you are only interested in finding out of a string is entirely whitespace or not, you can use the str.isspace() function:
>>> 'hello'.isspace()
False
>>> ' \n\t '.isspace()
True
This is what you're looking for
if not line.isspace(): do something
Also, str.strip does not use regular expressions.
If you are really just want to find out if the line only consists of whitespace characters regex is a little overkill. You should got for the following instead:
if text.strip():
#do stuff
which is basically the same as:
if not text.strip() == "":
#do stuff
Python evaluates every non-empty string to True. So if text consists only of whitespace-characters, text.strip() equals "" and therefore evaluates to False.
The expression '\S' in line does the same thing as any other string in line test; it tests whether the string on the left occurs inside the string on the right. It does not implicitly compile a regular expression and search for a match. This is a good thing. What if you were writing a program that manipulated regular expressions input by the user and you actually wanted to test whether some sub-expression like \S was in the input expression?
Likewise, read the documentation of str.strip. Does it say that will treat it's input as a regular expression and remove matching strings? No. If you want to do something with regular expressions, you have to actually tell Python that, not expect it to somehow guess that you meant a regular expression this time while other times it just meant a plain string. While you might think of searching for a regular expression as very similar to searching for a string, they are completely different operations as far as the language implementation is concerned. And most str methods wouldn't even make sense when applied to a regular expression.
Because re.match objects are "truthy" in boolean context (like most class instances), you can at least shorten your if statement by dropping the is not None test. The rest of the line is necessary to actually tell Python what you want. As for your str.strip case (or other cases where you want to do something similar to a string operation but with a regular expression), have a look at the functions in the re module; there are a number of convenience functions on there that can be helpful. Or else it should be pretty easy to implement a re_split function yourself.

Conditional Regular Expressions

I'm using Python and I want to use regular expressions to check if something "is part of an include list" but "is not part of an exclude list".
My include list is represented by a regex, for example:
And.*
Everything which starts with And.
Also the exclude list is represented by a regex, for example:
(?!Andrea)
Everything, but not the string Andrea. The exclude list is obviously a negation.
Using the two examples above, for example, I want to match everything which starts with And except for Andrea.
In the general case I have an includeRegEx and an excludeRegEx. I want to match everything which matchs includeRegEx but not matchs excludeRegEx. Attention: excludeRegEx is still in the negative form (as you can see in the example above), so it should be better to say: if something matches includeRegEx, I check if it also matches excludeRegEx, if it does, the match is satisfied. Is it possible to represent this in a single regular expression?
I think Conditional Regular Expressions could be the solution but I'm not really sure of that.
I'd like to see a working example in Python.
Thank you very much.
Why not put both in one regex?
And(?!rea$).*
Since the lookahead only "looks ahead" without consuming any characters, this works just fine (well, this is the whole point of lookaround, actually).
So, in Python:
if re.match(r"And(?!rea$).*", subject):
# Successful match
# Note that re.match always anchor the match
# to the start of the string.
else:
# Match attempt failed
From the wording of your question, I'm not sure if you're starting with two already finished lists of "match/don't match" pairs. In that case, you could simply combine them automatically by concatenating the regexes. This works just as well but is uglier:
(?!Andrea$)And.*
In general, then:
(?!excludeRegex$)includeRegex

Categories

Resources