I want a match pattern with a rather long OR-pattern something like:
match item:
case Really.Long.Qualified.Name.ONE | Really.Long.Qualified.Name.TWO | Really.Long.Qualified.Name.THREE | Some.Other.Patterns.Here:
pass
This is obviously very annoying to have on a single line. However, PyCharm doesn't seem to warn about longline as per usual and reports syntax errors if I use a line-break (even if it's escaped).
Is there any way to format this code more nicely, or must the entire pattern be on a single line? Is there a definitive source that establishes this - I couldn't find it in the PEP for match/case OR in particular.
If the latter, why would that language design decision be made? It seems... not good....
You can wrap such chain expressions within a pair of parenthesis.
match item:
case (
Really.Long.Qualified.Name.ONE |
Really.Long.Qualified.Name.TWO |
Really.Long.Qualified.Name.THREE |
Some.Other.Patterns.Here
):
pass
Related
I'm trying to get a grasp on regular expressions and I came across with the one included inside the str.extract method:
movies['year']=movies['title'].str.extract('.*\((.*)\).*',expand=True)
It is supposed to detect and extract whichever is in parentheses. So, if given this string: foobar (1995) it should return 1995. However, if I open a terminal and type the following
echo 'foobar (1995)` | grep '.*\((.*)\).*'
matches the whole string instead of only the content between parentheses. I assume the method is working with BRE flavor because of the parentheses scaping, and so is grep (default behavior). Also, regex matches in blue the whole string and green the year (capturing group). Am I missing something here? The regex works perfectly inside python
First of all, the behavior of Pandas .str.extract() is quite expected: it returns only the capturing group contents. The pattern used with extract requires at least 1 capturing group:
pat : string
Regular expression pattern with capturing groups
If you use a named capturing group, the new column will be named after the named group.
The grep command you provided can be reduced to
grep '\((.*)\)'
as grep is capable of matching a line partially (does not require a full line match) and works on a per line basis: once a match is found the whole line is returned. To override that behavior, you may use -o switch.
With grep, you cannot return the capturing group contents. This can be worked around with PCRE regexp powered with -P option, but it is not available on Mac, for example. sed or awk may help in those situations, too.
Try using this:
movies['year']= movies['title'].str.extract('.*\((\d{4})\).*',expand=False)
Set expand= True if you want it to return a DataFrame or when applying multiple capturing groups.
A year is always composed of 4 digits. So the regex: \((\d{4})\) match any date between parentheses.
I'm trying to do a non-greedy negative match, and I need to capture it as well. I'm using these flags in Python, re.DOTALL | re.LOCALE | re.MULTILINE, to do multi-line cleanup of some text-file 'databases' in which each field begins on a new line with a backslash. Each record begins with an \lx field.
\lx foo
\ps n
\nt note 1
\ps v
\nt note
\ge happy
\nt note 2
\ge lonely
\nt note 3
\ge lonely
\dt 19/Dec/2011
\lx bar
...
I'm trying to ensure that each \ge field has a \ps field somewhere above it within its record, one for one. Currently, one \ps is often followed by several \ge and thus needs to be copied down, as with the two lonely \ge above.
Here's most of the needed logic: after any \ps field, but before encountering another \ps or \lx, find a \ge, then find another \ge. Capture everything so that the \ps field can be copied down to just before the second \ge.
And here's my non-functional attempt. Replace this:
^(\\ps\b.*?\n)((?!^\\(ps|lx)*?)^(\\ge.*?\n)((?!^\\ps)*?)^(\\ge.*?\n)
with this:
\1\2\3\4\1\5
I'm getting a memory error even on a tiny file (34 lines long). Of course, even if this worked, I would have to run it multiple times, since it's only trying to handle a second \ge, and not a third or fourth one. So any ideas in that regard would interest me as well.
UPDATE: Alan Moore's solution worked great, although there were cases that required a little tweaking. Sadly, I had to turn off DOTALL since otherwise I couldn't prevent the first .* including subsequent \ps fields--even with the non-greedy .*? form. But I was delighted to learn about the (?s) modifier just now at regular-expressions dot info. This allowed me to turn off DOTALL in general but still use it in other regexes that it is essential for.
Here is the suggested regex, condensed down to the one-line format I need:
^(?P<PS_BLOCK>(?P<PS_LINE>\\ps.*\n)(?:(?!\\(?:ps|lx|ge)).*\n)*\\ge.*\n)(?P<GE_BLOCK>(?:(?!\\(?:ps|lx|ge)).*\n)*\\ge.*\n)
That worked, but when I modified the example above, it inserted the \ps above "note 2". It also was treating \lxs and \ge2 the same as \lx and \ge (needed a few \b). So, I went with a slightly tweaked version:
^(?P<PS_BLOCK>(?P<PS_LINE>\\ps\b.*\n)(?:(?!\\(?:ps|lx|ge)\b).*\n)*\\ge\b.*\n)(?P<AFTER_GE1>(?:(?!\\(?:ps|lx|ge)\b).*\n)*)(?P<GE2_LINE>\\ge\b.*\n)
and this replacement string:
\g<PS_BLOCK>\g<AFTER_GE1>\g<PS_LINE>\g<GE2_LINE>
Thanks again!
You're getting the memory error because you used the DOTALL flag. If your data is really formatted the way you showed it, you don't need that flag anyway; the default behavior is exactly what you want. You don't need the non-greedy modifier (?), either.
Try this regex:
prog = re.compile(r"""
^
(?P<PS_BLOCK>
(?P<PS_LINE>\\ps.*\n)
(?: # Zero or more lines that
(?!\\(?:ps|lx|ge)) # don't start with
.*\n # '\ps', '\lx', or '\ge'...
)*
\\ge.*\n # ...followed by a '\ge' line.
)
(?P<GE_BLOCK>
(?: # Again, zero or more lines
(?!\\(?:ps|lx|ge)) # that don't start with
.*\n # '\ps', '\lx', or '\ge'...
)*
\\ge.*\n # ...followed by a '\ge' line.
)
""", re.MULTILINE | re.VERBOSE)
The replacement string would be:
r'\g<PS_BLOCK>\g<PS_LINE>\g<GE_BLOCK>'
You'll still have to do multiple passes. If Python supported \G, that wouldn't be necessary, but you can use subn and check the number_of_subs_made return value.
Any time you're having problems with regexen and tell yourself "I would have to run it multiple times,.." it's a clear sign that you need to write a parser :-)
The language seems to be pretty regular, so a a parser should be easy to write, perhaps starting with something as easy as:
def parse_line(line):
kind, value = line.split(' ', 1) # split on the first space
kind = kind[1:] # remove the \
parsed_value = globals().get('parse_' + kind, lambda x:x)(value)
return (kind, parsed_value)
def parse_dt(value):
val = ... # create datetime.date() from "19/Dec/2011"
return val
It's maybe a little too cute to use globals() to write a state machine, but it saves a ton of boilerplate code... :-)
Convert your input into a list of tuples:
records = [parse_line(line) for line in open("myfile.dta")]
Figuring out if there is always a ("ps", ..) tuple before a ("ge", ..) tuple should then be pretty easy -- e.g. by first noting where all the lx tuples are...
This is my code:
db = xapian.Database(path/to/database)
enquire = xapian.Enquire
stemmer = xapian.Stem(<supported language>)
query_parser = xapian.QueryParser()
query_parser.set_database(db)
query_parser.set_stemmer(stemmer)
query_parser.set_default_op(xapian.query.OP_OR)
xapian_flags = xapian.QueryParser.FLAG_BOOLEAN | xapian.QueryParser.FLAG_SYNONYM | xapian.QueryParser.FLAG_LOVEHATE
query = query_parser.parse_query('"this exact phrase"', xapian_flags)
enquiry.set_query(query)
This isn't matching "this exact phrase" (I am able to achieve pretty much everything but exact matches). Note that I've included the double quotes mentioned in the documentation. Is there a way of achieving this?
By explicitly setting the flags to the query parser you override the default of FLAG_PHRASE | FLAG_LOVEHATE | FLAG_BOOLEAN. What you've done therefore is to turn on synonym support but turn off phrase searching, which is what the double quotes relies on.
Note that phrase searching isn't strictly the same as exact matching, although without more context it's difficult to advise if this is the wrong approach to take for your situation.
Given:
ABC
content 1
123
content 2
ABC
content 3
XYZ
Is it possible to create a regex that matches the shortest version of "ABC[\W\w]+?XYZ"
Essentially, I'm looking for "ABC followed by any characters terminating with XYZ, but don't match if I encounter ABC in between" (but think of ABC as a potential regex itself, as it would not always be a set length...so ABC or ABcC could also match)
So, more generally: REGEX1 followed by any character and terminated by REGEX2, not matching if REGEX1 occurs in between.
In this example, I would not want the first 4 lines.
(I'm sure this explanation could potentially need...further explanation haha)
EDIT:
Alright, I see the need for further explanation now! Thanks for the suggestions thus far. I'll at least give you all more to think about while I start looking into how each of your proposed solutions can be applied to my problem.
Proposal 1: Reverse the string contents and the regex.
This is certainly a very fun hack that solves the problem based on what I explained. In simplifying the issue, I failed to also mention that the same thing could happen in reverse because the ending signature could exist later on also (and has proven to be in my specific situation). That introduces the problem illustrated below:
ABC
content 1
123
content 2
ABC
content 3
XYZ
content 4
MNO
content 5
XYZ
In this instance, I would check for something like "ABC through XYZ" meaning to catch [ABC, content 1, XYZ]...but accidentally catching [ABC, content 1, 123, content 2, ABC, content 3, XYZ]. Reversing that would catch [ABC, content 3, XYZ, content 4, MNO, content 5, XYZ] instead of the [ABC, content 2, XYZ] that we want again. The point is to try to make it as generalized as possible because I will also be searching for things that could potentially have the same starting signature (regex "ABC" in this case), and different ending signatures.
If there is a way to build the regexes so that they encapsulate this sort of limitation, it could prove much easier to just reference that any time I build a regex to search for in this type of string, rather than creating a custom search algorithm that deals with it.
Proposal 2: A+B+C+[^A]+[^B]+[^C]+XYZ with IGNORECASE flag
This seems nice in the case that ABC is finite. Think of it as a regex in itself though. For example:
Hello!GoodBye!Hello.Later.
VERY simplified version of what I'm trying to do. I would want "Hello.Later." given the start regex Hello[!.] and the end Later[!.]. Running something simply like Hello[!.]Later[!.] would grab the entire string, but I'm looking to say that if the start regex Hello[!.] exists between the first starting regex instance found and the first ending regex instance found, ignore it.
The convo below this proposal indicates that I might be limited by regular language limitations similar to the parentheses matching problem (Google it, it's fun to think about). The purpose of this post is to see if I do in fact have to resort to creating an underlying algorithm that handles the issue I'm encountering. I would very much like to avoid it if possible (in the simple example that I gave you above, it's pretty easy to design a finite state machine for...I hope that holds as it grows slightly more complex).
Proposal 3: ABC(?:(?!ABC).)*?XYZ with DOTALL flag
I like the idea of this if it actually allows ABC to be a regex. I'll have to explore this when I get in to the office tomorrow. Nothing looks too out of the ordinary at first glance, but I'm entirely new to python regex (and new to actually applying regexes in code instead of just in theory homework)
A regex solution would be ABC(?:(?!ABC).)*?XYZ with the DOTALL flag.
Edit
So after reading your further explanations, I would say that my previous proposal, as well as MRAB's one are somehow similar and won't be of any help here. Your problem is actually the prolem of nested structures.
Think of your 'prefix' and 'suffix' as symbols. You could easily replace them with an opening and a closing parenthesis or whatever, and what you want is being able to match only the smallest (then deepest) pair ...
For example if your prefix is 'ABC.' and your suffix is 'XYZ.':
ABChello worldABCfooABCbarXYZ
You want to get only ABCbarXYZ.
It's the same if the prefix is (, and the suffix is ), the string:
(hello world(foo(bar)
It would match ideally only (bar) ...
Definitely you have to use a context free grammar (like programming languages do: C grammar, Python grammar) and a parser, or make your own by using regex as well as the iterating and storing mechanisms of your programming language.
But that's not possible with only regular expressions. They would probably help in your algorithm, but they just are not designed to handle that alone. Not the good tool for that job ... You cannot inflate tires with a screwdriver. Therefore, you will have to use some external mechanisms, not complicated though, to store the context, your position in the nested stack. Using your regular expression in each single context still may be possible.
Finite state machines are finite, and nested structures have an arbitrary depth that would require your automaton to grow arbitrarily, thus they are not regular languages.
Since recursion in a grammar allows the definition of nested syntactic structures, any language (including any programming language) which allows nested structures is a context-free language, not a regular language. For example, the set of strings consisting of balanced parentheses [like a LISP program with the alphanumerics removed] is a context-free language
see here
Former proposal (not relevant anymore)
If I do:
>>> s = """ABC
content 1
123
content 2
ABC
content 3
XYZ"""
>>> r = re.compile(r'A+B+C+[^A]+[^B]+[^C]+XYZ', re.I)
>>> re.findall(r,s)
I get
['ABC\ncontent 3\nXYZ']
Is that what you want ?
There is another method of solving this problem: not trying to do it in one regex. You could split the string by the first regex, and then use the second one on the last part.
Code is the best explanation:
s = """ABC
content 1
123
content 2
ABC
content 3
XYZ
content 4
XYZ"""
# capturing groups to preserve the matched section
prefix = re.compile('(ABC)')
suffix = re.compile('(XYZ)')
# prefix.split(s) == ['', 'ABC', [..], 'ABC', '\ncontent 3\nXYZ\ncontent 4\nXYZ']
# prefixmatch ^^^^^ ^^^^^^^^^^^^ rest ^^^^^^^^^^^^^^^^
prefixmatch, rest = prefix.split(s)[-2:]
# suffix.split(rest,1) == ['\ncontent 3\n', 'XYZ', '\ncontent 4\nXYZ']
# ^^ interior ^^ ^^^^^ suffixmatch
interior, suffixmatch = suffix.split(rest,1)[:2]
# join the parts up.
result = '%s%s%s' % (prefixmatch, interior, suffixmatch)
# result == 'ABC\ncontent 3\nXYZ'
Some points:
there should be appropriate error handling (even just try: ... except ValueError: .. around the whole thing) to handle the case when either regex doesn't match at all and so the list unpacking fails.
this assumes that the desired segment will occur immediately after the last occurrence of prefix, if not, then you can iterate through the results of prefix.split(s) two at a time (starting at index 1) and do the same splitting trick with suffix to find all the matches.
this likely to be reasonably inefficient, since it creates quite a few intermediate data structures.
I am attempting to create a regular expression to parse an address into five parts: "address1", which is the street address, "address2", which is the apartment number or whatever else shows up on line 2 of an address, the city, state, and zip code.
When I run this, Python (or Django) is throwing an error which states "unexpected end of pattern" when I run re.search. Can anyone tell me how to modify this regular expression to match correctly?
I'm very much a regular expression noob. I can make out most of what this one is supposed to do, but I could never have written it myself. I got this from http://regexlib.com/REDetails.aspx?regexp_id=472.
re.compile(r"""
(?x)^(?n:
(?<address1>
(\d{1,5}(\ 1\/[234])?(\x20[A-Z]([a-z])+)+ )
| (P.O. Box \d{1,5}))\s{1,2}
(?<city>
[A-Z]([a-z])
+ (\.?)(\x20[A-Z]([a-z])+){0, 2})\, \x20
(?<state>
A[LKSZRAP] | C[AOT] | D[EC] | F[LM] | G[AU] | HI
| I[ADL N] | K[SY] | LA | M[ADEHINOPST] | N[CDEHJMVY]
| O[HKR] | P[ARW] | RI | S[CD] | T[NX] | UT | V[AIT]
| W[AIVY]
| [A-Z]([a-z])
+ (\.?)(\x20[A-Z]([a-z])+){0,2})\x20
(?<zipcode>
(?!0{5})\d{5}(-\d {4})?)
)$"
""", re.VERBOSE)
Newlines added for readability. As a follow-up question, can this regex be separated into multiple lines like this for readability, or will it need to be all in one line to work (I could just concatenate the separate lines, I suppose)?
P.S. I know this smells like homework, but it is actually for work.
Edit: Actual code being used was requested, so here it is. I left it out because everything here is actually already up there, but perhaps it will help.
The function is part of a Django view, but that shouldn't matter too much for our purposes.
def parseAddress(address):
pattern = r"^(?n:(?<address1>(\d{1,5}(\ 1\/[234])?(\x20[A-Z]([a-z])+)+ )|(P\.O\.\ Box\ \d{1,5}))\s{1,2}(?i:(?<address2>(((APT|APARTMENT|BLDG|BUILDING|DEPT|DEPARTMENT|FL|FLOOR|HNGR|HANGER|LOT|PIER|RM|ROOM|S(LIP|PC|T(E|OP))|TRLR|TRAILER|UNIT)\x20\w{1,5})|(BSMT|BASEMENT|FRNT|FRONT|LBBY|LOBBY|LOWR|LOWER|OFC|OFFICE|PH|REAR|SIDE|UPPR|UPPER)\.?)\s{1,2})?)(?<city>[A-Z]([a-z])+(\.?)(\x20[A-Z]([a-z])+){0,2})\, \x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD] |T[NX]|UT|V[AIT]|W[AIVY]|[A-Z]([a-z])+(\.?)(\x20[A-Z]([a-z])+){0,2})\x20(?<zipcode>(?!0{5})\d{5}(-\d {4})?))$"
match = re.search(pattern, address)
I was using my home address as the input, but I tried "123 Main St., Austin, TX 12345" as input as well with the same result.
Some people might not consider this an answer, but bear with me for a minute.
I HIGHLY recommend AGAINST trying to parse street addresses with a regex. Street addresses are not "regular" in any sense of the word. There is infinite variation, and unless you restrict yourself to a very limited grammar, there will always be strings you cannot parse. A huge amount of time and money has been invested in solutions to parse addresses, starting with the US Post Office and the many, many providers of list cleanup services. Just Google "parsing street addresses" to get a hint of the scope of the problem. There are commercial solutions and some free solutions, but the comments on the web indicate that nobody gets it right all the time.
I also speak from experience. During the '80s I worked for a database typesetting company, and we had to parse addresses. We never were able to develop a solution that worked perfectly, and for data we captured ourselves (we had a large keyboarding department) we developed a special notation syntax so the operators could insert delimiters at the appropriate locations to help the parsing process.
Take a look at some of the free services out there. You will save yourself a lot of hassle.
Set x (verbose) flag in regex, i.e.: (?x)
a non-regex answer: check out the python library usaddress (there's also a web interface for trying it out)
agree w/ Jim that regex isn't a good solution here. usaddress parses addresses probabilistically, and is much more robust than regex-based parsers when dealing with messy addresses.
Your regex fails on the first character n, which you can verify as follows. Make a file test.py and put the following:
import re
re.compile(r'...')
where you fill in your pattern of course :) Now run python -m pdb test.py, enter c to continue and it will stop when the exception is raised. At that point type l to see where in the code you are. You see it fails because source.next isn't in FLAGS. This source is just your pattern, so you verify where it fails by typing print source.index.
Furthermore, removing that n in front, the pattern fails at the first a of <address1>.
The (?n is strange, I can't find it in the documentation, so it seems to be an unsupported extension. As for the ?<address1>, I think this should be ?P<address1>. There are more problems with it, like (?i: and if I remove those and fix the ?P< stuff, I get an error about unbalanced parenthesis at the last parenthesis.
Jim Garrison (above) is correct - addresses are too varied to parse with a regex. I work for an address verification software company - SmartyStreets. Try out our LiveAddress API - the REST endpoint provides all the address components parsed in a nice, easy to use JSON response. Here's a sample:
https://github.com/smartystreets/LiveAddressSamples/blob/master/python/street-address.py