Regex to match address with subpatterns

Regex to match address with subpatterns - python

I am attempting to create a regular expression to parse an address into five parts: "address1", which is the street address, "address2", which is the apartment number or whatever else shows up on line 2 of an address, the city, state, and zip code.
When I run this, Python (or Django) is throwing an error which states "unexpected end of pattern" when I run re.search. Can anyone tell me how to modify this regular expression to match correctly?
I'm very much a regular expression noob. I can make out most of what this one is supposed to do, but I could never have written it myself. I got this from http://regexlib.com/REDetails.aspx?regexp_id=472.
re.compile(r"""
(?x)^(?n:
(?<address1>
(\d{1,5}(\ 1\/[234])?(\x20[A-Z]([a-z])+)+ )
| (P.O. Box \d{1,5}))\s{1,2}
(?<city>
[A-Z]([a-z])
+ (\.?)(\x20[A-Z]([a-z])+){0, 2})\, \x20
(?<state>
A[LKSZRAP] | C[AOT] | D[EC] | F[LM] | G[AU] | HI
| I[ADL N] | K[SY] | LA | M[ADEHINOPST] | N[CDEHJMVY]
| O[HKR] | P[ARW] | RI | S[CD] | T[NX] | UT | V[AIT]
| W[AIVY]
| [A-Z]([a-z])
+ (\.?)(\x20[A-Z]([a-z])+){0,2})\x20
(?<zipcode>
(?!0{5})\d{5}(-\d {4})?)
)$"
""", re.VERBOSE)
Newlines added for readability. As a follow-up question, can this regex be separated into multiple lines like this for readability, or will it need to be all in one line to work (I could just concatenate the separate lines, I suppose)?
P.S. I know this smells like homework, but it is actually for work.
Edit: Actual code being used was requested, so here it is. I left it out because everything here is actually already up there, but perhaps it will help.
The function is part of a Django view, but that shouldn't matter too much for our purposes.
def parseAddress(address):
pattern = r"^(?n:(?<address1>(\d{1,5}(\ 1\/[234])?(\x20[A-Z]([a-z])+)+ )|(P\.O\.\ Box\ \d{1,5}))\s{1,2}(?i:(?<address2>(((APT|APARTMENT|BLDG|BUILDING|DEPT|DEPARTMENT|FL|FLOOR|HNGR|HANGER|LOT|PIER|RM|ROOM|S(LIP|PC|T(E|OP))|TRLR|TRAILER|UNIT)\x20\w{1,5})|(BSMT|BASEMENT|FRNT|FRONT|LBBY|LOBBY|LOWR|LOWER|OFC|OFFICE|PH|REAR|SIDE|UPPR|UPPER)\.?)\s{1,2})?)(?<city>[A-Z]([a-z])+(\.?)(\x20[A-Z]([a-z])+){0,2})\, \x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD] |T[NX]|UT|V[AIT]|W[AIVY]|[A-Z]([a-z])+(\.?)(\x20[A-Z]([a-z])+){0,2})\x20(?<zipcode>(?!0{5})\d{5}(-\d {4})?))$"
match = re.search(pattern, address)
I was using my home address as the input, but I tried "123 Main St., Austin, TX 12345" as input as well with the same result.

Some people might not consider this an answer, but bear with me for a minute.
I HIGHLY recommend AGAINST trying to parse street addresses with a regex. Street addresses are not "regular" in any sense of the word. There is infinite variation, and unless you restrict yourself to a very limited grammar, there will always be strings you cannot parse. A huge amount of time and money has been invested in solutions to parse addresses, starting with the US Post Office and the many, many providers of list cleanup services. Just Google "parsing street addresses" to get a hint of the scope of the problem. There are commercial solutions and some free solutions, but the comments on the web indicate that nobody gets it right all the time.
I also speak from experience. During the '80s I worked for a database typesetting company, and we had to parse addresses. We never were able to develop a solution that worked perfectly, and for data we captured ourselves (we had a large keyboarding department) we developed a special notation syntax so the operators could insert delimiters at the appropriate locations to help the parsing process.
Take a look at some of the free services out there. You will save yourself a lot of hassle.

Set x (verbose) flag in regex, i.e.: (?x)

a non-regex answer: check out the python library usaddress (there's also a web interface for trying it out)
agree w/ Jim that regex isn't a good solution here. usaddress parses addresses probabilistically, and is much more robust than regex-based parsers when dealing with messy addresses.

Your regex fails on the first character n, which you can verify as follows. Make a file test.py and put the following:
import re
re.compile(r'...')
where you fill in your pattern of course :) Now run python -m pdb test.py, enter c to continue and it will stop when the exception is raised. At that point type l to see where in the code you are. You see it fails because source.next isn't in FLAGS. This source is just your pattern, so you verify where it fails by typing print source.index.
Furthermore, removing that n in front, the pattern fails at the first a of <address1>.
The (?n is strange, I can't find it in the documentation, so it seems to be an unsupported extension. As for the ?<address1>, I think this should be ?P<address1>. There are more problems with it, like (?i: and if I remove those and fix the ?P< stuff, I get an error about unbalanced parenthesis at the last parenthesis.

Jim Garrison (above) is correct - addresses are too varied to parse with a regex. I work for an address verification software company - SmartyStreets. Try out our LiveAddress API - the REST endpoint provides all the address components parsed in a nice, easy to use JSON response. Here's a sample:
https://github.com/smartystreets/LiveAddressSamples/blob/master/python/street-address.py

Related

Are line breaks possible in Python match case patterns?

I want a match pattern with a rather long OR-pattern something like:
match item:
case Really.Long.Qualified.Name.ONE | Really.Long.Qualified.Name.TWO | Really.Long.Qualified.Name.THREE | Some.Other.Patterns.Here:
pass
This is obviously very annoying to have on a single line. However, PyCharm doesn't seem to warn about longline as per usual and reports syntax errors if I use a line-break (even if it's escaped).
Is there any way to format this code more nicely, or must the entire pattern be on a single line? Is there a definitive source that establishes this - I couldn't find it in the PEP for match/case OR in particular.
If the latter, why would that language design decision be made? It seems... not good....

You can wrap such chain expressions within a pair of parenthesis.
match item:
case (
Really.Long.Qualified.Name.ONE |
Really.Long.Qualified.Name.TWO |
Really.Long.Qualified.Name.THREE |
Some.Other.Patterns.Here
):
pass

REGEX working together separated by | OR. When run independently are both returning empty lists

I have written two REGEX that I originally was using with the | either or. I need them to both run separately, what should be a simple matter of doing is not working the way I expected. I have tested both regex with online tools, and they both work 100%. When ran in the code they both return: [].
For reference stringSoup is an html string.
Here was the original:
re.findall(r"(\(#([^)\s]+)\))|//.*instagram\.com/(\w+.*?)/(?:p)/g")
I need to run each re separately like so:
re.findall(r"(\(#([^)\s]+)\))/g", stringSoup)
re.findall(r"//.*instagram\.com/(\w+.*?)/(?:p)/g", stringSoup)
The first regex is to find usernames as (#username) The second is to find usernames as instagram.com/username
The original combined regex was working fine
After separation both of these are returning empty []

I'm not really certain I understand your question and some of the inputs, but I made a sample to hopefully re-create what you're trying to do:
\(#(?P<username1>[^)]+)\) # username is after '(#' and is everything up until ')'
| # or
.*instagram\.com\/(?P<username2>[^\/]+)\/p # username is between 'instagram.com/' and the next '/'
You can view it here. You can also remove the top half or the bottom half and see that each regex will only match that specific item. Note that using something like [^\/] might be a bit crude and you can make that more specific, but the above should give you what you need in a general sense.

how to better parse random tracklistings

I am interested in parsing tracklistings in a variety of formats, containing lines such as:
artist - title
artist-title
artist / title
artist - "title"
1. artist - title
0:00 - artist - tit le
05 artist - title 12:20
artist - title [record label]
These are text files which generally contain one tracklist but which may also contain other stuff which I don't want to parse, so the regex ideally needs to be strict enough to not include lines which aren't tracklistings, although really this is probably a question of balance.
I am having some success with the following regex:
simple = re.compile(r"""
^
(?P<time>\d?\d:\d\d)? # track time in 00:00 or 0:00
(
(?P<number>\d{1,2}) # track number as 0 01
[^\w] # not followed by word
)?
[-.)]? # possibly followed by something
"?
(?P<artist>[^"##]+) # artist anything except "##
"?
\s[-/\u2013]\s
"? # dash surrounded by spaces, possibly unicode
(?P<title>[^"##]+?) # title, not greedy
"?
(?P<label>\[\w+\])? # label i.e. [something Records]
(//|&\#13;)? # remove some weird endings, i.e. ascii carriage return
$
""", re.VERBOSE)
However, it's a bit horrible, I only started learning regex very recently. It has problems with lines like this:
an artist-a title # couldn't find ' - '
2 Croozin' - 2 Pumpin' # mistakes 2 as track number
05 artist - title 12:20 # doesn't work at all
In the case of 2 Croozin' - 2 Pumpin', the only way of telling that 2 isn't a track number is to take into account the surrounding context, i.e. look at the other tracks. (I forgot to mention this - these tracks are usually part of a tracklist)
So my question is, how can I improve this in general? Some ideas I've had are:
Use several regex, starting with very specific ones and carry on using less specific ones until it has parsed properly.
dump regex and use a proper parser such as pyparsing or parsley, which might be able to make better use of surrounding context, however I know absolutely nothing about parsing
use lookahead/lookbehind in a multiline regex to look at previous/next lines
use separate regex to get time, track number, artist, title
give up and do something less pointless
I can validate that it has parsed properly (to some degree) doing things such as making sure artists and titles are all different, tracks are in order, times are sensible, and possibly even check artists/titles/labels do actually exist.

At best, you are dealing with a context-sensitive grammar which moves you out of the realm of what regexps can handle alone and into parsing.
Even if your parser is implemented as a regexps and a pile of heuristics, it is still a parser and techniques from parsing will be valuable. Some languages have a chicken-egg problem: I'd like to call "The Artist Formerly Known as the Artist Formerly Known as Prince" an artist and not a track title, but until I see it a second time, I don't have the context to make that determination.
To amplify #JonClements comment, if the files do contain internal meta-data there are plenty of tools to extract and manipulate that information. Even if internal metadata just increases the probability that "A Question of Balance" is an album title, you'll need that information.
Steal as many design approaches as you can: look for open source tag manipulators (e.g. EasyTag) and see how they do it. While you are learning, you might just find a tool that does your job for you.

Python regex negation within regex

Given:
ABC
content 1
123
content 2
ABC
content 3
XYZ
Is it possible to create a regex that matches the shortest version of "ABC[\W\w]+?XYZ"
Essentially, I'm looking for "ABC followed by any characters terminating with XYZ, but don't match if I encounter ABC in between" (but think of ABC as a potential regex itself, as it would not always be a set length...so ABC or ABcC could also match)
So, more generally: REGEX1 followed by any character and terminated by REGEX2, not matching if REGEX1 occurs in between.
In this example, I would not want the first 4 lines.
(I'm sure this explanation could potentially need...further explanation haha)
EDIT:
Alright, I see the need for further explanation now! Thanks for the suggestions thus far. I'll at least give you all more to think about while I start looking into how each of your proposed solutions can be applied to my problem.
Proposal 1: Reverse the string contents and the regex.
This is certainly a very fun hack that solves the problem based on what I explained. In simplifying the issue, I failed to also mention that the same thing could happen in reverse because the ending signature could exist later on also (and has proven to be in my specific situation). That introduces the problem illustrated below:
ABC
content 1
123
content 2
ABC
content 3
XYZ
content 4
MNO
content 5
XYZ
In this instance, I would check for something like "ABC through XYZ" meaning to catch [ABC, content 1, XYZ]...but accidentally catching [ABC, content 1, 123, content 2, ABC, content 3, XYZ]. Reversing that would catch [ABC, content 3, XYZ, content 4, MNO, content 5, XYZ] instead of the [ABC, content 2, XYZ] that we want again. The point is to try to make it as generalized as possible because I will also be searching for things that could potentially have the same starting signature (regex "ABC" in this case), and different ending signatures.
If there is a way to build the regexes so that they encapsulate this sort of limitation, it could prove much easier to just reference that any time I build a regex to search for in this type of string, rather than creating a custom search algorithm that deals with it.
Proposal 2: A+B+C+[^A]+[^B]+[^C]+XYZ with IGNORECASE flag
This seems nice in the case that ABC is finite. Think of it as a regex in itself though. For example:
Hello!GoodBye!Hello.Later.
VERY simplified version of what I'm trying to do. I would want "Hello.Later." given the start regex Hello[!.] and the end Later[!.]. Running something simply like Hello[!.]Later[!.] would grab the entire string, but I'm looking to say that if the start regex Hello[!.] exists between the first starting regex instance found and the first ending regex instance found, ignore it.
The convo below this proposal indicates that I might be limited by regular language limitations similar to the parentheses matching problem (Google it, it's fun to think about). The purpose of this post is to see if I do in fact have to resort to creating an underlying algorithm that handles the issue I'm encountering. I would very much like to avoid it if possible (in the simple example that I gave you above, it's pretty easy to design a finite state machine for...I hope that holds as it grows slightly more complex).
Proposal 3: ABC(?:(?!ABC).)*?XYZ with DOTALL flag
I like the idea of this if it actually allows ABC to be a regex. I'll have to explore this when I get in to the office tomorrow. Nothing looks too out of the ordinary at first glance, but I'm entirely new to python regex (and new to actually applying regexes in code instead of just in theory homework)

A regex solution would be ABC(?:(?!ABC).)*?XYZ with the DOTALL flag.

Edit
So after reading your further explanations, I would say that my previous proposal, as well as MRAB's one are somehow similar and won't be of any help here. Your problem is actually the prolem of nested structures.
Think of your 'prefix' and 'suffix' as symbols. You could easily replace them with an opening and a closing parenthesis or whatever, and what you want is being able to match only the smallest (then deepest) pair ...
For example if your prefix is 'ABC.' and your suffix is 'XYZ.':
ABChello worldABCfooABCbarXYZ
You want to get only ABCbarXYZ.
It's the same if the prefix is (, and the suffix is ), the string:
(hello world(foo(bar)
It would match ideally only (bar) ...
Definitely you have to use a context free grammar (like programming languages do: C grammar, Python grammar) and a parser, or make your own by using regex as well as the iterating and storing mechanisms of your programming language.
But that's not possible with only regular expressions. They would probably help in your algorithm, but they just are not designed to handle that alone. Not the good tool for that job ... You cannot inflate tires with a screwdriver. Therefore, you will have to use some external mechanisms, not complicated though, to store the context, your position in the nested stack. Using your regular expression in each single context still may be possible.
Finite state machines are finite, and nested structures have an arbitrary depth that would require your automaton to grow arbitrarily, thus they are not regular languages.
Since recursion in a grammar allows the definition of nested syntactic structures, any language (including any programming language) which allows nested structures is a context-free language, not a regular language. For example, the set of strings consisting of balanced parentheses [like a LISP program with the alphanumerics removed] is a context-free language
see here
Former proposal (not relevant anymore)
If I do:
>>> s = """ABC
content 1
123
content 2
ABC
content 3
XYZ"""
>>> r = re.compile(r'A+B+C+[^A]+[^B]+[^C]+XYZ', re.I)
>>> re.findall(r,s)
I get
['ABC\ncontent 3\nXYZ']
Is that what you want ?

There is another method of solving this problem: not trying to do it in one regex. You could split the string by the first regex, and then use the second one on the last part.
Code is the best explanation:
s = """ABC
content 1
123
content 2
ABC
content 3
XYZ
content 4
XYZ"""
# capturing groups to preserve the matched section
prefix = re.compile('(ABC)')
suffix = re.compile('(XYZ)')
# prefix.split(s) == ['', 'ABC', [..], 'ABC', '\ncontent 3\nXYZ\ncontent 4\nXYZ']
# prefixmatch ^^^^^ ^^^^^^^^^^^^ rest ^^^^^^^^^^^^^^^^
prefixmatch, rest = prefix.split(s)[-2:]
# suffix.split(rest,1) == ['\ncontent 3\n', 'XYZ', '\ncontent 4\nXYZ']
# ^^ interior ^^ ^^^^^ suffixmatch
interior, suffixmatch = suffix.split(rest,1)[:2]
# join the parts up.
result = '%s%s%s' % (prefixmatch, interior, suffixmatch)
# result == 'ABC\ncontent 3\nXYZ'
Some points:
there should be appropriate error handling (even just try: ... except ValueError: .. around the whole thing) to handle the case when either regex doesn't match at all and so the list unpacking fails.
this assumes that the desired segment will occur immediately after the last occurrence of prefix, if not, then you can iterate through the results of prefix.split(s) two at a time (starting at index 1) and do the same splitting trick with suffix to find all the matches.
this likely to be reasonably inefficient, since it creates quite a few intermediate data structures.

Simple text parsing library

I have a method that takes addresses from the web, and therefore, there are many known errors like:
123 Awesome St, Pleasantville, NY, Get Directions
Which I want to be:
123 Awesome St, Pleasantville, NY
Is there a web service or Python library that can help with this? It's fine for us to start creating a list of items like ", Get Directions" or a more generalized version of that, but I thought there might be a helper library for this kind of textual analysis.

If the address contains one of those bad strings, walk backwards till you find another non-whitespace character. If the character is one of your separators, say , or :, drop everything from that character onwards. If it's a different character, drop everything after that character.
Make a list of known bad strings. Then, you could take that list and use it to build a gigantic regex and use re.sub().
This is a naive solution, and isn't going to be particularly performant, but it does give you a clean way of adding known bad strings, by adding them to a file called .badstrings or similar and building the list from them.
Note that if you make bad choices about what these bad strings are, you will break the algorithm. But it should work for the simple cases you describe in the comments.
EDIT: Something like this is what I mean:
import re
def sanitize_address(address, regex):
return regex.sub('', address)
badstrings = ['get directions', 'multiple locations']
base_regex = r'[,\s]+('+'|'.join(badstrings)+')'
regex = re.compile(base_regex, re.I)
address = '123 Awesome St, Pleasantville, NY, Get Directions'
print sanitize_address(address, regex)
which outputs:
123 Awesome St, Pleasantville, NY

I would say that the task is impossible to do with a high degree of confidence unless the data is in a fixed format, or you have a gigantic address database to make matches against.
You could possibly get away with having a list of countries, and then a rule set per country that you use. The American rule set could include a list of states, cities and postal codes and a pattern to find street addresses. You would then drop anything that isn't either a state, city postal code or looks like a street address.
You'd still drop things that should be a part of an address though, at least with Swedish addresses, that can include just the name of a farm instead of a street and number. If US country side addresses are the same there is just no way to know what is a part of an address and what isn't unless you have access to a database with all US addresses. :-)

Here is a Regex that will parse either one. If you have other examples, I can change the current Regex to work for it
(?<address>(?:[0-9]+\s+(?:\w+\s?)+)+)[,]\s+(?<city>(?:\w+\s?)+)[,]\s+(?<state>(?:\w+\s?)+)(?:$|[,])
this will even work for addresses that are in similar format to mine (1234 North 1234 West, Pleasantville, NY)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.