Simple text parsing library

Simple text parsing library - python

I have a method that takes addresses from the web, and therefore, there are many known errors like:
123 Awesome St, Pleasantville, NY, Get Directions
Which I want to be:
123 Awesome St, Pleasantville, NY
Is there a web service or Python library that can help with this? It's fine for us to start creating a list of items like ", Get Directions" or a more generalized version of that, but I thought there might be a helper library for this kind of textual analysis.

If the address contains one of those bad strings, walk backwards till you find another non-whitespace character. If the character is one of your separators, say , or :, drop everything from that character onwards. If it's a different character, drop everything after that character.
Make a list of known bad strings. Then, you could take that list and use it to build a gigantic regex and use re.sub().
This is a naive solution, and isn't going to be particularly performant, but it does give you a clean way of adding known bad strings, by adding them to a file called .badstrings or similar and building the list from them.
Note that if you make bad choices about what these bad strings are, you will break the algorithm. But it should work for the simple cases you describe in the comments.
EDIT: Something like this is what I mean:
import re
def sanitize_address(address, regex):
return regex.sub('', address)
badstrings = ['get directions', 'multiple locations']
base_regex = r'[,\s]+('+'|'.join(badstrings)+')'
regex = re.compile(base_regex, re.I)
address = '123 Awesome St, Pleasantville, NY, Get Directions'
print sanitize_address(address, regex)
which outputs:
123 Awesome St, Pleasantville, NY

I would say that the task is impossible to do with a high degree of confidence unless the data is in a fixed format, or you have a gigantic address database to make matches against.
You could possibly get away with having a list of countries, and then a rule set per country that you use. The American rule set could include a list of states, cities and postal codes and a pattern to find street addresses. You would then drop anything that isn't either a state, city postal code or looks like a street address.
You'd still drop things that should be a part of an address though, at least with Swedish addresses, that can include just the name of a farm instead of a street and number. If US country side addresses are the same there is just no way to know what is a part of an address and what isn't unless you have access to a database with all US addresses. :-)

Here is a Regex that will parse either one. If you have other examples, I can change the current Regex to work for it
(?<address>(?:[0-9]+\s+(?:\w+\s?)+)+)[,]\s+(?<city>(?:\w+\s?)+)[,]\s+(?<state>(?:\w+\s?)+)(?:$|[,])
this will even work for addresses that are in similar format to mine (1234 North 1234 West, Pleasantville, NY)

Related

Regex for multiple lines separated with "return" and multiple unnecessary spaces

I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.

Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions

In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.

regex to find LastnameFirstname with no space between in Python

i currently have several names that look like this
SmithJohn
smithJohn
O'BrienPeter
both of these have no spaces, but have a capital letter in between.
is there a regex to match these types of names (but won't match names like Smith, John, Smith John or Smith.John)? furthermore, how could i split up the last name and first name into two different variables?
thanks

If all you want is a string with a capital letter in the middle and lowercase letters around it, this should work okay: [a-z][A-Z] (make sure you use re.search and not match). It handles "O'BrienPeter" fine, but might match names like "McCutchon" when it shouldn't. It's impossible to come up with a regex, or any program really, that does that you want for all names (see Falsehoods Programmers Believe About Names).

As Brian points out, there's a question you need to ask yourself here: What guarantees do you have about the strings you will be processing?
Do you know without a doubt that the only capitals will be the beginnings of the names? Or could something like "McCutchonBrian", or in my case "Mallegol-HansenPhilip" have found its way in there as well?
In the greater context of software in general, you need to consider the assumptions you are going in with. Otherwise you're going to be solving a problem, that is in fact not the problem you have.

Ideas for slicing a BeautifulSoup string to compartmentalize into certain categories?

So i'm in the process of doing some webscraping using BeautifulSoup and am given sequence of strings that are in this format:
"PRICE. ADDRESS, PHONE#, " 'WEBSITE
to show you what I mean, here are two examples of how these strings are displayed in the HTML text.
"$10. 2109 W. Chicago Ave., 773-772-0406, "'theoldoaktap.com
"$9. 3619 North Ave., 773-772-8435, "'cemitaspuebla.com
What's the best way to go this? It would've been easy if a comma followed the price (could've just done split(",") and addressed them by index, but what other alternatives do I have now? Can't split by periods because some addresses with directional streets have periods in the front (i.e. W. Chicago Ave).
Would the best solution be to split() and extract the first string (price), and then make a new string with the leftover indexes and then go about splitting by the comma (split(","))? Seems super non-python-y and i'm not sure that would work either.
In the end, I want to end up with
Price = $10
Location = 2109 W. Chicago Ave
Phone# = 773-772-0406
Website = http://www.theoldoaktap.com
thank you all in advance. my brain is fried.

import re
test = '"$10. 2109 W. Chicago Ave., 773-772-0406, "\'theoldoaktap.com'
extracted_entities = re.match(r'"\$(\d+)\. ([^,]+), ([\d-]+), "\'<a href="([^"]+)"', test)
print extracted_entities.groups()
Basically, since your strings are pretty rigidly formatted, you can simply use regular expression to extract out its components using some predetermined patterns. If you are going to do these types of projects often, I highly suggest studying regex, it's a very powerful tool!
Reference: https://docs.python.org/2/howto/regex.html

Regex to match address with subpatterns

I am attempting to create a regular expression to parse an address into five parts: "address1", which is the street address, "address2", which is the apartment number or whatever else shows up on line 2 of an address, the city, state, and zip code.
When I run this, Python (or Django) is throwing an error which states "unexpected end of pattern" when I run re.search. Can anyone tell me how to modify this regular expression to match correctly?
I'm very much a regular expression noob. I can make out most of what this one is supposed to do, but I could never have written it myself. I got this from http://regexlib.com/REDetails.aspx?regexp_id=472.
re.compile(r"""
(?x)^(?n:
(?<address1>
(\d{1,5}(\ 1\/[234])?(\x20[A-Z]([a-z])+)+ )
| (P.O. Box \d{1,5}))\s{1,2}
(?<city>
[A-Z]([a-z])
+ (\.?)(\x20[A-Z]([a-z])+){0, 2})\, \x20
(?<state>
A[LKSZRAP] | C[AOT] | D[EC] | F[LM] | G[AU] | HI
| I[ADL N] | K[SY] | LA | M[ADEHINOPST] | N[CDEHJMVY]
| O[HKR] | P[ARW] | RI | S[CD] | T[NX] | UT | V[AIT]
| W[AIVY]
| [A-Z]([a-z])
+ (\.?)(\x20[A-Z]([a-z])+){0,2})\x20
(?<zipcode>
(?!0{5})\d{5}(-\d {4})?)
)$"
""", re.VERBOSE)
Newlines added for readability. As a follow-up question, can this regex be separated into multiple lines like this for readability, or will it need to be all in one line to work (I could just concatenate the separate lines, I suppose)?
P.S. I know this smells like homework, but it is actually for work.
Edit: Actual code being used was requested, so here it is. I left it out because everything here is actually already up there, but perhaps it will help.
The function is part of a Django view, but that shouldn't matter too much for our purposes.
def parseAddress(address):
pattern = r"^(?n:(?<address1>(\d{1,5}(\ 1\/[234])?(\x20[A-Z]([a-z])+)+ )|(P\.O\.\ Box\ \d{1,5}))\s{1,2}(?i:(?<address2>(((APT|APARTMENT|BLDG|BUILDING|DEPT|DEPARTMENT|FL|FLOOR|HNGR|HANGER|LOT|PIER|RM|ROOM|S(LIP|PC|T(E|OP))|TRLR|TRAILER|UNIT)\x20\w{1,5})|(BSMT|BASEMENT|FRNT|FRONT|LBBY|LOBBY|LOWR|LOWER|OFC|OFFICE|PH|REAR|SIDE|UPPR|UPPER)\.?)\s{1,2})?)(?<city>[A-Z]([a-z])+(\.?)(\x20[A-Z]([a-z])+){0,2})\, \x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD] |T[NX]|UT|V[AIT]|W[AIVY]|[A-Z]([a-z])+(\.?)(\x20[A-Z]([a-z])+){0,2})\x20(?<zipcode>(?!0{5})\d{5}(-\d {4})?))$"
match = re.search(pattern, address)
I was using my home address as the input, but I tried "123 Main St., Austin, TX 12345" as input as well with the same result.

Some people might not consider this an answer, but bear with me for a minute.
I HIGHLY recommend AGAINST trying to parse street addresses with a regex. Street addresses are not "regular" in any sense of the word. There is infinite variation, and unless you restrict yourself to a very limited grammar, there will always be strings you cannot parse. A huge amount of time and money has been invested in solutions to parse addresses, starting with the US Post Office and the many, many providers of list cleanup services. Just Google "parsing street addresses" to get a hint of the scope of the problem. There are commercial solutions and some free solutions, but the comments on the web indicate that nobody gets it right all the time.
I also speak from experience. During the '80s I worked for a database typesetting company, and we had to parse addresses. We never were able to develop a solution that worked perfectly, and for data we captured ourselves (we had a large keyboarding department) we developed a special notation syntax so the operators could insert delimiters at the appropriate locations to help the parsing process.
Take a look at some of the free services out there. You will save yourself a lot of hassle.

Set x (verbose) flag in regex, i.e.: (?x)

a non-regex answer: check out the python library usaddress (there's also a web interface for trying it out)
agree w/ Jim that regex isn't a good solution here. usaddress parses addresses probabilistically, and is much more robust than regex-based parsers when dealing with messy addresses.

Your regex fails on the first character n, which you can verify as follows. Make a file test.py and put the following:
import re
re.compile(r'...')
where you fill in your pattern of course :) Now run python -m pdb test.py, enter c to continue and it will stop when the exception is raised. At that point type l to see where in the code you are. You see it fails because source.next isn't in FLAGS. This source is just your pattern, so you verify where it fails by typing print source.index.
Furthermore, removing that n in front, the pattern fails at the first a of <address1>.
The (?n is strange, I can't find it in the documentation, so it seems to be an unsupported extension. As for the ?<address1>, I think this should be ?P<address1>. There are more problems with it, like (?i: and if I remove those and fix the ?P< stuff, I get an error about unbalanced parenthesis at the last parenthesis.

Jim Garrison (above) is correct - addresses are too varied to parse with a regex. I work for an address verification software company - SmartyStreets. Try out our LiveAddress API - the REST endpoint provides all the address components parsed in a nice, easy to use JSON response. Here's a sample:
https://github.com/smartystreets/LiveAddressSamples/blob/master/python/street-address.py

Problem with eastern european characters when scraping data from the European Parliament Website

EDIT: thanks a lot for all the answers an points raised. As a novice I am a bit overwhelmed, but it is a great motivation for continuing learning python!!
I am trying to scrape a lot of data from the European Parliament website for a research project. The first step is to create a list of all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name):
<td class="listcontentlight_left">
ANDRIKIENĖ, Laima Liucija
<br/>
Group of the European People's Party (Christian Democrats)
<br/>
</td>
So far I have been using PyParser and the following code:
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
However this does not catch the name from the html above. Any advice in how to proceed?
Best, Thomas
P.S: Here is all the code i have so far:
# -*- coding: utf-8 -*-
import urllib.request
from pyparsing_py3 import *
page = urllib.request.urlopen("http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN")
page = page.read().decode("utf8")
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)

I was able to show 31 names starting with A with code:
extended_chars = srange(r"[\0x80-\0x7FF]")
special_chars = ' -'''
name = Word(alphanums + alphas8bit + extended_chars + special_chars)
As John noticed you need more unicode characters (extended_chars) and some names have hypehen etc. (special chars). Count how many names you received and check if page has the same count as I do for 'A'.
Range 0x80-0x87F encode 2 bytes sequences in utf8 of probably all european languages. In pyparsing examples there is greetingInGreek.py for Greek and other example for Korean texts parsing.
If 2 bytes are not enough then try:
extended_chars = u''.join(unichr(c) for c in xrange(127, 65536, 1))

Are you sure that writing your own parser to pick bits out of HTML is the best option? You might find it easier to use a dedicated HTML parser. Beautiful Soup which lets you specify the location you're interested in using the DOM, so pulling the text from the first link inside a table cell with class "listcontentlight_left" is quite easy:
soup = BeautifulSoup(htmlDocument)
cells = soup.findAll("td", "listcontentlight_left")
for cell in cells:
print cell.a.string

Looks like you've got some kind of encoding problem if you are getting western European names OK (they have lots of accents etc also!). Show us all of your code plus the URL of a typical page that you are trying to scrape and has the East-only problem. Displaying the piece of html that you have is not much use; we have no idea what transformations it has been through; at the very least, use the result of the repr() function.
Update The offending character in that MEP's name is U+0116 (LATIN LETTER CAPITAL E WITH DOT ABOVE). So it is not included in pyparsing's "alphanums + alphas8bit". The Westies (latin-1) will all fit in what you've got already. I know little about pyparsing; you'll need to find a pyparsing expression that includes ALL unicode alphabetics ... not just Latin-n in case they start using Cyrillic for the Bulgarian MEPs instead of the current transcription into ASCII :-)
Other observations:
(1) alphaNUMs ... digits in a name?
(2) names may include apostrophe and hyphen e.g. O'Reilly, Foughbarre-Smith

at first i thought i’d recommend to try and build a custom letter class from python’s unicodedata.category method, which, when given a character, will tell you what class that codepoint is assigned to acc to the unicode character category; this would tell you whether a codepoint is e.g. an uppercase or lowercase letter, a digit or whatever.
on second thought and remiscent of an answer i gave the other day, let me suggest another approach. there are many implicit assumptions we have to get rid of when going from national to global; one of them is certainly that ‘a character equals a byte’, and one other is that ‘a person’s name is made up of letters, and i know what the possible letters are’. unicode is vast, and the eu currently has 23 official languages written in three alphabets; exactly what characters are used for each language will involve quite a bit of work to figure out. greek uses those fancy apostrophies and is distributed across at least 367 codepoints; bulgarian uses the cyrillic alphabet with a slew of extra characters unique to the language.
so why not simply turn the tables and take advantage of the larger context those names appear in? i brosed through some sample data and it looks like the general pattern for MEP names is LASTNAME, Firstname with (1) the last name in (almost) upper case; (2) a comma and a space; (3) the given names in ordinary case. this even holds in more ‘deviant’ examples like GERINGER de OEDENBERG, Lidia Joanna, GALLAGHER, Pat the Cope (wow), McGUINNESS, Mairead. It would take some work to recover the ordinary case from the last names (maybe leave all the lower case letters in place, and lower-case any capital letters that are preceded by another capital letters), but to extract the names is, in fact simple:
fullname := lastname ", " firstname
lastname := character+
firstname := character+
that’s right—since the EUP was so nice to present names enclosed in an HTML tag, you already know the maximum extent of it, so you can just cut out that maximum extent and split it up in two parts. as i see it, all you have to look for is the first occurrence of a sequence of comma, space—everything before that is the last, anything behind that the given names of the person. i call that the ‘silhouette approach’ since it’s like looking at the negative, the outline, rather than the positive, what the form is made up from.
as has been noted earlier, some names use hyphens; now there are several codepoints in unicode that look like hyphens. let’s hope the typists over there in brussels were consistent in their usage. ah, and there are many surnames using apostrophes, like d'Hondt, d'Alambert. happy hunting: possible incarnations include U+0060, U+00B4, U+0027, U+02BC and a fair number of look-alikes. most of these codepoints would be ‘wrong’ to use in surnames, but when was the last time you saw thos dits used correctly?
i somewhat distrust that alphanums + alphas8bit + extended_chars + special_chars pattern; at least that alphanums part is a tad bogey as it seems to include digits (which ones? unicode defines a few hundred digit characters), and that alphas8bit thingy does reek of a solvent made for another time. unicode conceptually works in a 32bit space. what’s 8bit intended to mean? letters found in codepage 852? c’mon this is 2010.
ah, and looking back i see you seem to be parsing the HTML with pyparsing. don’t do that. use e.g. beautiful soup for sorting out the markup; it’s quite good at dealing even with faulty HTML (most HTML in the wild does not validate) and once you get your head about it’s admittedly wonderlandish API (all you ever need is probably the find() method) it will be simple to fish out exactly those snippets of text you’re looking for.

Even though BeautifulSoup is the de facto standard for HTML parsing, pyparsing has some alternative approaches that lend themselves to HTML too (certainly a leg up over brute force reg exps). One function in particular is makeHTMLTags, which takes a single string argument (the base tag), and returns a 2-tuple of pyparsing expressions, one for the opening tag and one for the closing tag. Note that the opening tag expression does far more than just return the equivalent of "<"+tag+">". It also:
handles upper/lower casing of the tag
itself
handles embedded attributes
(returning them as named results)
handles attribute names that have
namespaces
handles attribute values in single, double, or no quotes
handles empty tags, as indicated by a
trailing '/' before the closing '>'
can be filtered for specific
attributes using the withAttribute
parse action
So instead of trying to match the specific name content, I suggest you try matching the surrounding <a> tag, and then accessing the title attribute. Something like this:
aTag,aEnd = makeHTMLTags("a")
for t,_,_ in aTag.scanString(page):
if ";id=" in t.href:
print t.title
Now you get whatever is in the title attribute, regardless of character set.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.