Get a whole unicode sentence - python

I'm trying to parse a sentence like Base: Lote Numero 1, Marcelo T de Alvear 500. Demanda: otras palabras. I want to: first, split the text by periods, then, use whatever is before the colon as a label for the sentence after the colon.
Right now I have the following definition:
from pyparsing import *
unicode_printables = u''.join(unichr(c) for c in xrange(65536)
if not unichr(c).isspace())
def parse_test(text):
label = Word(alphas)+Suppress(':')
value = OneOrMore(Word(unicode_printables)|Literal(','))
group = Group(label.setResultsName('label')+value.setResultsName('value'))
exp = delimitedList(
group,
delim='.'
)
return exp.parseString(text)
And kind of works but it drops the unicode caracters (and whatever that is not in alphanums) and I'm thinking that I would like to have the value as a whole sentence and not this: 'value': [(([u'Lote', u'Numero', u'1', ',', u'Marcelo', u'T', u'de', u'Alvear', u'500'], {}), 1).
Is a simple way to tackle this?

To directly answer your question, wrap your value definition with originalTextFor, and this will give you back the string slice that the matching tokens came from, as a single string. You could also add a parse action, like:
value.setParseAction(lambda t : ' '.join(t))
But this would explicitly put a single space between each item, when there might have been no spaces (in the case of a ',' after a word), or more than one space. originalTextFor will give you the exact input substring. But even simpler, if you are just reading everything after the ':', would be to use restOfLine. (Of course, the simplest would be just to use split(':'), but I assume you are specifically asking how to do this with pyparsing.)
A couple of other notes:
xxx.setResultsName('yyy') can be shortened to just xxx('yyy'), improving the readability of your parser definition.
Your definition of value as OneOrMore(Word(unicode_printables) | Literal(',')) has a couple of problems. For one thing, ',' will be included in the set of characters in unicode_printables, so ',' will be included in with any parsed words. The best way to solve this is to use the excludeChars parameter to Word, so that your sentence words do not include commas: OneOrMore(Word(unicode_printables, excludeChars=',') | ','). Now you can also exclude other possible punctuation, like ';', '-', etc. just be adding them in the excludeChars string. (I just noticed that you are using '.' as a delimiter for a delimitedList - for this to work, you will have to include '.' as an excluded character too.) Pyparsing is not like a regular expression in this regard - it does not do any lookahead to try to match the next token in the parser if the next character continues to match the current token. That is why you have to do some extra work of your own to avoid reading too much. In general, something as open-ended as OneOrMore(Word(unicode_printables)) is very likely to eat up the entire rest of your input string.

You should look into PyICU which provides access to the rich Unicode text library provided by ICU, including the BreakIterator class that provides a sentence finder.

Related

Do character classes count as groups in regular expressions?

A small project I got assigned is supposed to extract website URLs from given text. Here's how the most relevant portion of it looks like :
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+-\\/_]+
)''',re.VERBOSE)
This does do its job properly, but I noticed that it also includes the ','s and '.' in URL strings it prints. So my first question is, how do I make it exclude any punctuation symbols in the end of the string it detects ?
My second question is referring to the title itself ( finally ), but doesn't really seem to affect this particular program I'm working on : Do character classes ( in this case [a-zA-Z0-9.%+-\/_]+ ) count as groups ( group[3] in this case ) ?
Thanks in advance.
To exclude some symbols at the end of string you can use negative lookbehind. For example, to disallow . ,:
.*(?<![.,])$
answering in reverse:
No, character classes are just shorthand for bracketed text. They don't provide groups in the same way that surrounding with parenthesis would. They only allow the regular expression engine to select the specified characters -- nothing more, nothing less.
With regards to finding comma and dot: Actually, I see the problem here, though the below may still be valuable, so I'll leave it. Essentially, you have this: [a-zA-Z0-9.%+-\\/_]+ the - character has special meaning: everything between these two characters -- by ascii code. so [A-a] is a valid range. It include A-Z, but also a bunch of other characters that aren't A-Z. If you want to include - in the range, then it needs to be the last character: [a-zA-Z0-9.%+\\/_-]+ should work
For comma, I actually don't see it represented in your regex, so I can't comment specifically on that. It shouldn't be allowed anywhere in the url. In general though, you'll just want to add more groups/more conditions.
First, break apart the url into the specifc groups you'll want:
(scheme)://(domain)(endpoint)
Each section gets a different set of requirements: e.g. maybe domain needs to end with a slash:
[a-zA-Z0-9]+\.com/ should match any domain that uses an alphanumeric character, and ends -- specifically -- with .com (note the \., otherwise it'll capture any single character followed by com/
For the endpoint section, you'll probably still want to allow special characters, but if you're confident you don't want the url to end with, say, a dot, then you could do something [A-Za-z0-9] -- note the lack of a dot here, plus, it's length -- only a single character. This will change the rest of your regex, so you need to think about that.
A couple of random thoughts:
If you're confident you want to match the whole line, add a $ to the end of the regex, to signify the end of the line. One possibility here is that your regex does match some portion of the text, but ignores the junk at the end, since you didn't say to read the whole line.
Regexes get complicated really fast -- they're kind of write-only code. Add some comments to help. E.g.
web_url_regex = re.compile(
r'(http://|https://)' # Capture the scheme name
r'([a-zA-Z0-9.%+-\\/_])' # Everything else, apparently
)
Do not try to be exhaustive in your validation -- as noted, urls are hard to validate because you can't know for sure that one is valid. But the form is pretty consistent, as laid out above: scheme, domain, endpoint (and query string)
To answer the second question first, no a character class is not a group (unless you explicitly make it into one by putting it in parentheses).
Regarding the first question of how to make it exclude the punctuation symbols at the end, the code below should answer that.
Firstly though, your regex had an issue separate from the fact that it was matching the final punctuation, namely that the last - does not appear to be intended as defining a range of characters (see footnote below re why I believe this to be the case), but was doing so. I've moved it to the end of the character class to avoid this problem.
Now a character class to match the final character is added at the end of the regexp, which is the same as the previous character class except that it does not include . (other punctuation is now already not included). So the matched pattern cannot end in .. The + (one or more) on the previous character class is now reduced to * (zero or more).
If for any reason the exact set of characters matched needs tweaking, then the same principle can still be employed: match a single character at the end from a reduced set of possibilities, preceded by any number of characters from a wider set which includes characters that are permitted to be included but not at the end.
import re
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+\\/_-]*
[a-zA-Z0-9%+\\/_-]
)''',re.VERBOSE)
str = "... at http://www.google.com/. It says"
m = re.search(webURLregex, str)
if m:
print(m.group())
Outputs:
http://www.google.com/
[*] The observation that the second - does not appear to be intended to define a character range is based on the fact that, if it was, such a range would be from 056-134 (octal) which would include also the alphabetical characters, making the a-zA-Z redundant.

Regex to split text file in python

I am trying to find a way to parse a string of a transcript into speaker segments (as a list). Speaker labels are denoted by the upper-casing of the speaker's name followed by a colon. The problem I am having is some names have a number of non upper-case characters. Examples might include the following:
OBAMA: said something
O'MALLEY: said something else
GOV. HICKENLOOPER: said something else entirely'
I have written the following regex, but I am struggling to get it to work:
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+(\ |\.|\'|\d)*[A-Z]*:)', mystring)
What I think I have written (and ideally what I want to do) is a command to split the string based on:
1. Find a newline
2. Use positive look-ahead for one or more uppercase characters
3. If upper-case characters are found look for optional characters from the list of periods, apostrophes, single spaces, and digits
4. If these optional characters are found, look for additional uppercase characters.
5. Crucially, find a colon symbol at the end of this sequence.
EDIT: In many cases, the content of the speech will have newline characters contained within it, and possibly colon symbols. As such, the only thing separating the speaker label from the content of speech is the sequence mentioned above.
just change (\ |.|\'|\d) to [\ .\'\d] or (?:\ |.|\'|\d)
import re
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+[\ \.\'\d]*[A-Z]*:)', mystring)
print(parse_turns)
If it's true that the speaker's name and what they said are separated by a colon, then it might be simpler to move away from regex to do your splitting.
list_of_things = []
mystring = "OBAMA: Hi\nO'MALLEY: True Dat\nHUCK FINN: Sure thing\n"
lines = mystring.split("\n")# 1st split the string into lines based on the \n character
for line in lines:
colon_pos = line.find(":",0) # Finds the position of the first colon in the line
speaker, utterance = line[0:colon_pos].strip(), line[colon_pos+1:].strip()
list_of_things.append((speaker, utterance))
At the end, you should have a neat list of tuples containing speakers, and the things they said.

How do I build a tokenizing regex based iterator in python

I'm basing this question on an answer I gave to this other SO question, which was my specific attempt at a tokenizing regex based iterator using more_itertools's pairwise iterator recipe.
Following is my code taken from that answer:
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer(r"^|[ ]+|$", string)):
print(string[prev.end(): curr.start()]) # originally I yield here
I then noticed that if the string starts or ends with delimiters (i.e. string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d ") then the tokenizer will print empty strings (these are actually extra matches to string start and string end) in the beginning and end of its list of token outputs so to remedy this I tried the following (quite ugly) attempts at other regexes:
"(?:^|[ ]|$)+" - this seems quite simple and like it should work but it doesn't (and also seems to behave wildly different on other regex engines) for some reason it wouldn't build a single match from the string's start and the delimiters following it, the string start somehow also consumes the character following it! (this is also where I see divergence from other engines, is this a BUG? or does it have something to do with special non corporeal characters and the or (|) operator in python that I'm not aware of?), this solution also did nothing for the double match containing the string's end, once it matched the delimiters and then gave another match for the string end ($) character itself.
"(?:[ ]|$|^)+" - Putting the delimiters first actually solves one of the problems, the split at the beginning doesn't contain string start (but I don't care too much about that anyway since I'm interested in the tokens themselves), it also matches string start when there are no delimiters at the beginning of the string but the string ending is still a problem.
"(^[ ]*)|([ ]*$)|([ ]+)" - This final attempt got the string start to be part of the first match (which wasn't really that much of a problem in the first place) but try as I might I couldn't get rid of the delimiter + end and then delimiter match problem (which yields an additional empty string), still, I'm showing you this example (with grouping) since it shows that the ending special character $ is matched twice, once with the preceding delimiters and once by itself (2 group 2 matches).
My questions are:
Why do I get such a strange behavior in attempt #1
How do I solve the end of string issue?
Am I being a tank, i.e. is there a simple way to solve this that I'm blindly missing?
remember that the solution can't change the string and must
produce an iterable generator which iterates on the spaces between the tokens and not the tokens themselves (This last part might seem to complicate the answer unnecessarily since otherwise I have a simple answer but if you must know (and if you don't read no further) it's part of a bigger framework I'm building where this yielding method is inherited by a pipeline which then constructs yielded sentences out of it in various patterns which are used to extract fields from semi structured classifier driven messages)
The problems you're having are due to the trickiness and undocumented edge cases of zero-width matches. You can resolve them by using negative lookarounds to explicitly tell Python not to produce a match for ^ or $ if the string has delimiters at the start or end:
delimiter_re = r'[\n\- ]' # newline, hyphen, or space
search_regex = r'''^(?!{0}) # string start with no delimiter
| # or
{0}+ # sequence of delimiters (at least one)
| # or
(?<!{0})$ # string end with no delimiter
'''.format(delimiter_re)
search_pattern = re.compile(search_regex, re.VERBOSE)
Note that this will produce one match in an empty string, not zero, and not separate beginning and ending matches.
It may be simpler to iterate over non-delimiter sequences and use the resulting matches to locate the string components you want:
token = re.compile(r'[^\n\- ]+')
previous_end = 0
for match in token.finditer(string):
do_something_with(string[previous_end:match.start()])
previous_end = match.end()
do_something_with(string[previous_end:])
The extra matches you were getting at the end of the string were because after matching the sequence of delimiters at the end, the regex engine looks for matches at the end again, and finds a zero-width match for $.
The behavior you were getting at the beginning of the string for the ^|... pattern is trickier: the regex engine sees a zero-width match for ^ at the start of the string and emits it, without trying the other | alternatives. After the zero-width match, the engine needs to avoid producing that match again to avoid an infinite loop; this particular engine appears to do that by skipping a character, but the details are undocumented and the source is hard to navigate. (Here's part of the source, if you want to read it.)
The behavior you were getting at the start of the string for the (?:^|...)+ pattern is even trickier. Executing this straightforwardly, the engine would look for a match for (?:^|...) at the start of the string, find ^, then look for another match, find ^ again, then look for another match ad infinitum. There's some undocumented handling that stops it from going on forever, and this handling appears to produce a zero-width match, but I don't know what that handling is.
It sounds like you're just trying to return a list of all the "words" separated by any number of deliminating chars. You could instead just use regex groups and the negation regex ^ to achieve this:
# match any number of consecutive non-delim chars
string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d "
delimiters = '\n\- '
regex = r'([^{0}]+)'.format(delimiters)
for match in re.finditer(regex, string):
print(match.group(0))
output:
dasdha
hasud
hasuid
hsuia
dhsuai
dhasiu
dhaui
d

How to add tags to negated words in strings that follow "not", "no" and "never"

How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.
Input:It was never going to work, he thought. He did not play so well, so he had to practice some more.
Desired output:It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.
Any idea how to solve this?
To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
Will print (demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Explanation
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.
Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with
(\s+)(\w+)
And replace them with what you want
\1NEG_\2
I would not do this with regexp. Rather I would;
Split the input on punctuation characters.
For each fragment do
Set negation counter to 0
Split input into words
For each word
Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
If original word is in {No,Never,Not} increase negation counter by one.
You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.
Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.
Join the string together again and insert the result in your original string in the place of the first regex's match.

Python regex for reading CSV-like rows

I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:
data1, data2 ,"data3'''", 'data4""',,,data5,
but this one is malformed:
data1, data2, da"ta3", 'data4',
-- quotation marks can only be prepended or trailed by spaces.
Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.
I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.
So, maybe someone with experience in parsing something similar could help me on this?
(Or maybe this is too complex for regex and I should just write a function)
EDIT1:
csv module is not much of use here:
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
[['2', ' "dat', 'a1"', " 'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
[['2', 'dat,a1', "'dat", "a2'", '']]
-- unless this can be tuned?
EDIT2: A few language edits - I hope it's more valid English now
EDIT3: Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.
While the csv module is the right answer here, a regex that could do this is quite doable:
import re
r = re.compile(r'''
\s* # Any whitespace.
( # Start capturing here.
[^,"']+? # Either a series of non-comma non-quote characters.
| # OR
"(?: # A double-quote followed by a string of characters...
[^"\\]|\\. # That are either non-quotes or escaped...
)* # ...repeated any number of times.
" # Followed by a closing double-quote.
| # OR
'(?:[^'\\]|\\.)*'# Same as above, for single quotes.
) # Done capturing.
\s* # Allow arbitrary space before the comma.
(?:,|$) # Followed by a comma or the end of a string.
''', re.VERBOSE)
line = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
print r.findall(line)
# That prints: ['data1', 'data2', '"data3\'\'\'"', '\'data4""\'', 'data5']
EDIT: To validate lines, you can reuse the regex above with small additions:
import re
r_validation = re.compile(r'''
^(?: # Capture from the start.
# Below is the same regex as above, but condensed.
# One tiny modification is that it allows empty values
# The first plus is replaced by an asterisk.
\s*([^,"']*?|"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')\s*(?:,|$)
)*$ # And don't stop until the end.
''', re.VERBOSE)
line1 = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
line2 = r"""data1, data2, da"ta3", 'data4',"""
if r_validation.match(line1):
print 'Line 1 is valid.'
else:
print 'Line 1 is INvalid.'
if r_validation.match(line2):
print 'Line 2 is valid.'
else:
print 'Line 2 is INvalid.'
# Prints:
# Line 1 is valid.
# Line 2 is INvalid.
Although it would likely be possible with some combination of pre-processing, use of csv module, post-processing, and use of regular expressions, your stated requirements do not fit well with the design of the csv module, nor possibly with regular expressions (depending on the complexity of nested quotation marks that you might have to handle).
In complex parsing cases, pyparsing is always a good package to fall back on. If this isn't a one-off situation, it will likely produce the most straightforward and maintainable result, at the cost of possibly a little extra effort up front. Consider that investment to be paid back quickly, however, as you save yourself the extra effort of debugging the regex solutions to handle corner cases...
You can likely find examples of pyparsing-based CSV parsing easily, with this question maybe enough to get you started.
Python has a standard library module to read csv files:
import csv
reader = csv.reader(open('file.csv'))
for line in reader:
print line
For your example input this prints
['data1', ' data2 ', "data3'''", ' \'data4""\'', '', '', 'data5', '']
EDIT:
you need to add skipinitalspace=True to allow spaces before double quotation marks for the extra examples you provided. Not sure about the single quotes yet.
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]
It is not possible to give you an answer, because you have not completely specified the protocol that is being used by the writer.
It evidently contains rules like:
If a field contains any commas or single quotes, quote it with double quotes.
Else if the field contains any double quotes, quote it with single quotes.
Note: the result is still valid if you swap double and single in the above 2 clauses.
Else don't quote it.
The resultant field may have spaces (or other whitespace?) prepended or appended.
The so-augmented fields are assembled into a row, separated by commas and terminated by the platform's newline (LF or CRLF).
What is not mentioned is what the writer does in these cases:
(0) field contains BOTH single quotes and double quotes
(1) field contains leading non-newline whitespace
(2) field contains trailing non-newline whitespace
(3) field contains any newlines.
Where the writer ignores any of these cases, please specify what outcomes you want.
You also mention "quotation marks can only be prepended or trailed by spaces" -- surely you mean commas are allowed also, otherwise your example 'data4""',,,data5, fails on the first comma.
How is your data encoded?
This probably sounds too simple, but really from the looks of things you are looking for a string that contains either [a-zA-Z0-9]["']+[a-zA-Z0-9], I mean without in depth testing against the data really what you're looking for is a quote or double quote (or any combination) in between letters (you could also add numbers there).
Based on what you were asking, it really doesn't matter that it's a CSV, it matter's that you have data that doesn't conform. Which I believe just doing a search for a letter, then any combination of one or more " or ' and another letter.
Now are you looking to get a "quantity" or just a printout of the line that contains it so you know which ones to go back and fix?
I'm sorry I don't know python regex's but in perl this would look something like this:
# Look for one or more letter/number at least one ' or " or more and at least one
# or more letter/number
if ($line =~ m/[a-zA-Z0-9]+['"]+[a-zA-Z0-9]+/ig)
{
# Prints the line if the above regex is found
print $line;
}
Just simply convert that for when you look at a line.
I'm sorry if I misunderstood the question
I hope it helps!
If your goal is to convert the data to XML (or JSON, or YAML), look at this example for a Gelatin syntax that produces the following output:
<xml>
<line>
<column>data1</column>
<column>data2 </column>
<column>data3'''</column>
<column>data4""</column>
<column/>
<column/>
<column>data5</column>
<column/>
</line>
</xml>
Note that Gelatin also has a Python API:
from Gelatin.util import compile, generate_to_file
syntax = compile('syntax.gel')
generate_to_file(syntax, 'input.csv', 'output.xml', 'xml')

Categories

Resources