Tokenization which works with terms that contain whitespace in Python? - python

My standard approach to tokenize a text using a regex in Python is this:
> text = "Los Angeles is in California"
> tokens = re.findall(r'\w+', text)
> tokens
['Los','Angeles','is','in','California']
A problem arises if I want to find the name Los Angeles in the above text.
What is the best way to find a needle which contains whitespace in a haystack?
I am asking a general question, because the solution should also work for a case like United States of America and for needles which don't contain whitespace.
For example a simple if "Los Angeles" in text (match) would not do, because if "for" in text would also return a match. But I am looking for full words only (match for and not California).

I suggest to use a text parser like NLTK for such tasks.
But for this case you can use following regex :
>>> re.findall(r'\b([A-Z]\w+ [A-Z]\w+)|(\w+)\b',text)
[('Los Angeles', ''), ('', 'is'), ('', 'in'), ('', 'California')]
the regex r'([A-Z]\w+ [A-Z]\w+)|(\w+)' will match 2 group the first is a pair word that its elements contain capital words! and the second will match a word!

The solution turned out to be simple:
re.search(r'\b'+needle+r'\b', haystack)

Related

To find words (one or more than one consecutive) having first uppercase in capital?

I need to write a regular expression in python, which can find words from text having first letter in uppercase, these words can be single one or consecutive ones.
For example, for the sentence
Dallas Buyer Club is a great American biographical drama film,co-written by Craig Borten and Melisa Wallack, and Directed by Jean-Marc Vallee.
expexted output should be
'Dallas Buyer Club', 'American', 'Craig Borten', 'Melisa Wallack', 'Directed', 'Jean-Marc Vallee'
I have written a regular expression for this,
([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)
but output of this is
'Dallas Buyer Club', 'Craig Borten, 'Melisa Wallack', 'Jean-Marc Valee'
It is only printing consecutive first uppercase words, not single words like
'American', 'Directed'
also the regular expression,
[A-Z][a-z]+
printing all words but individually,
'Dallas', 'Buyers', 'Club' and so on.
Please help me on this.
I think you mixed up the brackets (and make the regex a bit too complicated. Simply use:
[A-Z][a-z]+(?:\s[A-Z][a-z]+)*
So here we have a matching part [A-Za-z]+ and in order to match more groups, we simply use (...)* to repeat ... zero or more times. In the ... we include the separator(s) (here \s) and the group again ([A-Z][a-z]+).
This will however not include the hyphen between 'Jean' and 'Marc'. In order to include it as well, we can expand the \s:
[A-Z][a-z]+(?:[\s-][A-Z][a-z]+)*
Depending on some other characters (or sequences of characters) that are allowed, you may have to alter the [\s-] part further).
This then generates:
>>> rgx = re.compile(r'[A-Z][a-z]+(?:[\s-][A-Z][a-z]+)*')
>>> txt = r'Dallas Buyer Club is a great American biographical drama film,co-written by Craig Borten and Melisa Wallack, and Directed by Jean-Marc Vallee.'
>>> rgx.findall(txt)
['Dallas Buyer Club', 'American', 'Craig Borten', 'Melisa Wallack', 'Directed', 'Jean-Marc Vallee']
EDIT: in case the remaining characters can be uppercase as well, you can use:
[A-Z][A-Za-z]+(?:[\s-][A-Z][A-Za-z]+)*
Note that this will match words with two or more characters. In case single word characters should be matched as well, like 'J R R Tolkien', then you can write:
[A-Z][A-Za-z]*(?:[\s-][A-Z][A-Za-z]*)*

keep trailing punctuation in python nltk.word_tokenize

There's a ton available about removing punctuation, but I can't seem to find anything keeping it.
If I do:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P', '.']
the last "." is pushed into its own token. However, if instead there is another word at the end, the last "." is preserved:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P. Another Co"
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P.', 'Another', 'Co']
I'd like this to always perform as the second case. For now, I'm hackishly doing:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str + " |||")
since I feel pretty confident in throwing away "|||" at any given time, but don't know what other punctuation I might want to preserve that could get dropped. Is there a better way to accomplish this ?
It is a quirk of spelling that if a sentence ends with an abbreviated word, we only write one period, not two. The nltk's tokenizer doesn't "remove" it, it splits it off because sentence structure ("a sentence must end with a period or other suitable punctuation") is more important to NLP tools than consistent representation of abbreviations. The tokenizer is smart enough to recognize most abbreviations, so it doesn't separate the period in L.P. mid-sentence.
Your solution with ||| results in inconsistent sentence structure, since you now have no sentence-final punctuation. A better solution would be to add the missing period only after abbreviations. Here's one way to do this, ugly but as reliable as the tokenizer's own abbreviation recognizer:
toks = nltk.word_tokenize(test_str + " .")
if len(toks) > 1 and len(toks[-2]) > 1 and toks[-2].endswith("."):
pass # Keep the added period
else:
toks = toks[:-1]
PS. The solution you have accepted will completely change the tokenization, leaving all punctuation attached to the adjacent word (along with other undesirable effects like introducing empty tokens). This is most likely not what you want.
Could you use re?
import re
test_str = "Some Co Inc. Other Co L.P."
print re.split('\s', test_str)
This will split the input string based on spacing, retaining your punctuation.

Correct Regex for Acronyms In Python

I want to find so called Acronyms in text is this the correct way of defining the regex for it?
My idea is that if something starts with capital and ends with capital letter it is acronym. Is this correct?
import re
test_string = "Department of Something is called DOS,
or DoS, or (DiS) or D.O.S. in United State of America, U.S.A./ USA"
pattern3=r'([A-Z][a-zA-Z]*[A-Z]|(?:[A-Z]\.)+)'
print re.findall(pattern3, test_string)
and the out put is:
['DOS', 'DoS', 'DiS', 'D.O.S.', 'U.S.A.', 'USA']
Think that you can use the word boundary \b anchor for what you want to do
>>> regex = r"\b[A-Z][a-zA-Z\.]*[A-Z]\b\.?"
>>> re.findall(regex, "AbIA AoP U.S.A.")
['AbIA', 'AoP', 'U.S.A.']

python, re.search / re.split for phrases which looks like a title, i.e. starting with an uppper case

I have a list of phrases (input by user) I'd like to locate them in a text file, for examples:
titles = ['Blue Team', 'Final Match', 'Best Player',]
text = 'In today Final match, The Best player is Joe from the Blue Team and the second best player is Jack from the Red team.'
1./ I can find all the occurrences of these phrases like so
titre = re.compile(r'(?P<title>%s)' % '|'.join(titles), re.M)
list = [ t for t in titre.split(text) if titre.search(t) ]
(For simplicity, I am assuming a perfect spacing.)
2./ I can also find variants of these phrases e.g. 'Blue team', final Match', 'best player' ... using re.I, if they ever appear in the text.
But I want to restrict to finding only variants of the input phrases with their first letter upper-cased e.g. 'Blue team' in the text, regardless how they were entered as input, e.g. 'bluE tEAm'.
Is it possible to write something to "block" the re.I flag for a portion of a phrase? In pseudo code I imagine generate something like '[B]lue Team|[F]inal Match'.
Note: My primary goal is not, for example, calculating frequency of the input phrases in the text but extracting and analyzing the text fragments between or around them.
I would use re.I and modify the list-comp to:
l = [ t for t in titre.split(text) if titre.search(t) and t[0].isupper() ]
I think regular expressions won't let you specify just a region where the ignore case flag is applicable. However, you can generate a new version of the text in which all the characters have been lower cased, but the first one for every word:
new_text = ' '.join([word[0] + word[1:].lower() for word in text.split()])
This way, a regular expression without the ignore flag will match taking into account the casing only for the first character of each word.
How about modifying the input so that it is in the correct case before you use it in the regular expression?

Python regex findall

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags.
Here is my attempt:
regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)
Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates'].
import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)
yields
['Barack Obama', 'Bill Gates']
The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same
unicode as u'[[1P].+?[/P]]+?' except harder to read.
The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,
Remove the outer enclosing square brackets. (Also remove the
stray 1 in front of P.)
To protect the literal brackets in [P], escape the brackets with a
backslash: \[P\].
To return only the words inside the tags, place grouping parentheses
around .+?.
Try this :
for match in re.finditer(r"\[P[^\]]*\](.*?)\[/P\]", subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
Your question is not 100% clear, but I'm assuming you want to find every piece of text inside [P][/P] tags:
>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('\[P\]\s?(.+?)\s?\[\/P\]', line)
['Barack Obama', 'Bill Gates']
you can replace your pattern with
regex = ur"\[P\]([\w\s]+)\[\/P\]"
Use this pattern,
pattern = '\[P\].+?\[\/P\]'
Check here

Categories

Resources