How can I make my title-case regular expression match prefix titles? - python

I need to pull possible titles out of a chunk of text. So for instance, I want to match words like "Joe Smith", "The Firm", or "United States of America". I now need to modify it to match names that begin with a title of some kind (such as "Dr. Joe Smith"). Here's the regular expression I have:
NON_CAPPED_WORDS = (
# Articles
'the',
'a',
'an',
# Prepositions
'about',
'after',
'as',
'at',
'before',
'by',
'for',
'from',
'in',
'into',
'like',
'of',
'on',
'to',
'upon',
'with',
'without',
)
TITLES = (
'Dr\.',
'Mr\.',
'Mrs\.',
'Ms\.',
'Gov\.',
'Sen\.',
'Rep\.',
)
# These are words that don't match the normal title case regex, but are still allowed
# in matches
IRREGULAR_WORDS = NON_CAPPED_WORDS + TITLES
non_capped_words_re = r'[\s:,]+|'.join(IRREGULAR_WORDS)
TITLE_RE = re.compile(r"""(?P<title>([A-Z0-9&][a-zA-Z0-9]*[\s,:-]*|{0})+\s*)""".format(non_capped_words_re))
Which builds the following regular expression:
(?P<title>([A-Z0-9&][a-zA-Z0-9]*[\s,:-]*|the[\s:,]+|a[\s:,]+|an[\s:,]+|about[\s:,]+|after[\s:,]+|as[\s:,]+|at[\s:,]+|before[\s:,]+|by[\s:,]+|for[\s:,]+|from[\s:,]+|in[\s:,]+|into[\s:,]+|like[\s:,]+|of[\s:,]+|on[\s:,]+|to[\s:,]+|upon[\s:,]+|with[\s:,]+|without[\s:,]+|Dr\.[\s:,]+|Mr\.[\s:,]+|Mrs\.[\s:,]+|Ms\.[\s:,]+|Gov\.[\s:,]+|Sen\.[\s:,]+|Rep\.)+\s*)
This doesn't seem to be working though:
>>> whitelisting.TITLE_RE.findall('Dr. Joe Smith')
[('Dr', 'Dr'), ('Joe Smith', 'Smith')]
Can someone who has better regex-fu help me fix this mess of a regex?

The problem seems to be that the first part of the expression, [A-Z0-9&][a-zA-Z0-9]*[\s,:-]*, is gobbling up the initial characters in your "prefix titles", since they are title-cased until you get to the period. So, when the + is repeating the subexpression and encounters 'Dr.', that initial part of the expression matches 'Dr', and leaves only the non-matching period.
One easy fix is to simply move the "special cases" to the front of the expression, so they're matched as a first resort, not a last resort (this essentially just moves {0} from the end of the expression to the front):
TITLE_RE = re.compile(r"""(?P<title>({0}|[A-Z0-9&][a-zA-Z0-9]*[\s,:-]*)+\s*)""".format(non_capped_words_re))
Result:
>>> TITLE_RE.findall('Dr. Joe Smith');
[('Dr. Joe Smith', 'Smith')]
I would probably go further and modify the expression to avoid all the repetition of [\s:,]+, but I'm not sure there's any real benefit, aside from making the formatted expression look a little nicer:
'|'.join(IRREGULAR_WORDS)
TITLE_RE = re.compile(r"""(?P<title>((?:{0})[\s:,]+|[A-Z0-9&][a-zA-Z0-9]*[\s,:-]*)+\s*)""".format(non_capped_words_re))

Related

Python splitting text with line breaks into a list

I'm trying to convert some text into a list. The text contains special characters, numbers, and line breaks. Ultimately I want to have a list with each word as an item in the list without any special characters, numbers, or spaces.
exerpt from text:
I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the&lt I
Currently I'm using this line to split each word into an item in the list:
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) \
for k in content.split(" ")]
print(text_list)
This code is leaving in spaces and combining words in each item of the list like below
Result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between St ', 'Petersburgh', 'and', 'Archangel ', ' lt the lt I']
I would like to split the words into individual items of the list and remove the string ' lt ' and numbers from my list items.
Expected result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post', 'road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the' 'I']
Please help me resolve this issue.
Thanks
Since it looks like you're parsing html text, it's likely all entities are enclosed in & and ;. Removing those makes matching the rest quite easy.
import re
content = 'I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the< I'
# first, remove entities, the question mark makes sure the expression isn't too greedy
content = re.sub(r'&[^ ]+?;', '', content)
# then just match anything that meets your rules
text_list = re.findall(r"[a-zA-Z0-9]+", content)
print(text_list)
Note that 'St Petersburg' likely got matched together because the character between the 't' and 'P' probably isn't a space, but a non-breaking space. If this were just html, I'd expect there to be or something of the sort, but it's possible that in your case there's some UTF non-breaking space character there.
That should not matter with the code above, but if you use a solution using .split(), it likely won't see that character as a space.
In case the &lt is not your mistake, but in the original, this works as a replacement for the .sub() statement:
content = re.sub(r'&[^ ;]+?(?=[ ;]);?', '', content)
Clearly a bit more complicated: it substitutes any string that starts with & [&], followed by one or more characters that are not a space or ;, taking as little as possible [[^ ;]+?], but only if they are then followed by a space or a ; [(?=[ ;])], and in that case that ; is also matched [;?].
Here is what can can be done. You just need to replace any known code of syntax in advance
import re
# define some special syntax that want to remove
special_syntax = r"&(lt|nbsp|gt|amp|quot|apos|cent|pound|yen|euro|copy|reg|)[; ]"
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k).strip() \
# Here I remove the syntax before split them and substitue special char again
for k in re.sub(special_syntax, ' ', content).split(" ")]
# remove empty string from the list
filter_object = filter(lambda x: x != "", text_list)
list(filter_object)
Output
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the', 'I']

Python connect composed keywords in texts

So, I have a keyword list lowercase. Let's say
keywords = ['machine learning', 'data science', 'artificial intelligence']
and a list of texts in lowercase. Let's say
texts = [
'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking',
'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]
I need to transform the texts into:
[[['the', 'new',
'machine_learning',
'model',
'built',
'by',
'google',
'is',
'revolutionary',
'for',
'the',
'current',
'state',
'of',
'artificial_intelligence'],
['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']],
[['data_science',
'and',
'artificial_intelligence',
'are',
'two',
'different',
'fields',
'although',
'they',
'are',
'interconnected'],
['scientists',
'from',
'harvard',
'are',
'explaining',
'it',
'in',
'a',
'detailed',
'presentation',
'that',
'could',
'be',
'found',
'on',
'our',
'page']]]
What I do right now is checking if the keywords are in a text and replace them with the keywords with _. But this is of complexity m*n and it is really slow when you have 700 long texts and 2M keywords as in my case.
I was trying to use Phraser, but I can't manage to build one with only my keywords.
Could someone suggest me a more optimized way of doing it?
The Phrases/Phraser classes of gensim are designed to use their internal, statistically-derived records of what word pairs should be promoted to phrases – not user-supplied pairings. (You could probably poke & prod a Phraser to do what you want, by synthesizing scores/thresholds, but that would be somewhat awkward & kludgey.)
You could, mimic their general approach: (1) operate on lists-of-tokens rather than raw strings; (2) learn & remember token-pairs that should be combined; & (3) perform combination in a single pass. That should work far more efficiently than anything based on doing repeated search-and-replace on a string – which it sounds like you've already tried and found wanting.
For example, let's first create a dictionary, where the keys are tuples of word-pairs that should be combined, and the values are tuples that include both their designated combination-token, and a 2nd item that's just an empty-tuple. (The reason for this will become clear later.)
keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = [
'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking',
'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]
combinations_dict = {tuple(kwsplit):('_'.join(kwsplit), ())
for kwsplit in [kwstr.split() for kwstr in keywords]}
combinations_dict
After this step, combinations_dict is:
{('machine', 'learning'): ('machine_learning', ()),
('data', 'science'): ('data_science', ()),
('artificial', 'intelligence'): ('artificial_intelligence', ())}
Now, we can use a Python generator function to create an iterable transformation of any other sequence-of-tokens, that takes original tokens one-by-one – but before emitting any, adds the next to a buffered candidate pair-of-tokens. If that pair is one that should be combined, a single combined token is yielded – but if not, just the 1st token is emitted, leaving the 2nd to be combined with the next token in a new candidate pair.
For example:
def combining_generator(tokens, comb_dict):
buff = () # start with empty buffer
for in_tok in tokens:
buff += (in_tok,) # add latest to buffer
if len(buff) < 2: # grow buffer to 2 tokens if possible
continue
# lookup what to do for current pair...
# ...defaulting to emit-[0]-item, keep-[1]-item in new buff
out_tok, buff = comb_dict.get(buff, (buff[0], (buff[1],)))
yield out_tok
if buff:
yield buff[0] # last solo token if any
Here we see the reason for the earlier () empty-tuples: that's the preferred state of the buff after a successful replacement. And driving the result & next-state this way helps us use the form of dict.get(key, default) that supplies a specific value to be used if the key isn't found.
Now designated combinations can be applied via:
tokenized_texts = [text.split() for text in texts]
retokenized_texts = [list(combining_generator(tokens, combinations_dict)) for tokens in tokenized_texts]
retokenized_texts
...which reports tokenized_texts as:
[
['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial', 'intelligence.', 'it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking'],
['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields,', 'although', 'they', 'are', 'interconnected.', 'scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page.']
]
Note that the tokens ('artificial', 'intelligence.') aren't combined here, as the dirt-simple .split() tokenization used has left the punctuation attached, preventing an exact match to the rule.
Real projects will want to use a more-sophisticated tokenization, that might either strip the punctuation, or retain punctuation as tokens, or do other preprocessing - and as a result would properly pass 'artificial' as a token without the attached '.'. For example a simple tokenization that just retains runs-of-word-characters discarding punctuation would be:
import re
tokenized_texts = [re.findall('\w+', text) for text in texts]
tokenized_texts
Another that also keeps any stray non-word/non-space characters (punctuation) as standalone tokens would be:
tokenized_texts = [re.findall(r'\w+|(?:[^\w\s])', text) for text in texts]
tokenized_texts
Either of these alternatives to a simple .split() would ensure your 1st text presents the necessary ('artificial', 'intelligence') pair for combination.
This is probably not the best pythonic way to do it but it works with 3 steps.
keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = ['the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.']
#Add underscore
for idx, text in enumerate(texts):
for keyword in keywords:
reload_text = texts[idx]
if keyword in text:
texts[idx] = reload_text.replace(keyword, keyword.replace(" ", "_"))
#Split text for each "." encountered
for idx, text in enumerate(texts):
texts[idx] = list(filter(None, text.split(".")))
print(texts)
#Split text to get each word
for idx,text in enumerate(texts):
for idx_s,sentence in enumerate(text):
texts[idx][idx_s] = list(map(lambda x: re.sub("[,\.!?]", "", x), sentence.split())) #map to delete every undesired characters
print(texts)
Output
[
[
['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial_intelligence'],
['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']
],
[
['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields', 'although', 'they', 'are', 'interconnected'],
['scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page']
]
]

Split text without separating e.g. 'New York'

I know how to split a string into a list of words, like this:
some_string = "Siva is belongs to new York and he was living in park meadows mall apartment "
some_string.split()
# ['Siva', 'is', 'belongs', 'to', 'new', 'York', 'and', 'he', 'was', living', 'in', 'park', 'meadows', 'mall', 'apartment']
However, some of the words should not be separated, for example, "New York" and "Park Meadows Mall". I have saved such special cases in a list called ´some_list´:
some_list = [('new York'), ('park meadows mall')]
where the desired result would be:
['Siva', 'is', 'belongs', 'to', 'new York', 'and', 'he', 'was', living', 'in', 'park meadows mall', 'apartment']
Any ideas on how I can get this done?
You can reconstruct the splitted elements into their compound form. Ideally, you want to scan over your splitted string only once, checking each element against all possible replacements.
A naive approach is too transform some_list into a lookup table for all possible sequences given one word. For example, the element 'new' indicates a potential replacement for 'new', 'York'. Such a table can be built by splitting off the first word of each compound word:
replacements = {}
for compound in some_list:
words = compound.split() # 'new York' => 'new', 'York'
try: # 'new' => [('new', 'York'), ('new', 'Orleans')]
replacements[words[0]] = [words]
except KeyError: # 'new' => [('new', 'York')]
replacements[words[0]].append(words)
Using this, you can traverse your splitted string and test for each word whether it might be part of a compound word. The tricky part is to avoid adding trailing parts of compound words.
splitted_string = some_string.split()
compound_string = []
insertion_offset = 0
for index, word in enumerate(splitted_string):
# we already added a compound string, skip its members
if len(compound_string) + insertion_offset > index:
continue
# check if a compound word starts here...
try:
candidate_compounds = replacements[word]
except KeyError:
# definitely not, just keep the word
compound_string.append(word)
else:
# try all possible compound words...
for compound in candidate_compounds:
if splitted_string[index:index+len(compound)] == compound:
insertion_offset += len(compound)
compound_string.append(' '.join(compound))
break
# ...but otherwise, just keep the word
else:
compound_string.append(word)
This will stitch together all individual pieces of compound words:
>>> print(compound_string)
['Siva', 'is', 'belongs', 'to', 'new York', 'he', 'was', 'living', 'in', 'park meadows mall']
Note that the ideal structure of the replacements table depends on your words in some_list. If there are no collisions of first words, you can skip the list of compound words and have only one compound word each. If there are many collisions, you may have to nest several tables inside to avoid having to try all candidates. The later is especially important if some_string is large.

Splitting a python string

I have a string in python that I want to split in a very particular manner. I want to split it into a list containing each separate word, except for the case when a group of words are bordered by a particular character. For example, the following strings would be split as such.
'Jimmy threw his ball through the window.'
becomes
['Jimmy', 'threw', 'his', 'ball', 'through', 'the', 'window.']
However, with a border character I'd want
'Jimmy |threw his ball| through the window.'
to become
['Jimmy', 'threw his ball', 'through', 'the', 'window.']
As an additional component I need - which may appear outside the grouping phrase to appear inside it after splitting up i.e.,
'Jimmy |threw his| ball -|through the| window.'
would become
['Jimmy', 'threw his', 'ball', '-through the', 'window.']
I cannot find a simple, pythonic way to do this without a lot of complicated for loops and if statements. Is there a simple way to handle something like this?
This isn't something with an out-of-the-box solution, but here's a function that's pretty Pythonic that should handle pretty much anything you throw at it.
def extract_groups(s):
separator = re.compile("(-?\|[\w ]+\|)")
components = separator.split(s)
groups = []
for component in components:
component = component.strip()
if len(component) == 0:
continue
elif component[0] in ['-', '|']:
groups.append(component.replace('|', ''))
else:
groups.extend(component.split(' '))
return groups
Using your examples:
>>> extract_groups('Jimmy threw his ball through the window.')
['Jimmy', 'threw', 'his', 'ball', 'through', 'the', 'window.']
>>> extract_groups('Jimmy |threw his ball| through the window.')
['Jimmy', 'threw his ball', 'through the', 'window.']
>>> extract_groups('Jimmy |threw his| ball -|through the| window.')
['Jimmy', 'threw his', 'ball', '-through the', 'window.']
There's probably some regular expression solving your problem. You might get the idea from the following example:
import re
s = 'Jimmy -|threw his| ball |through the| window.'
r = re.findall('-?\|.+?\||[\w\.]+', s)
print r
print [i.replace('|', '') for i in r]
Output:
['Jimmy', '-|threw his|', 'ball', '|through the|', 'window.']
['Jimmy', '-threw his', 'ball', 'through the', 'window.']
Explanation:
-? optional minus sign
\|.+?\| pipes with at least one character in between
| or
[\w\.]+ at least one "word" character or .
In case , or ' can appear in the original string, the expression needs some fine tuning.
You can parse that format using a regex, although your choice of delimiter makes it rather an ugly one!
This code finds all sequences that consist either of a pair of pipe characters | enclosing zero or more non-pipe characters, or one or more characters that are neither pipes nor whitespace.
import re
str = 'Jimmy |threw his| ball -|through the| window.'
for seq in re.finditer(r' \| [^|]* \| | [^|\s]+ ', str, flags=re.X):
print(seq.group())
output
Jimmy
|threw his|
ball
-
|through the|
window.

Split a large string into multiple substrings containing 'n' number of words via python

Source text: United States Declaration of Independence
How can one split the above source text into a number of sub-strings, containing an 'n' number of words?
I use split(' ') to extract each word, however I do not know how to do this with multiple words in one operation.
I could run through the list of words that I have, and create another by gluing together words in the first list (whilst adding spaces). However my method isn't very pythonic.
text = """
When in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature?s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation.
We hold these Truths to be self-evident, that all Men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty, and the pursuit of Happiness?-That to secure these Rights, Governments are instituted among Men, deriving their just Powers from the Consent of the Governed, that whenever any Form of Government becomes destructive of these Ends, it is the Right of the People to alter or abolish it, and to institute a new Government, laying its Foundation on such Principles, and organizing its Powers in such Form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient Causes; and accordingly all Experience hath shewn, that Mankind are more disposed to suffer, while Evils are sufferable, than to right themselves by abolishing the Forms to which they are accustomed. But when a long Train of Abuses and Usurpations, pursuing invariably the same Object, evinces a Design to reduce them under absolute Despotism, it is their Right, it is their Duty, to throw off such Government, and to provide new Guards for their future Security. Such has been the patient Sufferance of these Colonies; and such is now the Necessity which constrains them to alter their former Systems of Government. The History of the Present King of Great-Britain is a History of repeated Injuries and Usurpations, all having in direct Object the Establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid World.
"""
words = text.split()
subs = []
n = 4
for i in range(0, len(words), n):
subs.append(" ".join(words[i:i+n]))
print subs[:10]
prints:
['When in the course', 'of human Events, it', 'becomes necessary for one', 'People to dissolve the', 'Political Bands which have', 'connected them with another,', 'and to assume among', 'the Powers of the', 'Earth, the separate and', 'equal Station to which']
or, as a list comprehension:
subs = [" ".join(words[i:i+n]) for i in range(0, len(words), n)]
You're trying to create n-grams? Here's how I do it, using the NLTK.
punct = re.compile(r'^[^A-Za-z0-9]+|[^a-zA-Z0-9]+$')
is_word=re.compile(r'[a-z]', re.IGNORECASE)
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
word_tokenizer=nltk.tokenize.punkt.PunktWordTokenizer()
def get_words(sentence):
return [punct.sub('',word) for word in word_tokenizer.tokenize(sentence) if is_word.search(word)]
def ngrams(text, n):
for sentence in sentence_tokenizer.tokenize(text.lower()):
words = get_words(sentence)
for i in range(len(words)-(n-1)):
yield(' '.join(words[i:i+n]))
Then
for ngram in ngrams(sometext, 3):
print ngram
For large string, iterator is recommended for speed and low memory footprint.
import re, itertools
# Original text
text = "When in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature?s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation."
n = 10
# An iterator which will extract words one by one from text when needed
words = itertools.imap(lambda m:m.group(), re.finditer(r'\w+', text))
# The final iterator that combines words into n-length groups
word_groups = itertools.izip_longest(*(words,)*n)
for g in word_groups: print g
will get the following result:
('When', 'in', 'the', 'course', 'of', 'human', 'Events', 'it', 'becomes', 'necessary')
('for', 'one', 'People', 'to', 'dissolve', 'the', 'Political', 'Bands', 'which', 'have')
('connected', 'them', 'with', 'another', 'and', 'to', 'assume', 'among', 'the', 'Powers')
('of', 'the', 'Earth', 'the', 'separate', 'and', 'equal', 'Station', 'to', 'which')
('the', 'Laws', 'of', 'Nature', 'and', 'of', 'Nature', 's', 'God', 'entitle')
('them', 'a', 'decent', 'Respect', 'to', 'the', 'Opinions', 'of', 'Mankind', 'requires')
('that', 'they', 'should', 'declare', 'the', 'causes', 'which', 'impel', 'them', 'to')
('the', 'Separation', None, None, None, None, None, None, None, None)

Categories

Resources