Correct Regex for Acronyms In Python

Correct Regex for Acronyms In Python - python

I want to find so called Acronyms in text is this the correct way of defining the regex for it?
My idea is that if something starts with capital and ends with capital letter it is acronym. Is this correct?
import re
test_string = "Department of Something is called DOS,
or DoS, or (DiS) or D.O.S. in United State of America, U.S.A./ USA"
pattern3=r'([A-Z][a-zA-Z]*[A-Z]|(?:[A-Z]\.)+)'
print re.findall(pattern3, test_string)
and the out put is:
['DOS', 'DoS', 'DiS', 'D.O.S.', 'U.S.A.', 'USA']

Think that you can use the word boundary \b anchor for what you want to do
>>> regex = r"\b[A-Z][a-zA-Z\.]*[A-Z]\b\.?"
>>> re.findall(regex, "AbIA AoP U.S.A.")
['AbIA', 'AoP', 'U.S.A.']

Related

Regex - question about finding every word in a string that begins with a letter

import re
random_regex = re.compile(r'^\w')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)
This is the code I have, following along on Automate the Boring Stuff with Python. However I kind of side-tracked a bit and wanted to see if I could get a list of all the words in the string passed in random_regex.findall() that begin with a word, so I wrote \w for the regex pattern. However for some reason my output only prints "R" and not the rest of the letters in the string, Would anyone be able to explain why/tell me how to fix this problem?

import re
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)

A regex find all should work here:
inp = "RoboCop eats baby food. BABY FOOD."
words = re.findall(r'\w+', inp)
print(words) # ['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']

^ Requires the start of a string, so it only finds RoboCop. Use \w+ to get all of the letters. You can test your regex at regex101.com.
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
to get x:
['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']

Wildcard matching with looping re

I'm trying to improve the matching expression of this code so that it matches spaces before or after the string and also ignores the case. The goal is to output the shortened state abbreviation.
import re
s = "new South Wales "
for r in (("New South Wales", "NSW"), ("Victoria", "VIC"), ("Queensland", "QLD"), ("South Australia", "SA"), ("Western Australia", "WA"), ("Northern Territory", "NT"), ("Tasmania", "TAS"), ("Australian Capital Territory", "ACT")):
s = s.replace(*r)
output = {'state': s}
print (output)
I've figured out the regex to do this (see here):
(?i)(?<!\S)New South Wales(?!\S)
which will match with or without spaces on either side of string and also ignores case. Can anyone help me update my original code to include the new regex?

If I were you I would just strip() the string before passing it in and use something like re.sub() where we can tell it to ignore the case using 'flags=re.IGNORECASE' like below.
import re
s = " new South Wales ".strip()
for r in (("New South Wales", "NSW"), ("Victoria", "VIC"), ("Queensland", "QLD"), ("South Australia", "SA"), ("Western Australia", "WA"), ("Northern Territory", "NT"), ("Tasmania", "TAS"), ("Australian Capital Territory", "ACT")):
_regex = '{0}|{1}'.format(r[0], r[1])
if re.match(_regex, s, flags=re.IGNORECASE):
subbed_string = re.sub(r[0], r[1], s, flags=re.IGNORECASE)
print({'state': subbed_string.upper()})
Additionally I have added in a check for a match before trying to substitute in the value. Otherwise you could output the wrong result. For example:
(('Tasmania', 'TAS'){'state': 'new South Wales'})

keep trailing punctuation in python nltk.word_tokenize

There's a ton available about removing punctuation, but I can't seem to find anything keeping it.
If I do:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P', '.']
the last "." is pushed into its own token. However, if instead there is another word at the end, the last "." is preserved:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P. Another Co"
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P.', 'Another', 'Co']
I'd like this to always perform as the second case. For now, I'm hackishly doing:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str + " |||")
since I feel pretty confident in throwing away "|||" at any given time, but don't know what other punctuation I might want to preserve that could get dropped. Is there a better way to accomplish this ?

It is a quirk of spelling that if a sentence ends with an abbreviated word, we only write one period, not two. The nltk's tokenizer doesn't "remove" it, it splits it off because sentence structure ("a sentence must end with a period or other suitable punctuation") is more important to NLP tools than consistent representation of abbreviations. The tokenizer is smart enough to recognize most abbreviations, so it doesn't separate the period in L.P. mid-sentence.
Your solution with ||| results in inconsistent sentence structure, since you now have no sentence-final punctuation. A better solution would be to add the missing period only after abbreviations. Here's one way to do this, ugly but as reliable as the tokenizer's own abbreviation recognizer:
toks = nltk.word_tokenize(test_str + " .")
if len(toks) > 1 and len(toks[-2]) > 1 and toks[-2].endswith("."):
pass # Keep the added period
else:
toks = toks[:-1]
PS. The solution you have accepted will completely change the tokenization, leaving all punctuation attached to the adjacent word (along with other undesirable effects like introducing empty tokens). This is most likely not what you want.

Could you use re?
import re
test_str = "Some Co Inc. Other Co L.P."
print re.split('\s', test_str)
This will split the input string based on spacing, retaining your punctuation.

Tokenization which works with terms that contain whitespace in Python?

My standard approach to tokenize a text using a regex in Python is this:
> text = "Los Angeles is in California"
> tokens = re.findall(r'\w+', text)
> tokens
['Los','Angeles','is','in','California']
A problem arises if I want to find the name Los Angeles in the above text.
What is the best way to find a needle which contains whitespace in a haystack?
I am asking a general question, because the solution should also work for a case like United States of America and for needles which don't contain whitespace.
For example a simple if "Los Angeles" in text (match) would not do, because if "for" in text would also return a match. But I am looking for full words only (match for and not California).

I suggest to use a text parser like NLTK for such tasks.
But for this case you can use following regex :
>>> re.findall(r'\b([A-Z]\w+ [A-Z]\w+)|(\w+)\b',text)
[('Los Angeles', ''), ('', 'is'), ('', 'in'), ('', 'California')]
the regex r'([A-Z]\w+ [A-Z]\w+)|(\w+)' will match 2 group the first is a pair word that its elements contain capital words! and the second will match a word!

The solution turned out to be simple:
re.search(r'\b'+needle+r'\b', haystack)

Python regex findall

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags.
Here is my attempt:
regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)
Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates'].

import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)
yields
['Barack Obama', 'Bill Gates']
The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same
unicode as u'[[1P].+?[/P]]+?' except harder to read.
The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,
Remove the outer enclosing square brackets. (Also remove the
stray 1 in front of P.)
To protect the literal brackets in [P], escape the brackets with a
backslash: \[P\].
To return only the words inside the tags, place grouping parentheses
around .+?.

Try this :
for match in re.finditer(r"\[P[^\]]*\](.*?)\[/P\]", subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()

Your question is not 100% clear, but I'm assuming you want to find every piece of text inside [P][/P] tags:
>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('\[P\]\s?(.+?)\s?\[\/P\]', line)
['Barack Obama', 'Bill Gates']

you can replace your pattern with
regex = ur"\[P\]([\w\s]+)\[\/P\]"

Use this pattern,
pattern = '\[P\].+?\[\/P\]'
Check here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Correct Regex for Acronyms In Python - python

Think that you can use the word boundary \b anchor for what you want to do >>> regex = r"\b[A-Z][a-zA-Z\.]*[A-Z]\b\.?" >>> re.findall(regex, "AbIA AoP U.S.A.") ['AbIA', 'AoP', 'U.S.A.']

Related

Regex - question about finding every word in a string that begins with a letter

Wildcard matching with looping re

keep trailing punctuation in python nltk.word_tokenize

Tokenization which works with terms that contain whitespace in Python?

Python regex findall

Categories

Resources