Program to grab abbreviations and definitions - trouble getting all lower case abbreviations - python

I have a program that grabs abbreviations (i.e., looks for words enclosed in parentheses) and then based on the number of characters in the abbreviation, goes back that many words and defines it. So far, it works for definitions like with preceding words that start with capital letters or when most preceding words start with capital letters. For the latter, it skips lower case letters like "in" and goes to the next one. However, my problem is when the number of corresponding words are all lowercase.
Current Output:
All Awesome Dudes (AAD)
Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT)
Trials (IMMPACT). Some patient prefer the usual care (UC)
Desired Output:
All Awesome Dudes (AAD)
Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT)
usual care (UC)
import re
s = """Too many people, but not All Awesome Dudes (AAD) only care about the
Initiative on Methods, Measurement, and Pain Assessment in Clinical
Trials (IMMPACT). Some patient perfer the usual care (UC) approach of
doing nothing"""
allabbre = []
for match in re.finditer(r"\((.*?)\)", s):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = s[:start_index].split()
count=0
for k,i in enumerate(words[::-1]):
if i[0].isupper():count+=1
if count==size:break
words=words[-k-1:]
definition = " ".join(words)
abbr_keywords = definition + " " + "(" + abbr + ")"
pattern='[A-Z]'
if re.search(pattern, abbr):
if abbr_keywords not in allabbre:
allabbre.append(abbr_keywords)
print(abbr_keywords)

The flag is used for rare cases like All are Awesome Dudes (AAD)
import re
s = """Too many people, but not All Awesome Dudes (AAD) only care about the
Initiative on Methods, Measurement, and Pain Assessment in Clinical
Trials (IMMPACT). Some patient perfer the usual care (UC) approach of
doing nothing
"""
allabbre = []
for match in re.finditer(r"\((.*?)\)", s):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = s[:start_index].split()
count=size-1
flag=words[-1][0].isupper()
for k,i in enumerate(words[::-1]):
first_letter=i[0] if flag else i[0].upper()
if first_letter==abbr[count]:count-=1
if count==-1:break
words=words[-k-1:]
definition = " ".join(words)
abbr_keywords = definition + " " + "(" + abbr + ")"
pattern='[A-Z]'
if re.search(pattern, abbr):
if abbr_keywords not in allabbre:
allabbre.append(abbr_keywords)
print(abbr_keywords)

Related

Delete based on presence

I'm trying to analyze an article to determine if a specific substring appears.
If "Bill" appears, then I want to delete the substring's parent sentence from the article, as well as every sentence following the first deleted sentence.
If "Bill" does not appear, no alteration are made to the article.
Sample Text:
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, Star Fox in the way you can rotate your craft to fit through narrow gaps.
This is Bill, signing off. Thank you for reading. And see you tomorrow!"""
Desired Result When Targeted Substring is "Bill":
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, but does that hindsight extend to this thoroughly literally-named racing tie-in? Star Fox in the way you can rotate your craft to fit through narrow gaps.
"""
This is the code so far:
if "Bill" not in stringy[-200:]:
print(stringy)
text = stringy.rsplit("Bill")[0]
text = text.split('.')[:-1]
text = '.'.join(text) + '.'
It currently doesn't work when "Bill" appears outside of the last 200 characters, cutting off the text at the very first instance of "Bill" (the opening sentence, "This is Bill Everest here"). How can this code be altered to only select for "Bill"s in the last 200 characters?
Here's another approach that loops through each sentence using a regex. We keep a line count and once we're in the last 200 characters we check for 'Bill' in the line. If found, we exclude from this line onward.
Hope the code is readable enough.
import re
def remove_bill(stringy):
sentences = re.findall(r'([A-Z][^\.!?]*[\.!?]\s*\n*)', stringy)
total = len(stringy)
count = 0
for index, line in enumerate(sentences):
#Check each index of 'Bill' in line
for pos in (m.start() for m in re.finditer('Bill', line)):
if count + pos >= total - 200:
stringy = ''.join(sentences[:index])
return stringy
count += len(line)
return stringy
stringy = remove_bill(stringy)
Here is how you can use re:
import re
stringy = """..."""
target = "Bill"
l = re.findall(r'([A-Z][^\.!?]*[\.!?])',stringy)
for i in range(len(l)-1,0,-1):
if target in l[i] and sum([len(a) for a in l[i:]])-sum([len(a) for a in l[i].split(target)[:-1]]) < 200:
strings = ' '.join(l[:i])
print(stringy)

Retrieve definition for parenthesized abbreviation, based on letter count

I need to retrieve the definition of an acronym based on the number of letters enclosed in parentheses. For the data I'm dealing with, the number of letters in parentheses corresponds to the number of words to retrieve. I know this isn't a reliable method for getting abbreviations, but in my case it will be. For example:
String = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'
Desired output: family health history (FHH), nurse practitioner (NP)
I know how to extract parentheses from a string, but after that I am stuck. Any help is appreciated.
import re
a = 'Although family health history (FHH) is commonly accepted as an
important risk factor for common, chronic diseases, it is rarely considered
by a nurse practitioner (NP).'
x2 = re.findall('(\(.*?\))', a)
for x in x2:
length = len(x)
print(x, length)
Use the regex match to find the position of the start of the match. Then use python string indexing to get the substring leading up to the start of the match. Split the substring by words, and get the last n words. Where n is the length of the abbreviation.
import re
s = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'
for match in re.finditer(r"\((.*?)\)", s):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = s[:start_index].split()[-size:]
definition = " ".join(words)
print(abbr, definition)
This prints:
FHH family health history
NP nurse practitioner
does this solve your problem?
a = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'
splitstr=a.replace('.','').split(' ')
output=''
for i,word in enumerate(splitstr):
if '(' in word:
w=word.replace('(','').replace(')','').replace('.','')
for n in range(len(w)+1):
output=splitstr[i-n]+' '+output
print(output)
actually, Keatinge beat me to it
An idea, to use a recursive pattern with PyPI regex module.
\b[A-Za-z]+\s+(?R)?\(?[A-Z](?=[A-Z]*\))\)?
See this pcre demo at regex101
\b[A-Za-z]+\s+ matches a word boundary, one or more alpha, one or more white space
(?R)? recursive part: optionally paste the pattern from start
\(? need to make the parenthesis optional for recursion to fit in \)?
[A-Z](?=[A-Z]*\) match one upper alpha if followed by closing ) with any A-Z in between
Does not check if the first word letter actually match the letter at position in the abbreviation.
Does not check for an opening parenthesis in front of the abbreviation. To check, add a variable length lookbehind. Change [A-Z](?=[A-Z]*\)) to (?<=\([A-Z]*)[A-Z](?=[A-Z]*\)).
Using re with list-comprehension
x_lst = [ str(len(i[1:-1])) for i in re.findall('(\(.*?\))', a) ]
[re.search( r'(\S+\s+){' + i + '}\(.{' + i + '}\)', a).group(0) for i in x_lst]
#['family health history (FHH)', 'nurse practitioner (NP)']
This solution isn't particularly clever, it simpy searches for the acronyms and then builds up a pattern to extract the words ahead of each one:
import re
string = "Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP)."
definitions = []
for acronym in re.findall(r'\(([A-Z]+?)\)', string):
length = len(acronym)
match = re.search(r'(?:\w+\W+){' + str(length) + r'}\(' + acronym + r'\)', string)
definitions.append(match.group(0))
print(", ".join(definitions))
OUTPUT
> python3 test.py
family health history (FHH), nurse practitioner (NP)
>

Count characters in string

So I'm trying to count anhCrawler and return the number of characters with and without spaces alone with the position of "DEATH STAR" and return it in theReport. I can't get the numbers to count correctly either. Please help!
anhCrawler = """Episode IV, A NEW HOPE. It is a period of civil war. \
Rebel spaceships, striking from a hidden base, have won their first \
victory against the evil Galactic Empire. During the battle, Rebel \
spies managed to steal secret plans to the Empire's ultimate weapon, \
the DEATH STAR, an armored space station with enough power to destroy \
an entire planet. Pursued by the Empire's sinister agents, Princess Leia\
races home aboard her starship, custodian of the stolen plans that can \
save her people and restore freedom to the galaxy."""
theReport = """
This text contains {0} characters ({1} if you ignore spaces).
There are approximately {2} words in the text. The phrase
DEATH STAR occurs and starts at position {3}.
"""
def analyzeCrawler(thetext):
numchars = 0
nospacechars = 0
numspacechars = 0
anhCrawler = thetext
word = anhCrawler.split()
for char in word:
numchars = word[numchars]
if numchars == " ":
numspacechars += 1
anhCrawler = re.split(" ", anhCrawler)
for char in anhCrawler:
nospacechars += 1
numwords = len(anhCrawler)
pos = thetext.find("DEATH STAR")
char_len = len("DEATH STAR")
ds = thetext[261:271]
dspos = "[261:271]"
return theReport.format(numchars, nospacechars, numwords, dspos)
print analyzeCrawler(theReport)
You're overthinking this problem.
Number of chars in string (returns 520):
len(anhCrawler)
Number of non-whitespace characters in string (using split as using split automatically removes the whitespace, and join creates a string with no whitespace) (returns 434):
len(''.join(anhCrawler.split()))
Finding the position of "DEATH STAR" (returns 261):
anhCrawler.find("DEATH STAR")
Here, you have simplilfied version of your function:
import re
def analyzeCrawler2(thetext, text_to_search = "DEATH STAR"):
numchars = len(anhCrawler)
nospacechars = len(re.sub(r"\s+", "", anhCrawler))
numwords = len(anhCrawler.split())
dspos = anhCrawler.find(text_to_search)
return theReport.format(numchars, nospacechars, numwords, dspos)
print analyzeCrawler2(theReport)
This text contains 520 characters (434 if you ignore spaces).
There are approximately 87 words in the text. The phrase
DEATH STAR occurs and starts at position 261.
I think the trick part is to remove white spaces from the string and to calculate the non-space character count. This can be done simply using regular expression. Rest should be self-explanatory.
First off, you need to indent the code that's inside a function. Second... your code can be simplified to the following:
theReport = """
This text contains {0} characters ({1} if you ignore spaces).
There are approximately {2} words in the text. The phrase
DEATH STAR is the {3}th word and starts at the {4}th character.
"""
def analyzeCrawler(thetext):
numchars = len(anhCrawler)
nospacechars = len(anhCrawler.replace(' ', ''))
numwords = len(anhCrawler.split())
word = 'DEATH STAR'
wordPosition = anhCrawler.split().index(word)
charPosition = anhCrawler.find(word)
return theReport.format(
numchars, nospacechars, numwords, wordPosition, charPosition
)
I modified the last two format arguments because it wasn't clear what you meant by dspos, although maybe it's obvious and I'm not seeing it. In any case, I included the word and char position instead. You can determine which one you really meant to include.

Regex to split up message_txt over 160 characters

I am trying to split message text for a messaging system up into at most 160 character long sequences that end in spaces, unless it is the very last sequence, then it can end in anything as long as it is equal to or less than 160 characters.
this re expression '.{1,160}\s' almost works however it cuts of the last word of a message because generally the last character of a message is not a space.
I also tried '.{1,160}\s|.{1,160}' but this does not work because the final sequence is just the remaining text after the last space. Does anyone have an idea on how to do this?
EXAMPLE:
two_cities = ("It was the best of times, it was the worst of times, it was " +
"the age of wisdom, it was the age of foolishness, it was the " +
"epoch of belief, it was the epoch of incredulity, it was the " +
"season of Light, it was the season of Darkness, it was the " +
"spring of hope, it was the winter of despair, we had " +
"everything before us, we had nothing before us, we were all " +
"going direct to Heaven, we were all going direct the other " +
"way-- in short, the period was so far like the present period," +
" that some of its noisiest authorities insisted on its being " +
"received, for good or for evil, in the superlative degree of " +
"comparison only.")
chunks = re.findall('.{1,160}\s|.{1,160}', two_cities)
print(chunks)
will return
['It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of ',
'incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we ',
'had nothing before us, we were all going direct to Heaven, we were all going direct the other way-- in short, the period was so far like the present period, ',
'that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison ',
'only.']
where the final element of the list should be
'that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.'
not 'only.'
Try this - .{1,160}(?:(?<=[ ])|$)
.{1,160} # 1 - 160 chars
(?:
(?<= [ ] ) # Lookbehind, must end with a space
| $ # or, be at End of String
)
Info -
By default, the engine will try to match 160 characters (greedily).
Then it checks the next part of the expression.
The lookbehind enforces the last character matched with .{1,160} is a space.
Or, if at the end of the string, no enforcement.
If the lookbehind fails, and not at the end of string, the engine will backtrack to 159 characters, then check again. This repeats until the assertion passes.
You should avoid using a regular expression, since they can be inefficient.
I would recommend something like this: (see it in action here)
list = []
words = two_cities.split(" ")
for i in range(0, len(words)):
str = []
while i < len(words) and len(str) + len(words[i]) <= 160:
str.append(words[i] + " ")
i += 1
list.append(''.join(str))
print list
This creates a list of all the words, split on spaces.
If the word will fit onto the string, it will add it onto the string. When it cannot, it adds it to the list and starts a new string. At the end, you have a list of the results.

Python: Match strings for certain terms

I have a list of tweets, from which I have to choose tweets that have terms like "sale", "discount", or "offer". Also, I need to find tweets that advertise certain deals, like a discount, by recognizing things like "%", "Rs.", "$" amongst others. I have absolutely no idea about regular expressions and the documentation isn't getting me anywhere. Here is my code. It's rather lousy, but please excuse that
import pymongo
import re
import datetime
client = pymongo.MongoClient()
db = client .PWSocial
fourteen_days_ago = datetime.datetime.utcnow() - datetime.timedelta(days=14)
id_list = [57947109, 183093247, 89443197, 431336956]
ar1 = [" deal "," deals ", " offer "," offers " "discount", "promotion", " sale ", " inr", " rs", "%", "inr ", "rs ", " rs."]
def func(ac_id):
mylist = []
newlist = []
tweets = list(db.tweets.find({'user_id' : ac_id, 'created_at': { '$gte': fourteen_days_ago }}))
for item in tweets:
data = item.get('text')
data = data.lower()
data = data.split()
flag = 0
if set(ar1).intersection(data):
flag = 1
abc = []
for x in ar1:
for y in data:
if re.search(x,y):
abc.append(x)
flag = 1
break
if flag == 1:
mylist.append(item.get('id'))
newlist.append(abc)
print mylist
print newlist
for i in id_list:
func(i)
This code soen't give me any correct results, and being a noob to regexes, I cannot figure out whats wrong with it. Can anyone suggest a better way to do this job? Any help is appreciated.
My first advice - learn regular expressions, it gives you an unlimited power of text processing.
But, to give you some working solution (and start point to further exploration) try this:
import re
re_offers = re.compile(r'''
\b # Word boundary
(?: # Non capturing parenthesis
deals? # Deal or deals
| # or ...
offers? # Offer or offers
|
discount
|
promotion
|
sale
|
rs\.? # rs or rs.
|
inr\d+ # INR then digits
|
\d+inr # Digits then INR
) # And group
\b # Word boundary
| # or ...
\b\d+% # Digits (1 or more) then percent
|
\$\d+\b # Dollar then digits (didn't care of thousand separator yet)
''',
re.I|re.X) # Ignore case, verbose format - for you :)
abc = re_offers.findall("e misio $1 is inr123 discount 1INR a 1% and deal")
print(abc)
You don't need to use a regular expression for this, you can use any:
if any(term in tweet for term in search_terms):
In your array of things to search for you don't have a comma between " offers " and "discount" which is causing them to be joined together.
Also when you use split you are getting rid of the whitespace in your input text. "I have a deal" will become ["I","have","a","deal"] but your search terms almost all contain whitespace. So remove the spaces from your search terms in array ar1.
However you might want to avoid using regular expressions and just use in instead (you will still need the chnages I suggest above though):
if x in y:
You might want to consider starting with find instead instead of a regex. You don't have complex expressions, and as you're handling a line of text you don't need to call split, instead just use find:
for token in ar1:
if data.find(token) != -1:
abc.append(data)
Your for item in tweets loop becomes:
for item in tweets:
data = item.get('text')
data = data.lower()
for x in ar1:
if data.find(x)
newlist.append(data)
mylist.append(item.get('id'))
break
Re: your comment on jonsharpe's post, to avoid including substrings, surround your tokens by spaces, e.g. " rs ", " INR "

Categories

Resources