How to extract acronyms and abbreviations from pandas dataframe?

How to extract acronyms and abbreviations from pandas dataframe? - python

I have a pandas dataframe, a column with text data. I want to extract all unique acronyms and abbreviations in that text column.
So far I have a function to extract all acronyms and abbreviations from a given text.
def extract_acronyms_abbreviations(text):
eaa = {}
for match in re.finditer(r"\((.*?)\)", text):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = text[:start_index].split()[-size:]
definition = " ".join(words)
eaa[abbr] = definition
return eaa
extract_acronyms_abbreviations(a)
{'FHH': 'family health history', 'NP': 'nurse practitioner'}
I want to apply/extract all unique acronyms and abbreviations from the text column.
Sample Data:
s = """The MLCommons Association, an open engineering consortium dedicated to improving machine learning for everyone, today announced the general availability of the People's Speech Dataset and the Multilingual Spoken Words Corpus (MSWC). This trail-blazing and permissively licensed datasets advance innovation in machine learning research and commercial applications. Also today, the MLCommons Association is issuing a call for participation in the new DataPerf benchmark suite, which measures and encourages innovation in data-centric AI."""
k = """The MLCommons Association is a firm proponent of Data-Centric AI (DCAI), the discipline of systematically engineering the data for AI systems by developing efficient software tools and engineering practices to make dataset creation and curation easier. Our open datasets and tools like DataPerf concretely support the DCAI movement and drive machine learning innovation."""
j = """The key global provider of sustainable packaging solutions has now taken a significant step towards reaching these ambitions by signing two 10-year virtual Power Purchase Agreements (VPPA) with global renewable energy developer BayWa r.e covering its operations in Europe. The agreements form the largest solar VPPA for the packaging industry in Europe, as well as the first major solar VPPA by a Finnish company."""
a = """Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP)."""
import pandas as pd
data = {"text":[s,k,j,a,s,k,j]}
df = pd.DataFrame(data)
Desired output
{'MSWC': 'Multilingual Spoken Words Corpus',
'DCAI': 'proponent of Data-Centric AI',
'VPPA': 'virtual Power Purchase Agreements',
'NP': 'nurse practitioner',
'FHH': 'family health history'}

Lets say df['text'] contains the text data you want to work with.
df["acronyms"] = df.apply(extract_acronyms_abbreviations)
# It will create a new columns containing dictionary return by your function.
Now create a master dict like
master_dict = dict()
for d in df["acronyms"].values:
master_dict.update(d)
print(master_dict)

Related

Using regex to capture substring within a pandas df

I’m trying to extract specific substrings from larger phrases contained in my Pandas dataframe. I have rows formatted like so:
Appointment of DAVID MERRIGAN of Hammonds Plains, Nova Scotia, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of CARLA R. CONKIN of Fort Steele, British Columbia, to be Vice-Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of JUDY A. WHITE, Q.C., of Conne River, Newfoundland and Labrador, to be Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of GRETA SITTICHINLI of Inuvik, Northwest Territories, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
and I've been able to capture the capitalized names (e.g. DAVID MERRIGAN) with the regex below but I'm struggling to capture the locations, i.e. the 'of' statement following the capitalized name that ends with the second comma. I've tried just isolating the rest of the string that follows the name with the following code, but it just doesn't seem to work, I keep getting -1 as a response.
df_appointments['Name'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)')
df_appointments['Location'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)\b\s([^\n\r]*)')
Any help showing me how to isolate the location substring with regex (after that I can figure out how to get the position, etc) would be tremendously appreciated. Thank you.

The following pattern works for your sample set:
rgx = r'(?:\w\s)+([A-Z\s\.,]+)(?:\sof\s)([A-Za-z\s]+,\s[A-Za-z\s]+)'
It uses capture groups & non-capture groups to isolate only the names & locations from the strings. Rather than requiring two patterns, and having to perform two searches, you can then do the following to extract that information into two new columns:
df[['name', 'location']] = df['precis'].str.extract(rgx)
This then produces:
df
precis name location
0 Appointment of... DAVID MERRIGAN Hammonds Plains, Nova Scotia
1 Appointment of... CARLA R. CONKIN Fort Steele, British Columbia
2 Appointment of... JUDY A. WHITE, Q.C., Conne River, Newfoundland and...
3 Appointment of... GRETA SITTICHINLI Inuvik, Northwest Territories`
Depending on the exact format of all of your precis values, you might have to tweak the pattern to suit perfectly, but hopefully it gets you going...

# Final Answer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\yueheng.li\Desktop\Up\Question_20220824\Data.csv")
data[['Field_Part1','Field_Part2','Field_Part3']] = data['Precis'].str.split('of',2,expand=True)
data['Address_part1'] = data['Field_Part3'].str.split(',').str[0]
data['Address_part2'] = data['Field_Part3'].str.split(',').str[1]
data['Address'] = data['Address_part1']+','+data['Address_part2']
data.drop(['Field_Part1','Field_Part2','Field_Part3','Address_part1','Address_part2'],axis=1,inplace=True)
# Output Below
data
Easy Way to understand
Thanks
Leon

BB data extraction - limited set of data

I am trying to extract some data from the BB Finance API via rapidapi and have some difficulty extracting the data.
I have some files (txt,csv,xlsx,json) that I have used to see if I can get the data that I need but cannot find a way to get it.
this is an example:
"title":"Bloomberg Markets"
"modules":[26 items
0:{6 items
"id":"hero_1"
"title":NULL
"skipDedup":false
"type":"single_story"
"tracking":{2 items
"id":"36ba25c6-8f14-472a-8068-75b40b7ec2dc"
"title":"Hero"
}
"stories":[1 item
0:{23 items
"id":"2022-02-15/stocks-set-to-rise-on-prospect-of-easing-tension-markets-wrap"
"internalID":"R7D425T0AFB401"
"title":"Rally Falters on Return of Russia, Inflation Risks: Markets Wrap"
"summary":""
"autoGeneratedSummary":"The global stock rally stalled on Wednesday as traders struggled
to evaluate the risk of geopolitical tension in Ukraine and the impact of rising
inflation on central bank policies."
"abstract":[2 items
0:"Treasuries, gold steady; Italian bond yields at two-year high"
1:"Biden cautious on Ukraine as Russia says some troops withdrawn"
I am interested in:
stories:title,
stories:summary,
stories:autogeneratedsummary &
summary:abstract
but cannot find a way to jut extract that info.From there I was thinking of bringing that into a data-frame so that I can work with it.
Is there a way to do so?
thanks

Print texts that have cosine similarity score less than 0.90

I want to create deduplication process on my database.
I want to measure cosine similarity scores with Pythons Sklearn lib. between new texts and texts that are already in the database.
I want to add only documents that have cosine similarity score less than 0.90.
This is my code:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
list_of_texts_in_database = ["More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.",
"UK ministers say sanctions could target companies and individuals linked to the Russian government.",
"Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
"He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.",
"Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.",
"The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control",
"A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control",
"That could lead to a major escalation in conflict."]
list_of_new_texts = ["This is a totaly new document that needs to be added into the database one way or another.",
"Boris Johnson also says the UK could limit Russian firm ability to raise capital on London's market.",
"Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.",
"UK minister say sanctions could target companies and individuals linked to the Russian government.",
"That could lead to a major escalation in conflict."]
vectorizer = TfidfVectorizer(lowercase=True, analyzer='word', stop_words = None, ngram_range=(1, 1))
list_of_texts_in_database_tfidf = vectorizer.fit_transform(list_of_texts_in_database)
list_of_new_texts_tfidf = vectorizer.transform(list_of_new_texts)
cosineSimilarities = cosine_similarity(list_of_new_texts_tfidf, list_of_texts_in_database_tfidf)
print(cosineSimilarities)
This code works good, but I do not know how to map the results (how to get texts that have similarity score less than 0.90)

My suggestion would be as follows. You only add those texts with a score less than (or equal) 0.9.
import numpy as np
idx = np.where((cosineSimilarities <= 0.9).all(axis=1))
Then you have the indices of the new texts in list_of_new_texts that do not have a corresponding text with a score of > 0.9 in the already existing list list_of_texts_in_database.
Combining them you can do as follows (although somebody else might have a cleaner method for this...)
print(
list_of_texts_in_database + list(np.array(list_of_new_texts)[idx[0]])
)
Output:
['More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.',
'UK ministers say sanctions could target companies and individuals linked to the Russian government.',
"Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
'He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.',
'Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.',
'The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control',
'A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control',
'That could lead to a major escalation in conflict.',
'This is a totaly new document that needs to be added into the database one way or another.',
'Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.',
'UK minister say sanctions could target companies and individuals linked to the Russian government.']

why dont you work within a dataframe?
import pandas as pd
d = {'old_text':list_of_texts_in_database[:5], 'new_text':list_of_new_texts, 'old_emb': list_of_texts_in_database_tfidf[:5], 'new_emb': list_of_new_texts_tfidf}
df = pd.DataFrame(data=d)
df['score'] = df.apply(lambda row: cosine_similarity(row['old_emb'], row['new_emb'])[0][0], axis=1)
df = df.loc[df.score > 0.9, 'score']
df.head()

How to break long strings in python?

I have a list of very long strings, and I want to know if theres any way to separate every string´s element by a certain number of words like the image on the bottom.
List = [“End extreme poverty in all forms by 2030”,
“End hunger, achieve food security and improved nutrition and promote sustainable agriculture”,
“Ensure healthy lives and promote well-being for all at all ages”,
“Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all”,
“Promote sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all”]
Print(list)
Out:
“End extreme poverty in all forms by 2030”,
“End hunger, achieve food security and improved
nutrition and promote sustainable agriculture”,
“Ensure healthy lives and promote well-being for all at all ages”,
“Ensure inclusive and equitable quality education and promote
lifelong learning opportunities for all”,
“Promote sustained, inclusive and sustainable economic growth,
full and productive employment and decent work for all”]
the final result is kinda like the image

You can use textwrap:
import textwrap
print(textwrap.fill(List[1], 50))
End hunger, achieve food security and improved
nutrition and promote sustainable agriculture

If you want to break it in terms of number of words, you can:
Split the string into a list of words:
my_list = my_string.split()
Then you can break it into K chunks using numpy:
import numpy as np
my_chunks = np.array_split(my_list , K) #K is number of chunks
Then, to get a chunk back into a string, you can do:
my_string = " ".join(my_chunk[0])
Or to get all of them to strings, you can do:
my_strings = list(map(" ".join, my_chunks))

Compensating for "variance" in a survey

The title for this one was quite tricky.
I'm trying to solve a scenario,
Imagine a survey was sent out to XXXXX amount of people, asking them what their favourite football club was.
From the response back, it's obvious that while many are favourites of the same club, they all "expressed" it in different ways.
For example,
For Manchester United, some variations include...
Man U
Man Utd.
Man Utd.
Manchester U
Manchester Utd
All are obviously the same club however, if using a simple technique, of just trying to get an extract string match, each would be a separate result.
Now, if we further complication the scenario, let's say that because of the sheer volume of different clubs (eg. Man City, as M. City, Manchester City, etc), again plagued with this problem, its impossible to manually "enter" these variances and use that to create a custom filter such that converters all Man U -> Manchester United, Man Utd. > Manchester United, etc. But instead we want to automate this filter, to look for the most likely match and converter the data accordingly.
I'm trying to do this in Python (from a .cvs file) however welcome any pseudo answers that outline a good approach to solving this.
Edit: Some additional information
This isn't working off a set list of clubs, the idea is to "cluster" the ones we have together.
The assumption is there are no spelling mistakes.
There is no assumed length of how many clubs
And the survey list is long. Long enough that it doesn't warranty doing this manually (1000s of queries)

Google Refine does just this, but I'll assume you want to roll your own.
Note, difflib is built into Python, and has lots of features (including eliminating junk elements). I'd start with that.
You probably don't want to do it in a completely automated fashion. I'd do something like this:
# load corrections file, mapping user input -> output
# load survey
import difflib
possible_values = corrections.values()
for answer in survey:
output = corrections.get(answer,None)
if output = None:
likely_outputs = difflib.get_close_matches(input,possible_values)
output = get_user_to_select_output_or_add_new(likely_outputs)
corrections[answer] = output
possible_values.append(output)
save_corrections_as_csv

Please edit your question with answers to the following:
You say "we want to automate this filter, to look for the most likely match" -- match to what?? Do you have a list of the standard names of all of the possible football clubs, or do the many variations of each name need to be clustered to create such a list?
How many clubs?
How many survey responses?
After doing very light normalisation (replace . by space, strip leading/trailing whitespace, replace runs of whitespace by a single space, convert to lower case [in that order]) and counting, how many unique responses do you have?
Your focus seems to be on abbreviations of the standard name. Do you need to cope with nicknames e.g. Gunners -> Arsenal, Spurs -> Tottenham Hotspur? Acronyms (WBA -> West Bromwich Albion)? What about spelling mistakes, keyboard mistakes, SMS-dialect, ...? In general, what studies of your data have you done and what were the results?
You say """its impossible to manually "enter" these variances""" -- is it possible/permissible to "enter" some "variances" e.g. to cope with nicknames as above?
What are your criteria for success in this exercise, and how will you measure it?

It seems to me that you could convert many of these into a standard form by taking the string, lower-casing it, removing all punctuation, then comparing the start of each word.
If you had a list of all the actual club names, you could compare directly against that as well; and for strings which don't match first-n-letters to any actual team, you could try lexigraphical comparison against any of the returned strings which actually do match.
It's not perfect, but it should get you 99% of the way there.
import string
def words(s):
s = s.lower().strip(string.punctuation)
return s.split()
def bestMatchingWord(word, matchWords):
score,best = 0., ''
for matchWord in matchWords:
matchScore = sum(w==m for w,m in zip(word,matchWord)) / (len(word) + 0.01)
if matchScore > score:
score,best = matchScore,matchWord
return score,best
def bestMatchingSentence(wordList, matchSentences):
score,best = 0., []
for matchSentence in matchSentences:
total,words = 0., []
for word in wordList:
s,w = bestMatchingWord(word,matchSentence)
total += s
words.append(w)
if total > score:
score,best = total,words
return score,best
def main():
data = (
"Man U",
"Man. Utd.",
"Manch Utd",
"Manchester U",
"Manchester Utd"
)
teamList = (
('arsenal',),
('aston', 'villa'),
('birmingham', 'city', 'bham'),
('blackburn', 'rovers', 'bburn'),
('blackpool', 'bpool'),
('bolton', 'wanderers'),
('chelsea',),
('everton',),
('fulham',),
('liverpool',),
('manchester', 'city', 'cty'),
('manchester', 'united', 'utd'),
('newcastle', 'united', 'utd'),
('stoke', 'city'),
('sunderland',),
('tottenham', 'hotspur'),
('west', 'bromwich', 'albion'),
('west', 'ham', 'united', 'utd'),
('wigan', 'athletic'),
('wolverhampton', 'wanderers')
)
for d in data:
print "{0:20} {1}".format(d, bestMatchingSentence(words(d), teamList))
if __name__=="__main__":
main()
run on sample data gets you
Man U (1.9867767507647776, ['manchester', 'united'])
Man. Utd. (1.7448074166742613, ['manchester', 'utd'])
Manch Utd (1.9946817328797555, ['manchester', 'utd'])
Manchester U (1.989100008901989, ['manchester', 'united'])
Manchester Utd (1.9956787398647866, ['manchester', 'utd'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.