Using regex to capture substring within a pandas df

Using regex to capture substring within a pandas df - python

I’m trying to extract specific substrings from larger phrases contained in my Pandas dataframe. I have rows formatted like so:
Appointment of DAVID MERRIGAN of Hammonds Plains, Nova Scotia, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of CARLA R. CONKIN of Fort Steele, British Columbia, to be Vice-Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of JUDY A. WHITE, Q.C., of Conne River, Newfoundland and Labrador, to be Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of GRETA SITTICHINLI of Inuvik, Northwest Territories, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
and I've been able to capture the capitalized names (e.g. DAVID MERRIGAN) with the regex below but I'm struggling to capture the locations, i.e. the 'of' statement following the capitalized name that ends with the second comma. I've tried just isolating the rest of the string that follows the name with the following code, but it just doesn't seem to work, I keep getting -1 as a response.
df_appointments['Name'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)')
df_appointments['Location'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)\b\s([^\n\r]*)')
Any help showing me how to isolate the location substring with regex (after that I can figure out how to get the position, etc) would be tremendously appreciated. Thank you.

The following pattern works for your sample set:
rgx = r'(?:\w\s)+([A-Z\s\.,]+)(?:\sof\s)([A-Za-z\s]+,\s[A-Za-z\s]+)'
It uses capture groups & non-capture groups to isolate only the names & locations from the strings. Rather than requiring two patterns, and having to perform two searches, you can then do the following to extract that information into two new columns:
df[['name', 'location']] = df['precis'].str.extract(rgx)
This then produces:
df
precis name location
0 Appointment of... DAVID MERRIGAN Hammonds Plains, Nova Scotia
1 Appointment of... CARLA R. CONKIN Fort Steele, British Columbia
2 Appointment of... JUDY A. WHITE, Q.C., Conne River, Newfoundland and...
3 Appointment of... GRETA SITTICHINLI Inuvik, Northwest Territories`
Depending on the exact format of all of your precis values, you might have to tweak the pattern to suit perfectly, but hopefully it gets you going...

# Final Answer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\yueheng.li\Desktop\Up\Question_20220824\Data.csv")
data[['Field_Part1','Field_Part2','Field_Part3']] = data['Precis'].str.split('of',2,expand=True)
data['Address_part1'] = data['Field_Part3'].str.split(',').str[0]
data['Address_part2'] = data['Field_Part3'].str.split(',').str[1]
data['Address'] = data['Address_part1']+','+data['Address_part2']
data.drop(['Field_Part1','Field_Part2','Field_Part3','Address_part1','Address_part2'],axis=1,inplace=True)
# Output Below
data
Easy Way to understand
Thanks
Leon

Related

Print texts that have cosine similarity score less than 0.90

I want to create deduplication process on my database.
I want to measure cosine similarity scores with Pythons Sklearn lib. between new texts and texts that are already in the database.
I want to add only documents that have cosine similarity score less than 0.90.
This is my code:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
list_of_texts_in_database = ["More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.",
"UK ministers say sanctions could target companies and individuals linked to the Russian government.",
"Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
"He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.",
"Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.",
"The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control",
"A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control",
"That could lead to a major escalation in conflict."]
list_of_new_texts = ["This is a totaly new document that needs to be added into the database one way or another.",
"Boris Johnson also says the UK could limit Russian firm ability to raise capital on London's market.",
"Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.",
"UK minister say sanctions could target companies and individuals linked to the Russian government.",
"That could lead to a major escalation in conflict."]
vectorizer = TfidfVectorizer(lowercase=True, analyzer='word', stop_words = None, ngram_range=(1, 1))
list_of_texts_in_database_tfidf = vectorizer.fit_transform(list_of_texts_in_database)
list_of_new_texts_tfidf = vectorizer.transform(list_of_new_texts)
cosineSimilarities = cosine_similarity(list_of_new_texts_tfidf, list_of_texts_in_database_tfidf)
print(cosineSimilarities)
This code works good, but I do not know how to map the results (how to get texts that have similarity score less than 0.90)

My suggestion would be as follows. You only add those texts with a score less than (or equal) 0.9.
import numpy as np
idx = np.where((cosineSimilarities <= 0.9).all(axis=1))
Then you have the indices of the new texts in list_of_new_texts that do not have a corresponding text with a score of > 0.9 in the already existing list list_of_texts_in_database.
Combining them you can do as follows (although somebody else might have a cleaner method for this...)
print(
list_of_texts_in_database + list(np.array(list_of_new_texts)[idx[0]])
)
Output:
['More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.',
'UK ministers say sanctions could target companies and individuals linked to the Russian government.',
"Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
'He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.',
'Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.',
'The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control',
'A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control',
'That could lead to a major escalation in conflict.',
'This is a totaly new document that needs to be added into the database one way or another.',
'Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.',
'UK minister say sanctions could target companies and individuals linked to the Russian government.']

why dont you work within a dataframe?
import pandas as pd
d = {'old_text':list_of_texts_in_database[:5], 'new_text':list_of_new_texts, 'old_emb': list_of_texts_in_database_tfidf[:5], 'new_emb': list_of_new_texts_tfidf}
df = pd.DataFrame(data=d)
df['score'] = df.apply(lambda row: cosine_similarity(row['old_emb'], row['new_emb'])[0][0], axis=1)
df = df.loc[df.score > 0.9, 'score']
df.head()

How to count unique words in python with function?

I would like to count unique words with function. Unique words I want to define are the word only appear once so that's why I used set here. I put the error below. Does anyone how to fix this?
Here's my code:
def unique_words(corpus_text_train):
words = re.findall('\w+', corpus_text_train)
uw = len(set(words))
return uw
unique = unique_words(test_list_of_str)
unique
I got this error
TypeError: expected string or bytes-like object
Here's my bag of words model:
def BOW_model_relative(df):
corpus_text_train = []
for i in range(0, len(df)): #iterate over the rows in dataframe
corpus = df['text'][i]
#corpus = re.findall(r'\w+',corpus)
corpus = re.sub(r'[^\w\s]','',corpus)
corpus = corpus.lower()
corpus = corpus.split()
corpus = ' '.join(corpus)
corpus_text_train.append(corpus)
word2count = {}
for x in corpus_text_train:
words=word_tokenize(x)
for word in words:
if word not in word2count.keys():
word2count[word]=1
else:
word2count[word]+=1
total=0
for key in word2count.keys():
total+=word2count[key]
for key in word2count.keys():
word2count[key]=word2count[key]/total
return word2count,corpus_text_train
test_dict,test_list_of_str = BOW_model_relative(df)
#test_data = pd.DataFrame(test)
print(test_dict)
Here's my csv data
df = pd.read_csv('test.csv')
,text,title,authors,label
0,"On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a “pressure cooker” device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.
Given that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn’t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the “bad guys.” Unfortunately, our leadership – who ostensibly wants to protect us – finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.
New York City Mayor Bill de Blasio – who famously ended “stop-and-frisk” profiling in his city – was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. “There is no specific and credible threat to New York City from any terror organization,” de Blasio said late Saturday at the news conference. “We believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and … agencies are at full alert”, he said. Isn’t “an intentional act” terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O’Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York’s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word “terrorism” is defined. “A bomb exploding in New York is obviously an act of terrorism.” Cuomo hit the nail on the head, but why did need to clarify and caveat before making his “obvious” assessment?
The two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that “we need to do everything we can to support our first responders – also to pray for the victims” and that “we need to let this investigation unfold.” Trump was more direct. “I must tell you that just before I got off the plane a bomb went off in New York and nobody knows what’s going on,” he said. “But boy we are living in a time—we better get very tough folks. We better get very, very tough. It’s a terrible thing that’s going on in our world, in our country and we are going to get tough and smart and vigilant.”

The answer from Kohelet neglects characters such as , and ", which in OP's case would find people and people, to be two unique words. To make sure you only get actual words you need to take care of the unwanted characters. To remove the , and ", you could add the following:
text ='aa, aa bb cc'
def unique_words(text):
words = text.replace('"','').replace(',', '').split()
unique = list(set(words))
return len(unique)
unique_words(text)
# out
3
There are numerous ways to add text to be replaced

s='aa aa bb cc'
def unique_words(corpus_text_train):
splitted = corpus_text_train.split()
return(len(set(splitted)))
unique_words(s)
Out[14]: 3

The usage of radix sort

Let's say i have a file containing data on users and their favourite movies.
Ace: FANTASTIC FOUR, IRONMAN
Jane: EXOTIC WILDLIFE, TRANSFORMERS, NARNIA
Jack: IRONMAN, FANTASTIC FOUR
and based of that, the program I'm about to write returns me the name of the users that likes the same movies.
Since Ace and Jack likes the same movie, they will be partners hence the program would output:
Movies: FANTASTIC FOUR, IRONMAN
Partners: Ace, Jack
Jane would be exempted since she doesn't have anyone who shares the same interest in movies as her.
The problem I'm having now is figuring out on how Radix Sort would help me achieve this as I've been thinking whole day long. I don't have much knowledge on radix sort but i know that it compares elements one by one but I'm terribly confused in cases such as FANTASTIC FOUR being arranged first in Ace's data and second in Jack's data.
Would anyone kindly explain some algorithms that i could understand to achieve the output?

Can you show us how you sort your lists ? The quick and dirty code below give me the same output for sorted Ace and Jack.
Ace = ["FANTASTIC FOUR", "IRONMAN"]
Jane = ["EXOTIC WILDLIFE", "TRANSFORMERS", "NARNIA"]
Jack = ["IRONMAN", "FANTASTIC FOUR"]
sorted_Ace = sorted(Ace)
print (sorted_Ace)
sorted_Jack = sorted(Jack)
print (sorted_Jack)
You could start comparing elements one by one from here.
I made you a quick solution, it can show you how you can proceed as it's not optimized at all and not generalized.
Ace = ["FANTASTIC FOUR", "IRONMAN"]
Jane = ["EXOTIC WILDLIFE", "TRANSFORMERS", "NARNIA"]
Jack = ["IRONMAN", "FANTASTIC FOUR"]
Movies = []
Partners = []
sorted_Ace = sorted(Ace)
sorted_Jane = sorted(Jane)
sorted_Jack = sorted(Jack)
for i in range(len(sorted_Ace)):
if sorted_Ace[i] == sorted_Jack[i]:
Movies.append(sorted_Ace[i])
if len(Movies) == len(sorted_Ace):
Partners.append("Ace")
Partners.append("Jack")
print(Movies)
print(Partners)

How to speed up the sum of presence of keys in the series of documents? - Pandas, nltk

I have a dataframe column with documents like
38909 Hotel is an old style Red Roof and has not bee...
38913 I will never ever stay at this Hotel again. I ...
38914 After being on a bus for -- hours and finally ...
38918 We were excited about our stay at the Blu Aqua...
38922 This hotel has a great location if you want to...
Name: Description, dtype: object
I have a bag of words like keys = ['Hotel','old','finally'] but the actual length of keys = 44312
Currently Im using
df.apply(lambda x : sum([i in x for i in keys ]))
Which gives the following output based on sample keys
38909 2
38913 2
38914 3
38918 0
38922 1
Name: Description, dtype: int64
When I apply this on actual data for just 100 rows timeit gives
1 loop, best of 3: 5.98 s per loop
and I have 50000 rows. Is there a faster way of doing the same in nltk or pandas.
EDIT :
In case looking for document array
array([ 'Hotel is an old style Red Roof and has not been renovated up to the new standard, but the price was also not up to the level of the newer style Red Roofs. So, in overview it was an OK stay, and a safe',
'I will never ever stay at this Hotel again. I stayed there a few weeks ago, and I had my doubts the second I checked in. The guy that checked me in, I think his name was Julio, and his name tag read F',
"After being on a bus for -- hours and finally arriving at the Hotel Lawerence at - am, I bawled my eyes out when we got to the room. I realize it's suppose to be a boutique hotel but, there was nothin",
"We were excited about our stay at the Blu Aqua. A new hotel in downtown Chicago. It's architecturally stunning and has generally good reviews on TripAdvisor. The look and feel of the place is great, t",
'This hotel has a great location if you want to be right at Times Square and the theaters. It was an easy couple of blocks for us to go to theater, eat, shop, and see all of the activity day and night '], dtype=object)

The following code is not exactly equivalent to your (slow) version, but it demonstrates the idea:
keyset = frozenset(keys)
df.apply(lambda x : len(keyset.intersection(x.split())))
Differences/limitation:
In your version a word is counted even if it is contained as a substring in a word in the document. For example, had your keys contained the word tyl, it would be counted due to occurrence of "style" in your first document.
My solution doesn't account for punctuation in the documents. For example, the word again in the second document comes out of split() with the full stop attached to it. That can be fixed by preprocessing the document (or postprocessing the result of the split()) with a function that removes the punctuation.

It seems you can just use np.char.count -
[np.count_nonzero(np.char.count(i, keys)) for i in arr]
Might be better to feed a boolean array for counting -
[np.count_nonzero(np.char.count(i, keys)!=0) for i in arr]

If need check only if present values of list:
from numpy.core.defchararray import find
v = df['col'].values.astype(str)
a = (find(v[:, None], keys) >= 0).sum(axis=1)
print (a)
[2 1 1 0 0]
Or:
df = pd.concat([df['col'].str.contains(x) for x in keys], axis=1).sum(axis=1)
print (df)
38909 2
38913 1
38914 1
38918 0
38922 0
dtype: int64

Compensating for "variance" in a survey

The title for this one was quite tricky.
I'm trying to solve a scenario,
Imagine a survey was sent out to XXXXX amount of people, asking them what their favourite football club was.
From the response back, it's obvious that while many are favourites of the same club, they all "expressed" it in different ways.
For example,
For Manchester United, some variations include...
Man U
Man Utd.
Man Utd.
Manchester U
Manchester Utd
All are obviously the same club however, if using a simple technique, of just trying to get an extract string match, each would be a separate result.
Now, if we further complication the scenario, let's say that because of the sheer volume of different clubs (eg. Man City, as M. City, Manchester City, etc), again plagued with this problem, its impossible to manually "enter" these variances and use that to create a custom filter such that converters all Man U -> Manchester United, Man Utd. > Manchester United, etc. But instead we want to automate this filter, to look for the most likely match and converter the data accordingly.
I'm trying to do this in Python (from a .cvs file) however welcome any pseudo answers that outline a good approach to solving this.
Edit: Some additional information
This isn't working off a set list of clubs, the idea is to "cluster" the ones we have together.
The assumption is there are no spelling mistakes.
There is no assumed length of how many clubs
And the survey list is long. Long enough that it doesn't warranty doing this manually (1000s of queries)

Google Refine does just this, but I'll assume you want to roll your own.
Note, difflib is built into Python, and has lots of features (including eliminating junk elements). I'd start with that.
You probably don't want to do it in a completely automated fashion. I'd do something like this:
# load corrections file, mapping user input -> output
# load survey
import difflib
possible_values = corrections.values()
for answer in survey:
output = corrections.get(answer,None)
if output = None:
likely_outputs = difflib.get_close_matches(input,possible_values)
output = get_user_to_select_output_or_add_new(likely_outputs)
corrections[answer] = output
possible_values.append(output)
save_corrections_as_csv

Please edit your question with answers to the following:
You say "we want to automate this filter, to look for the most likely match" -- match to what?? Do you have a list of the standard names of all of the possible football clubs, or do the many variations of each name need to be clustered to create such a list?
How many clubs?
How many survey responses?
After doing very light normalisation (replace . by space, strip leading/trailing whitespace, replace runs of whitespace by a single space, convert to lower case [in that order]) and counting, how many unique responses do you have?
Your focus seems to be on abbreviations of the standard name. Do you need to cope with nicknames e.g. Gunners -> Arsenal, Spurs -> Tottenham Hotspur? Acronyms (WBA -> West Bromwich Albion)? What about spelling mistakes, keyboard mistakes, SMS-dialect, ...? In general, what studies of your data have you done and what were the results?
You say """its impossible to manually "enter" these variances""" -- is it possible/permissible to "enter" some "variances" e.g. to cope with nicknames as above?
What are your criteria for success in this exercise, and how will you measure it?

It seems to me that you could convert many of these into a standard form by taking the string, lower-casing it, removing all punctuation, then comparing the start of each word.
If you had a list of all the actual club names, you could compare directly against that as well; and for strings which don't match first-n-letters to any actual team, you could try lexigraphical comparison against any of the returned strings which actually do match.
It's not perfect, but it should get you 99% of the way there.
import string
def words(s):
s = s.lower().strip(string.punctuation)
return s.split()
def bestMatchingWord(word, matchWords):
score,best = 0., ''
for matchWord in matchWords:
matchScore = sum(w==m for w,m in zip(word,matchWord)) / (len(word) + 0.01)
if matchScore > score:
score,best = matchScore,matchWord
return score,best
def bestMatchingSentence(wordList, matchSentences):
score,best = 0., []
for matchSentence in matchSentences:
total,words = 0., []
for word in wordList:
s,w = bestMatchingWord(word,matchSentence)
total += s
words.append(w)
if total > score:
score,best = total,words
return score,best
def main():
data = (
"Man U",
"Man. Utd.",
"Manch Utd",
"Manchester U",
"Manchester Utd"
)
teamList = (
('arsenal',),
('aston', 'villa'),
('birmingham', 'city', 'bham'),
('blackburn', 'rovers', 'bburn'),
('blackpool', 'bpool'),
('bolton', 'wanderers'),
('chelsea',),
('everton',),
('fulham',),
('liverpool',),
('manchester', 'city', 'cty'),
('manchester', 'united', 'utd'),
('newcastle', 'united', 'utd'),
('stoke', 'city'),
('sunderland',),
('tottenham', 'hotspur'),
('west', 'bromwich', 'albion'),
('west', 'ham', 'united', 'utd'),
('wigan', 'athletic'),
('wolverhampton', 'wanderers')
)
for d in data:
print "{0:20} {1}".format(d, bestMatchingSentence(words(d), teamList))
if __name__=="__main__":
main()
run on sample data gets you
Man U (1.9867767507647776, ['manchester', 'united'])
Man. Utd. (1.7448074166742613, ['manchester', 'utd'])
Manch Utd (1.9946817328797555, ['manchester', 'utd'])
Manchester U (1.989100008901989, ['manchester', 'united'])
Manchester Utd (1.9956787398647866, ['manchester', 'utd'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.