I have the following code:
import nltk
page = '
EDUCATION
University
Won first prize for the best second year group project, focused on software engineering.
Sixth Form
Mathematics, Economics, French
UK, London
'
for sent in nltk.sent_tokenize(page):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(''.join(c[0] for c in chunk), ' ',chunk.label())
Returns:
EDUCATION ORGANIZATION
UniversityWon ORGANIZATION
Sixth PERSON
FormMathematics ORGANIZATION
Economics PERSON
FrenchUK GPE
London GPE
Which i'd like to be grouped into some data-structure based on the entity label, maybe a list: ORGANIZATION=[EDUCATION,UniversityWon,FormMathematics] PERSON=[Sixth,Economics] GPE=[FrenchUK,London]
Or maybe a dictionary with the keys: ORGANIZATION, PERSON, GPE then the associated values are as the lists above
A dictionary makes more sense, perhaps something like this.
from collections import defaultdict
entities = defaultdict(list)
for sent in nltk.sent_tokenize(page):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
entities[chunk.label()].append(''.join(c[0] for c in chunk))
Related
I have checked the other two answers to similar question but in vain.
from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
article = "Aziz Ismail Ansari[1] (/ənˈsɑːri/; born February 23, 1983) is an American actor, writer, producer, director, and comedian. He is known for his role as Tom Haverford on the NBC series Parks and Recreation (2009–2015) and as creator and star of the Netflix series Master of None (2015–) for which he won several acting and writing awards, including two Emmys and a Golden Globe, which was the first award received by an Asian American actor for acting on television.[2][3][4]Ansari began performing comedy in New York City while a student at New York University in 2000. He later co-created and starred in the MTV sketch comedy show Human Giant, after which he had acting roles in a number of feature films.As a stand-up comedian, Ansari released his first comedy special, Intimate Moments for a Sensual Evening, in January 2010 on Comedy Central Records. He continues to perform stand-up on tour and on Netflix. His first book, Modern Romance: An Investigation, was released in June 2015. He was included in the Time 100 list of most influential people in 2016.[5] In July 2019, Ansari released his fifth comedy special Aziz Ansari: Right Now, which was nominated for a Grammy Award for Best Comedy Album.[6]"
tokens = regexp_tokenize(article, "\w+|\d+")
az_only_alpha = [t for t in tokens if t.isalpha()]
#below line will take some time to execute.
az_no_stop_words = [t for t in az_only_alpha if t not in stopwords.words('english')]
az_lemmetize = [wordnetLem.lemmatize(t) for t in az_no_stop_words]
dictionary = Dictionary([az_lemmetize])
x = [dictionary.doc2bow(doc) for doc in [az_lemmetize]]
cp = TfidfModel(x)
cp[x[0]]
The last line gives me an empty array whereas I expect a list of tupples with id and their weights.
I am teaching myself python and have completed a rudimentary text summarizer. I'm nearly happy with the summarized text but want to polish the final product a bit more.
The code performs some standard text processing correctly (tokenization, remove stopwords, etc). The code then scores each sentence based on a weighted word frequency. I am using the heapq.nlargest() method to return the top 7 sentences which I feel does a good job based on my sample text.
The issue I'm facing is that the top 7 sentences are returned sorted from highest score -> lowest score. I understand the why this is happening. I would prefer to maintain the same sentence order as present in the original text. I've included the relevant bits of code and hope someone can guide me on a solution.
#remove all stopwords from text, build clean list of lower case words
clean_data = []
for word in tokens:
if str(word).lower() not in stoplist:
clean_data.append(word.lower())
#build dictionary of all words with frequency counts: {key:value = word:count}
word_frequencies = {}
for word in clean_data:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
#print(word_frequencies.items())
#update the dictionary with a weighted frequency
maximum_frequency = max(word_frequencies.values())
#print(maximum_frequency)
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
#print(word_frequencies.items())
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]
#print(sentence_scores.items())
summary_sentences = heapq.nlargest(7, sentence_scores, key = sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)
I'm testing using the following article: https://www.bbc.com/news/world-australia-45674716
Current output: "Australia bank inquiry: 'They didn't care who they hurt'
The inquiry has also heard testimony about corporate fraud, bribery rings at banks, actions to deceive regulators and reckless practices. A royal commission this year, the country's highest form of public inquiry, has exposed widespread wrongdoing in the industry. The royal commission came after a decade of scandalous behaviour in Australia's financial sector, the country's largest industry. "[The report] shines a very bright light on the poor behaviour of our financial sector," Treasurer Josh Frydenberg said. "When misconduct was revealed, it either went unpunished or the consequences did not meet the seriousness of what had been done," he said. The bank customers who lost everything
He also criticised what he called the inadequate actions of regulators for the banks and financial firms. It has also received more than 9,300 submissions of alleged misconduct by banks, financial advisers, pension funds and insurance companies."
As an example of the desired output: The third sentence above, "A royal commission this year, the country's highest form of public inquiry, has exposed widespread wrongdoing in the industry." actually comes before "Australia bank inquiry: They didnt care who they hurt" in the original article and I would like the output to maintain that sentence order.
Got it working, leaving here in case others are curious:
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
cnt = 0
for sent in sentence_list:
sentence_scores[sent] = []
score = 0
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
score = word_frequencies[word]
else:
score += word_frequencies[word]
sentence_scores[sent].append(score)
sentence_scores[sent].append(cnt)
cnt = cnt + 1
#Sort the dictionary using the score in descending order and then index in ascending order
#Getting the top 7 sentences
#Putting them in 1 string variable
from operator import itemgetter
top7 = dict(sorted(sentence_scores.items(), key=itemgetter(1), reverse = True)[0:7])
#print(top7)
def Sort(sub_li):
return(sorted(sub_li, key = lambda sub_li: sub_li[1]))
sentence_summary = Sort(top7.values())
summary = ""
for value in sentence_summary:
for key in top7:
if top7[key] == value:
summary = summary + key
print(summary)
I have a pandas dataframe with quarterly firm observations and respective speeches within each firm observation from different persons. As such, I have "common" variables like year, title, firm name etc. and then per quarterly observation I have a variable allinfolistmain which is a stored as a list of lists within each observation containing the name and the speech as separate list entries.
For instance, for one row of "allinfolistmain" the entry would look like this:
[[Mark Johnson, Hello], [Christina Brown, Have a good day], [Mark Johnson, You too], [Christina Brown, Thank you]]
The overall dataframe would look like this:
Index Year Title Firm allinfolistmain
0 2009 CC A 2009 A [[Mark Johnson, Hello], [Christina Brown, Have a good day], [Mark Johnson, You too], [Christina Brown, Thank you]]
1 2009 CC B 2009 B [[Lucas Bass, Hello], [Harm Brown, Have a good day], [Lucas Bass, You too], [Harm Brown, Thank you]]
2 2008 CC A 2008 A [[Mark Johnson, Nice to see you], [Christina Brown, You too], [Mark Johnson,Thanks], [Christina Brown, Bye]]
Now for each row/observation, I want to group the speeches (so list element indexed 1) by name (so list elements indexed 0), so that it looks like below that the speeches are just joined together in one string within the list:
[[Mark Johnson, Hello You too], [Christina Brown, Have a good day Thank you]]
Could someone help me with the code here how I can go trough each line and create such a new list? All suggestions are very much appreciated as I am still at the beginning of coding and I could not solve this issue.
Thank you so much!
Julia
If I correctly understand your question and how you've created the dataframe, is this what you want to do? At the end is a printed list:
# a new dictionary of lists to collect all "speeches" values for each "name" key
nd = {}
for row in df['allinfolistmain']: # for each row in the dataframe
for n in row: # for each name in the row
try: #
if nd[n[0]]: # check if the key already exists
nd[n[0]].append(n[1]) # if it does, add speech to its list
except KeyError: # otherwise they key doesn't yet exist
nd[n[0]] = [n[1]] # we add the key and the speech
newlist = [] # create a new list
for k, v in nd.iteritems(): # for each key, value in the new dictionary from previous step
newlist.append((k, ' '.join(v))) # add a tuple of (key, all speeches) as one string
print newlist
output:
[('Christina Brown', 'Have a good day Thank you You too Bye'),
('Mark Johnson', 'Hello You too Nice to see you Thanks'),
('Lucas Bass', 'Hello You too'),
('Harm Brown', 'Have a good day Thank you')]
from collections import defaultdict
def g(L):
res = defaultdict(list)
for v, k in L:
res[v].append(k)
new = list({key: ' '.join(value) for key, value in res.items()}.items())
return new
df.allinfolismain.apply(g)
single list test :
L=[('Mark Johnson', 'Hello'), ('Christina Brown', 'Have a good day'), ('Mark Johnson', 'You too'), ('Christina Brown', 'Thank you')]
g(L)
Out[784]:
[('Mark Johnson', 'Hello You too'),
('Christina Brown', 'Have a good day Thank you')]
I have data in the following format:
Bxxxx, Mxxxx F Birmingham AL (123) 555-2281 NCC Clinical Mental Health, Counselor Education, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029 -99.8115
Axxxx, Axxxx Brown Birmingham AL (123) 555-2281 NCC Clinical Mental Health, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029 -99.8115
Axxxx, Bxxxx Mobile AL (123) 555-8011 NCC Childhood & Adolescence, Clinical Mental Health, Sexual Abuse Recovery, Disaster Counseling English 99.68639 -99.053238
Axxxx, Rxxxx Lunsford Athens AL (123) 555-8119 NCC, NCCC, NCSC Career Development, Childhood & Adolescence, School, Disaster Counseling, Supervision English 99.804501 -99.971283
Axxxx, Mxxxx Mobile AL (123) 555-5963 NCC Clinical Mental Health, Counselor Education, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling, Supervision English 99.68639 -99.053238
Axxxx, Txxxx Mountain Brook AL (123) 555-3099 NCC Addictions and Dependency, Career Development, Childhood & Adolescence, Corrections/Offenders, Sexual Abuse Recovery English 99.50214 -99.75557
Axxxx, Lxxxx Birmingham AL (123) 555-4550 NCC Addictions and Dependency, Eating Disorders English 99.52029 -99.8115
Axxxx, Wxxxx Birmingham AL (123) 555-2328 NCC English 99.52029 -99.8115
Axxxx, Rxxxx Mobile AL (123) 555-9411 NCC Addictions and Dependency, Childhood & Adolescence, Couples & Family, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill English 99.68639 -99.053238
And need to extract only the person names. Ideally, I'd be able to use humanName to get a bunch of name objects with fields name.first, name.middle, name.last, name.title...
I've tried iterating through until I hit the first two consecutive caps letters representing the state and then storing the stuff previous into a list and then calling humanName but that was a disaster. I don't want to continue to try this method.
Is there a way to sense the starts and ends of words? That might be helpful...
Recommendations?
Your best bet is to find a different data source. Seriously. This one is farked.
If you can't do that, then I would do some work like this:
Replace all double spaces with single spaces.
Split the line by spaces
Take the last 2 items in the list. Those are lat and lng
Looping backwards in the list, do a lookup of each item into a list of potential languages. If the lookup fails, you are done with languages.
Join the remaining list items back with spaces
In the line, find the first opening paren. Read about 13 or 14 characters in, replace all punctuation with empty strings, and reformat it as a normal phone number.
Split the remainder of the line after the phone number by commas.
Using that split, loop through each item in the list. If the text starts with more than 1 capital letter, add it to certifications. Otherwise, add it to areas of practice.
Going back to the index you found in step #6, get the line up until then. Split it on spaces, and take the last item. That's the state. All that's left is name and city!
Take the first 2 items in the space-split line. That's your best guess for name, so far.
Look at the 3rd item. If it is a single letter, add it to the name and remove from the list.
Download US.zip from here: http://download.geonames.org/export/zip/US.zip
In the US data file, split all of it on tabs. Take the data at indexes 2 and 4, which are city name and state abbreviation. Loop through all data and insert each row, concatenated as abbreviation + ":" + city name (i.e. AK:Sand Point) into a new list.
Make a combination of all possible joins of the remaining items in your line, in the same format as in step #13. So you'd end up with AL:Brown Birmingham and AL:Birmingham for the 2nd line.
Loop through each combination and search for it in the list you created in step #13. If you found it, remove it from the split list.
Add all remaining items in the string-split list to the person's name.
If desired, split the name on the comma. index[0] is the last name index[1] is all remaining names. Don't make any assumptions about middle names.
Just for giggles, I implemented this. Enjoy.
import itertools
# this list of languages could be longer and should read from a file
languages = ["English", "Spanish", "Italian", "Japanese", "French",
"Standard Chinese", "Chinese", "Hindi", "Standard Arabic", "Russian"]
languages = [language.lower() for language in languages]
# Loop through US.txt and format it. Download from geonames.org.
cities = []
with open('US.txt', 'r') as us_data:
for line in us_data:
line_split = line.split("\t")
cities.append("{}:{}".format(line_split[4], line_split[2]))
# This is the dataset
with open('state-teachers.txt', 'r') as teachers:
next(teachers) # skip header
for line in teachers:
# Replace all double spaces with single spaces
while line.find(" ") != -1:
line = line.replace(" ", " ")
line_split = line.split(" ")
# Lat/Lon are the last 2 items
longitude = line_split.pop().strip()
latitude = line_split.pop().strip()
# Search for potential languages and trim off the line as we find them
teacher_languages = []
while True:
language_check = line_split[-1]
if language_check.lower().replace(",", "").strip() in languages:
teacher_languages.append(language_check)
del line_split[-1]
else:
break
# Rejoin everything and then use phone number as the special key to split on
line = " ".join(line_split)
phone_start = line.find("(")
phone = line[phone_start:phone_start+14].strip()
after_phone = line[phone_start+15:]
# Certifications can be recognized as acronyms
# Anything else is assumed to be an area of practice
certifications = []
areas_of_practice = []
specialties = after_phone.split(",")
for specialty in specialties:
specialty = specialty.strip()
if specialty[0:2].upper() == specialty[0:2]:
certifications.append(specialty)
else:
areas_of_practice.append(specialty)
before_phone = line[0:phone_start-1]
line_split = before_phone.split(" ")
# State is the last column before phone
state = line_split.pop()
# Name should be the first 2 columns, at least. This is a basic guess.
name = line_split[0] + " " + line_split[1]
line_split = line_split[2:]
# Add initials
if len(line_split[0].strip()) == 1:
name += " " + line_split[0].strip()
line_split = line_split[1:]
# Combo of all potential word combinations to see if we're dealing with a city or a name
combos = [" ".join(combo) for combo in set(itertools.permutations(line_split))] + line_split
line = " ".join(line_split)
city = ""
# See if the state:city combo is valid. If so, set it and let everything else be the name
for combo in combos:
if "{}:{}".format(state, combo) in cities:
city = combo
line = line.replace(combo, "")
break
# Remaining data must be a name
if line.strip() != "":
name += " " + line
# Clean up names
last_name, first_name = [piece.strip() for piece in name.split(",")]
print first_name, last_name
Not a code answer, but it looks like you could get most/all of the data you're after from the licensing board at http://www.abec.alabama.gov/rostersearch2.asp?search=%25&submit1=Search. Names are easy to get there.
Recently I got my hands on a research project that would greatly benefit from learning how to parse a string of biographical data on several individuals into a set of dictionaries for each individual.
The string contains break words and I was hoping to create keys off of the breakwords and separate dictionaries by line breaks. So here are two people I want to create two different dictionaries for within my data:
Bankers = [ ' Bakstansky, Peter; Senior Vice President, Federal
Reserve Bank of New York, in charge of public information
since 1976, when he joined the NY Fed as Vice President. Senior
Officer in charge of die Office of Regional and Community Affairs,
Ombudsman for the Bank and Senior Administrative Officer for Executive
Group, m zero children Educ City College of New York (Bachelor of
Business Administration, 1961); University of Illinois, Graduate
School, and New York University, Graduate School of Business. 1962-6:
Business and financial writer, New York, on American Banker, New
York-World Telegram & Sun, Neia York Herald Tribune (banking editor
1964-6). 1966-74: Chase Manhattan Bank: Manager of Public Relations,
based in Paris, 1966-71; Manager of Chase's European Marketing and
Planning, based in Brussels, 1971-2; Vice President and Director of
Public Relations, 1972-4.1974-76: Bache & Co., Vice President and
Director of Corporate Communications. Barron, Patrick K.; First Vice
President and < Operating Officer of the Federal Reserve Bank o
Atlanta since February 1996. Member of the Fed" Reserve Systems
Conference of first Vice Preside Vice chairman of the bank's
management Con and of the Discount Committee, m three child Educ
University of Miami (Bachelor's degree in Management); Harvard
Business School (Prog Management Development); Stonier Graduate Sr of
Banking, Rutgers University. 1967: Joined Fed Reserve Bank of Atlanta
in computer operations 1971: transferred to Miami Branch; 1974:
Assist: President; 1987: Senior Vice President.1988: re1- Atlanta as
Head of Corporate Services. Member Executive Committee of the Georgia
Council on Igmic Education; former vice diairman of Greater
ji§?Charnber of Commerce and the President'sof the University of
Miami; in Atlanta, former ||Mte vice chairman for the United Way of
Atlanta feiSinber of Leadership Atlanta. Member of the Council on
Economic Education. Interest. ' ]
So for example, in this data I have two people - Peter Batanksy and Patrick K. Barron. I want to create a dictionary for each individual with these 4 keys: bankerjobs, Number of children, Education, and nonbankerjobs.
In this text there are already break words: "m" = number of children "Educ", and anything before "m" is bankerjobs and anything after the first "." after Educ is nonbankerjobs, and the keyword to break between individuals seems to be any amount of spaces after a "." >1
How can I create a dictionary for each of these two individuals with these 4 keys using regular expressions on these break words?
specifically, what set of regex could help me create a dictionary for these two individuals with these 4 keys (built on the above specified break words)?
A pattern i am thinking would be something like this in perl:
pattern = [r'(m/[ '(.*);(.*)m(.*)Educ(.*)/)']
but i'm not sure..
I'm thinking the code would be similar to this but please correct it if im wrong:
my_banker_parser = re.compile(r'somefancyregex')
def nested_dict_from_text(text):
m = re.search(my_banker_parser, text)
if not m:
raise ValueError
d = m.groupdict()
return { "centralbanker": d }
result = nested_dict_from_text(bankers)
print(result)
My hope is to take this code and run it through the rest of the biographies for all of individuals of interest.
Using named groups will probably be less brittle, since it doesn't depend on the pieces of data being in the same order in each biography. Something like this should work:
>>> import re
>>> regex = re.compile(r'(?P<foo>foo)|(?P<bar>bar)|(?P<baz>baz)')
>>> data = {}
>>> for match in regex.finditer('bar baz foo something'):
... data.update((k, v) for k, v in match.groupdict().items() if v is not None)
...
>>> data
{'baz': 'baz', 'foo': 'foo', 'bar': 'bar'}