I have checked the other two answers to similar question but in vain.
from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
article = "Aziz Ismail Ansari[1] (/ənˈsɑːri/; born February 23, 1983) is an American actor, writer, producer, director, and comedian. He is known for his role as Tom Haverford on the NBC series Parks and Recreation (2009–2015) and as creator and star of the Netflix series Master of None (2015–) for which he won several acting and writing awards, including two Emmys and a Golden Globe, which was the first award received by an Asian American actor for acting on television.[2][3][4]Ansari began performing comedy in New York City while a student at New York University in 2000. He later co-created and starred in the MTV sketch comedy show Human Giant, after which he had acting roles in a number of feature films.As a stand-up comedian, Ansari released his first comedy special, Intimate Moments for a Sensual Evening, in January 2010 on Comedy Central Records. He continues to perform stand-up on tour and on Netflix. His first book, Modern Romance: An Investigation, was released in June 2015. He was included in the Time 100 list of most influential people in 2016.[5] In July 2019, Ansari released his fifth comedy special Aziz Ansari: Right Now, which was nominated for a Grammy Award for Best Comedy Album.[6]"
tokens = regexp_tokenize(article, "\w+|\d+")
az_only_alpha = [t for t in tokens if t.isalpha()]
#below line will take some time to execute.
az_no_stop_words = [t for t in az_only_alpha if t not in stopwords.words('english')]
az_lemmetize = [wordnetLem.lemmatize(t) for t in az_no_stop_words]
dictionary = Dictionary([az_lemmetize])
x = [dictionary.doc2bow(doc) for doc in [az_lemmetize]]
cp = TfidfModel(x)
cp[x[0]]
The last line gives me an empty array whereas I expect a list of tupples with id and their weights.
Related
I have the following code:
import nltk
page = '
EDUCATION
University
Won first prize for the best second year group project, focused on software engineering.
Sixth Form
Mathematics, Economics, French
UK, London
'
for sent in nltk.sent_tokenize(page):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(''.join(c[0] for c in chunk), ' ',chunk.label())
Returns:
EDUCATION ORGANIZATION
UniversityWon ORGANIZATION
Sixth PERSON
FormMathematics ORGANIZATION
Economics PERSON
FrenchUK GPE
London GPE
Which i'd like to be grouped into some data-structure based on the entity label, maybe a list: ORGANIZATION=[EDUCATION,UniversityWon,FormMathematics] PERSON=[Sixth,Economics] GPE=[FrenchUK,London]
Or maybe a dictionary with the keys: ORGANIZATION, PERSON, GPE then the associated values are as the lists above
A dictionary makes more sense, perhaps something like this.
from collections import defaultdict
entities = defaultdict(list)
for sent in nltk.sent_tokenize(page):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
entities[chunk.label()].append(''.join(c[0] for c in chunk))
from datasets import load_dataset #Huggingface
from transformers import BertTokenizer #Huggingface:
def tokenized_dataset(dataset):
""" Method that tokenizes each document in the train, test and validation dataset
Args:
dataset (DatasetDict): dataset that will be tokenized (train, test, validation)
Returns:
dict: dataset once tokenized
"""
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
encode = lambda document: tokenizer(document, return_tensors='pt', padding=True, truncation=True)
train_articles = [encode(document) for document in dataset["train"]["article"]]
test_articles = [encode(document) for document in dataset["test"]["article"]]
val_articles = [encode(document) for document in dataset["val"]["article"]]
train_abstracts = [encode(document) for document in dataset["train"]["abstract"]]
test_abstracts = [encode(document) for document in dataset["test"]["abstract"]]
val_abstracts = [encode(document) for document in dataset["val"]["abstract"]]
return {"train": (train_articles, train_abstracts),
"test": (test_articles, test_abstracts),
"val": (val_articles, val_abstracts)}
if __name__ == "__main__":
dataset = load_data("./train/", "./test/", "./val/", "./.cache_dir")
tokenized_data = tokenized_dataset(dataset)
I would like to modify the function tokenized_dataset because it creates a very heavy dictionary. The dataset created by that function will be reused for a ML training. However, dragging that dictionary during the training will slow down the training a lot.
Be aware that document is similar to
[['eleven politicians from 7 parties made comments in letter to a newspaper .',
"said dpp alison saunders had ` damaged public confidence ' in justice .",
'ms saunders ruled lord janner unfit to stand trial over child abuse claims .',
'the cps has pursued at least 19 suspected paedophiles with dementia .'],
['an increasing number of surveys claim to reveal what makes us happiest .',
'but are these generic lists really of any use to us ?',
'janet street-porter makes her own list - of things making her unhappy !'],
["author of ` into the wild ' spoke to five rape victims in missoula , montana .",
"` missoula : rape and the justice system in a college town ' was released april 21 .",
"three of five victims profiled in the book sat down with abc 's nightline wednesday night .",
'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football '
'players .',
"huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .",
'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable '
'cause .',
'mr krakauer wrote book after realizing close friend was a rape victim .'],
['tesco announced a record annual loss of £ 6.38 billion yesterday .',
'drop in sales , one-off costs and pensions blamed for financial loss .',
'supermarket giant now under pressure to close 200 stores nationwide .',
'here , retail industry veterans , plus mail writers , identify what went wrong .'],
...,
['snp leader said alex salmond did not field questions over his family .',
"said she was not ` moaning ' but also attacked criticism of women 's looks .",
'she made the remarks in latest programme profiling the main party leaders .',
'ms sturgeon also revealed her tv habits and recent image makeover .',
'she said she relaxed by eating steak and chips on a saturday night .']]
So in the dictionary, the keys are just strings, but the values are all list of lists of strings. Instead of having value = list of list of strings, I think it is better to create sort of a list of object function instead of having list of lists of strings. It will make the dictionary much lighter. How can I do that?
EDIT
For me, there is a difference between copying a list of lists of strings and copying a list of objects. Copying an object will simply copy the reference while copying a list of strings will copy everything. So it is much faster to copy the reference instead. This is the point of this question.
I have a bs4 program where I collect the descriptions of links. It first checks to see if there are any meta description tags and if there aren't any it gets the descriptions from tags.
This is the code:
from bs4 import BeautifulSoup
import requests
def find_title(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
with open('descrip.txt', 'a', encoding='utf-8') as f:
description = soup.find('meta', attrs={'name':'og:description'}) or soup.find('meta', attrs={'property':'description'}) or soup.find('meta', attrs={'name':'description'})
if description:
desc = description["content"]
else:
desc = soup.find_all('p')[0].getText()
lengths = len(desc)
index = 0
while lengths == 1:
index = index + 1
desc = soup.find_all('p')[index].getText()
lengths = len(desc)
if lengths > 300:
desc = soup.find_all('p')[index].getText()[0:300]
elif lengths < 300:
desc = soup.find_all('p')[index].getText()[0:lengths]
print(desc)
f.write(desc + '\n')
find_title('https://en.wikipedia.org/wiki/Portal:The_arts')
find_title('https://en.wikipedia.org/wiki/Portal:Biography')
find_title('https://en.wikipedia.org/wiki/Portal:Geography')
find_title('https://en.wikipedia.org/wiki/November_15')
find_title('https://en.wikipedia.org/wiki/November_16')
find_title('https://en.wikipedia.org/wiki/Wikipedia:Selected_anniversaries/November')
find_title('https://lists.wikimedia.org/mailman/listinfo/daily-article-l')
find_title('https://en.wikipedia.org/wiki/List_of_days_of_the_year')
find_title('https://en.wikipedia.org/wiki/File:Proclama%C3%A7%C3%A3o_da_Rep%C3%BAblica_by_Benedito_Calixto_1893.jpg')
find_title('https://en.wikipedia.org/wiki/First_Brazilian_Republic')
find_title('https://en.wikipedia.org/wiki/Empire_of_Brazil')
find_title('https://en.wikipedia.org/wiki/Pedro_II_of_Brazil')
find_title('https://en.wikipedia.org/wiki/Benedito_Calixto')
find_title('https://en.wikipedia.org/wiki/Rio_de_Janeiro')
find_title('https://en.wikipedia.org/wiki/Deodoro_da_Fonseca')
But the output to descrip.txt has some weird indents and some descriptions go in for multiple lines and there are spaces between some
This is the output:
The arts refers to the theory, human application and physical expression of creativity found in human cultures and societies through skills and imagination in order to produce objects, environments and experiences. Major constituents of the arts include visual arts (including architecture, ceramics,
A biography, or simply bio, is a detailed description of a person's life. It involves more than just the basic facts like education, work, relationships, and death; it portrays a person's experience of these life events. Unlike a profile or curriculum vitae (résumé), a biography presents a subject's
Geography (from Greek: γεωγραφία, geographia, literally "earth description") is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. The first person to use the word γεωγραφία was Eratosthenes (276–194 BC). Geography is an all-encompass
November 15 is the 319th day of the year (320th in leap years) in the Gregorian calendar. 46 days remain until the end of the year.
November 16 is the 320th day of the year (321st in leap years) in the Gregorian calendar. 45 days remain until the end of the year.
The arts refers to the theory, human application and physical expression of creativity found in human cultures and societies through skills and imagination in order to produce objects, environments and experiences. Major constituents of the arts include visual arts (including architecture, ceramics,
A biography, or simply bio, is a detailed description of a person's life. It involves more than just the basic facts like education, work, relationships, and death; it portrays a person's experience of these life events. Unlike a profile or curriculum vitae (résumé), a biography presents a subject's
Geography (from Greek: γεωγραφία, geographia, literally "earth description") is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. The first person to use the word γεωγραφία was Eratosthenes (276–194 BC). Geography is an all-encompass
November 15 is the 319th day of the year (320th in leap years) in the Gregorian calendar. 46 days remain until the end of the year.
November 16 is the 320th day of the year (321st in leap years) in the Gregorian calendar. 45 days remain until the end of the year.
Selected anniversaries / On this day archive
All · January · February · March · April · May · June · July · August · September · October · November · December
The sum of all human knowledge. Delivered to your inbox every day.
The following pages list the historical events, births, deaths, and holidays and observances of the specified day of the year:
Original file (5,799 × 3,574 pixels, file size: 15.11 MB, MIME type: image/jpeg)
The First Brazilian Republic or República Velha (Portuguese pronunciation: [ʁeˈpublikɐ ˈvɛʎɐ], "Old Republic"), officially the Republic of the United States of Brazil, refers to the period of Brazilian history from 1889 to 1930. The República Velha ended with the Brazilian Revolution of 1930 that installed Getúlio Vargas as a new president.
The Empire of Brazil was a 19th-century state that broadly comprised the territories which form modern Brazil and (until 1828) Uruguay. Its government was a representative parliamentary constitutional monarchy under the rule of Emperors Dom Pedro I and his son Dom Pedro II. A colony of the Kingdom of Portugal, Brazil became the seat of the Portuguese colonial Empire in 1808, when the Portuguese Prince regent, later King Dom João VI, fled from Napoleon's invasion of Portugal and established himself and his government in the Brazilian city of Rio de Janeiro. João VI later returned to Portugal, leaving his eldest son and heir, Pedro, to rule the Kingdom of Brazil as regent. On 7 September 1822, Pedro declared the independence of Brazil and, after waging a successful war against his father's kingdom, was acclaimed on 12 October as Pedro I, the first Emperor of Brazil. The new country was huge, sparsely populated and ethnically diverse.
Early life (1825–40)
Consolidation (1840–53)
Growth (1853–64)
Paraguayan War (1864–70)
Apogee (1870–81)
Decline and fall (1881–89)
Exile and death (1889–91)
Legacy
Benedito Calixto de Jesus (14 October 1853 – 31 May 1927) was a Brazilian painter.[1] His works usually depicted figures from Brazil and Brazilian culture, including a famous portrait of the bandeirante Domingos Jorge Velho in 1923,[2] and scenes from the coastline of São Paulo.[3] Unlike many artis
Rio de Janeiro (/ˈriːoʊ di ʒəˈnɛəroʊ, - deɪ -, - də -/; Portuguese: [ˈʁi.u d(ʒi) ʒɐˈne(j)ɾu] (listen);[3]), or simply Rio,[4] is anchor to the Rio de Janeiro metropolitan area and the second-most populous municipality in Brazil and the sixth-most populous in the Americas. Rio de Janeiro is the capit
Manuel Deodoro da Fonseca (Portuguese pronunciation: [mɐnuˈɛw deoˈdɔɾu da fõˈsekɐ]; 5 August 1827 – 23 August 1892) was a Brazilian politician and military officer who served as the first President of Brazil. He took office after heading a military coup that deposed Emperor Pedro II and proclaimed t
is there any way to fix this problem?
Add strip=True to getText() (note: it’s an alias of get_text()), and than add a space as a separator. For example:
get_text(strip=True, separator=' ')
I got html from the website and change it to txt.
However, how to clean the txt so that i keep only the sentences in the txt.
for example: I want to remove all irrelevent information such as 1990...himself,1987, the 59th ....
keep the sentences:
tom cruise is an american actor who has starred in many blockbuster movies and as of 2012 is the highest paid actor in hollywood. he is also a film producer and owns a production company. tom cruise has been the winner of three golden globe awards and has been nominated thrice for academy awards. apart from this, many of the movies cruise has starred in have been huge blockbusters on the box office.
after repeated success in many films, tom cruise kept going on with release of two mission impossible movies, war of the worlds which was a super duper box office hit and many more.
and so on.
1990
... himself
1987
the 59th annual academy awards
(tv special)
jack
/
maverick
/
vincent lauria
(uncredited)
related videos
none
none
none
see all 35 videos »
#csm.csm_widget />
reality tv
the office
late night
sitcoms
music
rappers
action
religion
top paid
how much money does tom cruise make? (salary & net worth)
tom cruise is an american actor who has starred in many blockbuster movies and as of 2012 is the highest paid actor in hollywood. he is also a film producer and owns a production company. tom cruise has been the winner of three golden globe awards and has been nominated thrice for academy awards. apart from this, many of the movies cruise has starred in have been huge blockbusters on the box office.
history
thomas cruise mapother iv a.k.a tom cruise was born in syracuse, new york to mother mary lee and father thomas cruise mapother iii. cruise’s mother was a special education teacher and father was an electrical engineer. tom cruise is basically of irish, german and english origin. cruise’s family had the male domination of his abusive father whom cruise had once described as the merchant of chaos. he was often bullied and beaten by his father and cruise called him a coward. a part of tom cruise’s childhood was spent in canada. however, when cruise was in the sixth grade, his mother left his father and brought cruise and his siblings back to america.
acting career
acting career of tom cruise started quite early but with a small role in the movie endless love (1981). however, he got his big break as a supporting actor in the movie taps later that year. in 1983, his movies risky business and all the right moves along with top gun in 1986 paved the path for tom cruise as an established actor and a superstar. after this there was no looking back and tom cruise went to star in many super-successful movies like cocktail, rain man, days of thunder, interview with the vampire.
then in 1996, he starred as a superspy ethan hunt in the very popular and blockbuster movie which went on to be a series, mission: impossible. that same year he also was seen in the lead role of the movie jerry maguire and won a golden globe for the same. in 1999, his supporting role in the movie magnolia again won him his second golden globe.
after repeated success in many films, tom cruise kept going on with release of two mission impossible movies, war of the worlds which was a super duper box office hit and many more.
net worth
tom cruise’s films have gained $7.3 million worldwide as of 2013. however, the net worth of the highest paid actor in hollywood is $270 million and he still gets paychecks from his previous movies.
154 magazine cover photos
|
none »
official sites:
facebook
|
official site
|
none
»
alternate names:
tomu kurûzu
height:
5' 7" (1.7 m)
none
did you know?
personal quote:
(1992 quote) i really enjoy talking to other actors and directors. sometimes, if i see their movies, i'll call them up or write them a note saying, "i enjoyed it," or asking, "how did you do that? how did you make that work?". i just saw
html text is called: text
sentence = re.sub(' ', '\n', text)
sentence = re.sub('none', '', words)
print sentence
the result: the sentence is destroyed.
ethan
hunt
/
ray
ferrier
(uncredited)
2006
the
late
late
show
with
craig
ferguson
(tv
series)
himself
-
episode
#2.140
(2006)
...
himself
(uncredited)
2006
getaway
(tv
series)
himself
-
seven
wonders
of
the
world
(2006)
...
himself
2006
cmt
insider
(tv
series)
himself
-
episode
dated
29
april
2006
(2006)
...
himself
2005-2006
corazón
de...
(tv
series)
himself
-
episode
dated
19
january
2006
(2006)
...
himself
-
episode
dated
15
november
2005
(2005)
...
himself
-
Try this:
^(\s*?\S*){5}$
The code is currently set to select any line that has five words or less. You can increase/decrease the number of words by changing the value of {5}
Demo: https://regex101.com/r/z2qxrx/3
Recently I got my hands on a research project that would greatly benefit from learning how to parse a string of biographical data on several individuals into a set of dictionaries for each individual.
The string contains break words and I was hoping to create keys off of the breakwords and separate dictionaries by line breaks. So here are two people I want to create two different dictionaries for within my data:
Bankers = [ ' Bakstansky, Peter; Senior Vice President, Federal
Reserve Bank of New York, in charge of public information
since 1976, when he joined the NY Fed as Vice President. Senior
Officer in charge of die Office of Regional and Community Affairs,
Ombudsman for the Bank and Senior Administrative Officer for Executive
Group, m zero children Educ City College of New York (Bachelor of
Business Administration, 1961); University of Illinois, Graduate
School, and New York University, Graduate School of Business. 1962-6:
Business and financial writer, New York, on American Banker, New
York-World Telegram & Sun, Neia York Herald Tribune (banking editor
1964-6). 1966-74: Chase Manhattan Bank: Manager of Public Relations,
based in Paris, 1966-71; Manager of Chase's European Marketing and
Planning, based in Brussels, 1971-2; Vice President and Director of
Public Relations, 1972-4.1974-76: Bache & Co., Vice President and
Director of Corporate Communications. Barron, Patrick K.; First Vice
President and < Operating Officer of the Federal Reserve Bank o
Atlanta since February 1996. Member of the Fed" Reserve Systems
Conference of first Vice Preside Vice chairman of the bank's
management Con and of the Discount Committee, m three child Educ
University of Miami (Bachelor's degree in Management); Harvard
Business School (Prog Management Development); Stonier Graduate Sr of
Banking, Rutgers University. 1967: Joined Fed Reserve Bank of Atlanta
in computer operations 1971: transferred to Miami Branch; 1974:
Assist: President; 1987: Senior Vice President.1988: re1- Atlanta as
Head of Corporate Services. Member Executive Committee of the Georgia
Council on Igmic Education; former vice diairman of Greater
ji§?Charnber of Commerce and the President'sof the University of
Miami; in Atlanta, former ||Mte vice chairman for the United Way of
Atlanta feiSinber of Leadership Atlanta. Member of the Council on
Economic Education. Interest. ' ]
So for example, in this data I have two people - Peter Batanksy and Patrick K. Barron. I want to create a dictionary for each individual with these 4 keys: bankerjobs, Number of children, Education, and nonbankerjobs.
In this text there are already break words: "m" = number of children "Educ", and anything before "m" is bankerjobs and anything after the first "." after Educ is nonbankerjobs, and the keyword to break between individuals seems to be any amount of spaces after a "." >1
How can I create a dictionary for each of these two individuals with these 4 keys using regular expressions on these break words?
specifically, what set of regex could help me create a dictionary for these two individuals with these 4 keys (built on the above specified break words)?
A pattern i am thinking would be something like this in perl:
pattern = [r'(m/[ '(.*);(.*)m(.*)Educ(.*)/)']
but i'm not sure..
I'm thinking the code would be similar to this but please correct it if im wrong:
my_banker_parser = re.compile(r'somefancyregex')
def nested_dict_from_text(text):
m = re.search(my_banker_parser, text)
if not m:
raise ValueError
d = m.groupdict()
return { "centralbanker": d }
result = nested_dict_from_text(bankers)
print(result)
My hope is to take this code and run it through the rest of the biographies for all of individuals of interest.
Using named groups will probably be less brittle, since it doesn't depend on the pieces of data being in the same order in each biography. Something like this should work:
>>> import re
>>> regex = re.compile(r'(?P<foo>foo)|(?P<bar>bar)|(?P<baz>baz)')
>>> data = {}
>>> for match in regex.finditer('bar baz foo something'):
... data.update((k, v) for k, v in match.groupdict().items() if v is not None)
...
>>> data
{'baz': 'baz', 'foo': 'foo', 'bar': 'bar'}