Using function objects to speed up the training - python

from datasets import load_dataset #Huggingface
from transformers import BertTokenizer #Huggingface:
def tokenized_dataset(dataset):
""" Method that tokenizes each document in the train, test and validation dataset
Args:
dataset (DatasetDict): dataset that will be tokenized (train, test, validation)
Returns:
dict: dataset once tokenized
"""
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
encode = lambda document: tokenizer(document, return_tensors='pt', padding=True, truncation=True)
train_articles = [encode(document) for document in dataset["train"]["article"]]
test_articles = [encode(document) for document in dataset["test"]["article"]]
val_articles = [encode(document) for document in dataset["val"]["article"]]
train_abstracts = [encode(document) for document in dataset["train"]["abstract"]]
test_abstracts = [encode(document) for document in dataset["test"]["abstract"]]
val_abstracts = [encode(document) for document in dataset["val"]["abstract"]]
return {"train": (train_articles, train_abstracts),
"test": (test_articles, test_abstracts),
"val": (val_articles, val_abstracts)}
if __name__ == "__main__":
dataset = load_data("./train/", "./test/", "./val/", "./.cache_dir")
tokenized_data = tokenized_dataset(dataset)
I would like to modify the function tokenized_dataset because it creates a very heavy dictionary. The dataset created by that function will be reused for a ML training. However, dragging that dictionary during the training will slow down the training a lot.
Be aware that document is similar to
[['eleven politicians from 7 parties made comments in letter to a newspaper .',
"said dpp alison saunders had ` damaged public confidence ' in justice .",
'ms saunders ruled lord janner unfit to stand trial over child abuse claims .',
'the cps has pursued at least 19 suspected paedophiles with dementia .'],
['an increasing number of surveys claim to reveal what makes us happiest .',
'but are these generic lists really of any use to us ?',
'janet street-porter makes her own list - of things making her unhappy !'],
["author of ` into the wild ' spoke to five rape victims in missoula , montana .",
"` missoula : rape and the justice system in a college town ' was released april 21 .",
"three of five victims profiled in the book sat down with abc 's nightline wednesday night .",
'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football '
'players .',
"huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .",
'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable '
'cause .',
'mr krakauer wrote book after realizing close friend was a rape victim .'],
['tesco announced a record annual loss of £ 6.38 billion yesterday .',
'drop in sales , one-off costs and pensions blamed for financial loss .',
'supermarket giant now under pressure to close 200 stores nationwide .',
'here , retail industry veterans , plus mail writers , identify what went wrong .'],
...,
['snp leader said alex salmond did not field questions over his family .',
"said she was not ` moaning ' but also attacked criticism of women 's looks .",
'she made the remarks in latest programme profiling the main party leaders .',
'ms sturgeon also revealed her tv habits and recent image makeover .',
'she said she relaxed by eating steak and chips on a saturday night .']]
So in the dictionary, the keys are just strings, but the values are all list of lists of strings. Instead of having value = list of list of strings, I think it is better to create sort of a list of object function instead of having list of lists of strings. It will make the dictionary much lighter. How can I do that?
EDIT
For me, there is a difference between copying a list of lists of strings and copying a list of objects. Copying an object will simply copy the reference while copying a list of strings will copy everything. So it is much faster to copy the reference instead. This is the point of this question.

Related

Identify and replace using regex some strings, stored within a list, within a string that may or may not contain them

import re
#list of names to identify in input strings
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort() # sorts normally by alphabetical order (optional)
result_list.sort(key=len, reverse=True) # sorts by descending length
#example 1
input_text = "Melissa went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so Thomas Edd is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker."
#In this example 2, it is almost the same however, some of the names were already encapsulated
# under the ((PERS)name) structure, and should not be encapsulated again.
input_text = "((PERS)Melissa) went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
for i in result_list:
input_text = re.sub(r"\(\(PERS\)" + r"(" + str(i) + r")" + r"\)",
lambda m: (f"((PERS){m[1]})"),
input_text)
print(repr(input_text)) # --> output
Note that the names meet certain conditions under which they must be identified, that is, they must be in the middle of 2 whitespaces \s*the searched name\s* or be at the beginning (?:(?<=\s)|^) or/and at the end of the input string.
It may also be the case that a name is followed by a comma, for example "Ada White, Melissa and Louis went shopping" or if spaces are accidentally omitted "Ada White,Melissa and Louis went shopping".
For this reason it is important that after [.,;] the possibility that it does find a name.
Cases where the names should NOT be encapsulated, would be for example...
"the Edd's business"
"The whitespace"
"the pasteurization process takes time"
"Those White-spaces in that text are unnecessary"
, since in these cases the name is followed or preceded by another word that should not be part of the name that is being searched for.
For examples 1 and 2 (note that example 2 is the same as example 1 but already has some encapsulated names and you have to prevent them from being encapsulated again), you should get the following output.
"((PERS)Melissa) went for a walk in the park, then ((PERS)Melisa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker."
You can use lookarounds to exclude already encapsulated names and those followed by ', an alphanumeric character or -:
import re
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort(key=len, reverse=True) # sorts by descending length
input_text = "((PERS)Melissa) went for a walk in the park, then Melissa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
pat = re.compile(rf"(?<!\(PERS\))({'|'.join(result_list)})(?!['\w)-])")
input_text = re.sub(pat, r'((PERS)\1)', input_text)
Output:
((PERS)Melissa) went for a walk in the park, then ((PERS)Melissa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker.
Of course you can refine the content of your lookahead based on further edge cases.

Remove breaks in input strings

This is a scrapping from a selenium scrap dump:
['The Quest for Ethical Artificial Intelligence: A Conversation with
Timnit Gebru', 'Mindfulness Self-Care for Students of Color', 'GPA: The Geopolitical landscape of the Olympic and Paralympic Movements', 'Interfaith Discussions', 'Mind the Gap', 'First-Year Arts Board Open House', 'Self-Care Night with CARE and BGLTQ+ Specialty Proctors', 'Drawing Plants & Flowers - Sold Out']
I have to pass this to an algorithm but as you can see, although all of them are perfectly encased within quotes, as the sentence breaks after "conversation with" and this is affecting my input. I tried removing whitespaces, didn't work. Any help will be highly appreciated.
You can't type a string on multiple lines unless you're using triple quotation marks.
i.e. The first item (string) in your list is
['The Quest for Ethical Artificial Intelligence: A Conversation with
Timnit Gebru']
You should keep the string on one line as follows.
['The Quest for Ethical Artificial Intelligence: A Conversation with Timnit Gebru']
Or you could use a triple quotation, as I mentioned.
['''The Quest for Ethical Artificial Intelligence: A Conversation with
Timnit Gebru''']
If you'd like to make your list visible and neat aswell, I'd recommend breaking between commas in the list for example:
['The Quest for Ethical Artificial Intelligence: A Conversation with Timnit Gebru',
'Mindfulness Self-Care for Students of Color',
'GPA: The Geopolitical landscape of the Olympic and Paralympic Movements',
'Interfaith Discussions',
'Mind the Gap',
'First-Year Arts Board Open House',
'Self-Care Night with CARE and BGLTQ+ Specialty Proctors',
'Drawing Plants & Flowers - Sold Out']
You could try
string_list = ['The Quest for Ethical Artificial Intelligence: A Conversation with
Timnit Gebru', 'Mindfulness Self-Care for Students of Color', 'GPA: The Geopolitical landscape of the Olympic and Paralympic Movements', 'Interfaith Discussions', 'Mind the Gap', 'First-Year Arts Board Open House', 'Self-Care Night with CARE and BGLTQ+ Specialty Proctors', 'Drawing Plants & Flowers - Sold Out']
string_list = [str.replace("\n", " ") for str in string_list]

Genism object to a readable result

I have checked the other two answers to similar question but in vain.
from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
article = "Aziz Ismail Ansari[1] (/ənˈsɑːri/; born February 23, 1983) is an American actor, writer, producer, director, and comedian. He is known for his role as Tom Haverford on the NBC series Parks and Recreation (2009–2015) and as creator and star of the Netflix series Master of None (2015–) for which he won several acting and writing awards, including two Emmys and a Golden Globe, which was the first award received by an Asian American actor for acting on television.[2][3][4]Ansari began performing comedy in New York City while a student at New York University in 2000. He later co-created and starred in the MTV sketch comedy show Human Giant, after which he had acting roles in a number of feature films.As a stand-up comedian, Ansari released his first comedy special, Intimate Moments for a Sensual Evening, in January 2010 on Comedy Central Records. He continues to perform stand-up on tour and on Netflix. His first book, Modern Romance: An Investigation, was released in June 2015. He was included in the Time 100 list of most influential people in 2016.[5] In July 2019, Ansari released his fifth comedy special Aziz Ansari: Right Now, which was nominated for a Grammy Award for Best Comedy Album.[6]"
tokens = regexp_tokenize(article, "\w+|\d+")
az_only_alpha = [t for t in tokens if t.isalpha()]
#below line will take some time to execute.
az_no_stop_words = [t for t in az_only_alpha if t not in stopwords.words('english')]
az_lemmetize = [wordnetLem.lemmatize(t) for t in az_no_stop_words]
dictionary = Dictionary([az_lemmetize])
x = [dictionary.doc2bow(doc) for doc in [az_lemmetize]]
cp = TfidfModel(x)
cp[x[0]]
The last line gives me an empty array whereas I expect a list of tupples with id and their weights.

Python deal with HTML

A python beginner get tortured here, need some help:-(
def getLinkinfo(endpoint):
response = urllib.request.urlopen(endpoint)
link_info = response.read().decode()
return link_info
text=getLinkinfo('http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html')
soup1=BeautifulSoup(text,'html.parser')
k=soup1.find_all('div','class'=='article')
Here I have already cut the main body I need to deal with, and one of the outputs is as shown below:
<div class="article"><h5>1. Let's resolve to reconnect, says Welby in new year message</h5>
<p class="metadata">Wed 1 Jan 2020 00:01 GMT</p>
<p class="metadata">Category: <span>UK-News</span></p>
<p class="snippet">The archbishop of Canterbury will urge people to make personal connections with others in 2020 to create a new unity in a divided society. In his new …</p></div>
Here my question is how can i get:
(1)'Title' between <h5>,<a> tag
(2) 'Category' which is behind <p class="metadata">(Here are two <p class="metadata">s, one with time is not needed)
(3)'Snippet' which is behind <p class="snippet">
Thx for help in advance, I feel if I know how to deal with this example, I can process a lot
from urllib.request import urlopen
from bs4 import BeautifulSoup
from pprint import pprint as pp
def main(url):
r = urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
goal = [(x.a.text, x.select("p")[1].text.split(' ', 1)[1], x.select_one('p.snippet').text)
for x in soup.select('.article')]
pp(goal)
main('http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html')
Output:
[("Let's resolve to reconnect, says Welby in new year message",
'UK-News',
'The archbishop of Canterbury will urge people to make personal connections '
'with others in 2020 to create a new unity in a divided society. In his new '
'…'),
("Be honest. You're not going to read all those books on your holiday, are "
'you?',
'Books',
'Every year, about this time, my Instagram feed fills up with pictures of '
'books. They’re piled somewhere between five and ten inches high, sometimes '
'st …'),
("Mariah Carey's Twitter account hacked on New Year's Eve",
'Music',
'Mariah Carey’s Twitter account appeared to have been hacked late Tuesday '
'afternoon, sharing numerous racist slurs and comments with the singer’s '
'21.4 …'),
('The joy audit: how to have more fun in 2020',
'Life-and-Style',
'The last time I felt joy was at an event that would be many people’s vision '
'of hell: a drunken Taylor Swift club-night singalong in the early hours of '
'…'),
('Providence Lost by Paul Lay review – the rise and fall of Oliver Cromwell’s '
'Protectorate',
'Books',
'The only public execution of a British head of state occurred 371 years ago '
'outside the Banqueting House in Whitehall on 30 January 1649. It was a rad '
'…'),
('Zero-carbon electricity outstrips fossil fuels in Britain across 2019',
'Business',
'Summary: Zero-carbon energy became Britain’s largest electricity source in '
'2019, delivering nearly half the country’s electrical power and for the '
'first time o …'),
('The final sprint: will any of the Democratic candidates excite voters?',
'US-News',
'Democrats overwhelmingly agree that their top priority in 2020 is to remove '
'Donald Trump from office. But which of the many Democrats running for pres '
'…'),
('War epics, airmen and young Sopranos: essential films for 2020',
'Film',
'1917 An epic of Lean-ian proportions is delivered in this spectacular from '
'director and co-writer Sam Mendes, who has developed a real-life story of h '
'…'),
("Stashing your cash: the beginner's guide to saving",
'Life-and-Style',
'Much like going for a run or eating your greens, saving your cash offers '
'long-term benefits, but is not always appealing. And, let’s face it, there '
'ar …'),
("'I'm on the hunt for humour and hope': what will authors be reading in "
'2020?',
'Books',
'Matt Haig I have been very dark and gloomy with my reading habits this '
'year, perhaps in tune with the social mood. Like a pig sniffing for '
'truffles, I …'),
('Twenty athletes set to light up the Tokyo 2020 Olympics',
'Sport',
'Dina Asher-Smith Great Britain Athletics, 100m, 200m, 4x100m Seb Coe, who '
'knows a thing or two about winning Olympic titles, believes Asher-Smith '
'will …'),
('The most exciting movies of 2020 – horror',
'Film',
'The Grudge A belated English language reboot of Japanese classic Ju-On: The '
'Grudge (2002), this stars Andrea Riseborough and Demián Bichir as detectiv '
'…'),
('Diary of a Murderer by Kim Young-ha review – dark stories from South Korea',
'Books',
'Given that loss of memory has become a familiar device in fiction, and the '
'psychopath such a popular character archetype, we shouldn’t be too surprise '
'…'),
('US election, Brexit and China to sway the markets in 2020',
'Business',
'After profiting from strong markets in 2019, investors are expecting 2020 '
'to bring further rising asset prices and lively merger activity. But '
'Brexit, …'),
('TS Eliot’s intimate letters to confidante unveiled after 60 years',
'Books',
'A collection of more than 1,000 letters from the Nobel laureate TS Eliot to '
'his confidante and muse Emily Hale is unveiled this week, after having bee '
'…'),
('Clive Lewis calls for unity among Labour leadership hopefuls',
'Politics',
'Summary: The Labour leadership hopeful Clive Lewis has called for unity '
'among would-be candidates to succeed Jeremy Corbyn as they confront the '
'“cliff face” of …'),
("Visa applications: Home Office refuses to reveal 'high risk' countries",
'UK-News',
'Summary: Campaign groups have criticised the Home Office after it refused '
'to release details of which countries are deemed a “risk” in an algorithm '
'that filter …'),
('Victims of NYE Surrey road crash were BA cabin crew',
'UK-News',
'At least seven people have been killed across the UK in road traffic '
'collisions over the new year period. The deaths included three British '
'Airways ca …'),
('Man held on suspicion of double murder after bodies found in house',
'UK-News',
'Police have arrested a man on suspicion of murdering two people at a house '
'in the village of Duffield in Derbyshire. The murder investigation was laun '
'…'),
("Great expectations: 'The quest for perfection has cannibalised my identity'",
'Life-and-Style',
'“You need to practice self-compassion,” my psychologist says to me. This is '
'our sixth session and as per usual he is struggling to find a phrase, a po '
'…'),
('Anti-Islamic slogans spray-painted near mosque in Brixton',
'UK-News',
'Anti-Islamic slogans have been painted on a building close to a mosque and '
'cultural centre in south London, the Metropolitan police have said. Officer '
'…'),
('Michael van Gerwen 3-7 Peter Wright: PDC world darts championship final – '
'as it happened',
'Sport',
'Summary: That’s it for tonight’s blog, so I’ll leave you with a report from '
'Ally Pally. Thanks for your company, goodnight! There’s so much affection '
'for Peter …'),
('Greggs launches meatless steak bake to beef up its vegan range',
'Business',
'Greggs, the UK’s largest bakery chain, will end speculation about its hotly '
'anticipated new vegan snack by launching a meat-free version of its popula '
'…'),
('Woodford folk festival review – a much-needed moment of positivity and '
'reprieve',
'Music',
'If Woodford folk festival was in mourning this year, you wouldn’t have '
'known it. The death in May of festival elder and decade-long patron Bob '
'Hawke c …'),
('Household haze: how to reduce smoke in your home without an air purifier',
'Life-and-Style',
'Summary: On 1 January, Canberra experienced its worst air quality on '
'record. Smoke from Australia’s devastating bushfires has now blown as far '
'as Queenstown, N …'),
("Sadiq Khan pledges free London travel for disabled people's carers",
'Politics',
'Sadiq Khan has kickstarted his bid for a second term as London mayor by '
'pledging free travel on the city’s transport for anyone accompanying a '
'disable …'),
('‘Everyone thought I was mad’: how to make a life-changing decision – and '
'stick to it',
'Life-and-Style',
'Summary: When I was 26, I broke up with a long-term partner, got an '
'ill-advised facial piercing and changed careers – all in the space of a '
'month. What I learn …'),
('In the Line of Duty review – race-against-time cop thriller',
'Film',
'There’s a straight-to-video feel to this cop thriller, directed by action '
'veteran Steven C Miller, written by Jeremy Drysdale (who scripted the indie '
'…'),
("Manchester poet Tony Walsh performs tribute to children's hospital",
'UK-News',
'The performance poet Tony Walsh, whose ode to Manchester became a worldwide '
'hit after the Arena bomb, has written a moving tribute to Royal Manchester '
'…'),
('Can your phone keep you fit? Our writers try 10 big fitness apps – from '
'weightlifting to pilates',
'Life-and-Style',
'Centr Price £15.49 a month. What is it? A full-service experience from the '
'Hollywood star Chris Hemsworth: not just workouts, but a complete meal plan '
'…'),
('We Are from Jazz review – zany Russian musical comedy',
'Film',
'Only in a Woody Allen film will you hear quite as much Dixieland jazz as '
'this. Here is We Are from Jazz, or We Are Jazzmen, the zany jazz comedy '
'music …'),
('The Other Half of Augusta Hope by Joanna Glen review – high emotions',
'Books',
'Summary: Who is Augusta Hope’s “other half”? In Glen’s debut, shortlisted '
'for the Costa first novel prize, at first it’s Augusta’s twin sister, '
'although the di …'),
('Tara Erraught/James Baillieu review – quietly intense and simply exquisite',
'Music',
'Irish mezzo Tara Erraught’s latest Wigmore recital with her pianist James '
'Baillieu took place between Christmas and New Year, though her beautifully '
'c …'),
('Talking Horses: picking the five best races of the last decade',
'Sport',
'You might take the view that the end of the decade is actually a year away, '
'but at least it’s 10 years since I last did something like this. I’ve limi '
'…'),
('The Reality Bubble by Ziya Tong review – blind spots and hidden truths',
'Books',
'Publishing functions very much like the fashion world. Like a suddenly '
'ubiquitous cut of hem or style of trainer, a book comes along every few '
'seasons …'),
('Alleged drink-driver arrested on motorway had no front tyres',
'UK-News',
'An alleged drink-driver who was arrested on the motorway on New Year’s Day '
'had been driving without front tyres. The motorist was said to be nearly si '
'…'),
('MC Beaton, multimillion-selling author of Agatha Raisin novels, dies aged '
'83',
'Books',
'MC Beaton, the prolific creator of the much loved fictional detectives '
'Agatha Raisin and Hamish Macbeth, has died after a short illness at the age '
'of …'),
("First transgender Marvel superhero coming 'very soon'",
'Film',
'The first transgender character in a Marvel movie will probably appear in a '
'film released next year. Speaking at an event at the New York Film Academy '
'…'),
('The six-pack can wait: how to set fitness goals you will actually keep',
'Life-and-Style',
'Summary: Most of us have, at some point in our lives, looked in the mirror '
'and decided we need a radical image overhaul – especially in January. Then, '
'when we …'),
('Gold from Highlands mine to be made into Scottish jewellery',
'UK-News',
'A small goldmine in the Highlands plans to start producing gold in '
'commercial quantities for the first time after repeated delays. The mine at '
'Cononis …'),
('Tell us about your mixed-sex civil partnership plans',
'UK-News',
'The first mixed-sex couples have started to become civil partners in the '
'UK, following a landmark legal battle won by Rebecca Steinfeld and Charles '
'Ke …'),
('England ready for tortoise and hare race in second Test at Newlands',
'Sport',
'As Harold Macmillan is supposed to have explained, there are times when the '
'best‑laid plans disappear like melting snow in springtime and a whole new '
'…'),
("All Federico Fellini's films – ranked!",
'Film',
'20. The Voice of the Moon (1990) A gentle, episodic Fellini, with Roberto '
'Benigni playing Ivo, a madcap character who travels far and wide across the '
'…'),
('Isle of Wight’s rattling, rolling, charming ex-tube trains face end of the '
'line',
'UK-News',
'The train trip from Ryde Pier Head to Shanklin on the Isle of Wight in '
'carriages built 80 years ago for the small tunnels of certain London '
'Undergroun …'),
('Meaty by Samantha Irby review – scatological essays',
'Books',
'To call Samantha Irby’s book scatological would be an understatement. This '
'is a book about assholes – yes, the kind who cheats on you, or never calls, '
'…'),
("Call for more diverse Lake District sparks row over area's future",
'UK-News',
'The head of the Lake District national park authority (LDNPA) has been '
'accused of using the issue of diversity to push through commercial '
'development …'),
("Sharon Choi: how we fell for Bong Joon-ho's translator",
'Film',
'Just when you thought Bong Joon-ho – the affable maestro of Korean cinema '
'and now, with his class-conscious Cannes winner Parasite, champion of the '
'pe …'),
('Whitehall reforms may lead to discrimination, says union',
'Politics',
'Boris Johnson’s “seismic” Whitehall reforms, including regular exams for '
'senior civil servants, could lead to discrimination against staff on the '
'grou …'),
('The most exciting movies of 2020 – biopics',
'Film',
'Respect Having wiped away her Catstoddler snot, Jennifer Hudson gives her '
'pipes a wider airing in this Aretha Franklin biopic which – unlike other '
'mov …'),
('Tune-free pop and the new Katie Hopkins: our 2020 celebrity predictions',
'Life-and-Style',
'There are two ways to spend New Year’s Eve, as best as I can tell: you '
'either dirty the floor of a house party and spend the smallest of the small '
'hou …')]
Great question. Covers the basic common issues of finding selective text that many come across.
We can use find command of BeautifulSoup itself to get selective text as below:
import requests
from bs4 import BeautifulSoup
url = 'http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# 1. To get PARENT TEXT only and ignore CHILD TEXT
title = soup.find('h5')
titletextonly = soup.find_all(text=True, recursive=False) # will give only parent text
#fulltitletext = soup.find_all(text=True, recursive=True) # will give all text under it including child text
titletext = "".join(titletextonly)
##################################################
# 2. Category
p_elements = soup.find_all('p',class_='metadata')
for p_element in p_elements:
p_with_span_element = p_element.find('span')
span_text = p_with_span_element.text # UK-News
##################################################
# 3. Snippet
p_snippet_text = soup.find('p', class_='snipppet').text

Transforming statement in interegative sentence with python NLTK

I have a thousands of sentences about events happened in the past. E.g.
sentence1 = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
sentence2 = 'Alfonso VI of Castile captures the Moorish Muslim city of Toledo, Spain.'
sentence3 = 'The Hindu Medang kingdom flourishes and declines.'
I want to transform them into questions of the form:
question1 = 'When were the Knights Templar founded to protect Christian pilgrims in Jerusalem?'
question2 = 'When did Alfonso VI of Castile capture the Moorish Muslim city of Toledo, Spain?'
question3 = 'When did the Hindu Medang kingdom flourish and decline?'
I realize that this is a complex problem and I am ok with a success rate of 80%.
As far as I understand from searches on the web NTLK is the way to go for this kind of problems.
I started to try some things but it is the first time I use this library and I cannot go much further than this:
import nltk
question = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
tokens = nltk.word_tokenize(question)
tagged = nltk.pos_tag(tokens)
This sounds like a problem many people must have encountered and solved.
Any suggestions?
NLTK can definitely be the right tool to use here. But the quality of your tokenizer and pos-tagger output depends on your corpus and type of sentences. Also, there is usually not really an out-of-the-box solution to this (afaik), and it requires some tuning. If you don't have very much time to put into this, I doubt that your success rate will even reach 80%.
Having said that; here's a basic list instertion based example that may help you to capture and succesfully convert some of your sentences.
import nltk
question_one = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
question_two = 'Alfonso VI of Castile captures the Moorish Muslim city of Toledo, Spain.'
def modify(inputStr):
tokens = nltk.PunktWordTokenizer().tokenize(inputStr)
tagged = nltk.pos_tag(tokens)
auxiliary_verbs = [i for i, w in enumerate(tagged) if w[1] == 'VBP']
if auxiliary_verbs:
tagged.insert(0, tagged.pop(auxiliary_verbs[0]))
else:
tagged.insert(0, ('did', 'VBD'))
tagged.insert(0, ('When', 'WRB'))
return ' '.join([t[0] for t in tagged])
question_one = modify(question_one)
question_two = modify(question_two)
print(question_one)
print(question_two)
Output:
When are The Knights Templar founded to protect Christian pilgrims in Jerusalem.
When did Alfonso VI of Castile captures the Moorish Muslim city of Toledo , Spain.
As you can see, you'd still need to fix correct casing ('The' is still uppercase), 'captures' is in the wrong tense now and you will want to expand on auxiliary_verbs types (probably 'VBP' alone is too limited). But it's a start. Hope this helps!

Categories

Resources