Remove breaks in input strings - python

This is a scrapping from a selenium scrap dump:
['The Quest for Ethical Artificial Intelligence: A Conversation with
Timnit Gebru', 'Mindfulness Self-Care for Students of Color', 'GPA: The Geopolitical landscape of the Olympic and Paralympic Movements', 'Interfaith Discussions', 'Mind the Gap', 'First-Year Arts Board Open House', 'Self-Care Night with CARE and BGLTQ+ Specialty Proctors', 'Drawing Plants & Flowers - Sold Out']
I have to pass this to an algorithm but as you can see, although all of them are perfectly encased within quotes, as the sentence breaks after "conversation with" and this is affecting my input. I tried removing whitespaces, didn't work. Any help will be highly appreciated.

You can't type a string on multiple lines unless you're using triple quotation marks.
i.e. The first item (string) in your list is
['The Quest for Ethical Artificial Intelligence: A Conversation with
Timnit Gebru']
You should keep the string on one line as follows.
['The Quest for Ethical Artificial Intelligence: A Conversation with Timnit Gebru']
Or you could use a triple quotation, as I mentioned.
['''The Quest for Ethical Artificial Intelligence: A Conversation with
Timnit Gebru''']
If you'd like to make your list visible and neat aswell, I'd recommend breaking between commas in the list for example:
['The Quest for Ethical Artificial Intelligence: A Conversation with Timnit Gebru',
'Mindfulness Self-Care for Students of Color',
'GPA: The Geopolitical landscape of the Olympic and Paralympic Movements',
'Interfaith Discussions',
'Mind the Gap',
'First-Year Arts Board Open House',
'Self-Care Night with CARE and BGLTQ+ Specialty Proctors',
'Drawing Plants & Flowers - Sold Out']

You could try
string_list = ['The Quest for Ethical Artificial Intelligence: A Conversation with
Timnit Gebru', 'Mindfulness Self-Care for Students of Color', 'GPA: The Geopolitical landscape of the Olympic and Paralympic Movements', 'Interfaith Discussions', 'Mind the Gap', 'First-Year Arts Board Open House', 'Self-Care Night with CARE and BGLTQ+ Specialty Proctors', 'Drawing Plants & Flowers - Sold Out']
string_list = [str.replace("\n", " ") for str in string_list]

Related

Identify and replace using regex some strings, stored within a list, within a string that may or may not contain them

import re
#list of names to identify in input strings
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort() # sorts normally by alphabetical order (optional)
result_list.sort(key=len, reverse=True) # sorts by descending length
#example 1
input_text = "Melissa went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so Thomas Edd is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker."
#In this example 2, it is almost the same however, some of the names were already encapsulated
# under the ((PERS)name) structure, and should not be encapsulated again.
input_text = "((PERS)Melissa) went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
for i in result_list:
input_text = re.sub(r"\(\(PERS\)" + r"(" + str(i) + r")" + r"\)",
lambda m: (f"((PERS){m[1]})"),
input_text)
print(repr(input_text)) # --> output
Note that the names meet certain conditions under which they must be identified, that is, they must be in the middle of 2 whitespaces \s*the searched name\s* or be at the beginning (?:(?<=\s)|^) or/and at the end of the input string.
It may also be the case that a name is followed by a comma, for example "Ada White, Melissa and Louis went shopping" or if spaces are accidentally omitted "Ada White,Melissa and Louis went shopping".
For this reason it is important that after [.,;] the possibility that it does find a name.
Cases where the names should NOT be encapsulated, would be for example...
"the Edd's business"
"The whitespace"
"the pasteurization process takes time"
"Those White-spaces in that text are unnecessary"
, since in these cases the name is followed or preceded by another word that should not be part of the name that is being searched for.
For examples 1 and 2 (note that example 2 is the same as example 1 but already has some encapsulated names and you have to prevent them from being encapsulated again), you should get the following output.
"((PERS)Melissa) went for a walk in the park, then ((PERS)Melisa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker."
You can use lookarounds to exclude already encapsulated names and those followed by ', an alphanumeric character or -:
import re
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort(key=len, reverse=True) # sorts by descending length
input_text = "((PERS)Melissa) went for a walk in the park, then Melissa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
pat = re.compile(rf"(?<!\(PERS\))({'|'.join(result_list)})(?!['\w)-])")
input_text = re.sub(pat, r'((PERS)\1)', input_text)
Output:
((PERS)Melissa) went for a walk in the park, then ((PERS)Melissa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker.
Of course you can refine the content of your lookahead based on further edge cases.

How can I use regular expressions to extract all words with at least one digit in text with Python

I am new to regular expressions and I have a text as follows. How can I use the RegEx to extract all words with at least one digit in it? Really appreciate it.
text = '''The start of the Civil War in 1861 followed by Tennessee’s secession from the Union and the lodging of
wounded Confederate soldiers on campus did not close East Tennessee University. By spring 1862 when the
trustees finally suspended operations, the majority of students had joined the military, President Joseph
Ridley had resigned, and two professors had left the university. Wounded Confederate soldiers were lodged
at university buildings after the January 1862 Battle of Mill Springs in Kentucky, known as the Battle of
Fishing Creek to the Confederacy. In the fall of 1863, Union troops forced the Confederates out of
Knoxville. On the Hill, the Union Army enclosed the three university buildings with an earthen
fortification they named Fort Byington in honor of an officer from Michigan who was killed in the defense
of Knoxville. They used the buildings for their headquarters, barracks, and a hospital for Black troops.
Despite a Confederate attempt to retake the city by siege—climaxed by a bloody, abortive attack on Fort
Sanders on November 29, 1863—the Union held and occupied Knoxville for the rest of the war. During the
battle, the Hill was hit with artillery fire from Confederate guns located in a trench at the site of
UT’s present-day Sorority Village. Campus also sustained a great deal of damage caused by the Union Army.
Troops denuded the grounds of trees, ruined the steward’s house, and destroyed the gymnasium with
misdirected cannon fire aimed at Confederate troops across the river. After the Civil War ended in 1865
and the Union Army left campus, Thomas Humes was elected university president. The university reopened in
1866 and operated for six months downtown in the Deaf and Dumb Asylum while repairs began at the damaged
campus. A petition to the federal war department for monetary compensation for campus damage done by the
Union Army undoubtedly received more favorable consideration because of Humes’s known Union loyalty
throughout the war. A Senate committee which considered the bill for damages also noted that East
Tennessee University was “particularly deserving of the favorable consideration of Congress” because it
was “the only educational institution of known loyalty…in any of the seceding states.” However in 1873,
President Ulysses S. Grant vetoed the bill that would have provided $18,500 to the university because he
felt it would set a bad precedent. The bill was redrafted specifying that the payment was compensation
for aid East Tennessee University gave to the Union during the war. On June 22, 1874, President Grant
signed the new bill and the trustees accepted the funds the same day with an agreement to release the
government from all claims. (More than a century and a half later, a buried Union trench was located in
2019 on the north side of the present-day McClung Museum with the use of ground-penetrating radar.)
'''
You could use this pattern:
'\w*\d+\w*'
How does it work:
\w* matches 0 or more characters (but not space)
\d+ matches 1 or more digits
\w* matches 0 or more characters again
Using re and findall we get:
re.findall('\w*\d+\w*',your_text)
we get:
['1861',
'1862',
'1862',
'1863',
'29',
'1863',
'1865',
'1866',
'1873',
'18',
'500',
'22',
'1874',
'2019']
Is this what you mean?
re.findall(r"\S*\d+\S*", text)
\S any character but a space,
\d any digit,
+ one or more occurrences,
* zero or more occurrences

Using function objects to speed up the training

from datasets import load_dataset #Huggingface
from transformers import BertTokenizer #Huggingface:
def tokenized_dataset(dataset):
""" Method that tokenizes each document in the train, test and validation dataset
Args:
dataset (DatasetDict): dataset that will be tokenized (train, test, validation)
Returns:
dict: dataset once tokenized
"""
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
encode = lambda document: tokenizer(document, return_tensors='pt', padding=True, truncation=True)
train_articles = [encode(document) for document in dataset["train"]["article"]]
test_articles = [encode(document) for document in dataset["test"]["article"]]
val_articles = [encode(document) for document in dataset["val"]["article"]]
train_abstracts = [encode(document) for document in dataset["train"]["abstract"]]
test_abstracts = [encode(document) for document in dataset["test"]["abstract"]]
val_abstracts = [encode(document) for document in dataset["val"]["abstract"]]
return {"train": (train_articles, train_abstracts),
"test": (test_articles, test_abstracts),
"val": (val_articles, val_abstracts)}
if __name__ == "__main__":
dataset = load_data("./train/", "./test/", "./val/", "./.cache_dir")
tokenized_data = tokenized_dataset(dataset)
I would like to modify the function tokenized_dataset because it creates a very heavy dictionary. The dataset created by that function will be reused for a ML training. However, dragging that dictionary during the training will slow down the training a lot.
Be aware that document is similar to
[['eleven politicians from 7 parties made comments in letter to a newspaper .',
"said dpp alison saunders had ` damaged public confidence ' in justice .",
'ms saunders ruled lord janner unfit to stand trial over child abuse claims .',
'the cps has pursued at least 19 suspected paedophiles with dementia .'],
['an increasing number of surveys claim to reveal what makes us happiest .',
'but are these generic lists really of any use to us ?',
'janet street-porter makes her own list - of things making her unhappy !'],
["author of ` into the wild ' spoke to five rape victims in missoula , montana .",
"` missoula : rape and the justice system in a college town ' was released april 21 .",
"three of five victims profiled in the book sat down with abc 's nightline wednesday night .",
'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football '
'players .',
"huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .",
'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable '
'cause .',
'mr krakauer wrote book after realizing close friend was a rape victim .'],
['tesco announced a record annual loss of £ 6.38 billion yesterday .',
'drop in sales , one-off costs and pensions blamed for financial loss .',
'supermarket giant now under pressure to close 200 stores nationwide .',
'here , retail industry veterans , plus mail writers , identify what went wrong .'],
...,
['snp leader said alex salmond did not field questions over his family .',
"said she was not ` moaning ' but also attacked criticism of women 's looks .",
'she made the remarks in latest programme profiling the main party leaders .',
'ms sturgeon also revealed her tv habits and recent image makeover .',
'she said she relaxed by eating steak and chips on a saturday night .']]
So in the dictionary, the keys are just strings, but the values are all list of lists of strings. Instead of having value = list of list of strings, I think it is better to create sort of a list of object function instead of having list of lists of strings. It will make the dictionary much lighter. How can I do that?
EDIT
For me, there is a difference between copying a list of lists of strings and copying a list of objects. Copying an object will simply copy the reference while copying a list of strings will copy everything. So it is much faster to copy the reference instead. This is the point of this question.

Python deal with HTML

A python beginner get tortured here, need some help:-(
def getLinkinfo(endpoint):
response = urllib.request.urlopen(endpoint)
link_info = response.read().decode()
return link_info
text=getLinkinfo('http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html')
soup1=BeautifulSoup(text,'html.parser')
k=soup1.find_all('div','class'=='article')
Here I have already cut the main body I need to deal with, and one of the outputs is as shown below:
<div class="article"><h5>1. Let's resolve to reconnect, says Welby in new year message</h5>
<p class="metadata">Wed 1 Jan 2020 00:01 GMT</p>
<p class="metadata">Category: <span>UK-News</span></p>
<p class="snippet">The archbishop of Canterbury will urge people to make personal connections with others in 2020 to create a new unity in a divided society. In his new …</p></div>
Here my question is how can i get:
(1)'Title' between <h5>,<a> tag
(2) 'Category' which is behind <p class="metadata">(Here are two <p class="metadata">s, one with time is not needed)
(3)'Snippet' which is behind <p class="snippet">
Thx for help in advance, I feel if I know how to deal with this example, I can process a lot
from urllib.request import urlopen
from bs4 import BeautifulSoup
from pprint import pprint as pp
def main(url):
r = urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
goal = [(x.a.text, x.select("p")[1].text.split(' ', 1)[1], x.select_one('p.snippet').text)
for x in soup.select('.article')]
pp(goal)
main('http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html')
Output:
[("Let's resolve to reconnect, says Welby in new year message",
'UK-News',
'The archbishop of Canterbury will urge people to make personal connections '
'with others in 2020 to create a new unity in a divided society. In his new '
'…'),
("Be honest. You're not going to read all those books on your holiday, are "
'you?',
'Books',
'Every year, about this time, my Instagram feed fills up with pictures of '
'books. They’re piled somewhere between five and ten inches high, sometimes '
'st …'),
("Mariah Carey's Twitter account hacked on New Year's Eve",
'Music',
'Mariah Carey’s Twitter account appeared to have been hacked late Tuesday '
'afternoon, sharing numerous racist slurs and comments with the singer’s '
'21.4 …'),
('The joy audit: how to have more fun in 2020',
'Life-and-Style',
'The last time I felt joy was at an event that would be many people’s vision '
'of hell: a drunken Taylor Swift club-night singalong in the early hours of '
'…'),
('Providence Lost by Paul Lay review – the rise and fall of Oliver Cromwell’s '
'Protectorate',
'Books',
'The only public execution of a British head of state occurred 371 years ago '
'outside the Banqueting House in Whitehall on 30 January 1649. It was a rad '
'…'),
('Zero-carbon electricity outstrips fossil fuels in Britain across 2019',
'Business',
'Summary: Zero-carbon energy became Britain’s largest electricity source in '
'2019, delivering nearly half the country’s electrical power and for the '
'first time o …'),
('The final sprint: will any of the Democratic candidates excite voters?',
'US-News',
'Democrats overwhelmingly agree that their top priority in 2020 is to remove '
'Donald Trump from office. But which of the many Democrats running for pres '
'…'),
('War epics, airmen and young Sopranos: essential films for 2020',
'Film',
'1917 An epic of Lean-ian proportions is delivered in this spectacular from '
'director and co-writer Sam Mendes, who has developed a real-life story of h '
'…'),
("Stashing your cash: the beginner's guide to saving",
'Life-and-Style',
'Much like going for a run or eating your greens, saving your cash offers '
'long-term benefits, but is not always appealing. And, let’s face it, there '
'ar …'),
("'I'm on the hunt for humour and hope': what will authors be reading in "
'2020?',
'Books',
'Matt Haig I have been very dark and gloomy with my reading habits this '
'year, perhaps in tune with the social mood. Like a pig sniffing for '
'truffles, I …'),
('Twenty athletes set to light up the Tokyo 2020 Olympics',
'Sport',
'Dina Asher-Smith Great Britain Athletics, 100m, 200m, 4x100m Seb Coe, who '
'knows a thing or two about winning Olympic titles, believes Asher-Smith '
'will …'),
('The most exciting movies of 2020 – horror',
'Film',
'The Grudge A belated English language reboot of Japanese classic Ju-On: The '
'Grudge (2002), this stars Andrea Riseborough and Demián Bichir as detectiv '
'…'),
('Diary of a Murderer by Kim Young-ha review – dark stories from South Korea',
'Books',
'Given that loss of memory has become a familiar device in fiction, and the '
'psychopath such a popular character archetype, we shouldn’t be too surprise '
'…'),
('US election, Brexit and China to sway the markets in 2020',
'Business',
'After profiting from strong markets in 2019, investors are expecting 2020 '
'to bring further rising asset prices and lively merger activity. But '
'Brexit, …'),
('TS Eliot’s intimate letters to confidante unveiled after 60 years',
'Books',
'A collection of more than 1,000 letters from the Nobel laureate TS Eliot to '
'his confidante and muse Emily Hale is unveiled this week, after having bee '
'…'),
('Clive Lewis calls for unity among Labour leadership hopefuls',
'Politics',
'Summary: The Labour leadership hopeful Clive Lewis has called for unity '
'among would-be candidates to succeed Jeremy Corbyn as they confront the '
'“cliff face” of …'),
("Visa applications: Home Office refuses to reveal 'high risk' countries",
'UK-News',
'Summary: Campaign groups have criticised the Home Office after it refused '
'to release details of which countries are deemed a “risk” in an algorithm '
'that filter …'),
('Victims of NYE Surrey road crash were BA cabin crew',
'UK-News',
'At least seven people have been killed across the UK in road traffic '
'collisions over the new year period. The deaths included three British '
'Airways ca …'),
('Man held on suspicion of double murder after bodies found in house',
'UK-News',
'Police have arrested a man on suspicion of murdering two people at a house '
'in the village of Duffield in Derbyshire. The murder investigation was laun '
'…'),
("Great expectations: 'The quest for perfection has cannibalised my identity'",
'Life-and-Style',
'“You need to practice self-compassion,” my psychologist says to me. This is '
'our sixth session and as per usual he is struggling to find a phrase, a po '
'…'),
('Anti-Islamic slogans spray-painted near mosque in Brixton',
'UK-News',
'Anti-Islamic slogans have been painted on a building close to a mosque and '
'cultural centre in south London, the Metropolitan police have said. Officer '
'…'),
('Michael van Gerwen 3-7 Peter Wright: PDC world darts championship final – '
'as it happened',
'Sport',
'Summary: That’s it for tonight’s blog, so I’ll leave you with a report from '
'Ally Pally. Thanks for your company, goodnight! There’s so much affection '
'for Peter …'),
('Greggs launches meatless steak bake to beef up its vegan range',
'Business',
'Greggs, the UK’s largest bakery chain, will end speculation about its hotly '
'anticipated new vegan snack by launching a meat-free version of its popula '
'…'),
('Woodford folk festival review – a much-needed moment of positivity and '
'reprieve',
'Music',
'If Woodford folk festival was in mourning this year, you wouldn’t have '
'known it. The death in May of festival elder and decade-long patron Bob '
'Hawke c …'),
('Household haze: how to reduce smoke in your home without an air purifier',
'Life-and-Style',
'Summary: On 1 January, Canberra experienced its worst air quality on '
'record. Smoke from Australia’s devastating bushfires has now blown as far '
'as Queenstown, N …'),
("Sadiq Khan pledges free London travel for disabled people's carers",
'Politics',
'Sadiq Khan has kickstarted his bid for a second term as London mayor by '
'pledging free travel on the city’s transport for anyone accompanying a '
'disable …'),
('‘Everyone thought I was mad’: how to make a life-changing decision – and '
'stick to it',
'Life-and-Style',
'Summary: When I was 26, I broke up with a long-term partner, got an '
'ill-advised facial piercing and changed careers – all in the space of a '
'month. What I learn …'),
('In the Line of Duty review – race-against-time cop thriller',
'Film',
'There’s a straight-to-video feel to this cop thriller, directed by action '
'veteran Steven C Miller, written by Jeremy Drysdale (who scripted the indie '
'…'),
("Manchester poet Tony Walsh performs tribute to children's hospital",
'UK-News',
'The performance poet Tony Walsh, whose ode to Manchester became a worldwide '
'hit after the Arena bomb, has written a moving tribute to Royal Manchester '
'…'),
('Can your phone keep you fit? Our writers try 10 big fitness apps – from '
'weightlifting to pilates',
'Life-and-Style',
'Centr Price £15.49 a month. What is it? A full-service experience from the '
'Hollywood star Chris Hemsworth: not just workouts, but a complete meal plan '
'…'),
('We Are from Jazz review – zany Russian musical comedy',
'Film',
'Only in a Woody Allen film will you hear quite as much Dixieland jazz as '
'this. Here is We Are from Jazz, or We Are Jazzmen, the zany jazz comedy '
'music …'),
('The Other Half of Augusta Hope by Joanna Glen review – high emotions',
'Books',
'Summary: Who is Augusta Hope’s “other half”? In Glen’s debut, shortlisted '
'for the Costa first novel prize, at first it’s Augusta’s twin sister, '
'although the di …'),
('Tara Erraught/James Baillieu review – quietly intense and simply exquisite',
'Music',
'Irish mezzo Tara Erraught’s latest Wigmore recital with her pianist James '
'Baillieu took place between Christmas and New Year, though her beautifully '
'c …'),
('Talking Horses: picking the five best races of the last decade',
'Sport',
'You might take the view that the end of the decade is actually a year away, '
'but at least it’s 10 years since I last did something like this. I’ve limi '
'…'),
('The Reality Bubble by Ziya Tong review – blind spots and hidden truths',
'Books',
'Publishing functions very much like the fashion world. Like a suddenly '
'ubiquitous cut of hem or style of trainer, a book comes along every few '
'seasons …'),
('Alleged drink-driver arrested on motorway had no front tyres',
'UK-News',
'An alleged drink-driver who was arrested on the motorway on New Year’s Day '
'had been driving without front tyres. The motorist was said to be nearly si '
'…'),
('MC Beaton, multimillion-selling author of Agatha Raisin novels, dies aged '
'83',
'Books',
'MC Beaton, the prolific creator of the much loved fictional detectives '
'Agatha Raisin and Hamish Macbeth, has died after a short illness at the age '
'of …'),
("First transgender Marvel superhero coming 'very soon'",
'Film',
'The first transgender character in a Marvel movie will probably appear in a '
'film released next year. Speaking at an event at the New York Film Academy '
'…'),
('The six-pack can wait: how to set fitness goals you will actually keep',
'Life-and-Style',
'Summary: Most of us have, at some point in our lives, looked in the mirror '
'and decided we need a radical image overhaul – especially in January. Then, '
'when we …'),
('Gold from Highlands mine to be made into Scottish jewellery',
'UK-News',
'A small goldmine in the Highlands plans to start producing gold in '
'commercial quantities for the first time after repeated delays. The mine at '
'Cononis …'),
('Tell us about your mixed-sex civil partnership plans',
'UK-News',
'The first mixed-sex couples have started to become civil partners in the '
'UK, following a landmark legal battle won by Rebecca Steinfeld and Charles '
'Ke …'),
('England ready for tortoise and hare race in second Test at Newlands',
'Sport',
'As Harold Macmillan is supposed to have explained, there are times when the '
'best‑laid plans disappear like melting snow in springtime and a whole new '
'…'),
("All Federico Fellini's films – ranked!",
'Film',
'20. The Voice of the Moon (1990) A gentle, episodic Fellini, with Roberto '
'Benigni playing Ivo, a madcap character who travels far and wide across the '
'…'),
('Isle of Wight’s rattling, rolling, charming ex-tube trains face end of the '
'line',
'UK-News',
'The train trip from Ryde Pier Head to Shanklin on the Isle of Wight in '
'carriages built 80 years ago for the small tunnels of certain London '
'Undergroun …'),
('Meaty by Samantha Irby review – scatological essays',
'Books',
'To call Samantha Irby’s book scatological would be an understatement. This '
'is a book about assholes – yes, the kind who cheats on you, or never calls, '
'…'),
("Call for more diverse Lake District sparks row over area's future",
'UK-News',
'The head of the Lake District national park authority (LDNPA) has been '
'accused of using the issue of diversity to push through commercial '
'development …'),
("Sharon Choi: how we fell for Bong Joon-ho's translator",
'Film',
'Just when you thought Bong Joon-ho – the affable maestro of Korean cinema '
'and now, with his class-conscious Cannes winner Parasite, champion of the '
'pe …'),
('Whitehall reforms may lead to discrimination, says union',
'Politics',
'Boris Johnson’s “seismic” Whitehall reforms, including regular exams for '
'senior civil servants, could lead to discrimination against staff on the '
'grou …'),
('The most exciting movies of 2020 – biopics',
'Film',
'Respect Having wiped away her Catstoddler snot, Jennifer Hudson gives her '
'pipes a wider airing in this Aretha Franklin biopic which – unlike other '
'mov …'),
('Tune-free pop and the new Katie Hopkins: our 2020 celebrity predictions',
'Life-and-Style',
'There are two ways to spend New Year’s Eve, as best as I can tell: you '
'either dirty the floor of a house party and spend the smallest of the small '
'hou …')]
Great question. Covers the basic common issues of finding selective text that many come across.
We can use find command of BeautifulSoup itself to get selective text as below:
import requests
from bs4 import BeautifulSoup
url = 'http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# 1. To get PARENT TEXT only and ignore CHILD TEXT
title = soup.find('h5')
titletextonly = soup.find_all(text=True, recursive=False) # will give only parent text
#fulltitletext = soup.find_all(text=True, recursive=True) # will give all text under it including child text
titletext = "".join(titletextonly)
##################################################
# 2. Category
p_elements = soup.find_all('p',class_='metadata')
for p_element in p_elements:
p_with_span_element = p_element.find('span')
span_text = p_with_span_element.text # UK-News
##################################################
# 3. Snippet
p_snippet_text = soup.find('p', class_='snipppet').text

BeautifulSoup, trying to extract text from anchor tags that contain author names

I am trying to scrape some data from this books site. I need to extract the title, and the author(s). I was able to extract the titles without much trouble. However, I am having issues to extract the authors when there are more than one, since they appear in the same line, and they belong to separate anchor tags within a header h4.
<h4>
"5
. "
The Elements of Style
" by "
William Strunk, Jr
", "
E. B. White
</h4>
This is what I tried:
book_container = soup.find_all('li', class_='item pb-3 pt-3 border-bottom')
for container in book_container:
# title
title = container.h4.a.text
titles.append(title)
# author(s)
author_s = container.h4.find_all('a')
print('### SECOND FOR LOOP ###')
for a in author_s:
if a['href'].startswith('/authors/'):
print(a.text)
I'd like to have two authors in a tuple.
You can extract all <a> links under <h4> (h4 is the tag where are title/authors). First <a> tag is the title, rest of <a> tags are the authors:
import requests
from bs4 import BeautifulSoup
url = 'https://thegreatestbooks.org/the-greatest-nonfiction-since/1900'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for item in soup.select('h4:has(>a)'):
elements = [i.get_text(strip=True) for i in item.select('a')]
title = elements[0]
authors = elements[1:]
print('{:<40} {}'.format(title, authors))
Prints:
The Diary of a Young Girl ['Anne Frank']
The Autobiography of Malcolm X ['Alex Haley']
Silent Spring ['Rachel Carson']
In Cold Blood ['Truman Capote']
The Elements of Style ['William Strunk, Jr', 'E. B. White']
The Double Helix: A Personal Account of the Discovery of the Structure of DNA ['James D. Watson']
Relativity ['Albert Einstein']
Look Homeward, Angel ['Thomas Wolfe']
Homage to Catalonia ['George Orwell']
Speak, Memory ['Vladimir Nabokov']
The General Theory of Employment, Interest and Money ['John Maynard Keynes']
The Second World War ['Winston Churchill']
The Education of Henry Adams ['Henry Adams']
Out of Africa ['Isak Dinesen']
The Structure of Scientific Revolutions ['Thomas Kuhn']
Dispatches ['Michael Herr']
The Gulag Archipelago ['Aleksandr Solzhenitsyn']
I Know Why the Caged Bird Sings ['Maya Angelou']
The Civil War ['Shelby Foote']
If This Is a Man ['Primo Levi']
Collected Essays of George Orwell ['George Orwell']
The Electric Kool-Aid Acid Test ['Tom Wolfe']
Civilization and Its Discontents ['Sigmund Freud']
The Death and Life of Great American Cities ['Jane Jacobs']
Selected Essays of T. S. Eliot ['T. S. Eliot']
A Room of One's Own ['Virginia Woolf']
The Right Stuff ['Tom Wolfe']
The Road to Serfdom ['Friedrich von Hayek']
R. E. Lee ['Douglas Southall Freeman']
The Varieties of Religious Experience ['Will James']
The Liberal Imagination ['Lionel Trilling']
Angela's Ashes: A Memoir ['Frank McCourt']
The Second Sex ['Simone de Beauvoir']
Mere Christianity ['C. S. Lewis']
Moveable Feast ['Ernest Hemingway']
The Autobiography of Alice B. Toklas ['Gertrude Stein']
The Origins of Totalitarianism ['Hannah Arendt']
Black Lamb and Grey Falcon ['Rebecca West']
Orthodoxy ['G. K. Chesterton']
Philosophical Investigations ['Ludwig Wittgenstein']
Night ['Elie Wiesel']
The Affluent Society ['John Kenneth Galbraith']
Mythology ['Edith Hamilton']
The Open Society ['Karl Popper']
The Color of Water: A Black Man's Tribute to His White Mother ['James McBride']
The Seven Storey Mountain ['Thomas Merton']
Hiroshima ['John Hersey']
Let Us Now Praise Famous Men ['James Agee']
Pragmatism ['Will James']
The Making of the Atomic Bomb ['Richard Rhodes']
This might not be the most pythonic way, but it's a workaround.
newlist = []
for a in author_s:
if a['href'].startswith('/authors/'):
if len(author_s)>2:
newlist.append(a.text)
print(tuple(newlist))
else:
print(a.text)
I'm utilizing the fact that variable author_s would contain a list which we could check for more names. More than 2 in list, means more authors. (Alternatively, you could also check for the existence of newline in print)
You will also notice the printed output will have two tuples. Always extract the second one. The rest with one author will remain the same. Since this request do not have multiple lines of two authors, I couldn't check for complications.
Output:
[The Elements of Style, William Strunk, Jr, E. B. White]
### SECOND FOR LOOP ###
('William Strunk, Jr',)
('William Strunk, Jr', 'E. B. White')

Categories

Resources