Python deal with HTML - python
A python beginner get tortured here, need some help:-(
def getLinkinfo(endpoint):
response = urllib.request.urlopen(endpoint)
link_info = response.read().decode()
return link_info
text=getLinkinfo('http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html')
soup1=BeautifulSoup(text,'html.parser')
k=soup1.find_all('div','class'=='article')
Here I have already cut the main body I need to deal with, and one of the outputs is as shown below:
<div class="article"><h5>1. Let's resolve to reconnect, says Welby in new year message</h5>
<p class="metadata">Wed 1 Jan 2020 00:01 GMT</p>
<p class="metadata">Category: <span>UK-News</span></p>
<p class="snippet">The archbishop of Canterbury will urge people to make personal connections with others in 2020 to create a new unity in a divided society. In his new …</p></div>
Here my question is how can i get:
(1)'Title' between <h5>,<a> tag
(2) 'Category' which is behind <p class="metadata">(Here are two <p class="metadata">s, one with time is not needed)
(3)'Snippet' which is behind <p class="snippet">
Thx for help in advance, I feel if I know how to deal with this example, I can process a lot
from urllib.request import urlopen
from bs4 import BeautifulSoup
from pprint import pprint as pp
def main(url):
r = urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
goal = [(x.a.text, x.select("p")[1].text.split(' ', 1)[1], x.select_one('p.snippet').text)
for x in soup.select('.article')]
pp(goal)
main('http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html')
Output:
[("Let's resolve to reconnect, says Welby in new year message",
'UK-News',
'The archbishop of Canterbury will urge people to make personal connections '
'with others in 2020 to create a new unity in a divided society. In his new '
'…'),
("Be honest. You're not going to read all those books on your holiday, are "
'you?',
'Books',
'Every year, about this time, my Instagram feed fills up with pictures of '
'books. They’re piled somewhere between five and ten inches high, sometimes '
'st …'),
("Mariah Carey's Twitter account hacked on New Year's Eve",
'Music',
'Mariah Carey’s Twitter account appeared to have been hacked late Tuesday '
'afternoon, sharing numerous racist slurs and comments with the singer’s '
'21.4 …'),
('The joy audit: how to have more fun in 2020',
'Life-and-Style',
'The last time I felt joy was at an event that would be many people’s vision '
'of hell: a drunken Taylor Swift club-night singalong in the early hours of '
'…'),
('Providence Lost by Paul Lay review – the rise and fall of Oliver Cromwell’s '
'Protectorate',
'Books',
'The only public execution of a British head of state occurred 371 years ago '
'outside the Banqueting House in Whitehall on 30 January 1649. It was a rad '
'…'),
('Zero-carbon electricity outstrips fossil fuels in Britain across 2019',
'Business',
'Summary: Zero-carbon energy became Britain’s largest electricity source in '
'2019, delivering nearly half the country’s electrical power and for the '
'first time o …'),
('The final sprint: will any of the Democratic candidates excite voters?',
'US-News',
'Democrats overwhelmingly agree that their top priority in 2020 is to remove '
'Donald Trump from office. But which of the many Democrats running for pres '
'…'),
('War epics, airmen and young Sopranos: essential films for 2020',
'Film',
'1917 An epic of Lean-ian proportions is delivered in this spectacular from '
'director and co-writer Sam Mendes, who has developed a real-life story of h '
'…'),
("Stashing your cash: the beginner's guide to saving",
'Life-and-Style',
'Much like going for a run or eating your greens, saving your cash offers '
'long-term benefits, but is not always appealing. And, let’s face it, there '
'ar …'),
("'I'm on the hunt for humour and hope': what will authors be reading in "
'2020?',
'Books',
'Matt Haig I have been very dark and gloomy with my reading habits this '
'year, perhaps in tune with the social mood. Like a pig sniffing for '
'truffles, I …'),
('Twenty athletes set to light up the Tokyo 2020 Olympics',
'Sport',
'Dina Asher-Smith Great Britain Athletics, 100m, 200m, 4x100m Seb Coe, who '
'knows a thing or two about winning Olympic titles, believes Asher-Smith '
'will …'),
('The most exciting movies of 2020 – horror',
'Film',
'The Grudge A belated English language reboot of Japanese classic Ju-On: The '
'Grudge (2002), this stars Andrea Riseborough and Demián Bichir as detectiv '
'…'),
('Diary of a Murderer by Kim Young-ha review – dark stories from South Korea',
'Books',
'Given that loss of memory has become a familiar device in fiction, and the '
'psychopath such a popular character archetype, we shouldn’t be too surprise '
'…'),
('US election, Brexit and China to sway the markets in 2020',
'Business',
'After profiting from strong markets in 2019, investors are expecting 2020 '
'to bring further rising asset prices and lively merger activity. But '
'Brexit, …'),
('TS Eliot’s intimate letters to confidante unveiled after 60 years',
'Books',
'A collection of more than 1,000 letters from the Nobel laureate TS Eliot to '
'his confidante and muse Emily Hale is unveiled this week, after having bee '
'…'),
('Clive Lewis calls for unity among Labour leadership hopefuls',
'Politics',
'Summary: The Labour leadership hopeful Clive Lewis has called for unity '
'among would-be candidates to succeed Jeremy Corbyn as they confront the '
'“cliff face” of …'),
("Visa applications: Home Office refuses to reveal 'high risk' countries",
'UK-News',
'Summary: Campaign groups have criticised the Home Office after it refused '
'to release details of which countries are deemed a “risk” in an algorithm '
'that filter …'),
('Victims of NYE Surrey road crash were BA cabin crew',
'UK-News',
'At least seven people have been killed across the UK in road traffic '
'collisions over the new year period. The deaths included three British '
'Airways ca …'),
('Man held on suspicion of double murder after bodies found in house',
'UK-News',
'Police have arrested a man on suspicion of murdering two people at a house '
'in the village of Duffield in Derbyshire. The murder investigation was laun '
'…'),
("Great expectations: 'The quest for perfection has cannibalised my identity'",
'Life-and-Style',
'“You need to practice self-compassion,” my psychologist says to me. This is '
'our sixth session and as per usual he is struggling to find a phrase, a po '
'…'),
('Anti-Islamic slogans spray-painted near mosque in Brixton',
'UK-News',
'Anti-Islamic slogans have been painted on a building close to a mosque and '
'cultural centre in south London, the Metropolitan police have said. Officer '
'…'),
('Michael van Gerwen 3-7 Peter Wright: PDC world darts championship final – '
'as it happened',
'Sport',
'Summary: That’s it for tonight’s blog, so I’ll leave you with a report from '
'Ally Pally. Thanks for your company, goodnight! There’s so much affection '
'for Peter …'),
('Greggs launches meatless steak bake to beef up its vegan range',
'Business',
'Greggs, the UK’s largest bakery chain, will end speculation about its hotly '
'anticipated new vegan snack by launching a meat-free version of its popula '
'…'),
('Woodford folk festival review – a much-needed moment of positivity and '
'reprieve',
'Music',
'If Woodford folk festival was in mourning this year, you wouldn’t have '
'known it. The death in May of festival elder and decade-long patron Bob '
'Hawke c …'),
('Household haze: how to reduce smoke in your home without an air purifier',
'Life-and-Style',
'Summary: On 1 January, Canberra experienced its worst air quality on '
'record. Smoke from Australia’s devastating bushfires has now blown as far '
'as Queenstown, N …'),
("Sadiq Khan pledges free London travel for disabled people's carers",
'Politics',
'Sadiq Khan has kickstarted his bid for a second term as London mayor by '
'pledging free travel on the city’s transport for anyone accompanying a '
'disable …'),
('‘Everyone thought I was mad’: how to make a life-changing decision – and '
'stick to it',
'Life-and-Style',
'Summary: When I was 26, I broke up with a long-term partner, got an '
'ill-advised facial piercing and changed careers – all in the space of a '
'month. What I learn …'),
('In the Line of Duty review – race-against-time cop thriller',
'Film',
'There’s a straight-to-video feel to this cop thriller, directed by action '
'veteran Steven C Miller, written by Jeremy Drysdale (who scripted the indie '
'…'),
("Manchester poet Tony Walsh performs tribute to children's hospital",
'UK-News',
'The performance poet Tony Walsh, whose ode to Manchester became a worldwide '
'hit after the Arena bomb, has written a moving tribute to Royal Manchester '
'…'),
('Can your phone keep you fit? Our writers try 10 big fitness apps – from '
'weightlifting to pilates',
'Life-and-Style',
'Centr Price £15.49 a month. What is it? A full-service experience from the '
'Hollywood star Chris Hemsworth: not just workouts, but a complete meal plan '
'…'),
('We Are from Jazz review – zany Russian musical comedy',
'Film',
'Only in a Woody Allen film will you hear quite as much Dixieland jazz as '
'this. Here is We Are from Jazz, or We Are Jazzmen, the zany jazz comedy '
'music …'),
('The Other Half of Augusta Hope by Joanna Glen review – high emotions',
'Books',
'Summary: Who is Augusta Hope’s “other half”? In Glen’s debut, shortlisted '
'for the Costa first novel prize, at first it’s Augusta’s twin sister, '
'although the di …'),
('Tara Erraught/James Baillieu review – quietly intense and simply exquisite',
'Music',
'Irish mezzo Tara Erraught’s latest Wigmore recital with her pianist James '
'Baillieu took place between Christmas and New Year, though her beautifully '
'c …'),
('Talking Horses: picking the five best races of the last decade',
'Sport',
'You might take the view that the end of the decade is actually a year away, '
'but at least it’s 10 years since I last did something like this. I’ve limi '
'…'),
('The Reality Bubble by Ziya Tong review – blind spots and hidden truths',
'Books',
'Publishing functions very much like the fashion world. Like a suddenly '
'ubiquitous cut of hem or style of trainer, a book comes along every few '
'seasons …'),
('Alleged drink-driver arrested on motorway had no front tyres',
'UK-News',
'An alleged drink-driver who was arrested on the motorway on New Year’s Day '
'had been driving without front tyres. The motorist was said to be nearly si '
'…'),
('MC Beaton, multimillion-selling author of Agatha Raisin novels, dies aged '
'83',
'Books',
'MC Beaton, the prolific creator of the much loved fictional detectives '
'Agatha Raisin and Hamish Macbeth, has died after a short illness at the age '
'of …'),
("First transgender Marvel superhero coming 'very soon'",
'Film',
'The first transgender character in a Marvel movie will probably appear in a '
'film released next year. Speaking at an event at the New York Film Academy '
'…'),
('The six-pack can wait: how to set fitness goals you will actually keep',
'Life-and-Style',
'Summary: Most of us have, at some point in our lives, looked in the mirror '
'and decided we need a radical image overhaul – especially in January. Then, '
'when we …'),
('Gold from Highlands mine to be made into Scottish jewellery',
'UK-News',
'A small goldmine in the Highlands plans to start producing gold in '
'commercial quantities for the first time after repeated delays. The mine at '
'Cononis …'),
('Tell us about your mixed-sex civil partnership plans',
'UK-News',
'The first mixed-sex couples have started to become civil partners in the '
'UK, following a landmark legal battle won by Rebecca Steinfeld and Charles '
'Ke …'),
('England ready for tortoise and hare race in second Test at Newlands',
'Sport',
'As Harold Macmillan is supposed to have explained, there are times when the '
'best‑laid plans disappear like melting snow in springtime and a whole new '
'…'),
("All Federico Fellini's films – ranked!",
'Film',
'20. The Voice of the Moon (1990) A gentle, episodic Fellini, with Roberto '
'Benigni playing Ivo, a madcap character who travels far and wide across the '
'…'),
('Isle of Wight’s rattling, rolling, charming ex-tube trains face end of the '
'line',
'UK-News',
'The train trip from Ryde Pier Head to Shanklin on the Isle of Wight in '
'carriages built 80 years ago for the small tunnels of certain London '
'Undergroun …'),
('Meaty by Samantha Irby review – scatological essays',
'Books',
'To call Samantha Irby’s book scatological would be an understatement. This '
'is a book about assholes – yes, the kind who cheats on you, or never calls, '
'…'),
("Call for more diverse Lake District sparks row over area's future",
'UK-News',
'The head of the Lake District national park authority (LDNPA) has been '
'accused of using the issue of diversity to push through commercial '
'development …'),
("Sharon Choi: how we fell for Bong Joon-ho's translator",
'Film',
'Just when you thought Bong Joon-ho – the affable maestro of Korean cinema '
'and now, with his class-conscious Cannes winner Parasite, champion of the '
'pe …'),
('Whitehall reforms may lead to discrimination, says union',
'Politics',
'Boris Johnson’s “seismic” Whitehall reforms, including regular exams for '
'senior civil servants, could lead to discrimination against staff on the '
'grou …'),
('The most exciting movies of 2020 – biopics',
'Film',
'Respect Having wiped away her Catstoddler snot, Jennifer Hudson gives her '
'pipes a wider airing in this Aretha Franklin biopic which – unlike other '
'mov …'),
('Tune-free pop and the new Katie Hopkins: our 2020 celebrity predictions',
'Life-and-Style',
'There are two ways to spend New Year’s Eve, as best as I can tell: you '
'either dirty the floor of a house party and spend the smallest of the small '
'hou …')]
Great question. Covers the basic common issues of finding selective text that many come across.
We can use find command of BeautifulSoup itself to get selective text as below:
import requests
from bs4 import BeautifulSoup
url = 'http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# 1. To get PARENT TEXT only and ignore CHILD TEXT
title = soup.find('h5')
titletextonly = soup.find_all(text=True, recursive=False) # will give only parent text
#fulltitletext = soup.find_all(text=True, recursive=True) # will give all text under it including child text
titletext = "".join(titletextonly)
##################################################
# 2. Category
p_elements = soup.find_all('p',class_='metadata')
for p_element in p_elements:
p_with_span_element = p_element.find('span')
span_text = p_with_span_element.text # UK-News
##################################################
# 3. Snippet
p_snippet_text = soup.find('p', class_='snipppet').text
Related
Joining output of iterative list in a while loop [duplicate]
This question already has answers here: What is the purpose of the return statement? How is it different from printing? (15 answers) How can I use `return` to get back multiple values from a loop? Can I put them in a list? (2 answers) Closed 5 months ago. I have a text file that I'm extracting text from using its punctuation and indentation patterns. The output should be a list of lists combining two lists; company_name and description [[company,description],[company,description]] To do that I'm running a while loop nested within a for loop to extract the description for each company. Here's my code for line in file: if not re.search(r" ", line, re.MULTILINE): name = line.split(',', 1)[0] companies.append(name) print(companies) companies = [] while re.search(r" ", line, re.MULTILINE): desc.append(line) print(desc) desc = [] break Sample from text file: XYZ Group, a nearly nine-year-old, Copenhagen-based company that has built a dual-purpose platform, providing both accountancy software and a marketplace for small and medium businesses to find accountants, has landed $73 million in growth funding from a single investor, Lugard Road Capital. TechCrunch has more here. Black Lake, a nearly five-year-old, China-based software platform for factory workers to log their daily tasks and managers to oversee the plant floor, recently raised $77 million in funding, including from Singapore’s sovereign wealth fund Temasek, which led the round, as well as China Renaissance and Lightspeed Venture Partners. The outfit has now raised more than $100 million altogether, including from from GGV... That's the output: ['XYZ Group'] [' company that has built a dual-purpose platform, providing both'] [' accountancy software and a marketplace for small and medium'] [' businesses to find accountants, has landed 73 million in growth funding from a single investor,'] [' Lugard Road Capital TechCrunch has more'] [' here'] ['Black Lake'] [' platform for factory workers to log their daily tasks and managers'] [' to oversee the plant floor, recently raised 77 million in funding,'] [' including from Singapore’s sovereign wealth fund Temasek,'] [' which led the round, as well as China'] [' Renaissance and Lightspeed Venture'] [' Partners The outfit has now raised more than 100'] [' million altogether, including from from GGV'] [' Capital, Bertelsmann Asia Investments,'] [' GSR Ventures, ZhenFund'] [' and others TechCrunch has more'] [' here'] The goal is to join the output of desc list under company name into 1 list Update I put desc = [] outside of the while loop and I'm getting this: ['XYZ Group'] [' company that has built a dual-purpose platform, providing both'] [' company that has built a dual-purpose platform, providing both', ' accountancy software and a marketplace for small and medium'] [' company that has built a dual-purpose platform, providing both', ' accountancy software and a marketplace for small and medium', ' businesses to find accountants, has landed 73 million in growth funding from a single investor,'] [' company that has built a dual-purpose platform, providing both', ' accountancy software and a marketplace for small and medium', ' businesses to find accountants, has landed 73 million in growth funding from a single investor,', ' Lugard Road Capital TechCrunch has more'] [' company that has built a dual-purpose platform, providing both', ' accountancy software and a marketplace for small and medium', ' businesses to find accountants, has landed 73 million in growth funding from a single investor,', ' Lugard Road Capital TechCrunch has more', ' here'] I only need the last iteration though
Assuming the text is always following a <company_name>, <description> pattern, a very simple approach based on .split(). Simply split on the first , by limiting the number of splits with maxsplit=1 to get the name and full_description which can be prettified afterwards: text = "XYZ Group, a nearly nine-year-old, Copenhagen-based company that has built a dual-purpose platform, providing both accountancy software and a marketplace for small and medium businesses to find accountants, has landed $73 million in growth funding from a single investor, Lugard Road Capital. TechCrunch has more here." name, full_description = text.split(',', 1) description = [s.strip() for s in full_description.split(',')] output = [name, description] print(output) Output: ['XYZ Group', ['a nearly nine-year-old', 'Copenhagen-based company that has built a dual-purpose platform', 'providing both accountancy software and a marketplace for small and medium businesses to find accountants', 'has landed $73 million in growth funding from a single investor', 'Lugard Road Capital. TechCrunch has more here.']] Alternatively, you could also use .split(" ") to split on the multiple occurring spaces and ignore any commas.
How can I use regular expressions to extract all words with at least one digit in text with Python
I am new to regular expressions and I have a text as follows. How can I use the RegEx to extract all words with at least one digit in it? Really appreciate it. text = '''The start of the Civil War in 1861 followed by Tennessee’s secession from the Union and the lodging of wounded Confederate soldiers on campus did not close East Tennessee University. By spring 1862 when the trustees finally suspended operations, the majority of students had joined the military, President Joseph Ridley had resigned, and two professors had left the university. Wounded Confederate soldiers were lodged at university buildings after the January 1862 Battle of Mill Springs in Kentucky, known as the Battle of Fishing Creek to the Confederacy. In the fall of 1863, Union troops forced the Confederates out of Knoxville. On the Hill, the Union Army enclosed the three university buildings with an earthen fortification they named Fort Byington in honor of an officer from Michigan who was killed in the defense of Knoxville. They used the buildings for their headquarters, barracks, and a hospital for Black troops. Despite a Confederate attempt to retake the city by siege—climaxed by a bloody, abortive attack on Fort Sanders on November 29, 1863—the Union held and occupied Knoxville for the rest of the war. During the battle, the Hill was hit with artillery fire from Confederate guns located in a trench at the site of UT’s present-day Sorority Village. Campus also sustained a great deal of damage caused by the Union Army. Troops denuded the grounds of trees, ruined the steward’s house, and destroyed the gymnasium with misdirected cannon fire aimed at Confederate troops across the river. After the Civil War ended in 1865 and the Union Army left campus, Thomas Humes was elected university president. The university reopened in 1866 and operated for six months downtown in the Deaf and Dumb Asylum while repairs began at the damaged campus. A petition to the federal war department for monetary compensation for campus damage done by the Union Army undoubtedly received more favorable consideration because of Humes’s known Union loyalty throughout the war. A Senate committee which considered the bill for damages also noted that East Tennessee University was “particularly deserving of the favorable consideration of Congress” because it was “the only educational institution of known loyalty…in any of the seceding states.” However in 1873, President Ulysses S. Grant vetoed the bill that would have provided $18,500 to the university because he felt it would set a bad precedent. The bill was redrafted specifying that the payment was compensation for aid East Tennessee University gave to the Union during the war. On June 22, 1874, President Grant signed the new bill and the trustees accepted the funds the same day with an agreement to release the government from all claims. (More than a century and a half later, a buried Union trench was located in 2019 on the north side of the present-day McClung Museum with the use of ground-penetrating radar.) '''
You could use this pattern: '\w*\d+\w*' How does it work: \w* matches 0 or more characters (but not space) \d+ matches 1 or more digits \w* matches 0 or more characters again Using re and findall we get: re.findall('\w*\d+\w*',your_text) we get: ['1861', '1862', '1862', '1863', '29', '1863', '1865', '1866', '1873', '18', '500', '22', '1874', '2019']
Is this what you mean? re.findall(r"\S*\d+\S*", text) \S any character but a space, \d any digit, + one or more occurrences, * zero or more occurrences
Remove breaks in input strings
This is a scrapping from a selenium scrap dump: ['The Quest for Ethical Artificial Intelligence: A Conversation with Timnit Gebru', 'Mindfulness Self-Care for Students of Color', 'GPA: The Geopolitical landscape of the Olympic and Paralympic Movements', 'Interfaith Discussions', 'Mind the Gap', 'First-Year Arts Board Open House', 'Self-Care Night with CARE and BGLTQ+ Specialty Proctors', 'Drawing Plants & Flowers - Sold Out'] I have to pass this to an algorithm but as you can see, although all of them are perfectly encased within quotes, as the sentence breaks after "conversation with" and this is affecting my input. I tried removing whitespaces, didn't work. Any help will be highly appreciated.
You can't type a string on multiple lines unless you're using triple quotation marks. i.e. The first item (string) in your list is ['The Quest for Ethical Artificial Intelligence: A Conversation with Timnit Gebru'] You should keep the string on one line as follows. ['The Quest for Ethical Artificial Intelligence: A Conversation with Timnit Gebru'] Or you could use a triple quotation, as I mentioned. ['''The Quest for Ethical Artificial Intelligence: A Conversation with Timnit Gebru'''] If you'd like to make your list visible and neat aswell, I'd recommend breaking between commas in the list for example: ['The Quest for Ethical Artificial Intelligence: A Conversation with Timnit Gebru', 'Mindfulness Self-Care for Students of Color', 'GPA: The Geopolitical landscape of the Olympic and Paralympic Movements', 'Interfaith Discussions', 'Mind the Gap', 'First-Year Arts Board Open House', 'Self-Care Night with CARE and BGLTQ+ Specialty Proctors', 'Drawing Plants & Flowers - Sold Out']
You could try string_list = ['The Quest for Ethical Artificial Intelligence: A Conversation with Timnit Gebru', 'Mindfulness Self-Care for Students of Color', 'GPA: The Geopolitical landscape of the Olympic and Paralympic Movements', 'Interfaith Discussions', 'Mind the Gap', 'First-Year Arts Board Open House', 'Self-Care Night with CARE and BGLTQ+ Specialty Proctors', 'Drawing Plants & Flowers - Sold Out'] string_list = [str.replace("\n", " ") for str in string_list]
Using function objects to speed up the training
from datasets import load_dataset #Huggingface from transformers import BertTokenizer #Huggingface: def tokenized_dataset(dataset): """ Method that tokenizes each document in the train, test and validation dataset Args: dataset (DatasetDict): dataset that will be tokenized (train, test, validation) Returns: dict: dataset once tokenized """ tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") encode = lambda document: tokenizer(document, return_tensors='pt', padding=True, truncation=True) train_articles = [encode(document) for document in dataset["train"]["article"]] test_articles = [encode(document) for document in dataset["test"]["article"]] val_articles = [encode(document) for document in dataset["val"]["article"]] train_abstracts = [encode(document) for document in dataset["train"]["abstract"]] test_abstracts = [encode(document) for document in dataset["test"]["abstract"]] val_abstracts = [encode(document) for document in dataset["val"]["abstract"]] return {"train": (train_articles, train_abstracts), "test": (test_articles, test_abstracts), "val": (val_articles, val_abstracts)} if __name__ == "__main__": dataset = load_data("./train/", "./test/", "./val/", "./.cache_dir") tokenized_data = tokenized_dataset(dataset) I would like to modify the function tokenized_dataset because it creates a very heavy dictionary. The dataset created by that function will be reused for a ML training. However, dragging that dictionary during the training will slow down the training a lot. Be aware that document is similar to [['eleven politicians from 7 parties made comments in letter to a newspaper .', "said dpp alison saunders had ` damaged public confidence ' in justice .", 'ms saunders ruled lord janner unfit to stand trial over child abuse claims .', 'the cps has pursued at least 19 suspected paedophiles with dementia .'], ['an increasing number of surveys claim to reveal what makes us happiest .', 'but are these generic lists really of any use to us ?', 'janet street-porter makes her own list - of things making her unhappy !'], ["author of ` into the wild ' spoke to five rape victims in missoula , montana .", "` missoula : rape and the justice system in a college town ' was released april 21 .", "three of five victims profiled in the book sat down with abc 's nightline wednesday night .", 'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football ' 'players .', "huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .", 'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable ' 'cause .', 'mr krakauer wrote book after realizing close friend was a rape victim .'], ['tesco announced a record annual loss of £ 6.38 billion yesterday .', 'drop in sales , one-off costs and pensions blamed for financial loss .', 'supermarket giant now under pressure to close 200 stores nationwide .', 'here , retail industry veterans , plus mail writers , identify what went wrong .'], ..., ['snp leader said alex salmond did not field questions over his family .', "said she was not ` moaning ' but also attacked criticism of women 's looks .", 'she made the remarks in latest programme profiling the main party leaders .', 'ms sturgeon also revealed her tv habits and recent image makeover .', 'she said she relaxed by eating steak and chips on a saturday night .']] So in the dictionary, the keys are just strings, but the values are all list of lists of strings. Instead of having value = list of list of strings, I think it is better to create sort of a list of object function instead of having list of lists of strings. It will make the dictionary much lighter. How can I do that? EDIT For me, there is a difference between copying a list of lists of strings and copying a list of objects. Copying an object will simply copy the reference while copying a list of strings will copy everything. So it is much faster to copy the reference instead. This is the point of this question.
Python, keep only sentence from html and remove all non-alphabet, number
I got html from the website and change it to txt. However, how to clean the txt so that i keep only the sentences in the txt. for example: I want to remove all irrelevent information such as 1990...himself,1987, the 59th .... keep the sentences: tom cruise is an american actor who has starred in many blockbuster movies and as of 2012 is the highest paid actor in hollywood. he is also a film producer and owns a production company. tom cruise has been the winner of three golden globe awards and has been nominated thrice for academy awards. apart from this, many of the movies cruise has starred in have been huge blockbusters on the box office. after repeated success in many films, tom cruise kept going on with release of two mission impossible movies, war of the worlds which was a super duper box office hit and many more. and so on. 1990 ... himself 1987 the 59th annual academy awards (tv special) jack / maverick / vincent lauria (uncredited) related videos none none none see all 35 videos » #csm.csm_widget /> reality tv the office late night sitcoms music rappers action religion top paid how much money does tom cruise make? (salary & net worth) tom cruise is an american actor who has starred in many blockbuster movies and as of 2012 is the highest paid actor in hollywood. he is also a film producer and owns a production company. tom cruise has been the winner of three golden globe awards and has been nominated thrice for academy awards. apart from this, many of the movies cruise has starred in have been huge blockbusters on the box office. history thomas cruise mapother iv a.k.a tom cruise was born in syracuse, new york to mother mary lee and father thomas cruise mapother iii. cruise’s mother was a special education teacher and father was an electrical engineer. tom cruise is basically of irish, german and english origin. cruise’s family had the male domination of his abusive father whom cruise had once described as the merchant of chaos. he was often bullied and beaten by his father and cruise called him a coward. a part of tom cruise’s childhood was spent in canada. however, when cruise was in the sixth grade, his mother left his father and brought cruise and his siblings back to america. acting career acting career of tom cruise started quite early but with a small role in the movie endless love (1981). however, he got his big break as a supporting actor in the movie taps later that year. in 1983, his movies risky business and all the right moves along with top gun in 1986 paved the path for tom cruise as an established actor and a superstar. after this there was no looking back and tom cruise went to star in many super-successful movies like cocktail, rain man, days of thunder, interview with the vampire. then in 1996, he starred as a superspy ethan hunt in the very popular and blockbuster movie which went on to be a series, mission: impossible. that same year he also was seen in the lead role of the movie jerry maguire and won a golden globe for the same. in 1999, his supporting role in the movie magnolia again won him his second golden globe. after repeated success in many films, tom cruise kept going on with release of two mission impossible movies, war of the worlds which was a super duper box office hit and many more. net worth tom cruise’s films have gained $7.3 million worldwide as of 2013. however, the net worth of the highest paid actor in hollywood is $270 million and he still gets paychecks from his previous movies. 154 magazine cover photos | none » official sites: facebook | official site | none » alternate names: tomu kurûzu height: 5' 7" (1.7 m) none did you know? personal quote: (1992 quote) i really enjoy talking to other actors and directors. sometimes, if i see their movies, i'll call them up or write them a note saying, "i enjoyed it," or asking, "how did you do that? how did you make that work?". i just saw html text is called: text sentence = re.sub(' ', '\n', text) sentence = re.sub('none', '', words) print sentence the result: the sentence is destroyed. ethan hunt / ray ferrier (uncredited) 2006 the late late show with craig ferguson (tv series) himself - episode #2.140 (2006) ... himself (uncredited) 2006 getaway (tv series) himself - seven wonders of the world (2006) ... himself 2006 cmt insider (tv series) himself - episode dated 29 april 2006 (2006) ... himself 2005-2006 corazón de... (tv series) himself - episode dated 19 january 2006 (2006) ... himself - episode dated 15 november 2005 (2005) ... himself -
Try this: ^(\s*?\S*){5}$ The code is currently set to select any line that has five words or less. You can increase/decrease the number of words by changing the value of {5} Demo: https://regex101.com/r/z2qxrx/3