Split sentences in Python to not exceed a number of characters - python

I have a string that contains sentences. If this string contains more character then a given number. I'd like to split up this string into several strings with less then the max number of character, but stil containing full sentence.
I did the below, which seem to run okay, but not sure if I will experience bugs putting this in production. Does the below look okay?
from nltk.tokenize import sent_tokenize
my_text = "President Donald Trump announced Friday that he has tested positive for Covid-19, and he isn’t the first sitting president to contract a highly contagious and potentially deadly virus in the middle of a pandemic.Former President Woodrow Wilson became ill with the 1918 flu when he was in Paris in April 1919 organizing a peace treaty and the League of Nations following World War I.Wilson wasn’t a healthy man and “always frail,” said Howard Markel, a physician and medical historian at the University of Michigan. He would go on to have symptoms such as headache, high fever, cough and runny nose, Markel said. Many of Wilson’s aides would also contract the flu, including his chief of staff, he added.Trump tweeted overnight that he and first lady Melania Trump tested positive for the coronavirus after the White House confirmed that aide Hope Hicks had tested positive and had some symptoms.Trump was experiencing “mild symptoms” after testing positive for the coronavirus, White House chief of staff Mark Meadows confirmed to reporters Friday morning. The announcement came hours after the administration confirmed that White House aide Hope Hicks tested positive for the virus.For Wilson, the virus “took its toll on him,” Markel said. “That can have neurologic and long-term complications. And he was already at the time traveling and living on a train and giving five to 10 speeches a day. That’s not healthy.”When he got back to the United States, Wilson went on a whistle-stop tour to get the League of Nations ratified, which ultimately failed, Markel said. While on his tour, Wilson became thinner, paler and more frail, Markel would write in a column. He lost his appetite, his asthma grew worse and he complained of unrelenting headaches, he added. He would later have a bad stroke.“His wife basically took over the presidency after that,” he added.Many infectious disease experts and medical historians have drawn other parallels between 1918 and today. Schools and businesses were also closed and infected people were quarantined a century ago. People were also resistant to wearing face masks, calling them dirt traps and some clipped holes so they could smoke cigars.Several U.S. cities implemented mandates, describing them as a symbol of “wartime patriotism.” In San Francisco, then-Mayor James Rolph said, ”[C]onscience, patriotism and self-protection demand immediate and rigid compliance,” according to influenzaarchive.org, which is authored by Markel. But some people refused to comply or take them seriously, Markel said.“One woman, a downtown attorney, argued to Mayor Rolph that the mask ordinance was ‘absolutely unconstitutional’ because it was not legally enacted, and that as a result, every police officer who had arrested a mask scofflaw was personally liable,” according to influenzaarchive.org.As with Trump, some reports and historians have suggested that Wilson downplayed the severity of the virus. But Markel said that is a “wrong and a false trope of popular history.”The federal government played a very small role in American public health during that era, he said. Unlike today, there was no CDC or national public health department. The Food and Drug Administration existed, but it consisted of a very small group of men.“It was primarily a city and state role, and those agencies were hardly downplaying it,” Markel said.Unlike today, Wilson did not get sick during his reelection, Markel said. He said the public needs to know “how healthy or how not healthy” Trump is before the election on Nov. 3.“When you’re voting for a president now, you really are potentially voting for the vice president,” he said. “Because what if Trump gets sick and gets incapacitated or worse between Election Day and Jan. 20 because of Covid? Well then the elected vice president becomes president.”“The importance of him being clear, open and honest — or his doctors — with his health conditions is something I’m skeptical we’ll see. But it is critical,” Markel said. President Donald Trump announced Friday that he has tested positive for Covid-19, and he isn’t the first sitting president to contract a highly contagious and potentially deadly virus in the middle of a pandemic.Former President Woodrow Wilson became ill with the 1918 flu when he was in Paris in April 1919 organizing a peace treaty and the League of Nations following World War I.Wilson wasn’t a healthy man and “always frail,” said Howard Markel, a physician and medical historian at the University of Michigan. He would go on to have symptoms such as headache, high fever, cough and runny nose, Markel said. Many of Wilson’s aides would also contract the flu, including his chief of staff, he added.Trump tweeted overnight that he and first lady Melania Trump tested positive for the coronavirus after the White House confirmed that aide Hope Hicks had tested positive and had some symptoms.Trump was experiencing “mild symptoms” after testing positive for the coronavirus, White House chief of staff Mark Meadows confirmed to reporters Friday morning. The announcement came hours after the administration confirmed that White House aide Hope Hicks tested positive for the virus.For Wilson, the virus “took its toll on him,” Markel said. “That can have neurologic and long-term complications. And he was already at the time traveling and living on a train and giving five to 10 speeches a day. That’s not healthy.”When he got back to the United States, Wilson went on a whistle-stop tour to get the League of Nations ratified, which ultimately failed, Markel said. While on his tour, Wilson became thinner, paler and more frail, Markel would write in a column. He lost his appetite, his asthma grew worse and he complained of unrelenting headaches, he added. He would later have a bad stroke.“His wife basically took over the presidency after that,” he added.Many infectious disease experts and medical historians have drawn other parallels between 1918 and today. Schools and businesses were also closed and infected people were quarantined a century ago. People were also resistant to wearing face masks, calling them dirt traps and some clipped holes so they could smoke cigars.Several U.S. cities implemented mandates, describing them as a symbol of “wartime patriotism.” In San Francisco, then-Mayor James Rolph said, ”[C]onscience, patriotism and self-protection demand immediate and rigid compliance,” according to influenzaarchive.org, which is authored by Markel. But some people refused to comply or take them seriously, Markel said.“One woman, a downtown attorney, argued to Mayor Rolph that the mask ordinance was ‘absolutely unconstitutional’ because it was not legally enacted, and that as a result, every police officer who had arrested a mask scofflaw was personally liable,” according to influenzaarchive.org.As with Trump, some reports and historians have suggested that Wilson downplayed the severity of the virus. But Markel said that is a “wrong and a false trope of popular history.”The federal government played a very small role in American public health during that era, he said. Unlike today, there was no CDC or national public health department. The Food and Drug Administration existed, but it consisted of a very small group of men.“It was primarily a city and state role, and those agencies were hardly downplaying it,” Markel said.Unlike today, Wilson did not get sick during his reelection, Markel said. He said the public needs to know “how healthy or how not healthy” Trump is before the election on Nov. 3.“When you’re voting for a president now, you really are potentially voting for the vice president,” he said. “Because what if Trump gets sick and gets incapacitated or worse between Election Day and Jan. 20 because of Covid? Well then the elected vice president becomes president.”“The importance of him being clear, open and honest — or his doctors — with his health conditions is something I’m skeptical we’ll see. But it is critical,” Markel said."
sentences = sent_tokenize(my_text)
sentences_split = []
shortened_sentence = ""
for idx, sentence in enumerate(sentences):
if len(shortened_sentence) + len(sentence) < 5120:
shortened_sentence += sentence
if (len(shortened_sentence) + len(sentence) > 5120) or (idx + 1 == len(sentences)):
sentences_split.append(shortened_sentence)
shortened_sentence = ""
print(sentences_split)

To better explain my point about problem with the second if block, expressed in comments, see following example.
We want string of max len=15, i.e. 1520 in this case is 16. As you can see first 3 items in the list are 5 + 6 + 4 = 15, so, fisrt shortened_sentence should consists of first 3 items in the list. but it does not. because the logic of the second if is incorrect.
sentences = ['abcde', 'fghijk', 'lmno', 'pqr']
# we need sentences with less than 16 chars
print([len(sentence) for sentence in sentences])
sentences_split = []
shortened_sentence = ""
for idx, sentence in enumerate(sentences):
if len(shortened_sentence) + len(sentence) < 16:
shortened_sentence += sentence
if (len(shortened_sentence) + len(sentence) > 16) or (idx + 1 == len(sentences)):
sentences_split.append(shortened_sentence)
shortened_sentence = ""
print(sentences_split)
print([len(sentence) for sentence in sentences_split])
output
[5, 6, 4, 3]
['abcdefghijk', 'lmnopqr']
[11, 7]
Compare it with
sentences = ['abcde', 'fghijk', 'lmno', 'pqr']
# we need sentences with less than 16 chars
print([len(word) for word in sentences])
sentences_split = []
shortened_sentence = ""
for sentence in sentences:
if len(shortened_sentence) + len(sentence) < 16:
shortened_sentence += sentence
else:
sentences_split.append(shortened_sentence)
shortened_sentence = sentence
sentences_split.append(shortened_sentence)
print(sentences_split)
print([len(sentence) for sentence in sentences_split])
output
[5, 6, 4, 3]
['abcdefghijklmno', 'pqr']
[15, 3]
Finally, if you are not sure " if I will experience bugs putting this in production" - write tests, a lot of tests. That's what tests are about - to help minimise bugs in production.
Also, note that second snippets is just a sample implementation, there are other possible implementations.

Related

Context-Based Paragraph Splitting of an Article or long Text

I want to chunk long text into paragraphs that are context-based, because right now i'm just splitting the text in to sentences and chunking them every 250 words and calling that a paragraph, but its obviously not a good way to make a paragraph because its "dumb" and information thats over 250 words gets kept out, and it isn't even really a paragraph, just all the sentences before 250 words is filled up. So i want to make context-based paragraph splitting so it "smart" and is actually a paragraph.
The code below is what i have now:
import re
newtext = '''
Kendrick Lamar Duckworth is an American rapper, songwriter, and record producer. He is often cited as one of the most influential rappers of his generation. Aside from his solo career, he is also a member of the hip hop supergroup Black Hippy alongside his former Top Dawg Entertainment (TDE) labelmates Ab-Soul, Jay Rock, and Schoolboy Q. Raised in Compton, California, Lamar embarked on his musical career as a teenager under the stage name K.Dot, releasing a mixtape titled Y.H.N.I.C. (Hub City Threat Minor of the Year) that garnered local attention and led to his signing with indie record label TDE. He began to gain recognition in 2010 after his first retail release, Overly Dedicated. The following year, he independently released his first studio album, Section.80, which included his debut single "HiiiPoWeR". By that time, he had amassed a large online following and collaborated with several prominent rappers. He subsequently secured a record deal with Dr. Dre's Aftermath Entertainment, under the aegis of Interscope Records. Lamar's major-label debut album, Good Kid, M.A.A.D City, was released in 2012, garnering him widespread critical recognition and mainstream success. His third album To Pimp a Butterfly (2015), which incorporated elements of funk, soul, jazz, and spoken word, predominantly centred around the Black-American experience. It became his first number-one album on the US Billboard 200 and was an enormous critical success. His fourth album, Damn (2017), saw continued acclaim, becoming the first non-classical and non-jazz album to be awarded the Pulitzer Prize for Music. It also yielded his first number-one single, "Humble", on the US Billboard Hot 100. Lamar curated the soundtrack to the superhero film Black Panther (2018) and in 2022, released his fifth and last album with TDE, Mr. Morale & the Big Steppers, which received critical acclaim. Lamar has certified sales of over 70 million records in the United States alone, and all of his albums have been certified platinum or higher by the Recording Industry Association of America (RIAA). He has received several accolades in his career, including 14 Grammy Awards, two American Music Awards, six Billboard Music Awards, 11 MTV Video Music Awards, a Pulitzer Prize, a Brit Award, and an Academy Award nomination. In 2012, MTV named him the Hottest MC in the Game on their annual list. Time named him one of the 100 most influential people in the world in 2016. In 2015, he received the California State Senate's Generational Icon Award. Three of his studio albums were included on Rolling Stone's 2020 list of the 500 Greatest Albums of All Time. Kendrick Lamar Duckworth was born in Compton, California on June 17, 1987, the son of a couple from Chicago. Although not in a gang himself, he grew up around gang members, with his closest friends being Westside Piru Bloods and his father, Kenny Duckworth, being a Gangster Disciple. His first name was given to him by his mother in honor of singer-songwriter Eddie Kendricks of The Temptations. He grew up on welfare and in Section 8 housing. In 1995, at the age of eight, Lamar witnessed his idols Tupac Shakur and Dr. Dre filming the music video for their hit single "California Love", which proved to be a significant moment in his life. As a child, Lamar attended McNair Elementary and Vanguard Learning Center in the Compton Unified School District. He has admitted to being quiet and shy in school, his mother even confirming he was a "loner" until the age of seven. As a teenager, he graduated from Centennial High School in Compton, where he was a straight-A student. Kendrick Lamar has stated that Tupac Shakur, the Notorious B.I.G., Jay-Z, Nas and Eminem are his top five favorite rappers. Tupac Shakur is his biggest influence, and has influenced his music as well as his day-to-day lifestyle. In a 2011 interview with Rolling Stone, Lamar mentioned Mos Def and Snoop Dogg as rappers that he listened to and took influence from during his early years. He also cites now late rapper DMX as an influence: "[DMX] really [got me started] on music," explained Lamar in an interview with Philadelphia's Power 99. "That first album [It's Dark and Hell Is Hot] is classic, [so he had an influence on me]." He has also stated Eazy-E as an influence in a post by Complex saying: "I Wouldn't Be Here Today If It Wasn't for Eazy-E." In a September 2012 interview, Lamar stated rapper Eminem "influenced a lot of my style" and has since credited Eminem for his own aggression, on records such as "Backseat Freestyle". Lamar also gave Lil Wayne's work in Hot Boys credit for influencing his style and praised his longevity. He has said that he also grew up listening to Rakim, Dr. Dre, and Tha Dogg Pound. In January 2013, when asked to name three rappers that have played a role in his style, Lamar said: "It's probably more of a west coast influence. A little bit of Kurupt, [Tupac], with some of the content of Ice Cube." In a November 2013 interview with GQ, when asked "The Four MC's That Made Kendrick Lamar?", he answered Tupac Shakur, Dr. Dre, Snoop Dogg and Mobb Deep, namely Prodigy. Lamar professed to having been influenced by jazz trumpeter Miles Davis and Parliament-Funkadelic during the recording of To Pimp a Butterfly. Lamar has been branded as the "new king of hip hop" numerous times. Forbes said, on Lamar's placement as hip hop's "king", "Kendrick Lamar may or may not be the greatest rapper alive right now. He is certainly in the very short lists of artists in the conversation." Lamar frequently refers to himself as the "greatest rapper alive" and once called himself "The King of New York." On the topic of his music genre, Lamar has said: "You really can't categorize my music, it's human music." Lamar's projects are usually concept albums. Critics found Good Kid, M.A.A.D City heavily influenced by West Coast hip hop and 90s gangsta rap. His third studio album, To Pimp a Butterfly, incorporates elements of funk, jazz, soul and spoken word poetry. Called a "radio-friendly but overtly political rapper" by Pitchfork, Lamar has been a branded "master of storytelling" and his lyrics have been described as "katana-blade sharp" and his flow limber and dexterous. Lamar's writing usually includes references to racism, black empowerment and social injustice, being compared to a State of Union address by The Guardian. His writing has also been called "confessional" and controversial. The New York Times has called Lamar's musical style anti-flamboyant, interior and complex and labelled him as a technical rapper. Billboard described his lyricism as "Shakespearean".
'''
#1100 words
regex = r'([A-z][^.!?]*[.!?]*"?)'
for sens in re.findall(regex, newtext):
newtext = newtext.replace(f'{sens}', f'{sens}<eos>')
sentencess = newtext.split('<eos>')
sentences = sentencess
current_chunk = 0
chunks = []
for sentence in sentences:
if len(chunks) == current_chunk + 1:
if len(chunks[current_chunk]) + len(sentence.split(" ")) <= 250:
chunks[current_chunk].extend(sentence.split(" "))
else:
current_chunk += 1
chunks.append(sentence.split(" "))
else:
chunks.append(sentence.split(" "))
for chunk_id in range(len(chunks)):
chunks[chunk_id] = " ".join(chunks[chunk_id])
print(chunks) # printing all split "paragraphs"
print(chunks[0]) # printing 1st "paragraph"

Why does my program output multiple times?

My python program works fine however it keeps printing the answer atleast 5 times over and I am racking my brain as to why. Any ideas?
Text = """The University of Wisconsin–Milwaukee is a public urban research university in Milwaukee, Wisconsin. It is the largest university in the Milwaukee metropolitan area and a member of the University of Wisconsin System. It is also one of the two doctoral degree-granting public universities and the second largest university in Wisconsin. The university consists of 14 schools and colleges, including the only graduate school of freshwater science in the U.S., the first CEPH accredited dedicated school of public health in Wisconsin, and the state"s only school of architecture. As of the 2015–2016 school year, the University of Wisconsin–Milwaukee had an enrollment of 27,156, with 1,604 faculty members, offering 191 degree programs, including 94 bachelor's, 64 master's and 33 doctorate degrees. The university is classified among "R1: Doctoral Universities – Highest research activity". In 2018, the university had a research expenditure of $55 million. The university's athletic teams are the Panthers. A total of 15 Panther athletic teams compete in NCAA Division I. Panthers have won the James J. McCafferty Trophy as the Horizon League's all-sports champions seven times since 2000. They have earned 133 Horizon League titles and made 40 NCAA tournament appearances as of 2016."""
for punc in "–.,\n":
Text=Text.replace(punc," ")
Text = Text.lower()
word_list = Text.split()
dict = {}
for word in word_list:
dict[word] = dict.get(word, 0) + 1
word_freq = []
for key, value in sorted(dict.items()):
if value > 5:
print(key, value)
You have an indentation issue that leads to the nested for loop. Fix the code into:
Text = """The University of Wisconsin–Milwaukee is a public urban research university in Milwaukee, Wisconsin. It is the largest university in the Milwaukee metropolitan area and a member of the University of Wisconsin System. It is also one of the two doctoral degree-granting public universities and the second largest university in Wisconsin. The university consists of 14 schools and colleges, including the only graduate school of freshwater science in the U.S., the first CEPH accredited dedicated school of public health in Wisconsin, and the state"s only school of architecture. As of the 2015–2016 school year, the University of Wisconsin–Milwaukee had an enrollment of 27,156, with 1,604 faculty members, offering 191 degree programs, including 94 bachelor's, 64 master's and 33 doctorate degrees. The university is classified among "R1: Doctoral Universities – Highest research activity". In 2018, the university had a research expenditure of $55 million. The university's athletic teams are the Panthers. A total of 15 Panther athletic teams compete in NCAA Division I. Panthers have won the James J. McCafferty Trophy as the Horizon League's all-sports champions seven times since 2000. They have earned 133 Horizon League titles and made 40 NCAA tournament appearances as of 2016."""
for punc in "–.,\n":
Text=Text.replace(punc," ")
Text = Text.lower()
word_list = Text.split()
freq = {}
for word in word_list:
freq[word] = freq.get(word, 0) + 1
for key, value in sorted(freq.items()):
if value > 5:
print(key, value)
Since the loops are nested, the print(key, value) line gets called whenever the outer loop goes to the next word. As your freq dictionary grows larger, it will inevitably keeps printing out that same dictionary for every iteration, leading to redundant printing.
=> You probably don't want that; you only want to print the freq dictionary only ONCE after the previous for loop has finished collecting the frequency of each word. Thus separating the loops - the second loop will only run after the first one finished.
Edit: Another thing pointed out by #random-davis is that you don't want to use reserved keyword like dict for your variable name. Change it to freq, or dictionary, or something else.

How to improve nltk human name identifier?

I am trying to extract human names from text.
Does anyone have a method that they would recommend?
It is fetching so much other data also. Why is this happening tried doing so much thing to fetch human name out of the file but always some error. Because i want to fetch the human name with each sentence and then match that name with my db and then link this sentence with that human name. But not able to achieve.
from nameparser.parser import HumanName
from nltk.corpus import wordnet
person_list = []
person_names=person_list
def get_human_names(text):
tokens = nltk.tokenize.word_tokenize(text)
pos = nltk.pos_tag(tokens)
sentt = nltk.ne_chunk(pos, binary = False)
person = []
name = ""
for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
for leaf in subtree.leaves():
person.append(leaf[0])
if len(person) > 1: #avoid grabbing lone surnames
for part in person:
name += part + ' '
if name[:-1] not in person_list:
person_list.append(name[:-1])
name = ''
person = []
# print (person_list)
text = """
Pooja Hegde Says Instagram Was Hacked After Meme On Samantha Ruth Prabhu Were Posted <title-break> Pooja Hegde shared this image.On Monday night, Pooja tweeted that her Instagram had been hacked and was in the process of being retrieved.— Pooja Hegde (#hegdepooja) May 27, 2020Spent the last hour stressing about the safety of my Instagram account.— Pooja Hegde (#hegdepooja) May 27, 2020Pooja Hegde's feed has been cleaned up now - the most recent Instagram post is now one on pet food that she posted three days ago.Pooja Hegde works mainly in the Telugu film industry and has also appeared in Hindi movies like Mohenjo Daro and Housefull 4.
T-Series' Hanuman Chalisa Crosses 1 Billion Views On Youtube <title-break> Hariharan in a still from Hanuman Chalisa.The recitation of Hanuman Chalisa is known to ward off evil and danger and empowers one with the strength and courage to face any problem head-on.T-Series' Hanuman Chalisa, sung by veteran singer Hariharan, which prominently features T-Series founder Gulshan Kumar, becomes the go-to rendition of this holy recital, crossing the highly coveted 1 billion mark on YouTube.Says T-Series head Bhushan Kumar, "People turn to and recite the Hanuman Chalisa during their low phase.We, at T-Series have always focused, supported and brought to audiences devotional music right from the time of the inception of the company."
Henry Cavill May Return As Superman And Twitter Is Losing Its Mind <title-break> Henry Cavill in Batman v Superman.While we await the official confirmation, Twitter is busy rejoicing over the reports and can't wait to see Henry Cavill in the Superman cape once again.Henry Cavill made his first appearance as Superman in 2013's Man Of Steel.Before Henry Cavill, the Superman cape has been worn by actors such as Christopher Reeve, Brandon Routh and Tom Welling.DC Universe's films aside, Henry Cavill famously starred in Mission: Impossible - Fallout alongside franchise veteran Tom Cruise.
Pooja Batra's Throwback Pic Of Nawab Shah Proposing To Her Is Pure Gold <title-break> On Thursday, the 43-year-old actress shared a set of pictures with Nawab Shah and it has our heart.In the pictures, Nawab can be seen kneeling in front of Pooja and holding her hand, while Pooja can be seen smiling with all her heart.Take a look:Pooja and Nawab Shah never fail to paint Instagram red with their loved-up posts for each other."Nawab Shah and I are following the government guidelines to minimise the spread and impact of COVID-19 and encourage all to stay home.On the other hand, Nawab Shah has starred in films such as Don 2, Tiger Zinda Hai, Panipat, Luck and Dilwale.
Shraddha Kapoor And Brother Siddhanth Found A New "Adventure" To Go On <title-break> (courtesy: shraddhakapoor)Highlights Shraddha and Siddhanth shared identical photos on Instagram"Make sure you are wearing a mask," wrote Siddhanth"Stay safe and all," he addedShraddha Kapoor and her brother Siddhanth found a fun way to spend quality time together.Sharing a picture on her Instagram profile on Thursday, Shraddha described it as "groceries adventure with my bhaiya Siddhanth Kapoor."Siddhanth Kapoor, in his caption, wrote: "It's a lot of fun when you go grocery shopping with your sibling."What fun yaar.... Should do this every day (kidding)," Siddhanth Kapoor commented on his sister's post.Siddhanth Kapoor has starred in films like Shootout at Wadala, Jazbaa, Ugly, Bombairiya and Paltan among others.
Reese Witherspoon And Son Are "Dreaming" Of Travelling To India <title-break> Reese Witherspoon shared this image.On Thursday, Reese shared a picture on her Instagram profile where the mother-son duo can be seen engrossed in reading an activity book on India.In the picture, Tennessee can be seen holding a pen while Reese can be seen pointing to a page with 'India' written on it.Sharing the picture, Reese wrote, "Dreaming of the places we will go."Reese Witherspoon has a body of work that includes films such as Election, Cruel Intentions, Legally Blonde, Sweet Home Alabama, Walk The Line and Wild.
Amitabh Bachchan, As He Once Was Vs As He Is Now. See His Post <title-break> In a then vs now mood, Amitabh Bachchan shared a picture collage of himself from his 1976 film Kabhie Kabhie, and a snippet of his look from his upcoming film Gulabo Sitabo.Sharing the picture, Big B wrote about the evergreen song Kabhi Kabhi Mere Dil Mein from Kabhie Kabhie."Srinagar, Kashmir.. Kabhie Kabhie.. writing the verse for the song Kabhi Kabhi Mere Dil Mein Khayaal Aata Hai," he wrote.Sahir Ludhianvi received the Filmfare Award for the best lyricist for Kabhi Kabhi Mere Dil Mein.Kabhie Kabhie also starred Rishi Kapoor and Neetu Kapoor in pivotal roles.
When Varun Dhawan And Kiara Advani Danced To Sun Saathiya <title-break> (courtesy: YouTube)Highlights Varun posted a video on his YouTube accountThe song originally featured Varun and Shraddha Kapoor"I hope you guys like this video," wrote Varun DhawanYou might have seen Varun Dhawan and Kiara Advani dancing together to the song First Class from the 2019 period drama Kalank but have you seen them grooving together to ABCD 2's Sun Saathiya?On Wednesday, Varun posted a video from his and Kiara's dance rehearsal session on his official YouTube channel.Here's a throwback dance cover video of my rehearsal with Kiara Advani.Kiara Advani was last seen in Netflix's Guilty.Kiara Advani made her Bollywood debut with the 2014 film Fugly.
Gabriella Demetriades Is "Back To The City." See Her Post <title-break> Gabriella Demetriades shared this photo.In the pictures, Gabriella can be seen sporting an all-white outfit while clicking mirror selfies in her apartment.Sharing the pictures, Gabriella captioned her post with these words: "Back to the city and my spot."Take a look:Earlier in the day, Gabriella Demetriades shared stunning pictures of herself chilling at the farmhouse.Gabriella Demetriades and Arjun Rampal started dating in 2018 after meeting through common friends.
Can You Spot Sonam And Arjun Kapoor In This Priceless Throwback Pic? <title-break> Aadar Jain shared this photo.On Wednesday, Aadar shared a photo from his childhood featuring Sonam Kapoor, her cousin Arjun Kapoor and others.The throwback photo was also re-shared by Sonam and Arjun Kapoor on their respective social media profile.On the other hand, Sonam Kapoor's last project remains the 2019 romance-drama The Zoya Factor.Sonam Kapoor's brother Harshvardhan Kapoor made his debut in Bollywood as a lead actor with Rakeysh Omprakash Mehra's Mirzya.
When Kriti Sanon's Sister Nupur Gave Her A Haircut At Home. So, How Did She Do? <title-break> (Image courtesy: kritisanon)Highlights Kriti Sanon posted a video from her haircut session on Instagram"The short hair looks so cute," commented a fan"Thank you, Nupur Sanon for such a refreshing cut," wrote Kriti SanonLeave it to Kriti and Nupur Sanon to set sister goals for every occasion.A few days ago, Kriti posted a video of Nupur giving her a haircut at home.Sharing the video on Instagram, Kriti wrote: "Baal baal bach gaye... Watch it till the end to see for yourself!Kriti Sanon made her Bollywood debut with the 2014 film Heropanti, co-starring Tiger Shroff.Kriti's sister Nupur Sanon is a singer, who made her acting debut opposite Akshay Kumar in B Praak's music video, titled Filhall.
Viral: Kareena Kapoor Asking "What Is The Meaning" While Filming Dabangg 2 Song Fevicol Se <title-break> Kareena Kapoor in a still from the clip.(Image courtesy: kareena.arabfc)Highlights Kareena Kapoor featured in a special dance number in 'Dabangg 2'A throwback video of hers shooting the track Fevicol Se is trending"What is the meaning of that..patrol se?"The clip is actually a BTS video of Kareena shooting the film's track Fevicol Se and it is going crazy viral on social media.The peppy track featured Kareena dancing with Salman Khan, who played the lead in all the three parts of the Dabangg series.Check out the viral clip, shared by several fan-pages dedicated to Kareena Kapoor, here:The first part of the Dabangg series featured Malaika Arora in the song Munni Badnaam Hui.
A Thoughtful Ranveer Singh, Described Best In Two Pics <title-break> Ranveer Singh shared this photo (courtesy ranveersingh)Highlights Ranveer shared two new photos on InstagramBoth the pics are close-up shots of the actorHe shared the pics without captionsRanveer Singh, who loves to interact with his fans on Instagram, updated his feed with two new photos recently.In one of the photos, Ranveer can be seen enjoying sea vibes on what appears to be a beach.The actor paid a tribute of sorts to Sylvester Stallone aka Rocky Balboa of the Rocky series of films, with this cool selfie.Ranveer Singh, who loves being extra, also posted this photo of himself as Joe Exotic of Netflix crime series Tiger King.Ranveer Singh was last seen in Gully Boy.
Amitabh Bachchan's Amar Akbar Anthony "Crosses Collections Of Baahubali 2," Allowing For Inflation <title-break> Sharing a black and white photo from the film's sets, Big B wrote: "Shweta and Abhishek visited me on sets of Amar Akbar Anthony.On Twitter, Amitabh Bachchan shared a few more Amar Akbar Anthony memories:T 3544 -43 YEARS .... 'Amar Akbar Anthony' is estimated to have made Rs 7.25 crore in those days.In March, Amitabh Bachchan shared this priceless gem from the film's mahurat with Dharmendra holding the clapperboard.pic.twitter.com/wKpMBIrubZ — Amitabh Bachchan (#SrBachchan) March 2, 2020Amitabh Bachchan will next be seen in Gulabo Sitabo, which releases on Amazon Prime on June 12.
"A Year Since You Left": Ajay Devgn Posts Tribute To Father Veeru <title-break> Ajay Devgn shared this video.(courtesy ajaydevgn)Highlights Ajay shared a video collage featuring Veeru Devgan"Your presence is reassuring," Ajay addedAbhishek Bachchan commented with a folded hands emoticonIn a moving tribute to father Veeru Devgan, Ajay Devgn shared a video collage featuring himself with Veeru Devgan on his Instagram profile.In the monochrome video collage, Ajay Devgn and Veeru Devgan can be seen posing for the camera on various occasions.Remembering Veeru Devgan on his first death anniversary, Ajay Devgn wrote a heartwarming post dedicated to him.Last year, on Teacher's day, Ajay Devgn remembered Veeru Devgan for the "invaluable life lessons."
Namrata Shirodkar's Throwback Pic Of Mahesh Babu And Son Gautham Is "Just Too Adorable" <title-break> The actress, who is rummaging through the dust-caked family albums in lockdown, just dropped another adorable throwback picture on her Instagram profile.Adding to her series "memory therapy," Namrata shared a picture featuring Mahesh Babu and their son Gautham and it is a blast from the past.In the picture, Mahesh Babu can be seen holding a pint-sized Gautham as they pose for the camera together.She wrote, "Mahesh Babu, do you even remember where this was?Take a look:Namrata Shirodkar keeps treating her Instafam to adorable throwback pictures featuring Mahesh Babu, son Gautham and daughter Sitara.
Karan Johar Thinks Rani Mukerji Is A "Magician." Here's Why <title-break> Karan Johar with Rani Mukerji.Sharing a picture of the drool-worthy cake, Karan Johar wrote a thank you note for Rani.Rani Mukerji and Karan Johar share a great rapport with each other.Karan Johar was the only Bollywood celebrity to be present at the wedding ceremony of Rani Mukerji in Italy - the actress got married to Aditya Chopra in 2014.Rani has worked with Karan Johar in films such as Kuch Kuch Hota Hai, Kabhi Khushi Kabhie Gham and Kabhi Alvida Naa Kehna.
Viral: The Internet Struck Gold With This Pic Of Shah Rukh Khan As A Teen <title-break> Shah Rukh Khan, who began his career as an actor with Fauji in 1988, attended St. Columba's School in New Delhi.In the photo, Shah Rukh can be seen posing with who seem to be his classmates.After passing out from school, Shah Rukh graduated from Delhi's Hansraj college.A year before his Bollywood debut, Shah Rukh got married to Gauri Khan.Shah Rukh Khan, who was last seen in Zero, has a few production projects lined-up.
Priyanka Chopra's "Zoom Meeting Lewk" Is Pretty Much The Same As Ours <title-break> And while switching between chilling on the couch and attending work meetings via Zoom, Priyanka has discovered the art of a quick look change.The busy actress shared glimpses of her "Zoom meeting lewk" on Instagram recently and all we can say is - been there, done that.Priyanka shared two photos in which she can be seen sporting a white blazer over a peach top and a pair of white pyjamas.For accessories, Priyanka was wearing flip flops, to go with her pyjama/zoom meeting "lewk".She recently shared this "Expectation vs reality" post, which will make you roll on the floor laughing.
"
"""
names = get_human_names(text)
for person in person_list:
person_split = person.split(" ")
for name in person_split:
if wordnet.synsets(name):
if(name in person):
person_names.remove(person)
break
print(person_names)
The output of the code
['Samantha Ruth Prabhu Were', 'Pooja Hegde', 'Mohenjo Daro', 'Gulshan Kumar', 'Bhushan Kumar', 'Twitter Is', 'Brandon Routh', 'Pooja Batra', 'Her Is Pure Gold', 'Shraddha Kapoor', 'Highlights Shraddha', 'Siddhanth Kapoor', 'Son Are', 'Legally Blonde', 'Amitabh Bachchan', 'Kabhie Kabhie', 'Kabhi Kabhi Mere Dil Mein', 'Kabhi Kabhi Mere Dil', 'Rishi Kapoor', 'Neetu Kapoor', 'Varun Dhawan', 'Highlights Varun', 'Varun DhawanYou', 'Kiara Advani', 'Gabriella Demetriades', 'Arjun Rampal', 'Arjun Kapoor', 'Sonam Kapoor', 'Harshvardhan Kapoor', 'Rakeysh Omprakash Mehra', 'Sister Nupur Gave', 'Nupur Sanon', 'Akshay Kumar', 'Kareena Kapoor', 'Fevicol Se', 'Malaika Arora', 'Munni Badnaam Hui', 'Ranveer Singh', 'Sylvester Stallone', 'Joe Exotic', 'Amar Akbar Anthony', 'Amazon Prime', 'Father Veeru', 'Veeru Devgan', 'Ajay Devgn', 'Namrata Shirodkar', 'Son Gautham', 'Karan Johar', 'Aditya Chopra', 'Kuch Kuch Hota Hai', 'Kabhi Khushi Kabhie Gham', 'Kabhi Alvida Naa Kehna', 'Shah Rukh Khan', 'Shah Rukh', 'Priyanka Chopra']

How to count unique words in python with function?

I would like to count unique words with function. Unique words I want to define are the word only appear once so that's why I used set here. I put the error below. Does anyone how to fix this?
Here's my code:
def unique_words(corpus_text_train):
words = re.findall('\w+', corpus_text_train)
uw = len(set(words))
return uw
unique = unique_words(test_list_of_str)
unique
I got this error
TypeError: expected string or bytes-like object
Here's my bag of words model:
def BOW_model_relative(df):
corpus_text_train = []
for i in range(0, len(df)): #iterate over the rows in dataframe
corpus = df['text'][i]
#corpus = re.findall(r'\w+',corpus)
corpus = re.sub(r'[^\w\s]','',corpus)
corpus = corpus.lower()
corpus = corpus.split()
corpus = ' '.join(corpus)
corpus_text_train.append(corpus)
word2count = {}
for x in corpus_text_train:
words=word_tokenize(x)
for word in words:
if word not in word2count.keys():
word2count[word]=1
else:
word2count[word]+=1
total=0
for key in word2count.keys():
total+=word2count[key]
for key in word2count.keys():
word2count[key]=word2count[key]/total
return word2count,corpus_text_train
test_dict,test_list_of_str = BOW_model_relative(df)
#test_data = pd.DataFrame(test)
print(test_dict)
Here's my csv data
df = pd.read_csv('test.csv')
,text,title,authors,label
0,"On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a “pressure cooker” device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.
Given that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn’t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the “bad guys.” Unfortunately, our leadership – who ostensibly wants to protect us – finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.
New York City Mayor Bill de Blasio – who famously ended “stop-and-frisk” profiling in his city – was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. “There is no specific and credible threat to New York City from any terror organization,” de Blasio said late Saturday at the news conference. “We believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and … agencies are at full alert”, he said. Isn’t “an intentional act” terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O’Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York’s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word “terrorism” is defined. “A bomb exploding in New York is obviously an act of terrorism.” Cuomo hit the nail on the head, but why did need to clarify and caveat before making his “obvious” assessment?
The two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that “we need to do everything we can to support our first responders – also to pray for the victims” and that “we need to let this investigation unfold.” Trump was more direct. “I must tell you that just before I got off the plane a bomb went off in New York and nobody knows what’s going on,” he said. “But boy we are living in a time—we better get very tough folks. We better get very, very tough. It’s a terrible thing that’s going on in our world, in our country and we are going to get tough and smart and vigilant.”
The answer from Kohelet neglects characters such as , and ", which in OP's case would find people and people, to be two unique words. To make sure you only get actual words you need to take care of the unwanted characters. To remove the , and ", you could add the following:
text ='aa, aa bb cc'
def unique_words(text):
words = text.replace('"','').replace(',', '').split()
unique = list(set(words))
return len(unique)
unique_words(text)
# out
3
There are numerous ways to add text to be replaced
s='aa aa bb cc'
def unique_words(corpus_text_train):
splitted = corpus_text_train.split()
return(len(set(splitted)))
unique_words(s)
Out[14]: 3

How to arrange the statement accordingly in dictionary?

a function which gives statements of commentary, the problem is they contain <br> and </br> tags, I want to arrange these in a new line
from pycricbuzz import Cricbuzz
c = Cricbuzz()
commentary1 = []
current_game3 = {}
matches = c.matches()
for match in matches:
if(match['mchstate'] != 'nextlive'):
col= (c.commentary(match['id']))
for my_str in col['commentary']:
current_game3[ "commentary2"] = my_str
commentary1.append(current_game3)
current_game3 = {}
print(commentary1)
when I print this I get output as below
{'commentary2': 'Preview by Tristan Lavalette<br/><br/>The Twenty20 tri-series decider between Australia and New Zealand is set to finish with a bang at the tiny Eden Park on Wednesday (February 21), as another bout of belligerent batting is expected in Auckland.<br/><br/>In a preview of the final, the teams clashed at Eden Park last Friday and produced a run-fest with the rampaging Australia successfully chasing down a record target of 244. The unbeaten Australia head into the final as favourites after a dazzling campaign from their new look side brimming with in-form Big Bash League players and headed by skipper David Warner, whose inventive captaincy has been inspirational.<br/><br/>Astoundingly, Australia is on the brink of leapfrogging into the No.1 T20 ranking having started the tournament a lowly No.7. A victory would be their sixth straight in the format equalling their best ever streak.<br/><br/>Australia\'s hard-hitting batting has relished chasing in every match and New Zealand\'s brains trust might deeply consider bowling first if skipper Kane Williamson wins the toss. Packed with firepower, Australia ooze with match-winners and chased down the record target with relative ease, confirming their penchant to chase. At the comically miniature Eden Park ground, Australia\'s powerful batting will be confident no matter the situation of the match.<br/><br/>Of course, the beleaguered bowlers aren\'t quite as cheery after copping a flogging last start especially to New Zealand dynamo Martin Guptill. Much like their counterparts, the Black Caps boast a high-octane batting order that has been inconsistent throughout the tournament but, ominously, has the artillery to spearhead New Zealand to a triumph.<br/><br/>Australia\'s attack has been settled throughout the tri-series but selectors might be tempted to tweak it in a bid to ruffle the Black Caps. Legspinner Adam Zampa could be given a call-up on the wearing pitch - the same one used for Friday\'s encounter - which is set to be helpful for spin.<br/><br/>If Zampa gets the nod, Australia will be faced with a dilemma of culling one of their frontline quicks of Billy Stanlake, Kane Richardson and Andrew Tye, who have each starred at various stages during the tri-series. Australia\'s fresh team has matured quickly but the pressure will be intensified in an away final amid an electrifying atmosphere.<br/><br/>Even they though endured a rocky tournament yielding just one win, New Zealand squeaked past England to reach the decider but will need to lift their game if they are to cause an upset. The Black Caps have been unable to consistently recapture their best after coming into the tri-series ranked No. 2 in the world.<br/><br/>New Zealand\'s eclectic bowling has struggled although the spin combination of Mitchell Santner and Ish Sodhi could prove a handful on this deck. For such a composed and experienced team, New Zealand has looked occasionally rattled having agonisingly lost consecutive matches.<br/><br/>Despite their struggles, New Zealand know one strong performance is enough for them to claim glory in front of their parochial home crowd desperate for some revelry.<br/><br/>With all to play for, the stage is set for a memorably entertaining finish for this inaugural tri-series tournament.<br/><br/>When: Wednesday, February 21, 2018; 7PM local, 11.30AM IST<br/><br/>Where: Eden Park, Auckland<br/><br/>What to expect: There is a chance of showers intervening. Once again, there should be plenty of runs on offer on the small ground although the pitch is tipped to produce some turn.<br/><br/>Team News<br/><br/>New Zealand: Despite agonisingly losing their last couple of games, New Zealand are set to stick with the same line-up.<br/><br/>Probable XI: Martin Guptill, Colin Munro, Kane Williamson (c), Colin de Grandhomme, Mark Chapman, Ross Taylor, Tim Seifert (wk), Mitchell Santner, Tim Southee, Ish Sodhi, Trent Boult<br/><br/>Australia: Zampa could be in line to play with the pitch possibly providing some turn. However, a red hot Australia may not want to disturb a winning combination.<br/><br/>Probable XI: David Warner, D\'Arcy Short, Chris Lynn, Glenn Maxwell, Aaron Finch, Marcus Stoinis, Alex Carey (wk), Ashton Agar, Kane Richardson, Andrew Tye, Billy Stanlake<br/><br/>Did you know<br/><br/>- Australia\'s greatest winning streak in T20Is is their six straight victories at the 2010 World T20 before losing the final to England<br/><br/>- David Warner has won 8 of 9 as T20 captain. The best record overall - minimum 10 matches - is Pakistan\'s Sarfraz Ahmed\'s 14 wins from 17 matches<br/><br/>- New Zealand have lost their last four T20I matches at Eden Park<br/><br/>What they said<br/><br/>"We\'ve had three pretty close T20 games, Australia batting exceptionally well at Eden Park and chasing down a score that was pretty formidable. But you\'ve got to be in the final and give yourself a chance" - Mike Hesson, the New Zealand coach.<br/><br/>"You\'ve just got to find a way to get one or two wickets in the first six (overs), it\'s as simple as that" - David Warner, the Australia captain, said about bowling at the tiny Eden Park.'},
I want to arrange like this
Preview by Tristan Lavalette
The Twenty20 tri-series decider between Australia and New Zealand is set to finish with a bang at the tiny Eden Park on Wednesday (February 21), as another bout of belligerent batting is expected in Auckland.
In a preview of the final, the teams clashed at Eden Park last Friday and produced a run-fest with the rampaging Australia successfully chasing down a record target of 244. The unbeaten Australia head into the final as favourites after a dazzling campaign from their new look side brimming with in-form Big Bash League players and headed by skipper David Warner, whose inventive captaincy has been inspirational.
Astoundingly, Australia is on the brink of leapfrogging into the No.1 T20 ranking having started the tournament a lowly No.7. A victory would be their sixth straight in the format equalling their best ever streak.<br/><br/>Australia\'s hard-hitting batting has relished chasing in every match and New Zealand\'s brains trust might deeply consider bowling first if skipper Kane Williamson wins the toss. Packed with firepower, Australia ooze with match-winners and chased down the record target with relative ease, confirming their penchant to chase. At the comically miniature Eden Park ground, Australia\'s powerful batting will be confident no matter the situation of the match.
Of course, the beleaguered bowlers aren\'t quite as cheery after copping a flogging last start especially to New Zealand dynamo Martin Guptill. Much like their counterparts, the Black Caps boast a high-octane batting order that has been inconsistent throughout the tournament but, ominously, has the artillery to spearhead New Zealand to a triumph.
Australia\'s attack has been settled throughout the tri-series but selectors might be tempted to tweak it in a bid to ruffle the Black Caps. Legspinner Adam Zampa could be given a call-up on the wearing pitch - the same one used for Friday\'s encounter - which is set to be helpful for spin.
If Zampa gets the nod, Australia will be faced with a dilemma of culling one of their frontline quicks of Billy Stanlake, Kane Richardson and Andrew Tye, who have each starred at various stages during the tri-series. Australia\'s fresh team has matured quickly but the pressure will be intensified in an away final amid an electrifying atmosphere.
Even they though endured a rocky tournament yielding just one win, New Zealand squeaked past England to reach the decider but will need to lift their game if they are to cause an upset. The Black Caps have been unable to consistently recapture their best after coming into the tri-series ranked No. 2 in the world.
New Zealand\'s eclectic bowling has struggled although the spin combination of Mitchell Santner and Ish Sodhi could prove a handful on this deck. For such a composed and experienced team, New Zealand has looked occasionally rattled having agonizingly lost consecutive matches.
Despite their struggles, New Zealand knows one strong performance is enough for them to claim glory in front of their parochial home crowd desperate for some revelry.
Assuming you want to print each commentary dictionary in the commentary1 list, you want to replace the
print(commentary1)
line with
print("\n".join([" ".join(i.values()).replace("<br/><br/>", "\n") for i in commentary1]))
That will take all the dictionaries in the commentary1 list, then take all of their values, append them with a space, replace the <br/><br/> tags with \n, then join them.
Use this:
from pycricbuzz import Cricbuzz
c = Cricbuzz()
commentary1 = []
current_game3 = {}
matches = c.matches()
for match in matches:
if match['mchstate'] != 'nextlive':
col= (c.commentary(match['id']))
for my_str in col['commentary']:
current_game3["commentary2"] = my_str.replace('<br/>', '\n')
commentary1.append(current_game3)
current_game3 = {}
for comment in commentary1:
print(comment['commentary2'])
Partial Output:
Preview by Tristan Lavalette
The Twenty20 tri-series decider between Australia and New Zealand is
set to finish with a bang at the tiny Eden Park on Wednesday (February
21), as another bout of belligerent batting is expected in Auckland.
In a preview of the final, the teams clashed at Eden Park last Friday
and produced a run-fest with the rampaging Australia successfully
chasing down a record target of 244. The unbeaten Australia head into
the final as favourites after a dazzling campaign from their new look
side brimming with in-form Big Bash League players and headed by
skipper David Warner, whose inventive captaincy has been
inspirational.
Astoundingly, Australia is on the brink of leapfrogging into the No.1
T20 ranking having started the tournament a lowly No.7. A victory
would be their sixth straight in the format equalling their best ever
streak.

Categories

Resources