How to count unique words in python with function?

How to count unique words in python with function? - python

I would like to count unique words with function. Unique words I want to define are the word only appear once so that's why I used set here. I put the error below. Does anyone how to fix this?
Here's my code:
def unique_words(corpus_text_train):
words = re.findall('\w+', corpus_text_train)
uw = len(set(words))
return uw
unique = unique_words(test_list_of_str)
unique
I got this error
TypeError: expected string or bytes-like object
Here's my bag of words model:
def BOW_model_relative(df):
corpus_text_train = []
for i in range(0, len(df)): #iterate over the rows in dataframe
corpus = df['text'][i]
#corpus = re.findall(r'\w+',corpus)
corpus = re.sub(r'[^\w\s]','',corpus)
corpus = corpus.lower()
corpus = corpus.split()
corpus = ' '.join(corpus)
corpus_text_train.append(corpus)
word2count = {}
for x in corpus_text_train:
words=word_tokenize(x)
for word in words:
if word not in word2count.keys():
word2count[word]=1
else:
word2count[word]+=1
total=0
for key in word2count.keys():
total+=word2count[key]
for key in word2count.keys():
word2count[key]=word2count[key]/total
return word2count,corpus_text_train
test_dict,test_list_of_str = BOW_model_relative(df)
#test_data = pd.DataFrame(test)
print(test_dict)
Here's my csv data
df = pd.read_csv('test.csv')
,text,title,authors,label
0,"On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a “pressure cooker” device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.
Given that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn’t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the “bad guys.” Unfortunately, our leadership – who ostensibly wants to protect us – finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.
New York City Mayor Bill de Blasio – who famously ended “stop-and-frisk” profiling in his city – was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. “There is no specific and credible threat to New York City from any terror organization,” de Blasio said late Saturday at the news conference. “We believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and … agencies are at full alert”, he said. Isn’t “an intentional act” terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O’Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York’s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word “terrorism” is defined. “A bomb exploding in New York is obviously an act of terrorism.” Cuomo hit the nail on the head, but why did need to clarify and caveat before making his “obvious” assessment?
The two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that “we need to do everything we can to support our first responders – also to pray for the victims” and that “we need to let this investigation unfold.” Trump was more direct. “I must tell you that just before I got off the plane a bomb went off in New York and nobody knows what’s going on,” he said. “But boy we are living in a time—we better get very tough folks. We better get very, very tough. It’s a terrible thing that’s going on in our world, in our country and we are going to get tough and smart and vigilant.”

The answer from Kohelet neglects characters such as , and ", which in OP's case would find people and people, to be two unique words. To make sure you only get actual words you need to take care of the unwanted characters. To remove the , and ", you could add the following:
text ='aa, aa bb cc'
def unique_words(text):
words = text.replace('"','').replace(',', '').split()
unique = list(set(words))
return len(unique)
unique_words(text)
# out
3
There are numerous ways to add text to be replaced

s='aa aa bb cc'
def unique_words(corpus_text_train):
splitted = corpus_text_train.split()
return(len(set(splitted)))
unique_words(s)
Out[14]: 3

Related

Using regex to capture substring within a pandas df

I’m trying to extract specific substrings from larger phrases contained in my Pandas dataframe. I have rows formatted like so:
Appointment of DAVID MERRIGAN of Hammonds Plains, Nova Scotia, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of CARLA R. CONKIN of Fort Steele, British Columbia, to be Vice-Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of JUDY A. WHITE, Q.C., of Conne River, Newfoundland and Labrador, to be Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of GRETA SITTICHINLI of Inuvik, Northwest Territories, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
and I've been able to capture the capitalized names (e.g. DAVID MERRIGAN) with the regex below but I'm struggling to capture the locations, i.e. the 'of' statement following the capitalized name that ends with the second comma. I've tried just isolating the rest of the string that follows the name with the following code, but it just doesn't seem to work, I keep getting -1 as a response.
df_appointments['Name'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)')
df_appointments['Location'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)\b\s([^\n\r]*)')
Any help showing me how to isolate the location substring with regex (after that I can figure out how to get the position, etc) would be tremendously appreciated. Thank you.

The following pattern works for your sample set:
rgx = r'(?:\w\s)+([A-Z\s\.,]+)(?:\sof\s)([A-Za-z\s]+,\s[A-Za-z\s]+)'
It uses capture groups & non-capture groups to isolate only the names & locations from the strings. Rather than requiring two patterns, and having to perform two searches, you can then do the following to extract that information into two new columns:
df[['name', 'location']] = df['precis'].str.extract(rgx)
This then produces:
df
precis name location
0 Appointment of... DAVID MERRIGAN Hammonds Plains, Nova Scotia
1 Appointment of... CARLA R. CONKIN Fort Steele, British Columbia
2 Appointment of... JUDY A. WHITE, Q.C., Conne River, Newfoundland and...
3 Appointment of... GRETA SITTICHINLI Inuvik, Northwest Territories`
Depending on the exact format of all of your precis values, you might have to tweak the pattern to suit perfectly, but hopefully it gets you going...

# Final Answer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\yueheng.li\Desktop\Up\Question_20220824\Data.csv")
data[['Field_Part1','Field_Part2','Field_Part3']] = data['Precis'].str.split('of',2,expand=True)
data['Address_part1'] = data['Field_Part3'].str.split(',').str[0]
data['Address_part2'] = data['Field_Part3'].str.split(',').str[1]
data['Address'] = data['Address_part1']+','+data['Address_part2']
data.drop(['Field_Part1','Field_Part2','Field_Part3','Address_part1','Address_part2'],axis=1,inplace=True)
# Output Below
data
Easy Way to understand
Thanks
Leon

Context-Based Paragraph Splitting of an Article or long Text

I want to chunk long text into paragraphs that are context-based, because right now i'm just splitting the text in to sentences and chunking them every 250 words and calling that a paragraph, but its obviously not a good way to make a paragraph because its "dumb" and information thats over 250 words gets kept out, and it isn't even really a paragraph, just all the sentences before 250 words is filled up. So i want to make context-based paragraph splitting so it "smart" and is actually a paragraph.
The code below is what i have now:
import re
newtext = '''
Kendrick Lamar Duckworth is an American rapper, songwriter, and record producer. He is often cited as one of the most influential rappers of his generation. Aside from his solo career, he is also a member of the hip hop supergroup Black Hippy alongside his former Top Dawg Entertainment (TDE) labelmates Ab-Soul, Jay Rock, and Schoolboy Q. Raised in Compton, California, Lamar embarked on his musical career as a teenager under the stage name K.Dot, releasing a mixtape titled Y.H.N.I.C. (Hub City Threat Minor of the Year) that garnered local attention and led to his signing with indie record label TDE. He began to gain recognition in 2010 after his first retail release, Overly Dedicated. The following year, he independently released his first studio album, Section.80, which included his debut single "HiiiPoWeR". By that time, he had amassed a large online following and collaborated with several prominent rappers. He subsequently secured a record deal with Dr. Dre's Aftermath Entertainment, under the aegis of Interscope Records. Lamar's major-label debut album, Good Kid, M.A.A.D City, was released in 2012, garnering him widespread critical recognition and mainstream success. His third album To Pimp a Butterfly (2015), which incorporated elements of funk, soul, jazz, and spoken word, predominantly centred around the Black-American experience. It became his first number-one album on the US Billboard 200 and was an enormous critical success. His fourth album, Damn (2017), saw continued acclaim, becoming the first non-classical and non-jazz album to be awarded the Pulitzer Prize for Music. It also yielded his first number-one single, "Humble", on the US Billboard Hot 100. Lamar curated the soundtrack to the superhero film Black Panther (2018) and in 2022, released his fifth and last album with TDE, Mr. Morale & the Big Steppers, which received critical acclaim. Lamar has certified sales of over 70 million records in the United States alone, and all of his albums have been certified platinum or higher by the Recording Industry Association of America (RIAA). He has received several accolades in his career, including 14 Grammy Awards, two American Music Awards, six Billboard Music Awards, 11 MTV Video Music Awards, a Pulitzer Prize, a Brit Award, and an Academy Award nomination. In 2012, MTV named him the Hottest MC in the Game on their annual list. Time named him one of the 100 most influential people in the world in 2016. In 2015, he received the California State Senate's Generational Icon Award. Three of his studio albums were included on Rolling Stone's 2020 list of the 500 Greatest Albums of All Time. Kendrick Lamar Duckworth was born in Compton, California on June 17, 1987, the son of a couple from Chicago. Although not in a gang himself, he grew up around gang members, with his closest friends being Westside Piru Bloods and his father, Kenny Duckworth, being a Gangster Disciple. His first name was given to him by his mother in honor of singer-songwriter Eddie Kendricks of The Temptations. He grew up on welfare and in Section 8 housing. In 1995, at the age of eight, Lamar witnessed his idols Tupac Shakur and Dr. Dre filming the music video for their hit single "California Love", which proved to be a significant moment in his life. As a child, Lamar attended McNair Elementary and Vanguard Learning Center in the Compton Unified School District. He has admitted to being quiet and shy in school, his mother even confirming he was a "loner" until the age of seven. As a teenager, he graduated from Centennial High School in Compton, where he was a straight-A student. Kendrick Lamar has stated that Tupac Shakur, the Notorious B.I.G., Jay-Z, Nas and Eminem are his top five favorite rappers. Tupac Shakur is his biggest influence, and has influenced his music as well as his day-to-day lifestyle. In a 2011 interview with Rolling Stone, Lamar mentioned Mos Def and Snoop Dogg as rappers that he listened to and took influence from during his early years. He also cites now late rapper DMX as an influence: "[DMX] really [got me started] on music," explained Lamar in an interview with Philadelphia's Power 99. "That first album [It's Dark and Hell Is Hot] is classic, [so he had an influence on me]." He has also stated Eazy-E as an influence in a post by Complex saying: "I Wouldn't Be Here Today If It Wasn't for Eazy-E." In a September 2012 interview, Lamar stated rapper Eminem "influenced a lot of my style" and has since credited Eminem for his own aggression, on records such as "Backseat Freestyle". Lamar also gave Lil Wayne's work in Hot Boys credit for influencing his style and praised his longevity. He has said that he also grew up listening to Rakim, Dr. Dre, and Tha Dogg Pound. In January 2013, when asked to name three rappers that have played a role in his style, Lamar said: "It's probably more of a west coast influence. A little bit of Kurupt, [Tupac], with some of the content of Ice Cube." In a November 2013 interview with GQ, when asked "The Four MC's That Made Kendrick Lamar?", he answered Tupac Shakur, Dr. Dre, Snoop Dogg and Mobb Deep, namely Prodigy. Lamar professed to having been influenced by jazz trumpeter Miles Davis and Parliament-Funkadelic during the recording of To Pimp a Butterfly. Lamar has been branded as the "new king of hip hop" numerous times. Forbes said, on Lamar's placement as hip hop's "king", "Kendrick Lamar may or may not be the greatest rapper alive right now. He is certainly in the very short lists of artists in the conversation." Lamar frequently refers to himself as the "greatest rapper alive" and once called himself "The King of New York." On the topic of his music genre, Lamar has said: "You really can't categorize my music, it's human music." Lamar's projects are usually concept albums. Critics found Good Kid, M.A.A.D City heavily influenced by West Coast hip hop and 90s gangsta rap. His third studio album, To Pimp a Butterfly, incorporates elements of funk, jazz, soul and spoken word poetry. Called a "radio-friendly but overtly political rapper" by Pitchfork, Lamar has been a branded "master of storytelling" and his lyrics have been described as "katana-blade sharp" and his flow limber and dexterous. Lamar's writing usually includes references to racism, black empowerment and social injustice, being compared to a State of Union address by The Guardian. His writing has also been called "confessional" and controversial. The New York Times has called Lamar's musical style anti-flamboyant, interior and complex and labelled him as a technical rapper. Billboard described his lyricism as "Shakespearean".
'''
#1100 words
regex = r'([A-z][^.!?]*[.!?]*"?)'
for sens in re.findall(regex, newtext):
newtext = newtext.replace(f'{sens}', f'{sens}<eos>')
sentencess = newtext.split('<eos>')
sentences = sentencess
current_chunk = 0
chunks = []
for sentence in sentences:
if len(chunks) == current_chunk + 1:
if len(chunks[current_chunk]) + len(sentence.split(" ")) <= 250:
chunks[current_chunk].extend(sentence.split(" "))
else:
current_chunk += 1
chunks.append(sentence.split(" "))
else:
chunks.append(sentence.split(" "))
for chunk_id in range(len(chunks)):
chunks[chunk_id] = " ".join(chunks[chunk_id])
print(chunks) # printing all split "paragraphs"
print(chunks[0]) # printing 1st "paragraph"

Print texts that have cosine similarity score less than 0.90

I want to create deduplication process on my database.
I want to measure cosine similarity scores with Pythons Sklearn lib. between new texts and texts that are already in the database.
I want to add only documents that have cosine similarity score less than 0.90.
This is my code:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
list_of_texts_in_database = ["More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.",
"UK ministers say sanctions could target companies and individuals linked to the Russian government.",
"Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
"He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.",
"Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.",
"The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control",
"A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control",
"That could lead to a major escalation in conflict."]
list_of_new_texts = ["This is a totaly new document that needs to be added into the database one way or another.",
"Boris Johnson also says the UK could limit Russian firm ability to raise capital on London's market.",
"Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.",
"UK minister say sanctions could target companies and individuals linked to the Russian government.",
"That could lead to a major escalation in conflict."]
vectorizer = TfidfVectorizer(lowercase=True, analyzer='word', stop_words = None, ngram_range=(1, 1))
list_of_texts_in_database_tfidf = vectorizer.fit_transform(list_of_texts_in_database)
list_of_new_texts_tfidf = vectorizer.transform(list_of_new_texts)
cosineSimilarities = cosine_similarity(list_of_new_texts_tfidf, list_of_texts_in_database_tfidf)
print(cosineSimilarities)
This code works good, but I do not know how to map the results (how to get texts that have similarity score less than 0.90)

My suggestion would be as follows. You only add those texts with a score less than (or equal) 0.9.
import numpy as np
idx = np.where((cosineSimilarities <= 0.9).all(axis=1))
Then you have the indices of the new texts in list_of_new_texts that do not have a corresponding text with a score of > 0.9 in the already existing list list_of_texts_in_database.
Combining them you can do as follows (although somebody else might have a cleaner method for this...)
print(
list_of_texts_in_database + list(np.array(list_of_new_texts)[idx[0]])
)
Output:
['More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.',
'UK ministers say sanctions could target companies and individuals linked to the Russian government.',
"Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
'He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.',
'Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.',
'The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control',
'A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control',
'That could lead to a major escalation in conflict.',
'This is a totaly new document that needs to be added into the database one way or another.',
'Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.',
'UK minister say sanctions could target companies and individuals linked to the Russian government.']

why dont you work within a dataframe?
import pandas as pd
d = {'old_text':list_of_texts_in_database[:5], 'new_text':list_of_new_texts, 'old_emb': list_of_texts_in_database_tfidf[:5], 'new_emb': list_of_new_texts_tfidf}
df = pd.DataFrame(data=d)
df['score'] = df.apply(lambda row: cosine_similarity(row['old_emb'], row['new_emb'])[0][0], axis=1)
df = df.loc[df.score > 0.9, 'score']
df.head()

Split sentences in Python to not exceed a number of characters

I have a string that contains sentences. If this string contains more character then a given number. I'd like to split up this string into several strings with less then the max number of character, but stil containing full sentence.
I did the below, which seem to run okay, but not sure if I will experience bugs putting this in production. Does the below look okay?
from nltk.tokenize import sent_tokenize
my_text = "President Donald Trump announced Friday that he has tested positive for Covid-19, and he isn’t the first sitting president to contract a highly contagious and potentially deadly virus in the middle of a pandemic.Former President Woodrow Wilson became ill with the 1918 flu when he was in Paris in April 1919 organizing a peace treaty and the League of Nations following World War I.Wilson wasn’t a healthy man and “always frail,” said Howard Markel, a physician and medical historian at the University of Michigan. He would go on to have symptoms such as headache, high fever, cough and runny nose, Markel said. Many of Wilson’s aides would also contract the flu, including his chief of staff, he added.Trump tweeted overnight that he and first lady Melania Trump tested positive for the coronavirus after the White House confirmed that aide Hope Hicks had tested positive and had some symptoms.Trump was experiencing “mild symptoms” after testing positive for the coronavirus, White House chief of staff Mark Meadows confirmed to reporters Friday morning. The announcement came hours after the administration confirmed that White House aide Hope Hicks tested positive for the virus.For Wilson, the virus “took its toll on him,” Markel said. “That can have neurologic and long-term complications. And he was already at the time traveling and living on a train and giving five to 10 speeches a day. That’s not healthy.”When he got back to the United States, Wilson went on a whistle-stop tour to get the League of Nations ratified, which ultimately failed, Markel said. While on his tour, Wilson became thinner, paler and more frail, Markel would write in a column. He lost his appetite, his asthma grew worse and he complained of unrelenting headaches, he added. He would later have a bad stroke.“His wife basically took over the presidency after that,” he added.Many infectious disease experts and medical historians have drawn other parallels between 1918 and today. Schools and businesses were also closed and infected people were quarantined a century ago. People were also resistant to wearing face masks, calling them dirt traps and some clipped holes so they could smoke cigars.Several U.S. cities implemented mandates, describing them as a symbol of “wartime patriotism.” In San Francisco, then-Mayor James Rolph said, ”[C]onscience, patriotism and self-protection demand immediate and rigid compliance,” according to influenzaarchive.org, which is authored by Markel. But some people refused to comply or take them seriously, Markel said.“One woman, a downtown attorney, argued to Mayor Rolph that the mask ordinance was ‘absolutely unconstitutional’ because it was not legally enacted, and that as a result, every police officer who had arrested a mask scofflaw was personally liable,” according to influenzaarchive.org.As with Trump, some reports and historians have suggested that Wilson downplayed the severity of the virus. But Markel said that is a “wrong and a false trope of popular history.”The federal government played a very small role in American public health during that era, he said. Unlike today, there was no CDC or national public health department. The Food and Drug Administration existed, but it consisted of a very small group of men.“It was primarily a city and state role, and those agencies were hardly downplaying it,” Markel said.Unlike today, Wilson did not get sick during his reelection, Markel said. He said the public needs to know “how healthy or how not healthy” Trump is before the election on Nov. 3.“When you’re voting for a president now, you really are potentially voting for the vice president,” he said. “Because what if Trump gets sick and gets incapacitated or worse between Election Day and Jan. 20 because of Covid? Well then the elected vice president becomes president.”“The importance of him being clear, open and honest — or his doctors — with his health conditions is something I’m skeptical we’ll see. But it is critical,” Markel said. President Donald Trump announced Friday that he has tested positive for Covid-19, and he isn’t the first sitting president to contract a highly contagious and potentially deadly virus in the middle of a pandemic.Former President Woodrow Wilson became ill with the 1918 flu when he was in Paris in April 1919 organizing a peace treaty and the League of Nations following World War I.Wilson wasn’t a healthy man and “always frail,” said Howard Markel, a physician and medical historian at the University of Michigan. He would go on to have symptoms such as headache, high fever, cough and runny nose, Markel said. Many of Wilson’s aides would also contract the flu, including his chief of staff, he added.Trump tweeted overnight that he and first lady Melania Trump tested positive for the coronavirus after the White House confirmed that aide Hope Hicks had tested positive and had some symptoms.Trump was experiencing “mild symptoms” after testing positive for the coronavirus, White House chief of staff Mark Meadows confirmed to reporters Friday morning. The announcement came hours after the administration confirmed that White House aide Hope Hicks tested positive for the virus.For Wilson, the virus “took its toll on him,” Markel said. “That can have neurologic and long-term complications. And he was already at the time traveling and living on a train and giving five to 10 speeches a day. That’s not healthy.”When he got back to the United States, Wilson went on a whistle-stop tour to get the League of Nations ratified, which ultimately failed, Markel said. While on his tour, Wilson became thinner, paler and more frail, Markel would write in a column. He lost his appetite, his asthma grew worse and he complained of unrelenting headaches, he added. He would later have a bad stroke.“His wife basically took over the presidency after that,” he added.Many infectious disease experts and medical historians have drawn other parallels between 1918 and today. Schools and businesses were also closed and infected people were quarantined a century ago. People were also resistant to wearing face masks, calling them dirt traps and some clipped holes so they could smoke cigars.Several U.S. cities implemented mandates, describing them as a symbol of “wartime patriotism.” In San Francisco, then-Mayor James Rolph said, ”[C]onscience, patriotism and self-protection demand immediate and rigid compliance,” according to influenzaarchive.org, which is authored by Markel. But some people refused to comply or take them seriously, Markel said.“One woman, a downtown attorney, argued to Mayor Rolph that the mask ordinance was ‘absolutely unconstitutional’ because it was not legally enacted, and that as a result, every police officer who had arrested a mask scofflaw was personally liable,” according to influenzaarchive.org.As with Trump, some reports and historians have suggested that Wilson downplayed the severity of the virus. But Markel said that is a “wrong and a false trope of popular history.”The federal government played a very small role in American public health during that era, he said. Unlike today, there was no CDC or national public health department. The Food and Drug Administration existed, but it consisted of a very small group of men.“It was primarily a city and state role, and those agencies were hardly downplaying it,” Markel said.Unlike today, Wilson did not get sick during his reelection, Markel said. He said the public needs to know “how healthy or how not healthy” Trump is before the election on Nov. 3.“When you’re voting for a president now, you really are potentially voting for the vice president,” he said. “Because what if Trump gets sick and gets incapacitated or worse between Election Day and Jan. 20 because of Covid? Well then the elected vice president becomes president.”“The importance of him being clear, open and honest — or his doctors — with his health conditions is something I’m skeptical we’ll see. But it is critical,” Markel said."
sentences = sent_tokenize(my_text)
sentences_split = []
shortened_sentence = ""
for idx, sentence in enumerate(sentences):
if len(shortened_sentence) + len(sentence) < 5120:
shortened_sentence += sentence
if (len(shortened_sentence) + len(sentence) > 5120) or (idx + 1 == len(sentences)):
sentences_split.append(shortened_sentence)
shortened_sentence = ""
print(sentences_split)

To better explain my point about problem with the second if block, expressed in comments, see following example.
We want string of max len=15, i.e. 1520 in this case is 16. As you can see first 3 items in the list are 5 + 6 + 4 = 15, so, fisrt shortened_sentence should consists of first 3 items in the list. but it does not. because the logic of the second if is incorrect.
sentences = ['abcde', 'fghijk', 'lmno', 'pqr']
# we need sentences with less than 16 chars
print([len(sentence) for sentence in sentences])
sentences_split = []
shortened_sentence = ""
for idx, sentence in enumerate(sentences):
if len(shortened_sentence) + len(sentence) < 16:
shortened_sentence += sentence
if (len(shortened_sentence) + len(sentence) > 16) or (idx + 1 == len(sentences)):
sentences_split.append(shortened_sentence)
shortened_sentence = ""
print(sentences_split)
print([len(sentence) for sentence in sentences_split])
output
[5, 6, 4, 3]
['abcdefghijk', 'lmnopqr']
[11, 7]
Compare it with
sentences = ['abcde', 'fghijk', 'lmno', 'pqr']
# we need sentences with less than 16 chars
print([len(word) for word in sentences])
sentences_split = []
shortened_sentence = ""
for sentence in sentences:
if len(shortened_sentence) + len(sentence) < 16:
shortened_sentence += sentence
else:
sentences_split.append(shortened_sentence)
shortened_sentence = sentence
sentences_split.append(shortened_sentence)
print(sentences_split)
print([len(sentence) for sentence in sentences_split])
output
[5, 6, 4, 3]
['abcdefghijklmno', 'pqr']
[15, 3]
Finally, if you are not sure " if I will experience bugs putting this in production" - write tests, a lot of tests. That's what tests are about - to help minimise bugs in production.
Also, note that second snippets is just a sample implementation, there are other possible implementations.

How to arrange the statement accordingly in dictionary?

a function which gives statements of commentary, the problem is they contain <br> and </br> tags, I want to arrange these in a new line
from pycricbuzz import Cricbuzz
c = Cricbuzz()
commentary1 = []
current_game3 = {}
matches = c.matches()
for match in matches:
if(match['mchstate'] != 'nextlive'):
col= (c.commentary(match['id']))
for my_str in col['commentary']:
current_game3[ "commentary2"] = my_str
commentary1.append(current_game3)
current_game3 = {}
print(commentary1)
when I print this I get output as below
{'commentary2': 'Preview by Tristan Lavalette<br/><br/>The Twenty20 tri-series decider between Australia and New Zealand is set to finish with a bang at the tiny Eden Park on Wednesday (February 21), as another bout of belligerent batting is expected in Auckland.<br/><br/>In a preview of the final, the teams clashed at Eden Park last Friday and produced a run-fest with the rampaging Australia successfully chasing down a record target of 244. The unbeaten Australia head into the final as favourites after a dazzling campaign from their new look side brimming with in-form Big Bash League players and headed by skipper David Warner, whose inventive captaincy has been inspirational.<br/><br/>Astoundingly, Australia is on the brink of leapfrogging into the No.1 T20 ranking having started the tournament a lowly No.7. A victory would be their sixth straight in the format equalling their best ever streak.<br/><br/>Australia\'s hard-hitting batting has relished chasing in every match and New Zealand\'s brains trust might deeply consider bowling first if skipper Kane Williamson wins the toss. Packed with firepower, Australia ooze with match-winners and chased down the record target with relative ease, confirming their penchant to chase. At the comically miniature Eden Park ground, Australia\'s powerful batting will be confident no matter the situation of the match.<br/><br/>Of course, the beleaguered bowlers aren\'t quite as cheery after copping a flogging last start especially to New Zealand dynamo Martin Guptill. Much like their counterparts, the Black Caps boast a high-octane batting order that has been inconsistent throughout the tournament but, ominously, has the artillery to spearhead New Zealand to a triumph.<br/><br/>Australia\'s attack has been settled throughout the tri-series but selectors might be tempted to tweak it in a bid to ruffle the Black Caps. Legspinner Adam Zampa could be given a call-up on the wearing pitch - the same one used for Friday\'s encounter - which is set to be helpful for spin.<br/><br/>If Zampa gets the nod, Australia will be faced with a dilemma of culling one of their frontline quicks of Billy Stanlake, Kane Richardson and Andrew Tye, who have each starred at various stages during the tri-series. Australia\'s fresh team has matured quickly but the pressure will be intensified in an away final amid an electrifying atmosphere.<br/><br/>Even they though endured a rocky tournament yielding just one win, New Zealand squeaked past England to reach the decider but will need to lift their game if they are to cause an upset. The Black Caps have been unable to consistently recapture their best after coming into the tri-series ranked No. 2 in the world.<br/><br/>New Zealand\'s eclectic bowling has struggled although the spin combination of Mitchell Santner and Ish Sodhi could prove a handful on this deck. For such a composed and experienced team, New Zealand has looked occasionally rattled having agonisingly lost consecutive matches.<br/><br/>Despite their struggles, New Zealand know one strong performance is enough for them to claim glory in front of their parochial home crowd desperate for some revelry.<br/><br/>With all to play for, the stage is set for a memorably entertaining finish for this inaugural tri-series tournament.<br/><br/>When: Wednesday, February 21, 2018; 7PM local, 11.30AM IST<br/><br/>Where: Eden Park, Auckland<br/><br/>What to expect: There is a chance of showers intervening. Once again, there should be plenty of runs on offer on the small ground although the pitch is tipped to produce some turn.<br/><br/>Team News<br/><br/>New Zealand: Despite agonisingly losing their last couple of games, New Zealand are set to stick with the same line-up.<br/><br/>Probable XI: Martin Guptill, Colin Munro, Kane Williamson (c), Colin de Grandhomme, Mark Chapman, Ross Taylor, Tim Seifert (wk), Mitchell Santner, Tim Southee, Ish Sodhi, Trent Boult<br/><br/>Australia: Zampa could be in line to play with the pitch possibly providing some turn. However, a red hot Australia may not want to disturb a winning combination.<br/><br/>Probable XI: David Warner, D\'Arcy Short, Chris Lynn, Glenn Maxwell, Aaron Finch, Marcus Stoinis, Alex Carey (wk), Ashton Agar, Kane Richardson, Andrew Tye, Billy Stanlake<br/><br/>Did you know<br/><br/>- Australia\'s greatest winning streak in T20Is is their six straight victories at the 2010 World T20 before losing the final to England<br/><br/>- David Warner has won 8 of 9 as T20 captain. The best record overall - minimum 10 matches - is Pakistan\'s Sarfraz Ahmed\'s 14 wins from 17 matches<br/><br/>- New Zealand have lost their last four T20I matches at Eden Park<br/><br/>What they said<br/><br/>"We\'ve had three pretty close T20 games, Australia batting exceptionally well at Eden Park and chasing down a score that was pretty formidable. But you\'ve got to be in the final and give yourself a chance" - Mike Hesson, the New Zealand coach.<br/><br/>"You\'ve just got to find a way to get one or two wickets in the first six (overs), it\'s as simple as that" - David Warner, the Australia captain, said about bowling at the tiny Eden Park.'},
I want to arrange like this
Preview by Tristan Lavalette
The Twenty20 tri-series decider between Australia and New Zealand is set to finish with a bang at the tiny Eden Park on Wednesday (February 21), as another bout of belligerent batting is expected in Auckland.
In a preview of the final, the teams clashed at Eden Park last Friday and produced a run-fest with the rampaging Australia successfully chasing down a record target of 244. The unbeaten Australia head into the final as favourites after a dazzling campaign from their new look side brimming with in-form Big Bash League players and headed by skipper David Warner, whose inventive captaincy has been inspirational.
Astoundingly, Australia is on the brink of leapfrogging into the No.1 T20 ranking having started the tournament a lowly No.7. A victory would be their sixth straight in the format equalling their best ever streak.<br/><br/>Australia\'s hard-hitting batting has relished chasing in every match and New Zealand\'s brains trust might deeply consider bowling first if skipper Kane Williamson wins the toss. Packed with firepower, Australia ooze with match-winners and chased down the record target with relative ease, confirming their penchant to chase. At the comically miniature Eden Park ground, Australia\'s powerful batting will be confident no matter the situation of the match.
Of course, the beleaguered bowlers aren\'t quite as cheery after copping a flogging last start especially to New Zealand dynamo Martin Guptill. Much like their counterparts, the Black Caps boast a high-octane batting order that has been inconsistent throughout the tournament but, ominously, has the artillery to spearhead New Zealand to a triumph.
Australia\'s attack has been settled throughout the tri-series but selectors might be tempted to tweak it in a bid to ruffle the Black Caps. Legspinner Adam Zampa could be given a call-up on the wearing pitch - the same one used for Friday\'s encounter - which is set to be helpful for spin.
If Zampa gets the nod, Australia will be faced with a dilemma of culling one of their frontline quicks of Billy Stanlake, Kane Richardson and Andrew Tye, who have each starred at various stages during the tri-series. Australia\'s fresh team has matured quickly but the pressure will be intensified in an away final amid an electrifying atmosphere.
Even they though endured a rocky tournament yielding just one win, New Zealand squeaked past England to reach the decider but will need to lift their game if they are to cause an upset. The Black Caps have been unable to consistently recapture their best after coming into the tri-series ranked No. 2 in the world.
New Zealand\'s eclectic bowling has struggled although the spin combination of Mitchell Santner and Ish Sodhi could prove a handful on this deck. For such a composed and experienced team, New Zealand has looked occasionally rattled having agonizingly lost consecutive matches.
Despite their struggles, New Zealand knows one strong performance is enough for them to claim glory in front of their parochial home crowd desperate for some revelry.

Assuming you want to print each commentary dictionary in the commentary1 list, you want to replace the
print(commentary1)
line with
print("\n".join([" ".join(i.values()).replace("<br/><br/>", "\n") for i in commentary1]))
That will take all the dictionaries in the commentary1 list, then take all of their values, append them with a space, replace the <br/><br/> tags with \n, then join them.

Use this:
from pycricbuzz import Cricbuzz
c = Cricbuzz()
commentary1 = []
current_game3 = {}
matches = c.matches()
for match in matches:
if match['mchstate'] != 'nextlive':
col= (c.commentary(match['id']))
for my_str in col['commentary']:
current_game3["commentary2"] = my_str.replace('<br/>', '\n')
commentary1.append(current_game3)
current_game3 = {}
for comment in commentary1:
print(comment['commentary2'])
Partial Output:
Preview by Tristan Lavalette
The Twenty20 tri-series decider between Australia and New Zealand is
set to finish with a bang at the tiny Eden Park on Wednesday (February
21), as another bout of belligerent batting is expected in Auckland.
In a preview of the final, the teams clashed at Eden Park last Friday
and produced a run-fest with the rampaging Australia successfully
chasing down a record target of 244. The unbeaten Australia head into
the final as favourites after a dazzling campaign from their new look
side brimming with in-form Big Bash League players and headed by
skipper David Warner, whose inventive captaincy has been
inspirational.
Astoundingly, Australia is on the brink of leapfrogging into the No.1
T20 ranking having started the tournament a lowly No.7. A victory
would be their sixth straight in the format equalling their best ever
streak.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.