Related
I want to chunk long text into paragraphs that are context-based, because right now i'm just splitting the text in to sentences and chunking them every 250 words and calling that a paragraph, but its obviously not a good way to make a paragraph because its "dumb" and information thats over 250 words gets kept out, and it isn't even really a paragraph, just all the sentences before 250 words is filled up. So i want to make context-based paragraph splitting so it "smart" and is actually a paragraph.
The code below is what i have now:
import re
newtext = '''
Kendrick Lamar Duckworth is an American rapper, songwriter, and record producer. He is often cited as one of the most influential rappers of his generation. Aside from his solo career, he is also a member of the hip hop supergroup Black Hippy alongside his former Top Dawg Entertainment (TDE) labelmates Ab-Soul, Jay Rock, and Schoolboy Q. Raised in Compton, California, Lamar embarked on his musical career as a teenager under the stage name K.Dot, releasing a mixtape titled Y.H.N.I.C. (Hub City Threat Minor of the Year) that garnered local attention and led to his signing with indie record label TDE. He began to gain recognition in 2010 after his first retail release, Overly Dedicated. The following year, he independently released his first studio album, Section.80, which included his debut single "HiiiPoWeR". By that time, he had amassed a large online following and collaborated with several prominent rappers. He subsequently secured a record deal with Dr. Dre's Aftermath Entertainment, under the aegis of Interscope Records. Lamar's major-label debut album, Good Kid, M.A.A.D City, was released in 2012, garnering him widespread critical recognition and mainstream success. His third album To Pimp a Butterfly (2015), which incorporated elements of funk, soul, jazz, and spoken word, predominantly centred around the Black-American experience. It became his first number-one album on the US Billboard 200 and was an enormous critical success. His fourth album, Damn (2017), saw continued acclaim, becoming the first non-classical and non-jazz album to be awarded the Pulitzer Prize for Music. It also yielded his first number-one single, "Humble", on the US Billboard Hot 100. Lamar curated the soundtrack to the superhero film Black Panther (2018) and in 2022, released his fifth and last album with TDE, Mr. Morale & the Big Steppers, which received critical acclaim. Lamar has certified sales of over 70 million records in the United States alone, and all of his albums have been certified platinum or higher by the Recording Industry Association of America (RIAA). He has received several accolades in his career, including 14 Grammy Awards, two American Music Awards, six Billboard Music Awards, 11 MTV Video Music Awards, a Pulitzer Prize, a Brit Award, and an Academy Award nomination. In 2012, MTV named him the Hottest MC in the Game on their annual list. Time named him one of the 100 most influential people in the world in 2016. In 2015, he received the California State Senate's Generational Icon Award. Three of his studio albums were included on Rolling Stone's 2020 list of the 500 Greatest Albums of All Time. Kendrick Lamar Duckworth was born in Compton, California on June 17, 1987, the son of a couple from Chicago. Although not in a gang himself, he grew up around gang members, with his closest friends being Westside Piru Bloods and his father, Kenny Duckworth, being a Gangster Disciple. His first name was given to him by his mother in honor of singer-songwriter Eddie Kendricks of The Temptations. He grew up on welfare and in Section 8 housing. In 1995, at the age of eight, Lamar witnessed his idols Tupac Shakur and Dr. Dre filming the music video for their hit single "California Love", which proved to be a significant moment in his life. As a child, Lamar attended McNair Elementary and Vanguard Learning Center in the Compton Unified School District. He has admitted to being quiet and shy in school, his mother even confirming he was a "loner" until the age of seven. As a teenager, he graduated from Centennial High School in Compton, where he was a straight-A student. Kendrick Lamar has stated that Tupac Shakur, the Notorious B.I.G., Jay-Z, Nas and Eminem are his top five favorite rappers. Tupac Shakur is his biggest influence, and has influenced his music as well as his day-to-day lifestyle. In a 2011 interview with Rolling Stone, Lamar mentioned Mos Def and Snoop Dogg as rappers that he listened to and took influence from during his early years. He also cites now late rapper DMX as an influence: "[DMX] really [got me started] on music," explained Lamar in an interview with Philadelphia's Power 99. "That first album [It's Dark and Hell Is Hot] is classic, [so he had an influence on me]." He has also stated Eazy-E as an influence in a post by Complex saying: "I Wouldn't Be Here Today If It Wasn't for Eazy-E." In a September 2012 interview, Lamar stated rapper Eminem "influenced a lot of my style" and has since credited Eminem for his own aggression, on records such as "Backseat Freestyle". Lamar also gave Lil Wayne's work in Hot Boys credit for influencing his style and praised his longevity. He has said that he also grew up listening to Rakim, Dr. Dre, and Tha Dogg Pound. In January 2013, when asked to name three rappers that have played a role in his style, Lamar said: "It's probably more of a west coast influence. A little bit of Kurupt, [Tupac], with some of the content of Ice Cube." In a November 2013 interview with GQ, when asked "The Four MC's That Made Kendrick Lamar?", he answered Tupac Shakur, Dr. Dre, Snoop Dogg and Mobb Deep, namely Prodigy. Lamar professed to having been influenced by jazz trumpeter Miles Davis and Parliament-Funkadelic during the recording of To Pimp a Butterfly. Lamar has been branded as the "new king of hip hop" numerous times. Forbes said, on Lamar's placement as hip hop's "king", "Kendrick Lamar may or may not be the greatest rapper alive right now. He is certainly in the very short lists of artists in the conversation." Lamar frequently refers to himself as the "greatest rapper alive" and once called himself "The King of New York." On the topic of his music genre, Lamar has said: "You really can't categorize my music, it's human music." Lamar's projects are usually concept albums. Critics found Good Kid, M.A.A.D City heavily influenced by West Coast hip hop and 90s gangsta rap. His third studio album, To Pimp a Butterfly, incorporates elements of funk, jazz, soul and spoken word poetry. Called a "radio-friendly but overtly political rapper" by Pitchfork, Lamar has been a branded "master of storytelling" and his lyrics have been described as "katana-blade sharp" and his flow limber and dexterous. Lamar's writing usually includes references to racism, black empowerment and social injustice, being compared to a State of Union address by The Guardian. His writing has also been called "confessional" and controversial. The New York Times has called Lamar's musical style anti-flamboyant, interior and complex and labelled him as a technical rapper. Billboard described his lyricism as "Shakespearean".
'''
#1100 words
regex = r'([A-z][^.!?]*[.!?]*"?)'
for sens in re.findall(regex, newtext):
newtext = newtext.replace(f'{sens}', f'{sens}<eos>')
sentencess = newtext.split('<eos>')
sentences = sentencess
current_chunk = 0
chunks = []
for sentence in sentences:
if len(chunks) == current_chunk + 1:
if len(chunks[current_chunk]) + len(sentence.split(" ")) <= 250:
chunks[current_chunk].extend(sentence.split(" "))
else:
current_chunk += 1
chunks.append(sentence.split(" "))
else:
chunks.append(sentence.split(" "))
for chunk_id in range(len(chunks)):
chunks[chunk_id] = " ".join(chunks[chunk_id])
print(chunks) # printing all split "paragraphs"
print(chunks[0]) # printing 1st "paragraph"
I have a text and I would like to be able to add certain words to a specific position in it. To do this, I need to cut my text into letters (not words). I can do the work but the problem is that the word I want to add cuts off another word.
My input( the numbers are not good because the text is much longer but this way you get an idea) :
{”text":The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. The sanitary conditions were below any reasonable standard. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection. ,"label":[[328,347,"Article 3 - Violated"],[2269,2323,"Article 3 - Violated"],[2791,2843,"Article 3 - Violated"],[2947,2988,"Article 3 - Violated"],[3099,3110,"Article 3 - Violated"],[3603,3615,"Article 3 - Violated"],[3702,3756,"Article 3 - Violated"],[4793,4923,"Article 3 - Violated"],[5185,5196,"Article 3 - Violated"],[8111,8198,"Article 3 - Respected"],[8510,8521,"Article 3 - Respected"],[8575,8601,"Article 3 - Respected"],[8965,9009,"Article 3 - Respected"],
And I would like to have this:
The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. <Article 3 - Violated>The sanitary conditions were below any reasonable standard</Article 3 - Violated>. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.
but I get this. It cuts the words.
The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. <Article 3 - Violated>The sanitary conditions were below any reasonable stan <Article 3 - Violated/>dard. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.
My code:
text =list(texte["text"].strip())
label = texte["label"]
for i in label:
debut = i[0]
fin = i[1]
nom = i[2]
for element in range(len(text)):
if element == debut:
text.insert(element,"<"+nom+">")
if element == fin:
a = element +1
text.insert(element+1,"<"+nom+"/>")
string = ""
for element in text:
string += element
print(string)
Your approach seems a bit odd: (1) Why are you making a character list out of the string? (2) The looping here for element in range(len(text)): ... seems completely unnecessary, why are you not directly using debut and fin?
Problem of your approach: By inserting items to the list text the position numbers in the label-lists become invalid.
Here's an alternative approach. I'm using the following data as a sample:
texte = {
"text": "The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. The sanitary conditions were below any reasonable standard. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.",
"label": [[262, 375, "Article 3 - Violated"], [637, 695, "Article 3 - Violated"]]
}
The numbers in texte["label"] mark the start and end of the following two passages:
The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker.
The sanitary conditions were below any reasonable standard.
The first number in a label-list is the position of the start of the passage, the second number is the first position after the last character of the passage. But I'm not sure about that, I haven't seen any related information in the question.
Now this
text = texte["text"]
new_text = ""
last_fin = 0
for debut, fin, nom in texte["label"]:
new_text += text[last_fin:debut] + "<" + nom + ">" + text[debut:fin] + "<" + nom + "/>"
last_fin = fin
new_text += text[last_fin:]
results in the following new_text:
The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. <Article 3 - Violated>The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker.<Article 3 - Violated/> There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. <Article 3 - Violated>The sanitary conditions were below any reasonable standard.<Article 3 - Violated/> In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.
If the second number in a label-list is the position of the last character of the passage (instead of the position of the first character after the passage), then the following should produce the same nex_text:
text = texte["text"]
new_text = ""
last_fin = 0
for debut, fin, nom in texte["label"]:
fin += 1
new_text += text[last_fin:debut] + "<" + nom + ">" + text[debut:fin] + "<" + nom + "/>"
last_fin = fin
new_text += text[last_fin:]
You can use .replace method;
string.replace(oldvalue, newvalue, count)
In your case you can replace "applicant's" string with:
text.replace("applicant's" , "Name", count_that_times_you_want_to_replace)
Can find more info here ;
https://www.geeksforgeeks.org/python-string-replace/
I want to create deduplication process on my database.
I want to measure cosine similarity scores with Pythons Sklearn lib. between new texts and texts that are already in the database.
I want to add only documents that have cosine similarity score less than 0.90.
This is my code:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
list_of_texts_in_database = ["More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.",
"UK ministers say sanctions could target companies and individuals linked to the Russian government.",
"Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
"He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.",
"Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.",
"The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control",
"A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control",
"That could lead to a major escalation in conflict."]
list_of_new_texts = ["This is a totaly new document that needs to be added into the database one way or another.",
"Boris Johnson also says the UK could limit Russian firm ability to raise capital on London's market.",
"Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.",
"UK minister say sanctions could target companies and individuals linked to the Russian government.",
"That could lead to a major escalation in conflict."]
vectorizer = TfidfVectorizer(lowercase=True, analyzer='word', stop_words = None, ngram_range=(1, 1))
list_of_texts_in_database_tfidf = vectorizer.fit_transform(list_of_texts_in_database)
list_of_new_texts_tfidf = vectorizer.transform(list_of_new_texts)
cosineSimilarities = cosine_similarity(list_of_new_texts_tfidf, list_of_texts_in_database_tfidf)
print(cosineSimilarities)
This code works good, but I do not know how to map the results (how to get texts that have similarity score less than 0.90)
My suggestion would be as follows. You only add those texts with a score less than (or equal) 0.9.
import numpy as np
idx = np.where((cosineSimilarities <= 0.9).all(axis=1))
Then you have the indices of the new texts in list_of_new_texts that do not have a corresponding text with a score of > 0.9 in the already existing list list_of_texts_in_database.
Combining them you can do as follows (although somebody else might have a cleaner method for this...)
print(
list_of_texts_in_database + list(np.array(list_of_new_texts)[idx[0]])
)
Output:
['More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.',
'UK ministers say sanctions could target companies and individuals linked to the Russian government.',
"Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
'He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.',
'Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.',
'The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control',
'A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control',
'That could lead to a major escalation in conflict.',
'This is a totaly new document that needs to be added into the database one way or another.',
'Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.',
'UK minister say sanctions could target companies and individuals linked to the Russian government.']
why dont you work within a dataframe?
import pandas as pd
d = {'old_text':list_of_texts_in_database[:5], 'new_text':list_of_new_texts, 'old_emb': list_of_texts_in_database_tfidf[:5], 'new_emb': list_of_new_texts_tfidf}
df = pd.DataFrame(data=d)
df['score'] = df.apply(lambda row: cosine_similarity(row['old_emb'], row['new_emb'])[0][0], axis=1)
df = df.loc[df.score > 0.9, 'score']
df.head()
I have a string that contains sentences. If this string contains more character then a given number. I'd like to split up this string into several strings with less then the max number of character, but stil containing full sentence.
I did the below, which seem to run okay, but not sure if I will experience bugs putting this in production. Does the below look okay?
from nltk.tokenize import sent_tokenize
my_text = "President Donald Trump announced Friday that he has tested positive for Covid-19, and he isn’t the first sitting president to contract a highly contagious and potentially deadly virus in the middle of a pandemic.Former President Woodrow Wilson became ill with the 1918 flu when he was in Paris in April 1919 organizing a peace treaty and the League of Nations following World War I.Wilson wasn’t a healthy man and “always frail,” said Howard Markel, a physician and medical historian at the University of Michigan. He would go on to have symptoms such as headache, high fever, cough and runny nose, Markel said. Many of Wilson’s aides would also contract the flu, including his chief of staff, he added.Trump tweeted overnight that he and first lady Melania Trump tested positive for the coronavirus after the White House confirmed that aide Hope Hicks had tested positive and had some symptoms.Trump was experiencing “mild symptoms” after testing positive for the coronavirus, White House chief of staff Mark Meadows confirmed to reporters Friday morning. The announcement came hours after the administration confirmed that White House aide Hope Hicks tested positive for the virus.For Wilson, the virus “took its toll on him,” Markel said. “That can have neurologic and long-term complications. And he was already at the time traveling and living on a train and giving five to 10 speeches a day. That’s not healthy.”When he got back to the United States, Wilson went on a whistle-stop tour to get the League of Nations ratified, which ultimately failed, Markel said. While on his tour, Wilson became thinner, paler and more frail, Markel would write in a column. He lost his appetite, his asthma grew worse and he complained of unrelenting headaches, he added. He would later have a bad stroke.“His wife basically took over the presidency after that,” he added.Many infectious disease experts and medical historians have drawn other parallels between 1918 and today. Schools and businesses were also closed and infected people were quarantined a century ago. People were also resistant to wearing face masks, calling them dirt traps and some clipped holes so they could smoke cigars.Several U.S. cities implemented mandates, describing them as a symbol of “wartime patriotism.” In San Francisco, then-Mayor James Rolph said, ”[C]onscience, patriotism and self-protection demand immediate and rigid compliance,” according to influenzaarchive.org, which is authored by Markel. But some people refused to comply or take them seriously, Markel said.“One woman, a downtown attorney, argued to Mayor Rolph that the mask ordinance was ‘absolutely unconstitutional’ because it was not legally enacted, and that as a result, every police officer who had arrested a mask scofflaw was personally liable,” according to influenzaarchive.org.As with Trump, some reports and historians have suggested that Wilson downplayed the severity of the virus. But Markel said that is a “wrong and a false trope of popular history.”The federal government played a very small role in American public health during that era, he said. Unlike today, there was no CDC or national public health department. The Food and Drug Administration existed, but it consisted of a very small group of men.“It was primarily a city and state role, and those agencies were hardly downplaying it,” Markel said.Unlike today, Wilson did not get sick during his reelection, Markel said. He said the public needs to know “how healthy or how not healthy” Trump is before the election on Nov. 3.“When you’re voting for a president now, you really are potentially voting for the vice president,” he said. “Because what if Trump gets sick and gets incapacitated or worse between Election Day and Jan. 20 because of Covid? Well then the elected vice president becomes president.”“The importance of him being clear, open and honest — or his doctors — with his health conditions is something I’m skeptical we’ll see. But it is critical,” Markel said. President Donald Trump announced Friday that he has tested positive for Covid-19, and he isn’t the first sitting president to contract a highly contagious and potentially deadly virus in the middle of a pandemic.Former President Woodrow Wilson became ill with the 1918 flu when he was in Paris in April 1919 organizing a peace treaty and the League of Nations following World War I.Wilson wasn’t a healthy man and “always frail,” said Howard Markel, a physician and medical historian at the University of Michigan. He would go on to have symptoms such as headache, high fever, cough and runny nose, Markel said. Many of Wilson’s aides would also contract the flu, including his chief of staff, he added.Trump tweeted overnight that he and first lady Melania Trump tested positive for the coronavirus after the White House confirmed that aide Hope Hicks had tested positive and had some symptoms.Trump was experiencing “mild symptoms” after testing positive for the coronavirus, White House chief of staff Mark Meadows confirmed to reporters Friday morning. The announcement came hours after the administration confirmed that White House aide Hope Hicks tested positive for the virus.For Wilson, the virus “took its toll on him,” Markel said. “That can have neurologic and long-term complications. And he was already at the time traveling and living on a train and giving five to 10 speeches a day. That’s not healthy.”When he got back to the United States, Wilson went on a whistle-stop tour to get the League of Nations ratified, which ultimately failed, Markel said. While on his tour, Wilson became thinner, paler and more frail, Markel would write in a column. He lost his appetite, his asthma grew worse and he complained of unrelenting headaches, he added. He would later have a bad stroke.“His wife basically took over the presidency after that,” he added.Many infectious disease experts and medical historians have drawn other parallels between 1918 and today. Schools and businesses were also closed and infected people were quarantined a century ago. People were also resistant to wearing face masks, calling them dirt traps and some clipped holes so they could smoke cigars.Several U.S. cities implemented mandates, describing them as a symbol of “wartime patriotism.” In San Francisco, then-Mayor James Rolph said, ”[C]onscience, patriotism and self-protection demand immediate and rigid compliance,” according to influenzaarchive.org, which is authored by Markel. But some people refused to comply or take them seriously, Markel said.“One woman, a downtown attorney, argued to Mayor Rolph that the mask ordinance was ‘absolutely unconstitutional’ because it was not legally enacted, and that as a result, every police officer who had arrested a mask scofflaw was personally liable,” according to influenzaarchive.org.As with Trump, some reports and historians have suggested that Wilson downplayed the severity of the virus. But Markel said that is a “wrong and a false trope of popular history.”The federal government played a very small role in American public health during that era, he said. Unlike today, there was no CDC or national public health department. The Food and Drug Administration existed, but it consisted of a very small group of men.“It was primarily a city and state role, and those agencies were hardly downplaying it,” Markel said.Unlike today, Wilson did not get sick during his reelection, Markel said. He said the public needs to know “how healthy or how not healthy” Trump is before the election on Nov. 3.“When you’re voting for a president now, you really are potentially voting for the vice president,” he said. “Because what if Trump gets sick and gets incapacitated or worse between Election Day and Jan. 20 because of Covid? Well then the elected vice president becomes president.”“The importance of him being clear, open and honest — or his doctors — with his health conditions is something I’m skeptical we’ll see. But it is critical,” Markel said."
sentences = sent_tokenize(my_text)
sentences_split = []
shortened_sentence = ""
for idx, sentence in enumerate(sentences):
if len(shortened_sentence) + len(sentence) < 5120:
shortened_sentence += sentence
if (len(shortened_sentence) + len(sentence) > 5120) or (idx + 1 == len(sentences)):
sentences_split.append(shortened_sentence)
shortened_sentence = ""
print(sentences_split)
To better explain my point about problem with the second if block, expressed in comments, see following example.
We want string of max len=15, i.e. 1520 in this case is 16. As you can see first 3 items in the list are 5 + 6 + 4 = 15, so, fisrt shortened_sentence should consists of first 3 items in the list. but it does not. because the logic of the second if is incorrect.
sentences = ['abcde', 'fghijk', 'lmno', 'pqr']
# we need sentences with less than 16 chars
print([len(sentence) for sentence in sentences])
sentences_split = []
shortened_sentence = ""
for idx, sentence in enumerate(sentences):
if len(shortened_sentence) + len(sentence) < 16:
shortened_sentence += sentence
if (len(shortened_sentence) + len(sentence) > 16) or (idx + 1 == len(sentences)):
sentences_split.append(shortened_sentence)
shortened_sentence = ""
print(sentences_split)
print([len(sentence) for sentence in sentences_split])
output
[5, 6, 4, 3]
['abcdefghijk', 'lmnopqr']
[11, 7]
Compare it with
sentences = ['abcde', 'fghijk', 'lmno', 'pqr']
# we need sentences with less than 16 chars
print([len(word) for word in sentences])
sentences_split = []
shortened_sentence = ""
for sentence in sentences:
if len(shortened_sentence) + len(sentence) < 16:
shortened_sentence += sentence
else:
sentences_split.append(shortened_sentence)
shortened_sentence = sentence
sentences_split.append(shortened_sentence)
print(sentences_split)
print([len(sentence) for sentence in sentences_split])
output
[5, 6, 4, 3]
['abcdefghijklmno', 'pqr']
[15, 3]
Finally, if you are not sure " if I will experience bugs putting this in production" - write tests, a lot of tests. That's what tests are about - to help minimise bugs in production.
Also, note that second snippets is just a sample implementation, there are other possible implementations.
I would like to count unique words with function. Unique words I want to define are the word only appear once so that's why I used set here. I put the error below. Does anyone how to fix this?
Here's my code:
def unique_words(corpus_text_train):
words = re.findall('\w+', corpus_text_train)
uw = len(set(words))
return uw
unique = unique_words(test_list_of_str)
unique
I got this error
TypeError: expected string or bytes-like object
Here's my bag of words model:
def BOW_model_relative(df):
corpus_text_train = []
for i in range(0, len(df)): #iterate over the rows in dataframe
corpus = df['text'][i]
#corpus = re.findall(r'\w+',corpus)
corpus = re.sub(r'[^\w\s]','',corpus)
corpus = corpus.lower()
corpus = corpus.split()
corpus = ' '.join(corpus)
corpus_text_train.append(corpus)
word2count = {}
for x in corpus_text_train:
words=word_tokenize(x)
for word in words:
if word not in word2count.keys():
word2count[word]=1
else:
word2count[word]+=1
total=0
for key in word2count.keys():
total+=word2count[key]
for key in word2count.keys():
word2count[key]=word2count[key]/total
return word2count,corpus_text_train
test_dict,test_list_of_str = BOW_model_relative(df)
#test_data = pd.DataFrame(test)
print(test_dict)
Here's my csv data
df = pd.read_csv('test.csv')
,text,title,authors,label
0,"On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a “pressure cooker” device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.
Given that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn’t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the “bad guys.” Unfortunately, our leadership – who ostensibly wants to protect us – finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.
New York City Mayor Bill de Blasio – who famously ended “stop-and-frisk” profiling in his city – was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. “There is no specific and credible threat to New York City from any terror organization,” de Blasio said late Saturday at the news conference. “We believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and … agencies are at full alert”, he said. Isn’t “an intentional act” terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O’Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York’s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word “terrorism” is defined. “A bomb exploding in New York is obviously an act of terrorism.” Cuomo hit the nail on the head, but why did need to clarify and caveat before making his “obvious” assessment?
The two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that “we need to do everything we can to support our first responders – also to pray for the victims” and that “we need to let this investigation unfold.” Trump was more direct. “I must tell you that just before I got off the plane a bomb went off in New York and nobody knows what’s going on,” he said. “But boy we are living in a time—we better get very tough folks. We better get very, very tough. It’s a terrible thing that’s going on in our world, in our country and we are going to get tough and smart and vigilant.”
The answer from Kohelet neglects characters such as , and ", which in OP's case would find people and people, to be two unique words. To make sure you only get actual words you need to take care of the unwanted characters. To remove the , and ", you could add the following:
text ='aa, aa bb cc'
def unique_words(text):
words = text.replace('"','').replace(',', '').split()
unique = list(set(words))
return len(unique)
unique_words(text)
# out
3
There are numerous ways to add text to be replaced
s='aa aa bb cc'
def unique_words(corpus_text_train):
splitted = corpus_text_train.split()
return(len(set(splitted)))
unique_words(s)
Out[14]: 3