I have a text and I would like to be able to add certain words to a specific position in it. To do this, I need to cut my text into letters (not words). I can do the work but the problem is that the word I want to add cuts off another word.
My input( the numbers are not good because the text is much longer but this way you get an idea) :
{”text":The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. The sanitary conditions were below any reasonable standard. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection. ,"label":[[328,347,"Article 3 - Violated"],[2269,2323,"Article 3 - Violated"],[2791,2843,"Article 3 - Violated"],[2947,2988,"Article 3 - Violated"],[3099,3110,"Article 3 - Violated"],[3603,3615,"Article 3 - Violated"],[3702,3756,"Article 3 - Violated"],[4793,4923,"Article 3 - Violated"],[5185,5196,"Article 3 - Violated"],[8111,8198,"Article 3 - Respected"],[8510,8521,"Article 3 - Respected"],[8575,8601,"Article 3 - Respected"],[8965,9009,"Article 3 - Respected"],
And I would like to have this:
The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. <Article 3 - Violated>The sanitary conditions were below any reasonable standard</Article 3 - Violated>. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.
but I get this. It cuts the words.
The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. <Article 3 - Violated>The sanitary conditions were below any reasonable stan <Article 3 - Violated/>dard. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.
My code:
text =list(texte["text"].strip())
label = texte["label"]
for i in label:
debut = i[0]
fin = i[1]
nom = i[2]
for element in range(len(text)):
if element == debut:
text.insert(element,"<"+nom+">")
if element == fin:
a = element +1
text.insert(element+1,"<"+nom+"/>")
string = ""
for element in text:
string += element
print(string)
Your approach seems a bit odd: (1) Why are you making a character list out of the string? (2) The looping here for element in range(len(text)): ... seems completely unnecessary, why are you not directly using debut and fin?
Problem of your approach: By inserting items to the list text the position numbers in the label-lists become invalid.
Here's an alternative approach. I'm using the following data as a sample:
texte = {
"text": "The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. The sanitary conditions were below any reasonable standard. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.",
"label": [[262, 375, "Article 3 - Violated"], [637, 695, "Article 3 - Violated"]]
}
The numbers in texte["label"] mark the start and end of the following two passages:
The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker.
The sanitary conditions were below any reasonable standard.
The first number in a label-list is the position of the start of the passage, the second number is the first position after the last character of the passage. But I'm not sure about that, I haven't seen any related information in the question.
Now this
text = texte["text"]
new_text = ""
last_fin = 0
for debut, fin, nom in texte["label"]:
new_text += text[last_fin:debut] + "<" + nom + ">" + text[debut:fin] + "<" + nom + "/>"
last_fin = fin
new_text += text[last_fin:]
results in the following new_text:
The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. <Article 3 - Violated>The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker.<Article 3 - Violated/> There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. <Article 3 - Violated>The sanitary conditions were below any reasonable standard.<Article 3 - Violated/> In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.
If the second number in a label-list is the position of the last character of the passage (instead of the position of the first character after the passage), then the following should produce the same nex_text:
text = texte["text"]
new_text = ""
last_fin = 0
for debut, fin, nom in texte["label"]:
fin += 1
new_text += text[last_fin:debut] + "<" + nom + ">" + text[debut:fin] + "<" + nom + "/>"
last_fin = fin
new_text += text[last_fin:]
You can use .replace method;
string.replace(oldvalue, newvalue, count)
In your case you can replace "applicant's" string with:
text.replace("applicant's" , "Name", count_that_times_you_want_to_replace)
Can find more info here ;
https://www.geeksforgeeks.org/python-string-replace/
Related
I want to chunk long text into paragraphs that are context-based, because right now i'm just splitting the text in to sentences and chunking them every 250 words and calling that a paragraph, but its obviously not a good way to make a paragraph because its "dumb" and information thats over 250 words gets kept out, and it isn't even really a paragraph, just all the sentences before 250 words is filled up. So i want to make context-based paragraph splitting so it "smart" and is actually a paragraph.
The code below is what i have now:
import re
newtext = '''
Kendrick Lamar Duckworth is an American rapper, songwriter, and record producer. He is often cited as one of the most influential rappers of his generation. Aside from his solo career, he is also a member of the hip hop supergroup Black Hippy alongside his former Top Dawg Entertainment (TDE) labelmates Ab-Soul, Jay Rock, and Schoolboy Q. Raised in Compton, California, Lamar embarked on his musical career as a teenager under the stage name K.Dot, releasing a mixtape titled Y.H.N.I.C. (Hub City Threat Minor of the Year) that garnered local attention and led to his signing with indie record label TDE. He began to gain recognition in 2010 after his first retail release, Overly Dedicated. The following year, he independently released his first studio album, Section.80, which included his debut single "HiiiPoWeR". By that time, he had amassed a large online following and collaborated with several prominent rappers. He subsequently secured a record deal with Dr. Dre's Aftermath Entertainment, under the aegis of Interscope Records. Lamar's major-label debut album, Good Kid, M.A.A.D City, was released in 2012, garnering him widespread critical recognition and mainstream success. His third album To Pimp a Butterfly (2015), which incorporated elements of funk, soul, jazz, and spoken word, predominantly centred around the Black-American experience. It became his first number-one album on the US Billboard 200 and was an enormous critical success. His fourth album, Damn (2017), saw continued acclaim, becoming the first non-classical and non-jazz album to be awarded the Pulitzer Prize for Music. It also yielded his first number-one single, "Humble", on the US Billboard Hot 100. Lamar curated the soundtrack to the superhero film Black Panther (2018) and in 2022, released his fifth and last album with TDE, Mr. Morale & the Big Steppers, which received critical acclaim. Lamar has certified sales of over 70 million records in the United States alone, and all of his albums have been certified platinum or higher by the Recording Industry Association of America (RIAA). He has received several accolades in his career, including 14 Grammy Awards, two American Music Awards, six Billboard Music Awards, 11 MTV Video Music Awards, a Pulitzer Prize, a Brit Award, and an Academy Award nomination. In 2012, MTV named him the Hottest MC in the Game on their annual list. Time named him one of the 100 most influential people in the world in 2016. In 2015, he received the California State Senate's Generational Icon Award. Three of his studio albums were included on Rolling Stone's 2020 list of the 500 Greatest Albums of All Time. Kendrick Lamar Duckworth was born in Compton, California on June 17, 1987, the son of a couple from Chicago. Although not in a gang himself, he grew up around gang members, with his closest friends being Westside Piru Bloods and his father, Kenny Duckworth, being a Gangster Disciple. His first name was given to him by his mother in honor of singer-songwriter Eddie Kendricks of The Temptations. He grew up on welfare and in Section 8 housing. In 1995, at the age of eight, Lamar witnessed his idols Tupac Shakur and Dr. Dre filming the music video for their hit single "California Love", which proved to be a significant moment in his life. As a child, Lamar attended McNair Elementary and Vanguard Learning Center in the Compton Unified School District. He has admitted to being quiet and shy in school, his mother even confirming he was a "loner" until the age of seven. As a teenager, he graduated from Centennial High School in Compton, where he was a straight-A student. Kendrick Lamar has stated that Tupac Shakur, the Notorious B.I.G., Jay-Z, Nas and Eminem are his top five favorite rappers. Tupac Shakur is his biggest influence, and has influenced his music as well as his day-to-day lifestyle. In a 2011 interview with Rolling Stone, Lamar mentioned Mos Def and Snoop Dogg as rappers that he listened to and took influence from during his early years. He also cites now late rapper DMX as an influence: "[DMX] really [got me started] on music," explained Lamar in an interview with Philadelphia's Power 99. "That first album [It's Dark and Hell Is Hot] is classic, [so he had an influence on me]." He has also stated Eazy-E as an influence in a post by Complex saying: "I Wouldn't Be Here Today If It Wasn't for Eazy-E." In a September 2012 interview, Lamar stated rapper Eminem "influenced a lot of my style" and has since credited Eminem for his own aggression, on records such as "Backseat Freestyle". Lamar also gave Lil Wayne's work in Hot Boys credit for influencing his style and praised his longevity. He has said that he also grew up listening to Rakim, Dr. Dre, and Tha Dogg Pound. In January 2013, when asked to name three rappers that have played a role in his style, Lamar said: "It's probably more of a west coast influence. A little bit of Kurupt, [Tupac], with some of the content of Ice Cube." In a November 2013 interview with GQ, when asked "The Four MC's That Made Kendrick Lamar?", he answered Tupac Shakur, Dr. Dre, Snoop Dogg and Mobb Deep, namely Prodigy. Lamar professed to having been influenced by jazz trumpeter Miles Davis and Parliament-Funkadelic during the recording of To Pimp a Butterfly. Lamar has been branded as the "new king of hip hop" numerous times. Forbes said, on Lamar's placement as hip hop's "king", "Kendrick Lamar may or may not be the greatest rapper alive right now. He is certainly in the very short lists of artists in the conversation." Lamar frequently refers to himself as the "greatest rapper alive" and once called himself "The King of New York." On the topic of his music genre, Lamar has said: "You really can't categorize my music, it's human music." Lamar's projects are usually concept albums. Critics found Good Kid, M.A.A.D City heavily influenced by West Coast hip hop and 90s gangsta rap. His third studio album, To Pimp a Butterfly, incorporates elements of funk, jazz, soul and spoken word poetry. Called a "radio-friendly but overtly political rapper" by Pitchfork, Lamar has been a branded "master of storytelling" and his lyrics have been described as "katana-blade sharp" and his flow limber and dexterous. Lamar's writing usually includes references to racism, black empowerment and social injustice, being compared to a State of Union address by The Guardian. His writing has also been called "confessional" and controversial. The New York Times has called Lamar's musical style anti-flamboyant, interior and complex and labelled him as a technical rapper. Billboard described his lyricism as "Shakespearean".
'''
#1100 words
regex = r'([A-z][^.!?]*[.!?]*"?)'
for sens in re.findall(regex, newtext):
newtext = newtext.replace(f'{sens}', f'{sens}<eos>')
sentencess = newtext.split('<eos>')
sentences = sentencess
current_chunk = 0
chunks = []
for sentence in sentences:
if len(chunks) == current_chunk + 1:
if len(chunks[current_chunk]) + len(sentence.split(" ")) <= 250:
chunks[current_chunk].extend(sentence.split(" "))
else:
current_chunk += 1
chunks.append(sentence.split(" "))
else:
chunks.append(sentence.split(" "))
for chunk_id in range(len(chunks)):
chunks[chunk_id] = " ".join(chunks[chunk_id])
print(chunks) # printing all split "paragraphs"
print(chunks[0]) # printing 1st "paragraph"
It shows error. What did I do wrong?
Error:
wordDict[word]+=1, KeyError: 'Mental'
My program:
import os
def func():
os.chdir("C:\\Users\\dinah\\OneDrive\\Desktop\\nlpPair")
with open("covid1.dat","r") as rfile:
d1 = rfile.read()
bow1 = d1.split(" ")
print("\n\t\t ---ni split words from d1 --- \n")
print(bow1)
wordSet= {'COVID-19', 'mental', 'symptom', 'pandemic', 'infection'}
wordDict = dict.fromkeys(wordSet, 0)
print(wordDict)
for word in bow1:
wordDict[word]+=1
print(wordDict)
The content of covid1.dat file:
Mental health symptoms among American veterans during the COVID-19 Pandemic.
We examined the symptom trajectories of posttraumatic stress disorder (PTSD), depression, and anxiety among 1,230 American veterans assessed online one month prior to the COVID-19 outbreak in the United States (February 2020) through the next year (August 2020, November 2020, February 2021). Veterans slightly increased mental health symptoms over time and those with pre-pandemic alcohol and cannabis use disorders reported greater symptoms compared to those without. Women and racial/ethnic minority veterans reported greater symptoms pre-pandemic but less steep increases over time compared to men and white veterans. Findings point to the continued need for mental health care efforts with veterans.
You are trying to access an entry in the dictionary before that entry exists.
wordDict[word]+=1 expands to wordDict[word] = wordDict[word] + 1. The first time you use word, wordDict[word] has not yet been set to anything, therefore you cannot add one to it.
To fix it, you can use a defaultdict or more simply, use the default value parameter in dict.get:
wordDict[word] = wordDict.get(word, 0) + 1
I want to detect multiple type of flag patterns from the Stock market the first one that I want to identify is the Bull Flag Pattern. I have tried some formula's but they all missed the point and gave me lot of stock name which did not have the pattern.
In the recent way I did
find the continuous rise and then check that the following values are lying between the mean of the continuous rise.
I'm also wondering if I plot this data in graph using matplot or plotly and then apply machine learning to it will that be a solution or not.
The code to get the data is as below
from pprint import print
from nsepy import get_history
from datetime import date
from datetime import datetime, timedelta
import matplotlib
from nsetools import Nse
nse = Nse()
old_date=date.today()-timedelta(days=30)
for stock in nse.get_stock_codes():
print("Stock",stock)
stock_data = get_history(symbol=stock,
start=date(old_date.year,old_date.month,old_date.day),
end=date(datetime.now().year,datetime.now().month,datetime.now().day)))
Any help will be useful. Thanks in advance.
Bull flag pattern matcher for your pattern day trading code
Pseudocode:
Get the difference between the min() and max() close price over the last n=20 timeperiods, here called flag_width. Get the difference between the min() and max() close price over the last m=30 timeperiods, here called poll_height.
When the relative gain percentage between poll_height and flag_width is over some massive threshold like (thresh=0.90) then you can say there is a tight flag in the last n=20 timeperiods, and a tall flag pole on period -20 to -30.
Another name for this could be: "price data elbow" or "hokey stick shape".
macd does a kind of 12,26 variation on this approach, but using 9,12,26 day exponential moving average.
Code Jist:
#movingMax returns the maximum price over the last t=20 timeperiods
highest_close_of_flag = movingMax(closeVector, 20);
lowest_close_of_flag = movingMin(closeVector, 20);
#movingMin returns the minimum price over the last t=20 timeperiods
highest_close_of_poll = movingMax(closeVector, 30);
lowest_close_of_poll = movingMin(closeVector, 30);
#We want poll to be much longer than the flag is wide.
flag_width = highest_close_of_flag - lowest_close_of_flag;
poll_height = highest_close_of_poll - lowest_close_of_poll;
# ((new-old)/old) yields percentage gain between.
bull_flag = (poll_height - flag_width ) ./ flag_width;
#Filter out bull flags who's flagpole tops go too high over the flapping flag
bull_flag -= (highest_close_of_poll -highest_close_of_flag ) ./ highest_close_of_flag;
thresh = 0.9;
bull_flag_trigger = (bull_flag > thresh);
bull_flag_trigger interpretation
A whale (relatively speaking) bought a green candle flagpole by aggressively hitting bids, or covering a prior naked short position via market options from timeperiod -30 to -20. The fish fought over the new higher price in a narrow horizontal band from timeperiod -20 to 0 at the top of the range.
The bull flag pattern is one of the most popular false flags, because it's so good, which means the other end of your trade spent money to paint that structure, so that you would happily scoop up their unwanted distribution of scrip at a much higher price, and now you're bag-holding a scrip at an unrealistic price that nobody wants to buy. You are playing a game of Chess/Checkers against very strong AI designed by corporations who pull trillions of dollars per year out of this constant sum game, and your losses are their gains make your time.
Drawbacks to this approach:
This approach doesn't take into account other desirable properties of a bull flags, such as straightness of the poll, high volume in the poll or flag, gain in trading range in the poll, the squareness/triangularness/resonant sloped-ness of the flag, or 10 other variations on what cause people with lots of money to a pattern day trade on appearance of this arbitrary structure. This is financial advice, you will lose all of your money to other people who can write better AI code in skyscrapers who get $1*10^9 annual pay packages in exchange for isolated alpha with math proofs and code demos.
Bull Flag Pattern is a pattern that is visible after plotting the chart. This pattern is hard to find at the data level.
According to me finding pattern in an image using Deep Learning(object detection) is a better choice. By doing so you can find other types of patterns also such as Bearish Flag, etc.
I would like to count unique words with function. Unique words I want to define are the word only appear once so that's why I used set here. I put the error below. Does anyone how to fix this?
Here's my code:
def unique_words(corpus_text_train):
words = re.findall('\w+', corpus_text_train)
uw = len(set(words))
return uw
unique = unique_words(test_list_of_str)
unique
I got this error
TypeError: expected string or bytes-like object
Here's my bag of words model:
def BOW_model_relative(df):
corpus_text_train = []
for i in range(0, len(df)): #iterate over the rows in dataframe
corpus = df['text'][i]
#corpus = re.findall(r'\w+',corpus)
corpus = re.sub(r'[^\w\s]','',corpus)
corpus = corpus.lower()
corpus = corpus.split()
corpus = ' '.join(corpus)
corpus_text_train.append(corpus)
word2count = {}
for x in corpus_text_train:
words=word_tokenize(x)
for word in words:
if word not in word2count.keys():
word2count[word]=1
else:
word2count[word]+=1
total=0
for key in word2count.keys():
total+=word2count[key]
for key in word2count.keys():
word2count[key]=word2count[key]/total
return word2count,corpus_text_train
test_dict,test_list_of_str = BOW_model_relative(df)
#test_data = pd.DataFrame(test)
print(test_dict)
Here's my csv data
df = pd.read_csv('test.csv')
,text,title,authors,label
0,"On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a “pressure cooker” device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.
Given that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn’t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the “bad guys.” Unfortunately, our leadership – who ostensibly wants to protect us – finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.
New York City Mayor Bill de Blasio – who famously ended “stop-and-frisk” profiling in his city – was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. “There is no specific and credible threat to New York City from any terror organization,” de Blasio said late Saturday at the news conference. “We believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and … agencies are at full alert”, he said. Isn’t “an intentional act” terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O’Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York’s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word “terrorism” is defined. “A bomb exploding in New York is obviously an act of terrorism.” Cuomo hit the nail on the head, but why did need to clarify and caveat before making his “obvious” assessment?
The two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that “we need to do everything we can to support our first responders – also to pray for the victims” and that “we need to let this investigation unfold.” Trump was more direct. “I must tell you that just before I got off the plane a bomb went off in New York and nobody knows what’s going on,” he said. “But boy we are living in a time—we better get very tough folks. We better get very, very tough. It’s a terrible thing that’s going on in our world, in our country and we are going to get tough and smart and vigilant.”
The answer from Kohelet neglects characters such as , and ", which in OP's case would find people and people, to be two unique words. To make sure you only get actual words you need to take care of the unwanted characters. To remove the , and ", you could add the following:
text ='aa, aa bb cc'
def unique_words(text):
words = text.replace('"','').replace(',', '').split()
unique = list(set(words))
return len(unique)
unique_words(text)
# out
3
There are numerous ways to add text to be replaced
s='aa aa bb cc'
def unique_words(corpus_text_train):
splitted = corpus_text_train.split()
return(len(set(splitted)))
unique_words(s)
Out[14]: 3
How can I calculate correlation between classes of the texts?
E.g., I have 3 texts:
texts = ["Chennai Super Kings won the final 2018 IPL", "Chennai Super Kings Crowned IPL 2018 Champions",
"Chennai super kings returns"]
subjects = ["final", "Crowned",
"returns"]
So, each text has a label (class). So, it is close to the text classification problem. But I need to calculate the measure of "difference".
I can count Tfidf and get the matrix:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = ["Chennai Super Kings won the final 2018 IPL", "Chennai Super Kings Crowned IPL 2018 Champions",
"Chennai super kings returns"]
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(texts)
res = pd.DataFrame(features.todense(), columns=tfidf.get_feature_names())
2018 champions chennai crowned final ipl kings returns super the won
"final" 0.333407445657484 0.0 0.2589206239570202 0.0 0.4383907244416506 0.333407445657484 0.2589206239570202 0.0 0.2589206239570202 0.4383907244416506 0.4383907244416506
"Crowned" 0.37095371207541605 0.4877595527309446 0.28807864923451976 0.4877595527309446 0.0 0.37095371207541605 0.28807864923451976 0.0 0.28807864923451976 0.0 0.0
"returns" 0.0 0.0 0.4128585720620119 0.0 0.0 0.0 0.4128585720620119 0.6990303272568005 0.4128585720620119 0.0 0.0
I need to get a score which will tell me:
- how much the subject "final" is close to "Crowned".
What metric should I use?
////////////////////////////////////////////////////////////////
Suppose you have 5 texts:
After school, Kamal took the girls to the old house. It was very old and very dirty too. There was rubbish everywhere. The windows were broken and the walls were damp. It was scary. (1)
Amy didn’t like it. There were paintings of zombies and skeletons on the walls. “We’re going to take photos for the school art competition,” said Kamal. Amy didn’t like it but she didn’t say anything. (2)
“Where’s Grant?” asked Tara. “Er, he’s buying more paint.” Kamal looked away quickly. Tara thought he looked suspicious. “It’s getting dark, can we go now?” said Amy. She didn’t like zombies. (3)
Then, they heard a loud noise coming from a cupboard in the corner of the room. “What’s that?” Amy was frightened. “I didn’t hear anything,” said Kamal. Something was making strange noises. (4)
“What do you mean? There’s nothing there!” Kamal was trying not to smile. Suddenly the door opened with a bang and a zombie appeared, shouting and moving its arms. Amy screamed and covered her eyes. (5)
Each text has labels:
1st text - school, house, scary
2nd text - zombies, paint
3rd text - zombies, dark, paint
4th text - noise, frightened
5th text - zombie, screamed
the 1st task is to find the correlation between text. Seems #MarkH has already given me the right direction (cosine similarity)
the 2nd task is to find the correlation between labels. You see that almost all labels are "zombie". Also, the 3rd sentence and the 2th sentence have 2 equal labeles: "zombies, paint".
Suppose we have 10000 texts. So what chance these lables describes the same thing and we can delete one of label (paint) and use onle 1 (zombie)? So, it's like a contribution to the variation.
Does it affect too much if we remove some lables? Can we remove/unit some labels?
I think you can use cosine similarity which is quite common for this kind of task.
from sklearn.metrics.pairwise import cosine_similarity
msgs_CosSim = pd.DataFrame(cosine_similarity(features, features))
the concept of correlation finds the closeness between the features but you are saying you want to do it for the class labels that don't make sense bcoz if the features are same the then they must have the same class label. Please share the ultimate problem u r trying to solve.