How can I calculate correlation between classes of the texts?
E.g., I have 3 texts:
texts = ["Chennai Super Kings won the final 2018 IPL", "Chennai Super Kings Crowned IPL 2018 Champions",
"Chennai super kings returns"]
subjects = ["final", "Crowned",
"returns"]
So, each text has a label (class). So, it is close to the text classification problem. But I need to calculate the measure of "difference".
I can count Tfidf and get the matrix:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = ["Chennai Super Kings won the final 2018 IPL", "Chennai Super Kings Crowned IPL 2018 Champions",
"Chennai super kings returns"]
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(texts)
res = pd.DataFrame(features.todense(), columns=tfidf.get_feature_names())
2018 champions chennai crowned final ipl kings returns super the won
"final" 0.333407445657484 0.0 0.2589206239570202 0.0 0.4383907244416506 0.333407445657484 0.2589206239570202 0.0 0.2589206239570202 0.4383907244416506 0.4383907244416506
"Crowned" 0.37095371207541605 0.4877595527309446 0.28807864923451976 0.4877595527309446 0.0 0.37095371207541605 0.28807864923451976 0.0 0.28807864923451976 0.0 0.0
"returns" 0.0 0.0 0.4128585720620119 0.0 0.0 0.0 0.4128585720620119 0.6990303272568005 0.4128585720620119 0.0 0.0
I need to get a score which will tell me:
- how much the subject "final" is close to "Crowned".
What metric should I use?
////////////////////////////////////////////////////////////////
Suppose you have 5 texts:
After school, Kamal took the girls to the old house. It was very old and very dirty too. There was rubbish everywhere. The windows were broken and the walls were damp. It was scary. (1)
Amy didn’t like it. There were paintings of zombies and skeletons on the walls. “We’re going to take photos for the school art competition,” said Kamal. Amy didn’t like it but she didn’t say anything. (2)
“Where’s Grant?” asked Tara. “Er, he’s buying more paint.” Kamal looked away quickly. Tara thought he looked suspicious. “It’s getting dark, can we go now?” said Amy. She didn’t like zombies. (3)
Then, they heard a loud noise coming from a cupboard in the corner of the room. “What’s that?” Amy was frightened. “I didn’t hear anything,” said Kamal. Something was making strange noises. (4)
“What do you mean? There’s nothing there!” Kamal was trying not to smile. Suddenly the door opened with a bang and a zombie appeared, shouting and moving its arms. Amy screamed and covered her eyes. (5)
Each text has labels:
1st text - school, house, scary
2nd text - zombies, paint
3rd text - zombies, dark, paint
4th text - noise, frightened
5th text - zombie, screamed
the 1st task is to find the correlation between text. Seems #MarkH has already given me the right direction (cosine similarity)
the 2nd task is to find the correlation between labels. You see that almost all labels are "zombie". Also, the 3rd sentence and the 2th sentence have 2 equal labeles: "zombies, paint".
Suppose we have 10000 texts. So what chance these lables describes the same thing and we can delete one of label (paint) and use onle 1 (zombie)? So, it's like a contribution to the variation.
Does it affect too much if we remove some lables? Can we remove/unit some labels?
I think you can use cosine similarity which is quite common for this kind of task.
from sklearn.metrics.pairwise import cosine_similarity
msgs_CosSim = pd.DataFrame(cosine_similarity(features, features))
the concept of correlation finds the closeness between the features but you are saying you want to do it for the class labels that don't make sense bcoz if the features are same the then they must have the same class label. Please share the ultimate problem u r trying to solve.
Related
I have a text and I would like to be able to add certain words to a specific position in it. To do this, I need to cut my text into letters (not words). I can do the work but the problem is that the word I want to add cuts off another word.
My input( the numbers are not good because the text is much longer but this way you get an idea) :
{”text":The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. The sanitary conditions were below any reasonable standard. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection. ,"label":[[328,347,"Article 3 - Violated"],[2269,2323,"Article 3 - Violated"],[2791,2843,"Article 3 - Violated"],[2947,2988,"Article 3 - Violated"],[3099,3110,"Article 3 - Violated"],[3603,3615,"Article 3 - Violated"],[3702,3756,"Article 3 - Violated"],[4793,4923,"Article 3 - Violated"],[5185,5196,"Article 3 - Violated"],[8111,8198,"Article 3 - Respected"],[8510,8521,"Article 3 - Respected"],[8575,8601,"Article 3 - Respected"],[8965,9009,"Article 3 - Respected"],
And I would like to have this:
The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. <Article 3 - Violated>The sanitary conditions were below any reasonable standard</Article 3 - Violated>. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.
but I get this. It cuts the words.
The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. <Article 3 - Violated>The sanitary conditions were below any reasonable stan <Article 3 - Violated/>dard. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.
My code:
text =list(texte["text"].strip())
label = texte["label"]
for i in label:
debut = i[0]
fin = i[1]
nom = i[2]
for element in range(len(text)):
if element == debut:
text.insert(element,"<"+nom+">")
if element == fin:
a = element +1
text.insert(element+1,"<"+nom+"/>")
string = ""
for element in text:
string += element
print(string)
Your approach seems a bit odd: (1) Why are you making a character list out of the string? (2) The looping here for element in range(len(text)): ... seems completely unnecessary, why are you not directly using debut and fin?
Problem of your approach: By inserting items to the list text the position numbers in the label-lists become invalid.
Here's an alternative approach. I'm using the following data as a sample:
texte = {
"text": "The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker. There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. The sanitary conditions were below any reasonable standard. In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.",
"label": [[262, 375, "Article 3 - Violated"], [637, 695, "Article 3 - Violated"]]
}
The numbers in texte["label"] mark the start and end of the following two passages:
The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker.
The sanitary conditions were below any reasonable standard.
The first number in a label-list is the position of the start of the passage, the second number is the first position after the last character of the passage. But I'm not sure about that, I haven't seen any related information in the question.
Now this
text = texte["text"]
new_text = ""
last_fin = 0
for debut, fin, nom in texte["label"]:
new_text += text[last_fin:debut] + "<" + nom + ">" + text[debut:fin] + "<" + nom + "/>"
last_fin = fin
new_text += text[last_fin:]
results in the following new_text:
The applicant's cells were overcrowded. The detainees had to take turns to sleep because there was usually one sleeping place for two to three of them. There was almost no light in the cells because of the metal shutters on the windows, as well as no fresh air. <Article 3 - Violated>The lack of air was aggravated by the detainees' smoking and the applicant, a non-smoker, became a passive smoker.<Article 3 - Violated/> There was one hour of daily exercise. The applicant's eyesight deteriorated and he developed respiratory problems. In summer the average air temperature was around thirty degrees which, combined with the high humidity level, caused skin conditions to develop. <Article 3 - Violated>The sanitary conditions were below any reasonable standard.<Article 3 - Violated/> In particular, the cells were supplied with water for only one or two hours a day and on some days there was no water supply at all. The lack of water caused intestinal problems and in 1999 the administration had to announce quarantine in that connection.
If the second number in a label-list is the position of the last character of the passage (instead of the position of the first character after the passage), then the following should produce the same nex_text:
text = texte["text"]
new_text = ""
last_fin = 0
for debut, fin, nom in texte["label"]:
fin += 1
new_text += text[last_fin:debut] + "<" + nom + ">" + text[debut:fin] + "<" + nom + "/>"
last_fin = fin
new_text += text[last_fin:]
You can use .replace method;
string.replace(oldvalue, newvalue, count)
In your case you can replace "applicant's" string with:
text.replace("applicant's" , "Name", count_that_times_you_want_to_replace)
Can find more info here ;
https://www.geeksforgeeks.org/python-string-replace/
I would like to count unique words with function. Unique words I want to define are the word only appear once so that's why I used set here. I put the error below. Does anyone how to fix this?
Here's my code:
def unique_words(corpus_text_train):
words = re.findall('\w+', corpus_text_train)
uw = len(set(words))
return uw
unique = unique_words(test_list_of_str)
unique
I got this error
TypeError: expected string or bytes-like object
Here's my bag of words model:
def BOW_model_relative(df):
corpus_text_train = []
for i in range(0, len(df)): #iterate over the rows in dataframe
corpus = df['text'][i]
#corpus = re.findall(r'\w+',corpus)
corpus = re.sub(r'[^\w\s]','',corpus)
corpus = corpus.lower()
corpus = corpus.split()
corpus = ' '.join(corpus)
corpus_text_train.append(corpus)
word2count = {}
for x in corpus_text_train:
words=word_tokenize(x)
for word in words:
if word not in word2count.keys():
word2count[word]=1
else:
word2count[word]+=1
total=0
for key in word2count.keys():
total+=word2count[key]
for key in word2count.keys():
word2count[key]=word2count[key]/total
return word2count,corpus_text_train
test_dict,test_list_of_str = BOW_model_relative(df)
#test_data = pd.DataFrame(test)
print(test_dict)
Here's my csv data
df = pd.read_csv('test.csv')
,text,title,authors,label
0,"On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a “pressure cooker” device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.
Given that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn’t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the “bad guys.” Unfortunately, our leadership – who ostensibly wants to protect us – finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.
New York City Mayor Bill de Blasio – who famously ended “stop-and-frisk” profiling in his city – was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. “There is no specific and credible threat to New York City from any terror organization,” de Blasio said late Saturday at the news conference. “We believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and … agencies are at full alert”, he said. Isn’t “an intentional act” terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O’Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York’s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word “terrorism” is defined. “A bomb exploding in New York is obviously an act of terrorism.” Cuomo hit the nail on the head, but why did need to clarify and caveat before making his “obvious” assessment?
The two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that “we need to do everything we can to support our first responders – also to pray for the victims” and that “we need to let this investigation unfold.” Trump was more direct. “I must tell you that just before I got off the plane a bomb went off in New York and nobody knows what’s going on,” he said. “But boy we are living in a time—we better get very tough folks. We better get very, very tough. It’s a terrible thing that’s going on in our world, in our country and we are going to get tough and smart and vigilant.”
The answer from Kohelet neglects characters such as , and ", which in OP's case would find people and people, to be two unique words. To make sure you only get actual words you need to take care of the unwanted characters. To remove the , and ", you could add the following:
text ='aa, aa bb cc'
def unique_words(text):
words = text.replace('"','').replace(',', '').split()
unique = list(set(words))
return len(unique)
unique_words(text)
# out
3
There are numerous ways to add text to be replaced
s='aa aa bb cc'
def unique_words(corpus_text_train):
splitted = corpus_text_train.split()
return(len(set(splitted)))
unique_words(s)
Out[14]: 3
a function which gives statements of commentary, the problem is they contain <br> and </br> tags, I want to arrange these in a new line
from pycricbuzz import Cricbuzz
c = Cricbuzz()
commentary1 = []
current_game3 = {}
matches = c.matches()
for match in matches:
if(match['mchstate'] != 'nextlive'):
col= (c.commentary(match['id']))
for my_str in col['commentary']:
current_game3[ "commentary2"] = my_str
commentary1.append(current_game3)
current_game3 = {}
print(commentary1)
when I print this I get output as below
{'commentary2': 'Preview by Tristan Lavalette<br/><br/>The Twenty20 tri-series decider between Australia and New Zealand is set to finish with a bang at the tiny Eden Park on Wednesday (February 21), as another bout of belligerent batting is expected in Auckland.<br/><br/>In a preview of the final, the teams clashed at Eden Park last Friday and produced a run-fest with the rampaging Australia successfully chasing down a record target of 244. The unbeaten Australia head into the final as favourites after a dazzling campaign from their new look side brimming with in-form Big Bash League players and headed by skipper David Warner, whose inventive captaincy has been inspirational.<br/><br/>Astoundingly, Australia is on the brink of leapfrogging into the No.1 T20 ranking having started the tournament a lowly No.7. A victory would be their sixth straight in the format equalling their best ever streak.<br/><br/>Australia\'s hard-hitting batting has relished chasing in every match and New Zealand\'s brains trust might deeply consider bowling first if skipper Kane Williamson wins the toss. Packed with firepower, Australia ooze with match-winners and chased down the record target with relative ease, confirming their penchant to chase. At the comically miniature Eden Park ground, Australia\'s powerful batting will be confident no matter the situation of the match.<br/><br/>Of course, the beleaguered bowlers aren\'t quite as cheery after copping a flogging last start especially to New Zealand dynamo Martin Guptill. Much like their counterparts, the Black Caps boast a high-octane batting order that has been inconsistent throughout the tournament but, ominously, has the artillery to spearhead New Zealand to a triumph.<br/><br/>Australia\'s attack has been settled throughout the tri-series but selectors might be tempted to tweak it in a bid to ruffle the Black Caps. Legspinner Adam Zampa could be given a call-up on the wearing pitch - the same one used for Friday\'s encounter - which is set to be helpful for spin.<br/><br/>If Zampa gets the nod, Australia will be faced with a dilemma of culling one of their frontline quicks of Billy Stanlake, Kane Richardson and Andrew Tye, who have each starred at various stages during the tri-series. Australia\'s fresh team has matured quickly but the pressure will be intensified in an away final amid an electrifying atmosphere.<br/><br/>Even they though endured a rocky tournament yielding just one win, New Zealand squeaked past England to reach the decider but will need to lift their game if they are to cause an upset. The Black Caps have been unable to consistently recapture their best after coming into the tri-series ranked No. 2 in the world.<br/><br/>New Zealand\'s eclectic bowling has struggled although the spin combination of Mitchell Santner and Ish Sodhi could prove a handful on this deck. For such a composed and experienced team, New Zealand has looked occasionally rattled having agonisingly lost consecutive matches.<br/><br/>Despite their struggles, New Zealand know one strong performance is enough for them to claim glory in front of their parochial home crowd desperate for some revelry.<br/><br/>With all to play for, the stage is set for a memorably entertaining finish for this inaugural tri-series tournament.<br/><br/>When: Wednesday, February 21, 2018; 7PM local, 11.30AM IST<br/><br/>Where: Eden Park, Auckland<br/><br/>What to expect: There is a chance of showers intervening. Once again, there should be plenty of runs on offer on the small ground although the pitch is tipped to produce some turn.<br/><br/>Team News<br/><br/>New Zealand: Despite agonisingly losing their last couple of games, New Zealand are set to stick with the same line-up.<br/><br/>Probable XI: Martin Guptill, Colin Munro, Kane Williamson (c), Colin de Grandhomme, Mark Chapman, Ross Taylor, Tim Seifert (wk), Mitchell Santner, Tim Southee, Ish Sodhi, Trent Boult<br/><br/>Australia: Zampa could be in line to play with the pitch possibly providing some turn. However, a red hot Australia may not want to disturb a winning combination.<br/><br/>Probable XI: David Warner, D\'Arcy Short, Chris Lynn, Glenn Maxwell, Aaron Finch, Marcus Stoinis, Alex Carey (wk), Ashton Agar, Kane Richardson, Andrew Tye, Billy Stanlake<br/><br/>Did you know<br/><br/>- Australia\'s greatest winning streak in T20Is is their six straight victories at the 2010 World T20 before losing the final to England<br/><br/>- David Warner has won 8 of 9 as T20 captain. The best record overall - minimum 10 matches - is Pakistan\'s Sarfraz Ahmed\'s 14 wins from 17 matches<br/><br/>- New Zealand have lost their last four T20I matches at Eden Park<br/><br/>What they said<br/><br/>"We\'ve had three pretty close T20 games, Australia batting exceptionally well at Eden Park and chasing down a score that was pretty formidable. But you\'ve got to be in the final and give yourself a chance" - Mike Hesson, the New Zealand coach.<br/><br/>"You\'ve just got to find a way to get one or two wickets in the first six (overs), it\'s as simple as that" - David Warner, the Australia captain, said about bowling at the tiny Eden Park.'},
I want to arrange like this
Preview by Tristan Lavalette
The Twenty20 tri-series decider between Australia and New Zealand is set to finish with a bang at the tiny Eden Park on Wednesday (February 21), as another bout of belligerent batting is expected in Auckland.
In a preview of the final, the teams clashed at Eden Park last Friday and produced a run-fest with the rampaging Australia successfully chasing down a record target of 244. The unbeaten Australia head into the final as favourites after a dazzling campaign from their new look side brimming with in-form Big Bash League players and headed by skipper David Warner, whose inventive captaincy has been inspirational.
Astoundingly, Australia is on the brink of leapfrogging into the No.1 T20 ranking having started the tournament a lowly No.7. A victory would be their sixth straight in the format equalling their best ever streak.<br/><br/>Australia\'s hard-hitting batting has relished chasing in every match and New Zealand\'s brains trust might deeply consider bowling first if skipper Kane Williamson wins the toss. Packed with firepower, Australia ooze with match-winners and chased down the record target with relative ease, confirming their penchant to chase. At the comically miniature Eden Park ground, Australia\'s powerful batting will be confident no matter the situation of the match.
Of course, the beleaguered bowlers aren\'t quite as cheery after copping a flogging last start especially to New Zealand dynamo Martin Guptill. Much like their counterparts, the Black Caps boast a high-octane batting order that has been inconsistent throughout the tournament but, ominously, has the artillery to spearhead New Zealand to a triumph.
Australia\'s attack has been settled throughout the tri-series but selectors might be tempted to tweak it in a bid to ruffle the Black Caps. Legspinner Adam Zampa could be given a call-up on the wearing pitch - the same one used for Friday\'s encounter - which is set to be helpful for spin.
If Zampa gets the nod, Australia will be faced with a dilemma of culling one of their frontline quicks of Billy Stanlake, Kane Richardson and Andrew Tye, who have each starred at various stages during the tri-series. Australia\'s fresh team has matured quickly but the pressure will be intensified in an away final amid an electrifying atmosphere.
Even they though endured a rocky tournament yielding just one win, New Zealand squeaked past England to reach the decider but will need to lift their game if they are to cause an upset. The Black Caps have been unable to consistently recapture their best after coming into the tri-series ranked No. 2 in the world.
New Zealand\'s eclectic bowling has struggled although the spin combination of Mitchell Santner and Ish Sodhi could prove a handful on this deck. For such a composed and experienced team, New Zealand has looked occasionally rattled having agonizingly lost consecutive matches.
Despite their struggles, New Zealand knows one strong performance is enough for them to claim glory in front of their parochial home crowd desperate for some revelry.
Assuming you want to print each commentary dictionary in the commentary1 list, you want to replace the
print(commentary1)
line with
print("\n".join([" ".join(i.values()).replace("<br/><br/>", "\n") for i in commentary1]))
That will take all the dictionaries in the commentary1 list, then take all of their values, append them with a space, replace the <br/><br/> tags with \n, then join them.
Use this:
from pycricbuzz import Cricbuzz
c = Cricbuzz()
commentary1 = []
current_game3 = {}
matches = c.matches()
for match in matches:
if match['mchstate'] != 'nextlive':
col= (c.commentary(match['id']))
for my_str in col['commentary']:
current_game3["commentary2"] = my_str.replace('<br/>', '\n')
commentary1.append(current_game3)
current_game3 = {}
for comment in commentary1:
print(comment['commentary2'])
Partial Output:
Preview by Tristan Lavalette
The Twenty20 tri-series decider between Australia and New Zealand is
set to finish with a bang at the tiny Eden Park on Wednesday (February
21), as another bout of belligerent batting is expected in Auckland.
In a preview of the final, the teams clashed at Eden Park last Friday
and produced a run-fest with the rampaging Australia successfully
chasing down a record target of 244. The unbeaten Australia head into
the final as favourites after a dazzling campaign from their new look
side brimming with in-form Big Bash League players and headed by
skipper David Warner, whose inventive captaincy has been
inspirational.
Astoundingly, Australia is on the brink of leapfrogging into the No.1
T20 ranking having started the tournament a lowly No.7. A victory
would be their sixth straight in the format equalling their best ever
streak.
I have a dataframe column with documents like
38909 Hotel is an old style Red Roof and has not bee...
38913 I will never ever stay at this Hotel again. I ...
38914 After being on a bus for -- hours and finally ...
38918 We were excited about our stay at the Blu Aqua...
38922 This hotel has a great location if you want to...
Name: Description, dtype: object
I have a bag of words like keys = ['Hotel','old','finally'] but the actual length of keys = 44312
Currently Im using
df.apply(lambda x : sum([i in x for i in keys ]))
Which gives the following output based on sample keys
38909 2
38913 2
38914 3
38918 0
38922 1
Name: Description, dtype: int64
When I apply this on actual data for just 100 rows timeit gives
1 loop, best of 3: 5.98 s per loop
and I have 50000 rows. Is there a faster way of doing the same in nltk or pandas.
EDIT :
In case looking for document array
array([ 'Hotel is an old style Red Roof and has not been renovated up to the new standard, but the price was also not up to the level of the newer style Red Roofs. So, in overview it was an OK stay, and a safe',
'I will never ever stay at this Hotel again. I stayed there a few weeks ago, and I had my doubts the second I checked in. The guy that checked me in, I think his name was Julio, and his name tag read F',
"After being on a bus for -- hours and finally arriving at the Hotel Lawerence at - am, I bawled my eyes out when we got to the room. I realize it's suppose to be a boutique hotel but, there was nothin",
"We were excited about our stay at the Blu Aqua. A new hotel in downtown Chicago. It's architecturally stunning and has generally good reviews on TripAdvisor. The look and feel of the place is great, t",
'This hotel has a great location if you want to be right at Times Square and the theaters. It was an easy couple of blocks for us to go to theater, eat, shop, and see all of the activity day and night '], dtype=object)
The following code is not exactly equivalent to your (slow) version, but it demonstrates the idea:
keyset = frozenset(keys)
df.apply(lambda x : len(keyset.intersection(x.split())))
Differences/limitation:
In your version a word is counted even if it is contained as a substring in a word in the document. For example, had your keys contained the word tyl, it would be counted due to occurrence of "style" in your first document.
My solution doesn't account for punctuation in the documents. For example, the word again in the second document comes out of split() with the full stop attached to it. That can be fixed by preprocessing the document (or postprocessing the result of the split()) with a function that removes the punctuation.
It seems you can just use np.char.count -
[np.count_nonzero(np.char.count(i, keys)) for i in arr]
Might be better to feed a boolean array for counting -
[np.count_nonzero(np.char.count(i, keys)!=0) for i in arr]
If need check only if present values of list:
from numpy.core.defchararray import find
v = df['col'].values.astype(str)
a = (find(v[:, None], keys) >= 0).sum(axis=1)
print (a)
[2 1 1 0 0]
Or:
df = pd.concat([df['col'].str.contains(x) for x in keys], axis=1).sum(axis=1)
print (df)
38909 2
38913 1
38914 1
38918 0
38922 0
dtype: int64
I'm trying to come up with a parser for football plays. I use the term "natural language" here very loosely so please bear with me as I know little to nothing about this field.
Here are some examples of what I'm working with
(Format: TIME|DOWN&DIST|OFF_TEAM|DESCRIPTION):
04:39|4th and 20#NYJ46|Dal|Mat McBriar punts for 32 yards to NYJ14. Jeremy Kerley - no return. FUMBLE, recovered by NYJ.|
04:31|1st and 10#NYJ16|NYJ|Shonn Greene rush up the middle for 5 yards to the NYJ21. Tackled by Keith Brooking.|
03:53|2nd and 5#NYJ21|NYJ|Mark Sanchez rush to the right for 3 yards to the NYJ24. Tackled by Anthony Spencer. FUMBLE, recovered by NYJ (Matthew Mulligan).|
03:20|1st and 10#NYJ33|NYJ|Shonn Greene rush to the left for 4 yards to the NYJ37. Tackled by Jason Hatcher.|
02:43|2nd and 6#NYJ37|NYJ|Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins.|
02:02|1st and 10#NYJ44|NYJ|Shonn Greene rush to the right for 1 yard to the NYJ45. Tackled by Anthony Spencer.|
01:23|2nd and 9#NYJ45|NYJ|Mark Sanchez pass to the left to LaDainian Tomlinson for 5 yards to the 50. Tackled by Sean Lee.|
As of now, I've written a dumb parser that handles all the easy stuff (playID, quarter, time, down&distance, offensive team) along with some scripts that goes and gets this data and sanitizes it into the format seen above. A single line gets turned into a "Play" object to be stored into a database.
The tough part here (for me at least) is parsing the description of the play. Here is some information that I would like to extract from that string:
Example string:
"Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins."
Result:
turnover = False
interception = False
fumble = False
to_on_downs = False
passing = True
rushing = False
direction = 'left'
loss = False
penalty = False
scored = False
TD = False
PA = False
FG = False
TPC = False
SFTY = False
punt = False
kickoff = False
ret_yardage = 0
yardage_diff = 7
playmakers = ['Mark Sanchez', 'Shonn Greene', 'Mike Jenkins']
The logic that I had for my initial parser went something like this:
# pass, rush or kick
# gain or loss of yards
# scoring play
# Who scored? off or def?
# TD, PA, FG, TPC, SFTY?
# first down gained
# punt?
# kick?
# return yards?
# penalty?
# def or off?
# turnover?
# INT, fumble, to on downs?
# off play makers
# def play makers
The descriptions can get pretty hairy (multiple fumbles & recoveries with penalties, etc) and I was wondering if I could take advantage of some NLP modules out there. Chances are I'm going to spend a few days on a dumb/static state-machine like parser instead but if anyone has suggestions on how to approach it using NLP techniques I'd like to hear about them.
I think pyparsing would be very useful here.
Your input text looks very regular (unlike real natural language), and pyparsing is great at this stuff. you should have a look at it.
For example to parse the following sentences:
Mat McBriar punts for 32 yards to NYJ14.
Mark Sanchez rush to the right for 3 yards to the NYJ24.
You would define a parse sentence with something like(look for exact syntax in docs):
name = Group(Word(alphas) + Word(alphas)).setResultsName('name')
action = Or(Exact("punts"),Exact("rush")).setResultsName('action') + Optional(Exact("to the")) + Or(Exact("left"), Exact("right")) )
distance = Word(number).setResultsName("distance") + Exact("yards")
pattern = name + action + Exact("for") + distance + Or(Exact("to"), Exact("to the")) + Word()
And pyparsing would break strings using this pattern. It will also return a dictionary with the items name, action and distance - extracted from the sentence.
I imagine pyparsing would work pretty well, but rule-based systems are pretty brittle. So, if you go beyond football, you might run into some trouble.
I think a better solution for this case would be a part of speech tagger and a lexicon (read dictionary) of player names, positions and other sport terminology. Dump it into your favorite machine learning tool, figure out good features and I think it'd do pretty well.
NTLK is a good place to start for NLP. Unfortunately, the field isn't very developed and there isn't a tool out there that's like bam, problem solved, easy cheesy.