I'm trying to come up with a parser for football plays. I use the term "natural language" here very loosely so please bear with me as I know little to nothing about this field.
Here are some examples of what I'm working with
(Format: TIME|DOWN&DIST|OFF_TEAM|DESCRIPTION):
04:39|4th and 20#NYJ46|Dal|Mat McBriar punts for 32 yards to NYJ14. Jeremy Kerley - no return. FUMBLE, recovered by NYJ.|
04:31|1st and 10#NYJ16|NYJ|Shonn Greene rush up the middle for 5 yards to the NYJ21. Tackled by Keith Brooking.|
03:53|2nd and 5#NYJ21|NYJ|Mark Sanchez rush to the right for 3 yards to the NYJ24. Tackled by Anthony Spencer. FUMBLE, recovered by NYJ (Matthew Mulligan).|
03:20|1st and 10#NYJ33|NYJ|Shonn Greene rush to the left for 4 yards to the NYJ37. Tackled by Jason Hatcher.|
02:43|2nd and 6#NYJ37|NYJ|Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins.|
02:02|1st and 10#NYJ44|NYJ|Shonn Greene rush to the right for 1 yard to the NYJ45. Tackled by Anthony Spencer.|
01:23|2nd and 9#NYJ45|NYJ|Mark Sanchez pass to the left to LaDainian Tomlinson for 5 yards to the 50. Tackled by Sean Lee.|
As of now, I've written a dumb parser that handles all the easy stuff (playID, quarter, time, down&distance, offensive team) along with some scripts that goes and gets this data and sanitizes it into the format seen above. A single line gets turned into a "Play" object to be stored into a database.
The tough part here (for me at least) is parsing the description of the play. Here is some information that I would like to extract from that string:
Example string:
"Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins."
Result:
turnover = False
interception = False
fumble = False
to_on_downs = False
passing = True
rushing = False
direction = 'left'
loss = False
penalty = False
scored = False
TD = False
PA = False
FG = False
TPC = False
SFTY = False
punt = False
kickoff = False
ret_yardage = 0
yardage_diff = 7
playmakers = ['Mark Sanchez', 'Shonn Greene', 'Mike Jenkins']
The logic that I had for my initial parser went something like this:
# pass, rush or kick
# gain or loss of yards
# scoring play
# Who scored? off or def?
# TD, PA, FG, TPC, SFTY?
# first down gained
# punt?
# kick?
# return yards?
# penalty?
# def or off?
# turnover?
# INT, fumble, to on downs?
# off play makers
# def play makers
The descriptions can get pretty hairy (multiple fumbles & recoveries with penalties, etc) and I was wondering if I could take advantage of some NLP modules out there. Chances are I'm going to spend a few days on a dumb/static state-machine like parser instead but if anyone has suggestions on how to approach it using NLP techniques I'd like to hear about them.
I think pyparsing would be very useful here.
Your input text looks very regular (unlike real natural language), and pyparsing is great at this stuff. you should have a look at it.
For example to parse the following sentences:
Mat McBriar punts for 32 yards to NYJ14.
Mark Sanchez rush to the right for 3 yards to the NYJ24.
You would define a parse sentence with something like(look for exact syntax in docs):
name = Group(Word(alphas) + Word(alphas)).setResultsName('name')
action = Or(Exact("punts"),Exact("rush")).setResultsName('action') + Optional(Exact("to the")) + Or(Exact("left"), Exact("right")) )
distance = Word(number).setResultsName("distance") + Exact("yards")
pattern = name + action + Exact("for") + distance + Or(Exact("to"), Exact("to the")) + Word()
And pyparsing would break strings using this pattern. It will also return a dictionary with the items name, action and distance - extracted from the sentence.
I imagine pyparsing would work pretty well, but rule-based systems are pretty brittle. So, if you go beyond football, you might run into some trouble.
I think a better solution for this case would be a part of speech tagger and a lexicon (read dictionary) of player names, positions and other sport terminology. Dump it into your favorite machine learning tool, figure out good features and I think it'd do pretty well.
NTLK is a good place to start for NLP. Unfortunately, the field isn't very developed and there isn't a tool out there that's like bam, problem solved, easy cheesy.
Related
How can I calculate correlation between classes of the texts?
E.g., I have 3 texts:
texts = ["Chennai Super Kings won the final 2018 IPL", "Chennai Super Kings Crowned IPL 2018 Champions",
"Chennai super kings returns"]
subjects = ["final", "Crowned",
"returns"]
So, each text has a label (class). So, it is close to the text classification problem. But I need to calculate the measure of "difference".
I can count Tfidf and get the matrix:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = ["Chennai Super Kings won the final 2018 IPL", "Chennai Super Kings Crowned IPL 2018 Champions",
"Chennai super kings returns"]
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(texts)
res = pd.DataFrame(features.todense(), columns=tfidf.get_feature_names())
2018 champions chennai crowned final ipl kings returns super the won
"final" 0.333407445657484 0.0 0.2589206239570202 0.0 0.4383907244416506 0.333407445657484 0.2589206239570202 0.0 0.2589206239570202 0.4383907244416506 0.4383907244416506
"Crowned" 0.37095371207541605 0.4877595527309446 0.28807864923451976 0.4877595527309446 0.0 0.37095371207541605 0.28807864923451976 0.0 0.28807864923451976 0.0 0.0
"returns" 0.0 0.0 0.4128585720620119 0.0 0.0 0.0 0.4128585720620119 0.6990303272568005 0.4128585720620119 0.0 0.0
I need to get a score which will tell me:
- how much the subject "final" is close to "Crowned".
What metric should I use?
////////////////////////////////////////////////////////////////
Suppose you have 5 texts:
After school, Kamal took the girls to the old house. It was very old and very dirty too. There was rubbish everywhere. The windows were broken and the walls were damp. It was scary. (1)
Amy didn’t like it. There were paintings of zombies and skeletons on the walls. “We’re going to take photos for the school art competition,” said Kamal. Amy didn’t like it but she didn’t say anything. (2)
“Where’s Grant?” asked Tara. “Er, he’s buying more paint.” Kamal looked away quickly. Tara thought he looked suspicious. “It’s getting dark, can we go now?” said Amy. She didn’t like zombies. (3)
Then, they heard a loud noise coming from a cupboard in the corner of the room. “What’s that?” Amy was frightened. “I didn’t hear anything,” said Kamal. Something was making strange noises. (4)
“What do you mean? There’s nothing there!” Kamal was trying not to smile. Suddenly the door opened with a bang and a zombie appeared, shouting and moving its arms. Amy screamed and covered her eyes. (5)
Each text has labels:
1st text - school, house, scary
2nd text - zombies, paint
3rd text - zombies, dark, paint
4th text - noise, frightened
5th text - zombie, screamed
the 1st task is to find the correlation between text. Seems #MarkH has already given me the right direction (cosine similarity)
the 2nd task is to find the correlation between labels. You see that almost all labels are "zombie". Also, the 3rd sentence and the 2th sentence have 2 equal labeles: "zombies, paint".
Suppose we have 10000 texts. So what chance these lables describes the same thing and we can delete one of label (paint) and use onle 1 (zombie)? So, it's like a contribution to the variation.
Does it affect too much if we remove some lables? Can we remove/unit some labels?
I think you can use cosine similarity which is quite common for this kind of task.
from sklearn.metrics.pairwise import cosine_similarity
msgs_CosSim = pd.DataFrame(cosine_similarity(features, features))
the concept of correlation finds the closeness between the features but you are saying you want to do it for the class labels that don't make sense bcoz if the features are same the then they must have the same class label. Please share the ultimate problem u r trying to solve.
It's supposed to count the amount of lines and the amount of characters in the line.
I cannot add any more variables to my display function, as the professor said to use 2.
I've gotten it to count the characters in the line correctly if I change the zXX= zXX + 1
to zXX = w
but if i do that it will not count the number of lines, if someone could help I'd greatly appreciate it.
Currently I have:
def display(x, y):
y = str(y)
varx = str(len(y))
vary = y + "#" + x + "#" + varx
return vary.rjust(3)
def main():
script = '''Grandson|Cough, cough, cough. Cough, cough, cough. {Grandson is on the bed, playing video game.} Mother|{Enters.} Hi, honey. Grandson|Hi, Mom. Mother|{Kisses son and feels his forehead.} You feeling any better? Grandson|A little bit. Mother|Guess what? Grandson|What? Mother|Your Grandfather's here. {Opens curtains.} Grandson|Mom, can't you tell him I'm sick? Mother|You're sick? That's why he's here. Grandson|He'll pinch my cheek. I hate that. Mother|Maybe he won't. Grandfather|{Entering with a flourish.} Heyyyy!! How's the sickie? Heh? {Pinches boy's cheek. Boy looks at mother accusingly.} Mother|I think I'll leave you two pals alone. {Exits.} Grandfather|I brought you a special present. Grandson|What is it? Grandfather|Open it up. Grandson|{Opens the package. Disappointed.} A book? Grandfather|That's right. When I was your age, television was called books. And this is a special book. It was the book my father used to read to me when I was sick, {takes book} and I used to read it to your father. And today I'm gonna read it to you. Grandson|Has it got any sports in it? Grandfather|Are you kidding? Fencing, fighting, torture, revenge, giants, monsters, chases, escapes, true love, miracles... Grandson|Doesn't sound too bad. I'll try to stay awake. {Turns off TV.} Grandfather|Oh, well, thank you very much, very nice of you. Your vote of confidence is overwhelming. All right. {Puts glasses on.} The Princess Bride, by S. Morgenstern. Chapter One. Buttercup was raised on a small farm in the country of Florin.'''
zX = script.split('\n')
#print(zX)
zXX = 0
for w in zX:
#print(zX[zXX:(zXX + 1)])
zXXXX = w.split('|')
#print(zXXXX)
zXXX = zXXXX[0].upper() + " " + zXXXX[1]
#print(zXXX)
zXX = zXX + 1
print(display(zXXX, zXX))
main()
The output is:
1#GRANDSON Cough, cough, cough. Cough, cough, cough. {Grandson is on the bed, playing video game.}#1
2#MOTHER {Enters.} Hi, honey.#1
3#GRANDSON Hi, Mom.#1
4#MOTHER {Kisses son and feels his forehead.} You feeling any better?#1
5#GRANDSON A little bit.#1
6#MOTHER Guess what?#1
7#GRANDSON What?#1
8#MOTHER Your Grandfather's here. {Opens curtains.}#1
9#GRANDSON Mom, can't you tell him I'm sick?#1
10#MOTHER You're sick? That's why he's here.#2
11#GRANDSON He'll pinch my cheek. I hate that.#2
12#MOTHER Maybe he won't.#2
13#GRANDFATHER {Entering with a flourish.} Heyyyy!! How's the sickie? Heh?
{Pinches boy's cheek. Boy looks at mother accusingly.}#2
14#MOTHER I think I'll leave you two pals alone. {Exits.}#2
15#GRANDFATHER I brought you a special present.#2
16#GRANDSON What is it?#2
17#GRANDFATHER Open it up.#2
18#GRANDSON {Opens the package. Disappointed.} A book?#2
19#GRANDFATHER That's right. When I was your age, television was called books. And this is a special book. It was the book my father used to read to me when I was sick, {takes book} and I used to read it to your father. And today I'm gonna read it to you.#2
20#GRANDSON Has it got any sports in it?#2
21#GRANDFATHER Are you kidding? Fencing, fighting, torture, revenge, giants, monsters, chases, escapes, true love, miracles...#2
22#GRANDSON Doesn't sound too bad. I'll try to stay awake. {Turns off TV.}#2
23#GRANDFATHER Oh, well, thank you very much, very nice of you. Your vote of confidence is overwhelming. All right. {Puts glasses on.} The Princess Bride, by S. Morgenstern. Chapter One. Buttercup was raised on a small farm in the country of Florin.#2
Once I got past the formatting, and variable names, and that you didn't include the line breaks in 'script', I'm pretty sure that the issue is that you have:
varx = str(len(y))
Instead of:
varx = str(len(x))
Though that is based on my interpretation of what you provided
I beginner in python & I don't know this line in the following is used for what? It's a Boolean value?(for & if in a sentence?) If anybody know it, please explain. Thanks
taken_match = [couple for couple in tentative_engagements if woman in couple]
###
for woman in preferred_rankings_men[man]:
#Boolean for whether woman is taken or not
taken_match = [couple for couple in tentative_engagements if woman in couple]
if (len(taken_match) == 0):
#tentatively engage the man and woman
tentative_engagements.append([man, woman])
free_men.remove(man)
print('%s is no longer a free man and is now tentatively engaged to %s'%(man, woman))
break
elif (len(taken_match) > 0):
...
Python has some pretty sweet syntax to create lists quickly. What you're seeing here is list comprehension-
taken_match = [couple for couple in tentative_engagements if woman in couple]
taken_match will be a list of all the couples where the woman is in the couple- basically, this filters out all the couples where the woman is NOT in the couple.
If we were to write this out without list comprehension:
taken_match = []
for couple in couples:
if woman in couple:
taken_match.append(couple)
As you can see.. list comprehension is way cooler :)
After that line, you're checking if the length of the taken_match is 0- if it is, no couples were found with that woman in them, so we add in an engagement between what the man and the woman, and then move on. If you have any other lines you didn't understand, feel free to ask!
New programmer, working in python 2.7.
With this code, I get a syntax error in the line 'if partychoice = R, stating the '=' is invalid syntax. How come it won;t let me assign the variable.
Also, I'm sure there are tons of other errors, but I have to start somewhere.
print "Welcome to 'Pass the Bill', where you will take on the role of a professional lobbyist trying to guide a bill through Congress and get it signed into law by the President!"
start = raw_input(str ('Press Y to continue:'))
print 'Great, lets get started!'
partychoice = raw_input (str("Would you like to be a Republican or Democrat? Type 'R' for Republican or 'D' for Democrat."))
if partychoice = R:
print 'Ah, the Grand old Party of Lincoln and Reagan. Great Choice!'
replegchoice = raw_input (str("What type of bill what you like congress to introduce? Restrictions on abortions, lower income taxes, easier access to automatic weapons, private health plans, or stricter immigration laws? ( A = abortion restrictions, L = lower taxes, AW = automatic weapons, H = private health plans, S = stricter immigration laws'))
if replegchoice = A or a
print 'A controversial choice, despite support of most Republicans, you are sure to face opposition from Democrats in Congress!'
if replegchoice = L or l
print 'A popular idea, Republicans in Congress are sure to support this idea, as will many American voters!'
if replegchoice = AW, aw, Aw, or AW
print 'Rural, midwest, and small town voters will love this, as will most Republicans in Congress. Democrats and voters in urban cities will surely be less supportive.'
if replegchoice = H or h
print 'Eimination of Medicare, Medicaid, and Obamacare! Republicans generally like the idea of making each person responsible for paying their own health care costs'
if replegchoice = S or s
print 'a popular idea supported by president Trump, this is sure face strong opposition from democrats and many voters.'
Thanks everyone.
Change the line to
if partychoice == 'R':
First you need to use two '=' characters. One '=' sets a variable, two compares for equality.
Second you want to compare the variable partychoice to the string "R" so you need quotes. Without quotes it thinks you're comparing a reference to another object.
You need to replace "=" with "==" in the statement:
if partychoice = R:"
"=" is an assignment operator
"==" is an equality operator
e.g.
#assign something to a variable
x = 5
print x
>>5
#compare for equality
y = 6
if y == 6:
print y
else:
print "y is not 6"
>>6
Be sure to use informative titles in future posts, something that relates to the question you're asking.
For CONDITIONALS you need to use comparison operator == not the assignment operator which is =.
Check this out: https://www.tutorialspoint.com/python/python_basic_operators.htm
Hope that helps!
There are several issues with the code:
You are using the assignment operator, =, instead of the comparative operator, ==.
You need to remember to use ' when containing and comparing strings, currently you are comparing them with the variable R instead of the letter 'R'
I removed the superfulous str() conversion inside the raw_input function. This isn't needed as "" defines a string.
You should simply call the .lower() function on the results that way you only have to check the lower case version of the string. It will save you a lot of time.
print "Welcome to 'Pass the Bill', where you will take on the role of a professional lobbyist trying to guide a bill through Congress and get it signed into law by the President!"
start = raw_input('Press Y to continue:')
print 'Great, lets get started!'
partychoice = raw_input("Would you like to be a Republican or Democrat? Type 'R' for Republican or 'D' for Democrat.").lower()
if partychoice == 'r':
print 'Ah, the Grand old Party of Lincoln and Reagan. Great Choice!'
replegchoice = raw_input ("What type of bill what you like congress to introduce? Restrictions on abortions, lower income taxes, easier access to automatic weapons, private health plans, or stricter immigration laws? ( A = abortion restrictions, L = lower taxes, AW = automatic weapons, H = private health plans, S = stricter immigration laws")
if replegchoice == 'a':
print 'A controversial choice, despite support of most Republicans, you are sure to face opposition from Democrats in Congress!'
if replegchoice == 'l':
print 'A popular idea, Republicans in Congress are sure to support this idea, as will many American voters!'
if replegchoice == 'aw':
print 'Rural, midwest, and small town voters will love this, as will most Republicans in Congress. Democrats and voters in urban cities will surely be less supportive.'
if replegchoice == 'h':
print 'Eimination of Medicare, Medicaid, and Obamacare! Republicans generally like the idea of making each person responsible for paying their own health care costs'
if replegchoice == 's':
print 'a popular idea supported by president Trump, this is sure face strong opposition from democrats and many voters.'
The title for this one was quite tricky.
I'm trying to solve a scenario,
Imagine a survey was sent out to XXXXX amount of people, asking them what their favourite football club was.
From the response back, it's obvious that while many are favourites of the same club, they all "expressed" it in different ways.
For example,
For Manchester United, some variations include...
Man U
Man Utd.
Man Utd.
Manchester U
Manchester Utd
All are obviously the same club however, if using a simple technique, of just trying to get an extract string match, each would be a separate result.
Now, if we further complication the scenario, let's say that because of the sheer volume of different clubs (eg. Man City, as M. City, Manchester City, etc), again plagued with this problem, its impossible to manually "enter" these variances and use that to create a custom filter such that converters all Man U -> Manchester United, Man Utd. > Manchester United, etc. But instead we want to automate this filter, to look for the most likely match and converter the data accordingly.
I'm trying to do this in Python (from a .cvs file) however welcome any pseudo answers that outline a good approach to solving this.
Edit: Some additional information
This isn't working off a set list of clubs, the idea is to "cluster" the ones we have together.
The assumption is there are no spelling mistakes.
There is no assumed length of how many clubs
And the survey list is long. Long enough that it doesn't warranty doing this manually (1000s of queries)
Google Refine does just this, but I'll assume you want to roll your own.
Note, difflib is built into Python, and has lots of features (including eliminating junk elements). I'd start with that.
You probably don't want to do it in a completely automated fashion. I'd do something like this:
# load corrections file, mapping user input -> output
# load survey
import difflib
possible_values = corrections.values()
for answer in survey:
output = corrections.get(answer,None)
if output = None:
likely_outputs = difflib.get_close_matches(input,possible_values)
output = get_user_to_select_output_or_add_new(likely_outputs)
corrections[answer] = output
possible_values.append(output)
save_corrections_as_csv
Please edit your question with answers to the following:
You say "we want to automate this filter, to look for the most likely match" -- match to what?? Do you have a list of the standard names of all of the possible football clubs, or do the many variations of each name need to be clustered to create such a list?
How many clubs?
How many survey responses?
After doing very light normalisation (replace . by space, strip leading/trailing whitespace, replace runs of whitespace by a single space, convert to lower case [in that order]) and counting, how many unique responses do you have?
Your focus seems to be on abbreviations of the standard name. Do you need to cope with nicknames e.g. Gunners -> Arsenal, Spurs -> Tottenham Hotspur? Acronyms (WBA -> West Bromwich Albion)? What about spelling mistakes, keyboard mistakes, SMS-dialect, ...? In general, what studies of your data have you done and what were the results?
You say """its impossible to manually "enter" these variances""" -- is it possible/permissible to "enter" some "variances" e.g. to cope with nicknames as above?
What are your criteria for success in this exercise, and how will you measure it?
It seems to me that you could convert many of these into a standard form by taking the string, lower-casing it, removing all punctuation, then comparing the start of each word.
If you had a list of all the actual club names, you could compare directly against that as well; and for strings which don't match first-n-letters to any actual team, you could try lexigraphical comparison against any of the returned strings which actually do match.
It's not perfect, but it should get you 99% of the way there.
import string
def words(s):
s = s.lower().strip(string.punctuation)
return s.split()
def bestMatchingWord(word, matchWords):
score,best = 0., ''
for matchWord in matchWords:
matchScore = sum(w==m for w,m in zip(word,matchWord)) / (len(word) + 0.01)
if matchScore > score:
score,best = matchScore,matchWord
return score,best
def bestMatchingSentence(wordList, matchSentences):
score,best = 0., []
for matchSentence in matchSentences:
total,words = 0., []
for word in wordList:
s,w = bestMatchingWord(word,matchSentence)
total += s
words.append(w)
if total > score:
score,best = total,words
return score,best
def main():
data = (
"Man U",
"Man. Utd.",
"Manch Utd",
"Manchester U",
"Manchester Utd"
)
teamList = (
('arsenal',),
('aston', 'villa'),
('birmingham', 'city', 'bham'),
('blackburn', 'rovers', 'bburn'),
('blackpool', 'bpool'),
('bolton', 'wanderers'),
('chelsea',),
('everton',),
('fulham',),
('liverpool',),
('manchester', 'city', 'cty'),
('manchester', 'united', 'utd'),
('newcastle', 'united', 'utd'),
('stoke', 'city'),
('sunderland',),
('tottenham', 'hotspur'),
('west', 'bromwich', 'albion'),
('west', 'ham', 'united', 'utd'),
('wigan', 'athletic'),
('wolverhampton', 'wanderers')
)
for d in data:
print "{0:20} {1}".format(d, bestMatchingSentence(words(d), teamList))
if __name__=="__main__":
main()
run on sample data gets you
Man U (1.9867767507647776, ['manchester', 'united'])
Man. Utd. (1.7448074166742613, ['manchester', 'utd'])
Manch Utd (1.9946817328797555, ['manchester', 'utd'])
Manchester U (1.989100008901989, ['manchester', 'united'])
Manchester Utd (1.9956787398647866, ['manchester', 'utd'])