extract relationships using NLTK - python

This is a follow-up of my question. I am using nltk to parse out persons, organizations, and their relationships. Using this example, I was able to create chunks of persons and organizations; however, I am getting an error in the nltk.sem.extract_rel command:
AttributeError: 'Tree' object has no attribute 'text'
Here is the complete code:
import nltk
import re
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
sample = f.read()
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences)
# tried plain ne_chunk instead of batch_ne_chunk as given in the book
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences]
# pattern to find <person> served as <title> in <org>
IN = re.compile(r'.+\s+as\s+')
for doc in chunked_sentences:
for rel in nltk.sem.extract_rels('ORG', 'PERSON', doc,corpus='ieer', pattern=IN):
print nltk.sem.show_raw_rtuple(rel)
This example is very similar to the one given in the book, but the example uses prepared 'parsed docs,' which appears of nowhere and I don't know where to find its object type. I scoured thru the git libraries as well. Any help is appreciated.
My ultimate goal is to extract persons, organizations, titles (dates) for some companies; then create network maps of persons and organizations.

It looks like to be a "Parsed Doc" an object needs to have a headline member and a text member both of which are lists of tokens, where some of the tokens are marked up as trees. For example this (hacky) example works:
import nltk
import re
IN = re.compile (r'.*\bin\b(?!\b.+ing)')
class doc():
doc.text=[nltk.Tree('ORGANIZATION', ['WHYY']), 'in', nltk.Tree('LOCATION',['Philadelphia']), '.', 'Ms.', nltk.Tree('PERSON', ['Gross']), ',']
for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN):
print nltk.sem.relextract.show_raw_rtuple(rel)
When run this provides the output:
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
Obviously you wouldn't actually code it like this, but it provides a working example of the data format expected by extract_rels, you just need to determine how to do your preprocessing steps to get your data massaged into that format.

Here is the source code of nltk.sem.extract_rels function :
def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10):
Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern.
The parameters ``subjclass`` and ``objclass`` can be used to restrict the
Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION',
:param subjclass: the class of the subject Named Entity.
:type subjclass: str
:param objclass: the class of the object Named Entity.
:type objclass: str
:param doc: input document
:type doc: ieer document or a list of chunk trees
:param corpus: name of the corpus to take as input; possible values are
'ieer' and 'conll2002'
:type corpus: str
:param pattern: a regular expression for filtering the fillers of
retrieved triples.
:type pattern: SRE_Pattern
:param window: filters out fillers which exceed this threshold
:type window: int
:return: see ``mk_reldicts``
:rtype: list(defaultdict)
So if you pass corpus parameter as ieer, the nltk.sem.extract_rels function expects the doc parameter to be a IEERDocument object. You should pass corpus as ace or just don't pass it(default is ace). In this case it expects a list of chunk trees(that's what you wanted). I modified the code as below:
import nltk
import re
from nltk.sem import extract_rels,rtuple
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
sample = f.read().decode('utf-8')
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
# here i changed reg ex and below i exchanged subj and obj classes' places
OF = re.compile(r'.*\bof\b.*')
for i, sent in enumerate(tagged_sentences):
sent = nltk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
rels = extract_rels('PER', 'ORG', sent, corpus='ace', pattern=OF, window=7) # extract_rels method expects one chunked sentence
for rel in rels:
print('{0:<5}{1}'.format(i, rtuple(rel)))
And it gives the result :
[PER: u'Chairman/NNP'] u'and/CC Chief/NNP Executive/NNP Officer/NNP of/IN the/DT' [ORG: u'Company/NNP']

this is nltk version problem. your code should work in nltk 2.x
but for nltk 3 you should code like this
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc,corpus='ieer', pattern = IN):
print (nltk.sem.relextract.rtuple(rel))
NLTK Example for Relation Extraction Does not work


How to get n-grams from a column in pandas dataframe

I have some doubts regarding n-grams.
Specifically, I would like to extract 2-grams, 3-grams and 4-grams from the following column:
For each topic, we will explore the words occuring in that topic and its relative weight.
We will check where our test document would be classified.
For each document we create a dictionary reporting how many
words and how many times those words appear.
Save this to ‘bow_corpus’, then check our selected document earlier.
To do this, I used the following function
def n_grams(lines , min_length=2, max_length=4):
ngrams={length:collections.Counter() for length in lengths)
queue= collection.deque(maxlen=max_length)
but it does not work since I got None as output.
Can you please tell me what is wrong in the code?
Your ngrams dictionary has empty Counter() objects because you don't pass anything to count. There are also a few other problems:
Function names can't include - in Python.
collection.deque is invalid, I think you wanted to call collections.deque()
I think there are better options to fix your code than using collections library. Two of them are as follows:
You might fix your function using list comprehension:
def n_grams(lines, min_length=2, max_length=4):
tokens = lines.split()
ngrams = dict()
for n in range(min_length, max_length + 1):
ngrams[n] = [tokens[i:i+n] for i in range(len(tokens)-n+1)]
return ngrams
Or you might use nltk which supports tokenization and n-grams natively.
from nltk import ngrams
from nltk.tokenize import word_tokenize
def n_grams(lines, min_length=2, max_length=4):
tokens = word_tokenize(lines)
ngrams = {n: ngrams(tokens, n) for n in range(min_length, max_length + 1)}
return ngrams

Built-in function to get the frequency of one word with spaCy?

I'm looking for faster alternatives to NLTK to analyze big corpora and do basic things like calculating frequencies, PoS tagging etc... SpaCy seems great and easy to use in many ways, but I can't find any built-in function to count the frequency of a specific word for example. I've looked at the spaCy documentation, but I can't find a straightforward way to do it. Am I missing something?
What I would like would be the NLTK equivalent of:
tokens.count("word") #where tokens is the tokenized text in which the word is to be counted
In NLTK, the above code would tell me that in my text, the word "word" appears X number of times.
Note that I've come by the count_by function, but it doesn't seem to do what I'm looking for.
I use spaCy for frequency counts in corpora quite often. This is what I usually do:
import spacy
nlp = spacy.load("en_core_web_sm")
list_of_words = ['run', 'jump', 'catch']
def word_count(string):
words_counted = 0
my_string = nlp(string)
for token in my_string:
# actual word
word = token.text
# lemma
lemma_word = token.lemma_
# part of speech
word_pos = token.pos_
if lemma_word in list_of_words:
words_counted += 1
return words_counted
sentence = "I ran, jumped, and caught the ball."
words_counted = word_count(sentence)
Python stdlib includes collections.Counter for this kind of purpose. You have not given me an answer if this answer suits your case.
from collections import Counter
text = "Lorem Ipsum is simply dummy text of the ...."
freq = Counter(text.split())
>>> Counter({'the': 6, 'Lorem': 4, 'of': 4, 'Ipsum': 3, 'dummy': 2 ...})
>>> 4
Alright just to give some time reference, I have used this script,
import random, timeit
from collections import Counter
def loadWords():
with open('corpora.txt', 'w') as corpora:
randWords = ['foo', 'bar', 'life', 'car', 'wrong',\
'right', 'left', 'plain', 'random', 'the']
for i in range(100000000):
corpora.write(randWords[random.randint(0, 9)] + " ")
def countWords():
with open('corpora.txt', 'r') as corpora:
content = corpora.read()
myDict = Counter(content.split())
print("foo: ", myDict['foo'])
print(timeit.timeit(loadWords, number=1))
print(timeit.timeit(countWords, number=1))
foo: 9998872
Still I am not sure if this is enough for you.
Updating with this answer as this is the page I found when searching for an answer for this specific problem. I find that this is an easier solution than the ones provided before and that only uses spaCy.
As you mentioned spaCy Doc object has the built in method Doc.count_by. From what I understand of your question it does what you ask for but it is not obvious.
It counts the occurances of an given attribute and returns a dictionary with the attributes hash as key in integer form and the counts.
First of all we need to import ORTH from spacy.attr. ORTH is the exact verbatim text of a token. We also need to load the model and provide a text.
import spacy
from spacy.attrs import ORTH
nlp = spacy.load("en_core_web_sm")
doc = nlp("apple apple orange banana")
Then we create a dictionary of word counts
count_dict = doc.count_by(ORTH)
You could count by other attributes like LEMMA, just import the attribute you wish to use.
If we look at the dictionary we will se that it contains the hash for the lexeme and the word count.
{8566208034543834098: 2, 2208928596161743350: 1, 2525716904149915114: 1}
We can get the text for the word if we look up the hash in the vocab.
With this we can create a simple function that takes the search word and a count dict created with the Doc.count_by method.
def get_word_count(word, count_dict):
return count_dict[nlp.vocab.strings[word]]
If we run the function with our search word 'apple' and the count dict we created earlier
get_word_count('apple', count_dict)
We get:

How to check two POS tags are in the same category in NLTK?

Like the title says, how can I check two POS tags are in the same category?
For example,
go -> VB
goes -> VBZ
These two words are both verbs. Or,
bag -> NN
bags -> NNS
These two are both nouns.
So my question is that whether there exists any function in NLTK to check if two given tags are in the same category?
Let's take the simple case first: Your corpus is tagged with the Brown tagset (that's what it looks like), and you'd be happy with the simple tags defined in the nltk's "universal" tagset: ., ADJ, ADP, ADV, CONJ, DET, NOUN, NUM, PRON, PRT, VERB, X, where the dot stands for "punctuation". In this case, simply load the nltk's map and use it with your data:
tagmap = nltk.tag.mapping.tagset_mapping("en-brown", "universal")
if tagmap[tag1] == tagmap[tag2]:
print("The two words have the same part of speech")
If that's not your use case, you'll need to manually decide on a mapping from each individual tag to the simplified category you want to assign it to. If you are working with the Brown corpus tagset, you can see the tags and their meanings here, or from within python like this:
Study your tags and define a dictionary that maps each POS tag to your chosen category; people sometimes find it useful to just group Brown corpus tags by their first two letters, putting together "NN", "NN$", "NNS-HL", etc. You could create this particular mapping automatically like this:
from nltk.corpus import brown
alltags = set(t for w, t in brown.tagged_words())
tagmap = dict(t[:2] for t in alltags)
Then you can customize this map according to your needs; e.g., to put all punctuation tags together in the category ".":
for tag in tagmap:
if not tag.isalpha():
tagmap[tag] = "."
Once your tagmap is to your liking, use it like the one I imported from the nltk.
Finally, you might find it convenient to retag your entire corpus in one go, so that you can simply compare the assigned tags. If corpus is a list of tagged sentences in the format of the nltk's <corpus>.tagged_sents() command (so not a corpus reader object), you can retag everything like this:
newcorpus = []
for sent in corpus:
newcorpus.append( [ (w, tagmap[t]) for w, t in sent ] )
Not sure if this is what you are looking for, but you can tag with a universal tagset:
from pprint import pprint
from collections import defaultdict
from nltk import pos_tag
from nltk.tokenize import sent_tokenize, word_tokenize
s = "I go. He goes. This bag is brown. These bags are brown."
d = defaultdict(list)
for sent in sent_tokenize(s):
text = word_tokenize(sent)
for value, tag in pos_tag(text, tagset='universal'):
{'.': ['.', '.', '.', '.'],
'ADJ': ['brown'],
'DET': ['This', 'These'],
'NOUN': ['bag', 'bags'],
'PRON': ['I', 'He'],
'VERB': ['go', 'goes', 'is', 'brown', 'are']}
Note how bag and bags fall into NOUN category and go and goes fall into VERB.

Replacement by synsets in Python pattern packatge

My goal is to create a system that will be able to take any random text, extract sentences, remove punctuations, and then, on the bare sentence (one of them), to randomly replace NN or VB tagged words with their meronym, holonym or synonim as well as with a similar word from a WordNet synset. There is a lot of work ahead, but I have a problem at the very beginning.
For this I use pattern and TextBlob packages. This is what I have done so far...
from pattern.web import URL, plaintext
from pattern.text import tokenize
from pattern.text.en import wordnet
from textblob import TextBlob
import string
s = URL('http://www.fangraphs.com/blogs/the-fringe-five-baseballs-most-compelling-fringe-prospects-35/#more-157570').download()
s = plaintext(s, keep=[])
secam = (tokenize(s, punctuation=""))
simica = secam[15].strip(string.punctuation)
simica = simica.replace(",", "")
simica = TextBlob(simica)
simicaTg = simica.words
synsimica = wordnet.synsets(simicaTg[3])[0]
djidja = synsimica.hyponyms()
Now everything works the way I want but when I try to extract the i.e. hyponym from this djidja variable it proves to be impossible since it is a Synset object, and I can't manipulate it anyhow.
Any idea how to extract a the very word that is reported in hyponyms list (i.e. print(djidja[2]) displays Synset(u'bowler')...so how to extract only 'bowler' from this)?
Recall that a synset is just a list of words marked as synonyms. Given a sunset, you can extract the words that form it:
from pattern.text.en import wordnet
s = wordnet.synsets('dog')[0] # a word can belong to many synsets, let's just use one for the sake of argument
This outputs:
Out[14]: [u'dog', u'domestic dog', u'Canis familiaris']
You can also extract hypernims and hyponyms:
Out[16]: [Synset(u'canine'), Synset(u'domestic animal')]
Out[17]: [u'canine', u'canid']

Extract headings from a MS Word document in Python

I have an MS Word document contains some text and headings, I want to extract the headings, I installed Python for win32, but I didn't know which method to use, it seems the help document of python for windows does not list the functions of the word obejct. take the following code as example
import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
doc = word.ActiveDocument
how can I know all the functions of the word object?I didn't find anything useful in the help document.
The Word object model can be found here. Your doc object will contain these properties, and you can use them to perform your desired actions (note that I haven't used this feature with Word, so my knowledge of the object model is sparse). For instance, if you wanted to read all the words in a document, you could do:
for word in doc.Words:
print word
And you would get all of the words. Each of those word items would be a Word object (reference here), so you could access those properties during iteration. In your case, here is how you would get the style:
for word in doc.Words:
print word.Style
On a sample doc with a single Heading 1 and normal text, this prints:
Heading 1
Heading 1
Heading 1
Heading 1
Heading 1
To group the headings together, you can use itertools.groupby. As explained in the code comments below, you need to reference the str() of the object itself, as using word.Style returns an instance that won't properly group with other instances of the same style:
from itertools import groupby
import win32com.client as win32
# All the same as yours
word = win32.Dispatch("Word.Application")
word.Visible = 0
doc = word.ActiveDocument
# Here we use itertools.groupby (without sorting anything) to
# find groups of words that share the same heading (note it picks
# up newlines). The tricky/confusing thing here is that you can't
# just group on the Style itself - you have to group on the str().
# There was some other interesting behavior, but I have zero
# experience with COMObjects so I'll leave it there :)
# All of these comments for two lines of code :)
for heading, grp_wrds in groupby(doc.Words, key=lambda x: str(x.Style)):
print heading, ''.join(str(word) for word in grp_wrds)
This outputs:
Heading 1 Here is some text
No header
If you replace the join with a list comprehension, you get the below (where you can see the newlines):
Heading 1 ['Here ', 'is ', 'some ', 'text', '\r']
Normal ['\r', 'No ', 'header', '\r', '\r']
convert word to docx and use python docx module
from docx import Document
file = 'test.docx'
document = Document(file)
for paragraph in document.paragraphs:
if paragraph.style.name == 'Heading 1':
You can also use the Google Drive SDK to convert the Word document to something more useful, like HTML, where you can easily extract the headers.

