Create a dictionary with 'word groups'

Create a dictionary with 'word groups' - python

I would like to do some text analysis on job descriptions and was going to use nltk. I can build a dictionary and remove the stopwords, which is part of what I want. However in addition to the single words and their frequencies I would like to keep meaningful 'word groups' and count them as well.
For example in job descriptions containing 'machine learning' I don't want to consider 'machine' and 'learning' separately but keep retain the word group in my dictionary if it frequently occurs together. What is the most efficient method to do that? (I think I wont need to go beyond word groups containing 2 or words). And: At which point should I do the stopword removal?
Here is an example:
text = 'As a Data Scientist, you will focus on machine
learning and Natural Language Processing'
The dictionary I would like to have is:
dict = ['data scientist', 'machine learning', 'natural language processing',
'data', 'scientist', 'focus', 'machine', 'learning', 'natural'
'language', 'processing']

Sounds like what you want do is use collocations from nltk.

Tokenize your multi-word expressions into tuples, then put them in a set for easy lookup. The easiest way is to use nltk.ngrams which allows you to iterate directly over the ngrams in your text. Since your sample data includes a trigram, here's a search for n up to 3.
raw_keywords = [ 'data scientist', 'machine learning', 'natural language processing',
'data', 'scientist', 'focus', 'machine', 'learning', 'natural'
'language', 'processing']
keywords = set(tuple(term.split()) for term in raw_keywords)
tokens = nltk.word_tokenize(text.lower())
# Scan text once for each ngram size.
for n in 1, 2, 3:
for ngram in nltk.ngrams(tokens, n):
if ngram in keywords:
print(ngram)
If you have huge amounts of text, you could check you you'll get a speed-up by iterating over maximal ngrams only (with the option pad_right=True to avoid missing small ngram sizes). The number of lookups is the same both ways, so I doubt it will make much difference, except in the order of returned results.
for ngram in nltk.ngrams(tokens, n, pad_right=True):
for k in range(n):
if ngram[:k+1] in keywords:
print(ngram[:k+1])
As for stopword removal: If you remove them, you'll produce ngrams where there were none before, e.g., "sewing machine and learning center" will match "machine learning" after stopword removal. You'll have to decide if this is something you want, or not. If it were me I would remove punctuation before the keyword scan, but leave the stopwords in place.

Thanks #Batman, I played around a bit with collocations and ended up only needing a couple of lines of code. (Obviously 'meaningful text' should be a lot longer to find actual collocations)
meaningful_text = 'As a Data Scientist, you will focus on machine
learning and Natural Language Processing'
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(meaningful_text))
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)

Related

Finding the original form of a word after stemming

I am stemming a list of words and making a dataframe from it. The original data is as follow:
original = 'The man who flies the airplane dies in an air crash. His wife died a couple of weeks ago.'
df = pd.DataFrame({'text':[original]})
the functions I've used for lemmatisation and stemming are:
# lemmatize & stemmed.
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
result.append(lemmatize_stemming(token))
return result
The output will come from running df['text'].map(preprocess)[0] for which I get:
['man',
'fli',
'airplan',
'die',
'air',
'crash',
'wife',
'die',
'coupl',
'week',
'ago']
I wonder how can I return the output to the original tokens? for instance I have die which is from died and dies.

Stemming destroys information in the original corpus, by non-reversibly turning multiple tokens into some shared 'stem' form.
I you want the original text, you need to retain it yourself.
But also, note: many algorithms working on large amounts of data, like word2vec under ideal conditions, don't necessarily need or even benefit from stemming. You want to have vectors for all the words in the original text – not just the stems – and with enough data, the related forms of a word will get similar vectors. (Indeed, they'll even differ in useful ways, with all 'past' or 'adverbial' or whatever variants sharing a similar directional skew.)
So only do it if you're sure it's helping your goals, withn your corpus limits & goals.

You could return the mapping relationship as the result and perform postprocessing later.
def preprocess(text):
lemma_mapping = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
lemma_mapping[token] = lemmatize_stemming(token)
return lemma_mapping
Or store it as a by-product.
from collections import defaultdict
lemma_mapping = defaultdict(str)
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
lemma = lemmatize_stemming(token)
result.append(lemma)
lemma_mapping[token] = lemma
return result

How to efficiently search for a list of strings in another list of strings using Python?

I have two list of names (strings) that look like this:
executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
analysts = ['Justin Post', 'Some Dude', 'Some Chick']
I need to find where those names occur in a list of strings that looks like this:
str = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores.
The reason I need to do this is so that I can concatenate the conversation strings together (which are separated by the names). How would I go about doing this efficiently?
I looked at some similar questions and tried the solutions to no avail, such as this:
if any(x in str for x in executives):
print('yes')
and this ...
match = next((x for x in executives if x in str), False)
match

I'm not sure if that is what you are looking for:
executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
text = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores."]
result = [s for s in text if any(ex in s for ex in executives)]
print(result)
output:
['Brian Olsavsky - Amazon.com']

str = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores"]
executives = ['Brian Olsavsky', 'Justin', 'Some Guy', 'Some Lady']
As an addition, if you need the exact location, you could use this:
print([[i, str.index(q), q.index(i)] for i in executives for q in str if i in q ])
this outputs
[['Brian Olsavsky', 3, 0], ['Justin', 0, 0], ['Justin', 4, 11], ['Justin', 9, 5]]

TLDR
This answer is focusing on efficiency. Use the other answers if it is not a key issue. If it is, make a dict from the corpus you are searching in, then use this dict to find what you are looking for.
#import stuff we need later
import string
import random
import numpy as np
import time
import matplotlib.pyplot as plt
Create example corpus
First we create a list of strings which we will search in.
Create random words, by which I mean random sequence of characters, with a length drawn from a Poisson distribution, with this function:
def poissonlength_words(lam_word): #generating words, length chosen from a Poisson distrib
return ''.join([random.choice(string.ascii_lowercase) for _ in range(np.random.poisson(lam_word))])
(lam_word being the parameter of the Poisson distribution.)
Let's create number_of_sentences variable length sentences from these words (by sentence I mean list of the randomly generated words separated by spaces).
The length of the sentences can also be drawn from a Poisson distribution.
lam_word=5
lam_sentence=1000
number_of_sentences = 10000
sentences = [' '.join([poissonlength_words(lam_word) for _ in range(np.random.poisson(lam_sentence))])
for x in range(number_of_sentences)]
sentences[0] now will start like this:
tptt lxnwf iem fedg wbfdq qaa aqrys szwx zkmukc...
Let's also create names, which we will search for. Let these names be bigrams. The first name (ie first element of bigram) will be n characters, last name (second bigram element) will be m character long, and it will consist of random characters:
def bigramgen(n,m):
return ''.join([random.choice(string.ascii_lowercase) for _ in range(n)])+' '+\
''.join([random.choice(string.ascii_lowercase) for _ in range(m)])
The task
Let's say we want to find sentences where bigrams such as ab c appears. We don't want to find dab c or ab cd, only where ab c stands alone.
To test how fast a method is, let's find an ever-increasing number of bigrams, and measure the elapsed time. The number of bigrams we search for can be, for example:
number_of_bigrams_we_search_for = [10,30,50,100,300,500,1000,3000,5000,10000]
Brute force method
Just loop through each bigram, loop through each sentence, use in to find matches. Meanwhile, measure elapsed time with time.time().
bruteforcetime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
start = time.time()
for bigram in bigrams:
#the core of the brute force method starts here
reslist=[]
for sentencei, sentence in enumerate(sentences):
if ' '+bigram+' ' in sentence:
reslist.append([bigram,sentencei])
#and ends here
end = time.time()
bruteforcetime.append(end-start)
bruteforcetime will hold the number of seconds necessary to find 10, 30, 50 ... bigrams.
Warning: this might take a long time for high number of bigrams.
The sort your stuff to make it quicker method
Let's create an empty set for every word appearing in any of the sentences (using dict comprehension):
worddict={word:set() for sentence in sentences for word in sentence.split(' ')}
To each of these sets, add the index of each word in which it appears:
for sentencei, sentence in enumerate(sentences):
for wordi, word in enumerate(sentence.split(' ')):
worddict[word].add(sentencei)
Note that we only do this once, no matter how many bigrams we search later.
Using this dictionary, we can seach for the sentences where the each part of the bigram appears. This is very fast, since calling a dict value is very fast. We then take the intersection of these sets. When we are searching for ab c, we will have a set of sentence indexes where ab and c both appears.
for bigram in bigrams:
reslist=[]
setlist = [worddict[gram] for gram in target.split(' ')]
intersection = set.intersection(*setlist)
for candidate in intersection:
if bigram in sentences[candidate]:
reslist.append([bigram, candidate])
Let's put the whole thing together, and measure the time elapsed:
logtime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
start_time=time.time()
worddict={word:set() for sentence in sentences for word in sentence.split(' ')}
for sentencei, sentence in enumerate(sentences):
for wordi, word in enumerate(sentence.split(' ')):
worddict[word].add(sentencei)
for bigram in bigrams:
reslist=[]
setlist = [worddict[gram] for gram in bigram.split(' ')]
intersection = set.intersection(*setlist)
for candidate in intersection:
if bigram in sentences[candidate]:
reslist.append([bigram, candidate])
end_time=time.time()
logtime.append(end_time-start_time)
Warning: this might take a long time for high number of bigrams, but less than the brute force method.
Results
We can plot how much time each method took.
plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')
Or, plotting the y axis on a log scale:
plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.yscale('log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')
Giving us the plots:
Making the worddict dictionary takes a lot of time, and is a disadvantage when searching for a small number of names. There is a point however, that the corpus is big enough and the number of names we are searching for is high enough that this time is compensated by the speed of searching in it, compared to the brute force method. So, if these conditions are satisfied, I recommend using this method.
(Notebook available here.)

How to identify substrings in the order of the string?

I have a list of sentences as below.
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
I also have a set of selected concepts.
selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']
Now I want to select the concepts in seleceted_concepts from sentences in the order of the sentence.
i.e. my output should be as follows.
output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]
I could extract the concepts in the sentences as follows.
output = []
for sentence in sentences:
sentence_tokens = []
for item in selected_concepts:
if item in sentence:
sentence_tokens.append(item)
output.append(sentence_tokens)
However, I have troubles of organising the extracted concepts accoridng to sentence order. Is there any easy way of doing it in python?

One way to do it is to use .find() method to find the position of the substring and then sort by that value. For example:
output = []
for sentence in sentences:
sentence_tokens = []
for item in selected_concepts:
index = sentence.find(item)
if index >= 0:
sentence_tokens.append((index, item))
sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
output.append(sentence_tokens)

You could use .find() and .insert() instead.
Something like:
output = []
for sentence in sentences:
sentence_tokens = []
for item in selected_concepts:
pos = sentence.find(item)
if pos != -1:
sentence_tokens.insert(pos, item)
output.append(sentence_tokens)
The only problem would be overlap in the selected_concepts. For example, 'databases process' and 'process'. In this case, they would end up in the opposite of the order they are in in selected_concepts. You could potentially fix this with the following:
output = []
selected_concepts_multiplier = len(selected_concepts)
for sentence in sentences:
sentence_tokens = []
for k,item in selected_concepts:
pos = sentence.find(item)
if pos != -1:
sentence_tokens.insert((selected_concepts_multiplier * pos) + k, item)
output.append(sentence_tokens)

there is a built in statement called "in". it can check is there any string in other string.
sentences = [
'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd'
]
selected_concepts = [
'machine learning',
'patterns',
'data mining',
'methods','database systems',
'interdisciplinary subfield','knowledege discovery',
'databases process',
'information',
'process'
]
output = [] #prepare the output
for s in sentences: #now lets check each sentences
output.append(list()) #add a list to output, so it will become multidimensional list
for c in selected_concepts: #check all selected_concepts
if c in s: #if there a selected concept in a sentence
output[-1].append(c) #then add the selected concept to the last list in output
print(output)

You can use the fact that regular expressions search text in order, left to right, and disallow overlaps:
import re
concept_re = re.compile(r'\b(?:' +
'|'.join(re.escape(concept) for concept in selected_concepts) + r')\b')
output = [match
for sentence in sentences for match in concept_re.findall(sentence)]
output
# => ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems', 'data mining', 'interdisciplinary subfield', 'information', 'information', 'data mining', 'databases process']
This should also be faster than searching for concepts individually, since the algorithm regexps use is more efficient for this, as well as being completely implemented in low-level code.
There is one difference though - if a concept repeats itself within one sentence, your code will only give one appearance per sentence, while this code outputs them all. If this is a meaningful difference, it is rather easy to dedupe a list.

Here I used a simple re.findall method if the pattern is matched in the string then re.findall will give the output as that matched pattern otherwise it will return an empty list based on that I wrote this code
import re
selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
output = []
for sentence in sentences:
matched_concepts = []
for selected_concept in selected_concepts:
if re.findall(selected_concept, sentence):
matched_concepts.append(selected_concept)
output.append(matched_concepts)
print output
Output:
[['machine learning', 'patterns', 'data mining', 'methods', 'database systems', 'process'], ['data mining', 'interdisciplinary subfield', 'information'], ['data mining', 'databases process', 'process']]

How to extract the verbs and all corresponding adverbs from a text?

Using ngram in Python my aim is to find out verbs and their corresponding adverbs from an input text.
What I have done:
Input text:""He is talking weirdly. A horse can run fast. A big tree is there. The sun is beautiful. The place is well decorated.They are talking weirdly. She runs fast. She is talking greatly.Jack runs slow.""
Code:-
`finder2 = BigramCollocationFinder.from_words(wrd for (wrd,tags) in posTagged if tags in('VBG','RB','VBN',))
scored = finder2.score_ngrams(bigram_measures.raw_freq)
print sorted(finder2.nbest(bigram_measures.raw_freq, 5))`
From my code, I got the output:
[('talking', 'greatly'), ('talking', 'weirdly'), ('weirdly', 'talking'),('runs','fast'),('runs','slow')]
which is the list of verbs and their corresponding adverbs.
What I am looking for:
I want to figure out verb and all corresponding adverbs from this. For example ('talking'- 'greatly','weirdly),('runs'-'fast','slow')etc.

You already have a list of all verb-adverb bigrams, so you're just asking how to consolidate them into a dictionary that gives all adverbs for each verb. But first let's re-create your bigrams in a more direct way:
pairs = list()
for (w1, tag1), (w2, tag2) in nltk.bigrams(posTagged):
if t1.startswith("VB") and t2 == "RB":
pairs.append((w1, w2))
Now for your question: We'll build a dictionary with the adverbs that follow each verb. I'll store the adverbs in a set, not a list, to get rid of duplications.
from collections import defaultdict
consolidated = defaultdict(set)
for verb, adverb in pairs:
consolidated[verb].add(adverb)
The defaultdict provides an empty set for verbs that haven't been seen before, so we don't need to check by hand.
Depending on the details of your assignment, you might also want to case-fold and lemmatize your verbs so that the adverbs from "Driving recklessly" and "I drove carefully" are recorded together:
wnl = nltk.stem.WordNetLemmatizer()
...
for verb, adverb in pairs:
verb = wnl.lemmatize(verb.lower(), "v")
consolidated[verb].add(adverb)

I think you are losing information you will need for this. You need to retain the part-of-speech data somehow, so that bigrams like ('weirdly', 'talking') can be processed in the correct manner.
It may be that the bigram finder can accept the tagged word tuples (I'm not familiar with nltk). Or, you may have to resort to creating an external index. If so, something like this might work:
part_of_speech = {word:tag for word,tag in posTagged}
best_bigrams = finger2.nbest(... as you like it ...)
verb_first_bigrams = [b if part_of_speech[b[1]] == 'RB' else (b[1],b[0]) for b in best_bigrams]
Then, with the verbs in front, you can transform it into a dictionary or list-of-lists or whatever:
adverbs_for = {}
for verb,adverb in verb_first_bigrams:
if verb not in adverbs_for:
adverbs_for[verb] = [adverb]
else:
adverbs_for[verb].append(adverb)

Extracting all Nouns from a text file using nltk

Is there a more efficient way of doing this?
My code reads a text file and extracts all Nouns.
import nltk
File = open(fileName) #open file
lines = File.read() #read all lines
sentences = nltk.sent_tokenize(lines) #tokenize sentences
nouns = [] #empty to array to hold all nouns
for sentence in sentences:
for word,pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))):
if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
nouns.append(word)
How do I reduce the time complexity of this code? Is there a way to avoid using the nested for loops?
Thanks in advance!

If you are open to options other than NLTK, check out TextBlob. It extracts all nouns and noun phrases easily:
>>> from textblob import TextBlob
>>> txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter
actions between computers and human (natural) languages."""
>>> blob = TextBlob(txt)
>>> print(blob.noun_phrases)
[u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']

import nltk
lines = 'lines is some string of words'
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN'
# do the nlp stuff
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print nouns
>>> ['lines', 'string', 'words']
Useful tip: it is often the case that list comprehensions are a faster method of building a list than adding elements to a list with the .insert() or append() method, within a 'for' loop.

You can achieve good results using nltk, Textblob, SpaCy or any of the many other libraries out there. These libraries will all do the job but with different degrees of efficiency.
import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en')
nlp1 = spacy.load('en_core_web_lg')
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
On my windows 10 2 cores, 4 processors, 8GB ram i5 hp laptop, in jupyter notebook, I ran some comparisons and here are the results.
For TextBlob:
%%time
print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])
And the output is
>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 8.01 ms #average over 20 iterations
For nltk:
%%time
print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])
And the output is
>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 7.09 ms #average over 20 iterations
For spacy:
%%time
print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])
And the output is
>>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 30.19 ms #average over 20 iterations
It seems nltk and TextBlob are reasonably faster and this is to be expected since store nothing else about the input text, txt. Spacy is way slower. One more thing. SpaCy missed the noun NLP while nltk and TextBlob got it. I would shot for nltk or TextBlob unless there is something else I wish to extract from the input txt.
Check out a quick start into spacy here.
Check out some basics about TextBlob here. Check out nltk HowTos here

import nltk
lines = 'lines is some string of words'
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if(pos[:2] == 'NN')]
print (nouns)
Just simplied abit more.

I'm not an NLP expert, but I think you're pretty close already, and there likely isn't a way to get better than quadratic time complexity in these outer loops here.
Recent versions of NLTK have a built in function that does what you're doing by hand, nltk.tag.pos_tag_sents, and it returns a list of lists of tagged words too.

Your code has no redundancy: You read the file once and visit each sentence, and each tagged word, exactly once. No matter how you write your code (e.g., using comprehensions), you will only be hiding the nested loops, not skipping any processing.
The only potential for improvement is in its space complexity: Instead of reading the whole file at once, you could read it in increments. But since you need to process a whole sentence at a time, it's not as simple as reading and processing one line at a time; so I wouldn't bother unless your files are whole gigabytes long; for short files it's not going to make any difference.
In short, your loops are fine. There are a thing or two in your code that you could clean up (e.g. the if clause that matches the POS tags), but it's not going to change anything efficiency-wise.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a dictionary with 'word groups' - python

Sounds like what you want do is use collocations from nltk.

Related

Finding the original form of a word after stemming

How to efficiently search for a list of strings in another list of strings using Python?

How to identify substrings in the order of the string?

How to extract the verbs and all corresponding adverbs from a text?

Extracting all Nouns from a text file using nltk

Categories

Resources