how to extract the contextual words of a token in python - python

Actually i want to extract the contextual words of a specific word. For this purpose i can use the n-gram in python but the draw back of this is that it slides the window by one but i only need the contextual words of a specific word. E.g. my file is like this
IL-2
gene
expression
and
NF-kappa
B
activation
through
CD28
requires
reactive
oxygen
production
by
5-lipoxygenase
.
mean each token on every line. now i want to extract the surrounding words of each e.g. through and requires are the surrounding words of "CD28". I write a python code but did not worked and generating an error of ValueError: list.index(x): x not in list.
My code is
import re;
import nltk;
file=open("C:/Python26/test.txt");
contents= file.read()
tokens = nltk.word_tokenize(contents)
f=open("trigram.txt",'w');
for l in tokens:
print tokens[l],tokens[l+1]
f.close();

First of all, list.index(x) : Return the index in the list of the first item whose value is x.
>>> ["foo", "bar", "baz"].index('bar')
1
In your code, the variable 'word' is populated using range of integers not by actual contents. so we can't directly use 'word' in the list.index() function.
>>> print lines.index(1)
ValueError: 1 is not in list
change your code like this :
file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');
with open(file,'r') as rf:
lines = rf.readlines();
for word in range(1,len(lines)-1):
f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())
f.close()

I dont really understood what you want to do, but, I'll do my best.
If you want to process words with python there is a library called NLTK which means Natural Language Toolkit.
You may need to tokenize a sentence or a document.
import nltk
def tokenize_query(query):
return nltk.word_tokenize(query)
f = open('C:/Python26/tokens.txt')
raw = f.read()
tokenize_query(raw)
We can also read a file one line at a time using a for loop:
f = open('C:/Python26/tokens.txt', 'rU')
for line in f:
print(line.strip())
r means 'read' and U means 'universal', if you are wondering.
strip() is just cutting '\n' from the text.
The context may be provided by wordnet and all its functions.
I guess you should use synsets with the word's pos (part of speech).
A synset is sort of a synonyms list in a semantic way.
NLTK can provide you some others nice features like sentiment analysis and similarity between synsets.

file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');
with open(file,'r') as rf:
lines = rf.readlines();
for word in range(1,len(lines)-1):
f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())
f.write("\n")
f.close()

This code also gives the same result
import nltk;
from nltk.util import ngrams
from nltk import word_tokenize
file = open("C:/Python26/tokens.txt");
contents=file.read();
tokens = nltk.word_tokenize(contents);
f_tri = open("trigram.txt",'w');
trigram = ngrams(tokens,3)
for t in trigram:
f_tri.write(str(t)+"\n")
f_tri.close()

Related

How to Tokenize block of text as one token in python?

Recently I am working on a genome data set which consists of many blocks of genomes. On previous works on natural language processing, I have used sent_tokenize and word_tokenize from nltk to tokenize the sentences and words. But when I use these functions on genome data set, it is not able to tokenize the genomes correctly. The text below shows some part of the genome data set.
>NR_004049 1
tattattatacacaatcccggggcgttctatatagttatgtataatgtat
atttatattatttatgcctctaactggaacgtaccttgagcatatatgct
gtgacccgaaagatggtgaactatacttgatcaggttgaagtcaggggaa
accctgatggaagaccgaaacagttctgacgtgcaaatcgattgtcagaa
ttgagtataggggcgaaagaccaatcgaaccatctagtagctggttcctt
ccgaagtttccctcaggatagctggtgcattttaatattatataaaataa
tcttatctggtaaagcgaatgattagaggccttagggtcgaaacgatctt
aacctattctcaaactttaaatgggtaagaaccttaactttcttgatatg
aagttcaaggttatgatataatgtgcccagtgggccacttttggtaagca
gaactggcgctgtgggatgaaccaaacgtaatgttacggtgcccaaataa
caact
>NR_004048 1
aatgttttatataaattgcagtatgtgtcacccaaaatagcaaaccccat
aaccaaccagattattatgatacataatgcttatatgaaactaagacatt
tcgcaacatttattttaggtatataaatacatttattgaaggaattgata
tatgccagtaaaatggtgtatttttaatttctttcaataaaaacataatt
gacattatataaaaatgaattataaaactctaagcggtggatcactcggc
tcatgggtcgatgaagaacgcagcaaactgtgcgtcatcgtgtgaactgc
aggacacatgaacatcgacattttgaacgcatatcgcagtccatgctgtt
atgtactttaattaattttatagtgctgcttggactacatatggttgagg
gttgtaagactatgctaattaagttgcttataaatttttataagcatatg
gtatattattggataaatataataatttttattcataatattaaaaaata
aatgaaaaacattatctcacatttgaatgt
>NR_004047 1
atattcaggttcatcgggcttaacctctaagcagtttcacgtactgttta
actctctattcagagttcttttcaactttccctcacggtacttgtttact
atcggtctcatggttatatttagtgtttagatggagtttaccacccactt
agtgctgcactatcaagcaacactgactctttggaaacatcatctagtaa
tcattaacgttatacgggcctggcaccctctatgggtaaatggcctcatt
taagaaggacttaaatcgctaatttctcatactagaatattgacgctcca
tacactgcatctcacatttgccatatagacaaagtgacttagtgctgaac
tgtcttctttacggtcgccgctactaagaaaatccttggtagttactttt
cctcccctaattaatatgcttaaattcagggggtagtcccatatgagttg
>NR_004052 1
When the tokenizer of ntlk is applied on this dataset, each line of text (for example tattattatacacaatcccggggcgttctatatagttatgtataatgtat ) becomes one token which is not correct. and a block of sequences should be considered as one token. For example in this case contents between >NR_004049 1 and >NR_004048 1 should be consider as one token:
>NR_004049 1
tattattatacacaatcccggggcgttctatatagttatgtataatgtat
atttatattatttatgcctctaactggaacgtaccttgagcatatatgct
gtgacccgaaagatggtgaactatacttgatcaggttgaagtcaggggaa
accctgatggaagaccgaaacagttctgacgtgcaaatcgattgtcagaa
ttgagtataggggcgaaagaccaatcgaaccatctagtagctggttcctt
ccgaagtttccctcaggatagctggtgcattttaatattatataaaataa
tcttatctggtaaagcgaatgattagaggccttagggtcgaaacgatctt
aacctattctcaaactttaaatgggtaagaaccttaactttcttgatatg
aagttcaaggttatgatataatgtgcccagtgggccacttttggtaagca
gaactggcgctgtgggatgaaccaaacgtaatgttacggtgcccaaataa
caact
>NR_004048 1
So each block starting with special words such as >NR_004049 1 until the next special character should be considered as one token. The problem here is tokenizing this kind of data set and i dont have any idea how can i correctly tokenize them.
I really appreciate answers which helps me to solve this.
Update:
One way to solve this problem is to append al lines within each block, and then using the nltk tokenizer. for example This means that to append all lines between >NR_004049 1 and >NR_004048 1 to make one string from several lines, so the nltk tokenizer will consider it as one token. Can any one help me how can i append lines within each block?
You just need to concatenate the lines between two ids apparently. There should be no need for nltk or any tokenizer, just a bit of programming ;)
patterns = {}
with open('data', "r") as f:
id = None
current = ""
for line0 in f:
line= line0.rstrip()
if line[0] == '>' : # new pattern
if len(current)>0:
# print("adding "+id+" "+current)
patterns[id] = current
current = ""
# to find the next id:
tokens = line.split(" ")
id = tokens[0][1:]
else: # continuing pattern
current = current + line
if len(current)>0:
patterns[id] = current
# print("adding "+id+" "+current)
# do whatever with the patterns:
for id, pattern in patterns.items():
print(f"{id}\t{pattern}")

Calculate a measure between keywords and each word of a textfile

I have two .txt files, one that contains 200.000 words and the second contains 100 keywords( one each line). I want to calculate the cosine similarity between each of the 100 keywords and each word of my 200.000 words , and display for every keyword the 50 words with the highest score.
Here's what I did, note that Bertclient is what i'm using to extract vectors :
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
# Process words
with open("./words.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
with open("./100_keywords.txt", "r", encoding='utf8') as keyword_file:
for keyword in keyword_file:
vector_key = bc.encode([keyword])
for w in words:
vector_word = bc.encode([w])
cosine_lib = cosine_similarity(vector_key,vector_word)
print (cosine_lib)
This keeps running but it doesn't stop. Any idea how I can correct this ?
I know nothing of Bert...but there's something fishy with the import and run. I don't think you have it installed correctly or something. I tried to pip install it and just run this:
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
print ('done importing')
and it never finished. Take a look at the dox for bert and see if something else needs to be done.
On your code, it is generally better do do ALL of the reading first, then the processing, so import both lists first, separately, check a few values with something like:
# check first five
print(words[:5])
Also, you need to look at a different way to do your comparisons instead of the nested loops. You realize now that you are converting each word in words EVERY TIME for each keyword, which is not necessary and probably really slow. I would recommend you either use a dictionary to pair the word with the encoding or make a list of tuples with the (word, encoding) in it if you are more comfortable with that.
Comment me back if that doesn't makes sense after you get Bert up and running.
--Edit--
Here is a chunk of code that works similar to what you want to do. There are a lot of options for how you can hold results, etc. depending on your needs, but this should get you started with "fake bert"
from operator import itemgetter
# fake bert ... just return something like length
def bert(word):
return len(word)
# a fake compare function that will compare "bert" conversions
def bert_compare(x, y):
return abs(x-y)
# Process words
with open("./word_data_file.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
# Process keywords
with open("./keywords.txt", "r", encoding='utf8') as keyword_file:
keywords = keyword_file.read().split()
# encode the words and put result in dictionary
encoded_words = {}
for word in words:
encoded_words[word] = bert(word)
encoded_keywords = {}
for word in keywords:
encoded_keywords[word] = bert(word)
# let's use our bert conversions to find which keyword is most similar in
# length to the word
for word in encoded_words.keys():
result = [] # make a new result set for each pass
for kword in encoded_keywords.keys():
similarity = bert_compare(encoded_words.get(word), encoded_keywords.get(kword))
# stuff the answer into a tuple that can be sorted
result.append((word, kword, similarity))
result.sort(key=itemgetter(2))
print(f'the keyword with the closest size to {result[0][0]} is {result[0][1]}')

Tokenization and lemmatization for TF-IDF use for bunch of txt files using NLTK library

Doing the text analysis of italian text (tokenization, lemmalization) for future use of TF-IDF technics and constructing clusters based on that. For preprocessing I use NLTK and for one text file everything is working fine:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
it_stop_words = nltk.corpus.stopwords.words('italian')
lmtzr = WordNetLemmatizer()
with open('3003.txt', 'r' , encoding="latin-1") as myfile:
data=myfile.read()
word_tokenized_list = nltk.tokenize.word_tokenize(data)
word_tokenized_no_punct = [str.lower(x) for x in word_tokenized_list if x not in string.punctuation]
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lmtzr.lemmatize(x) for x in word_tokenized_no_punct_no_sw_no_apostrophe]
But the question is that I need to perform the following to bunch of .txt files in the folder. For that I'm trying to use possibilities of PlaintextCorpusReader():
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'reports/'
newcorpus = PlaintextCorpusReader(corpusdir, '.txt')
Basically I can not just apply newcorpus into the previous functions because it's an object and not a string. So my questions are:
How should the functions look like (or how should I change the existing ones for a distinct file) for doing tokenization and lemmatization for a corpus of files (using PlaintextCorpusReader())
How would the TF-IDF approach (standard sklearn approach of vectorizer = TfidfVectorizer() will look like in PlaintextCorpusReader()
Many Thanks!
I think your question can be answered by reading: this question, this another one and [TfidfVectorizer docs][3]. For completeness, I wrapped the answers below:
First, you want to get the files ids, by the first question you can get them as follows:
ids = newcorpus.fileids()
Then, based on the second quetion you can retrieve documents' words, sentences or paragraphs:
doc_words = []
doc_sents = []
doc_paras = []
for id_ in ids:
# Get words
doc_words.append(newcorpus.words(id_))
# Get sentences
doc_sents.append(newcorpus.sents(id_))
# Get paragraph
doc_paras.append(newcorpus.paras(id_))
Now, on the ith position of doc_words, doc_sents and doc_paras you have all words, sentences and paragraphs (respectively) for every document in the corpus.
For tf-idf you probably just want the words. Since TfidfVectorizer.fit's method gets an iterable which yields str, unicode or file objects, you need to either transform your documents (array of tokenized words) into a single string, or use a similar approach to this one. The latter solution uses a dummy tokenizer to deal directly with arrays of words.
You can also pass your own tokenizer to TfidVectorizer and use PlaintextCorpusReader simply for file reading.

How can I effectively pull out human readable strings/terms from code automatically?

I'm trying to determine the most common words, or "terms" (I think) as I iterate over many different files.
Example - For this line of code found in a file:
for w in sorted(strings, key=strings.get, reverse=True):
I'd want these unique strings/terms returned to my dictionary as keys:
for
w
in
sorted
strings
key
strings
get
reverse
True
However, I want this code to be tunable so that I can return strings with periods or other characters between them as well, because I just don't know what makes sense yet until I run the script and count up the "terms" a few times:
strings.get
How can I approach this problem? It would help to understand how I can do this one line at a time so I can loop it as I read my file's lines in. I've got the basic logic down but I'm currently just doing the tallying by unique line instead of "term":
strings = dict()
fname = '/tmp/bigfile.txt'
with open(fname, "r") as f:
for line in f:
if line in strings:
strings[line] += 1
else:
strings[line] = 1
for w in sorted(strings, key=strings.get, reverse=True):
print str(w).rstrip() + " : " + str(strings[w])
(Yes I used code from my little snippet here as the example at the top.)
If the only python token you want to keep together is the object.attr construct then all the tokens you are interested would fit into the regular expression
\w+\.?\w*
Which basically means "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters"
note that this would also match number literals like 42 or 7.6 but that would be easy enough to filter out afterwards.
then you can use collections.Counter to do the actual counting for you:
import collections
import re
pattern = re.compile(r"\w+\.?\w*")
#here I'm using the source file for `collections` as the test example
with open(collections.__file__, "r") as f:
tokens = collections.Counter(t.group() for t in pattern.finditer(f.read()))
for token, count in tokens.most_common(5): #show only the top 5
print(token, count)
Running python version 3.6.0a1 the output is this:
self 226
def 173
return 170
self.data 129
if 102
which makes sense for the collections module since it is full of classes that use self and define methods, it also shows that it does capture self.data which fits the construct you are interested in.

n-grams with Naive Bayes classifier

Im new to python and need help!
i was practicing with python NLTK text classification.
Here is the code example i am practicing on
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
Ive tried this one
from nltk import bigrams
from nltk.probability import ELEProbDist, FreqDist
from nltk import NaiveBayesClassifier
from collections import defaultdict
train_samples = {}
with file ('positive.txt', 'rt') as f:
for line in f.readlines():
train_samples[line]='pos'
with file ('negative.txt', 'rt') as d:
for line in d.readlines():
train_samples[line]='neg'
f=open("test.txt", "r")
test_samples=f.readlines()
def bigramReturner(text):
tweetString = text.lower()
bigramFeatureVector = {}
for item in bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
def get_labeled_features(samples):
word_freqs = {}
for text, label in train_samples.items():
tokens = text.split()
for token in tokens:
if token not in word_freqs:
word_freqs[token] = {'pos': 0, 'neg': 0}
word_freqs[token][label] += 1
return word_freqs
def get_label_probdist(labeled_features):
label_fd = FreqDist()
for item,counts in labeled_features.items():
for label in ['neg','pos']:
if counts[label] > 0:
label_fd.inc(label)
label_probdist = ELEProbDist(label_fd)
return label_probdist
def get_feature_probdist(labeled_features):
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
num_samples = len(train_samples) / 2
for token, counts in labeled_features.items():
for label in ['neg','pos']:
feature_freqdist[label, token].inc(True, count=counts[label])
feature_freqdist[label, token].inc(None, num_samples - counts[label])
feature_values[token].add(None)
feature_values[token].add(True)
for item in feature_freqdist.items():
print item[0],item[1]
feature_probdist = {}
for ((label, fname), freqdist) in feature_freqdist.items():
probdist = ELEProbDist(freqdist, bins=len(feature_values[fname]))
feature_probdist[label,fname] = probdist
return feature_probdist
labeled_features = get_labeled_features(train_samples)
label_probdist = get_label_probdist(labeled_features)
feature_probdist = get_feature_probdist(labeled_features)
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
for sample in test_samples:
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
but getting this error, why?
Traceback (most recent call last):
File "C:\python\naive_test.py", line 76, in <module>
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
File "C:\python\naive_test.py", line 23, in bigramReturner
bigramFeatureVector.append(' '.join(item))
AttributeError: 'dict' object has no attribute 'append'
A bigram feature vector follows the exact same principals as a unigram feature vector. So, just like the tutorial you mentioned you will have to check if a bigram feature is present in any of the documents you will use.
As for the bigram features and how to extract them, I have written the code bellow for it. You can simply adopt them to change the variable "tweets" in the tutorial.
import nltk
text = "Hi, I want to get the bigram list of this string"
for item in nltk.bigrams (text.split()): print ' '.join(item)
Instead of printing them you can simply append them to the "tweets" list and you are good to go! I hope this would be helpful enough. Otherwise, let me know if you still have problems.
Please note that in applications like sentiment analysis some researchers tend to tokenize the words and remove the punctuation and some others don't. From experince I know that if you don't remove punctuations, Naive bayes works almost the same, however an SVM would have a decreased accuracy rate. You might need to play around with this stuff and decide what works better on your dataset.
Edit 1:
There is a book named "Natural language processing with Python" which I can recommend it to you. It contains examples of bigrams as well as some exercises. However, I think you can even solve this case without it. The idea behind selecting bigrams a features is that we want to know the probabilty that word A would appear in our corpus followed by the word B. So, for example in the sentence
"I drive a truck"
the word unigram features would be each of those 4 words while the word bigram features would be:
["I drive", "drive a", "a truck"]
Now you want to use those 3 as your features. So the code function bellow puts all bigrams of a string in a list named bigramFeatureVector.
def bigramReturner (tweetString):
tweetString = tweetString.lower()
tweetString = removePunctuation (tweetString)
bigramFeatureVector = []
for item in nltk.bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
Note that you have to write your own removePunctuation function. What you get as output of the above function is the bigram feature vector. You will treat it exactly the same way the unigram feature vectors are treated in the tutorial you mentioned.

Categories

Resources