How to Tokenize block of text as one token in python? - python

Recently I am working on a genome data set which consists of many blocks of genomes. On previous works on natural language processing, I have used sent_tokenize and word_tokenize from nltk to tokenize the sentences and words. But when I use these functions on genome data set, it is not able to tokenize the genomes correctly. The text below shows some part of the genome data set.
>NR_004049 1
tattattatacacaatcccggggcgttctatatagttatgtataatgtat
atttatattatttatgcctctaactggaacgtaccttgagcatatatgct
gtgacccgaaagatggtgaactatacttgatcaggttgaagtcaggggaa
accctgatggaagaccgaaacagttctgacgtgcaaatcgattgtcagaa
ttgagtataggggcgaaagaccaatcgaaccatctagtagctggttcctt
ccgaagtttccctcaggatagctggtgcattttaatattatataaaataa
tcttatctggtaaagcgaatgattagaggccttagggtcgaaacgatctt
aacctattctcaaactttaaatgggtaagaaccttaactttcttgatatg
aagttcaaggttatgatataatgtgcccagtgggccacttttggtaagca
gaactggcgctgtgggatgaaccaaacgtaatgttacggtgcccaaataa
caact
>NR_004048 1
aatgttttatataaattgcagtatgtgtcacccaaaatagcaaaccccat
aaccaaccagattattatgatacataatgcttatatgaaactaagacatt
tcgcaacatttattttaggtatataaatacatttattgaaggaattgata
tatgccagtaaaatggtgtatttttaatttctttcaataaaaacataatt
gacattatataaaaatgaattataaaactctaagcggtggatcactcggc
tcatgggtcgatgaagaacgcagcaaactgtgcgtcatcgtgtgaactgc
aggacacatgaacatcgacattttgaacgcatatcgcagtccatgctgtt
atgtactttaattaattttatagtgctgcttggactacatatggttgagg
gttgtaagactatgctaattaagttgcttataaatttttataagcatatg
gtatattattggataaatataataatttttattcataatattaaaaaata
aatgaaaaacattatctcacatttgaatgt
>NR_004047 1
atattcaggttcatcgggcttaacctctaagcagtttcacgtactgttta
actctctattcagagttcttttcaactttccctcacggtacttgtttact
atcggtctcatggttatatttagtgtttagatggagtttaccacccactt
agtgctgcactatcaagcaacactgactctttggaaacatcatctagtaa
tcattaacgttatacgggcctggcaccctctatgggtaaatggcctcatt
taagaaggacttaaatcgctaatttctcatactagaatattgacgctcca
tacactgcatctcacatttgccatatagacaaagtgacttagtgctgaac
tgtcttctttacggtcgccgctactaagaaaatccttggtagttactttt
cctcccctaattaatatgcttaaattcagggggtagtcccatatgagttg
>NR_004052 1
When the tokenizer of ntlk is applied on this dataset, each line of text (for example tattattatacacaatcccggggcgttctatatagttatgtataatgtat ) becomes one token which is not correct. and a block of sequences should be considered as one token. For example in this case contents between >NR_004049 1 and >NR_004048 1 should be consider as one token:
>NR_004049 1
tattattatacacaatcccggggcgttctatatagttatgtataatgtat
atttatattatttatgcctctaactggaacgtaccttgagcatatatgct
gtgacccgaaagatggtgaactatacttgatcaggttgaagtcaggggaa
accctgatggaagaccgaaacagttctgacgtgcaaatcgattgtcagaa
ttgagtataggggcgaaagaccaatcgaaccatctagtagctggttcctt
ccgaagtttccctcaggatagctggtgcattttaatattatataaaataa
tcttatctggtaaagcgaatgattagaggccttagggtcgaaacgatctt
aacctattctcaaactttaaatgggtaagaaccttaactttcttgatatg
aagttcaaggttatgatataatgtgcccagtgggccacttttggtaagca
gaactggcgctgtgggatgaaccaaacgtaatgttacggtgcccaaataa
caact
>NR_004048 1
So each block starting with special words such as >NR_004049 1 until the next special character should be considered as one token. The problem here is tokenizing this kind of data set and i dont have any idea how can i correctly tokenize them.
I really appreciate answers which helps me to solve this.
Update:
One way to solve this problem is to append al lines within each block, and then using the nltk tokenizer. for example This means that to append all lines between >NR_004049 1 and >NR_004048 1 to make one string from several lines, so the nltk tokenizer will consider it as one token. Can any one help me how can i append lines within each block?

You just need to concatenate the lines between two ids apparently. There should be no need for nltk or any tokenizer, just a bit of programming ;)
patterns = {}
with open('data', "r") as f:
id = None
current = ""
for line0 in f:
line= line0.rstrip()
if line[0] == '>' : # new pattern
if len(current)>0:
# print("adding "+id+" "+current)
patterns[id] = current
current = ""
# to find the next id:
tokens = line.split(" ")
id = tokens[0][1:]
else: # continuing pattern
current = current + line
if len(current)>0:
patterns[id] = current
# print("adding "+id+" "+current)
# do whatever with the patterns:
for id, pattern in patterns.items():
print(f"{id}\t{pattern}")

Related

Calculate a measure between keywords and each word of a textfile

I have two .txt files, one that contains 200.000 words and the second contains 100 keywords( one each line). I want to calculate the cosine similarity between each of the 100 keywords and each word of my 200.000 words , and display for every keyword the 50 words with the highest score.
Here's what I did, note that Bertclient is what i'm using to extract vectors :
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
# Process words
with open("./words.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
with open("./100_keywords.txt", "r", encoding='utf8') as keyword_file:
for keyword in keyword_file:
vector_key = bc.encode([keyword])
for w in words:
vector_word = bc.encode([w])
cosine_lib = cosine_similarity(vector_key,vector_word)
print (cosine_lib)
This keeps running but it doesn't stop. Any idea how I can correct this ?
I know nothing of Bert...but there's something fishy with the import and run. I don't think you have it installed correctly or something. I tried to pip install it and just run this:
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
print ('done importing')
and it never finished. Take a look at the dox for bert and see if something else needs to be done.
On your code, it is generally better do do ALL of the reading first, then the processing, so import both lists first, separately, check a few values with something like:
# check first five
print(words[:5])
Also, you need to look at a different way to do your comparisons instead of the nested loops. You realize now that you are converting each word in words EVERY TIME for each keyword, which is not necessary and probably really slow. I would recommend you either use a dictionary to pair the word with the encoding or make a list of tuples with the (word, encoding) in it if you are more comfortable with that.
Comment me back if that doesn't makes sense after you get Bert up and running.
--Edit--
Here is a chunk of code that works similar to what you want to do. There are a lot of options for how you can hold results, etc. depending on your needs, but this should get you started with "fake bert"
from operator import itemgetter
# fake bert ... just return something like length
def bert(word):
return len(word)
# a fake compare function that will compare "bert" conversions
def bert_compare(x, y):
return abs(x-y)
# Process words
with open("./word_data_file.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
# Process keywords
with open("./keywords.txt", "r", encoding='utf8') as keyword_file:
keywords = keyword_file.read().split()
# encode the words and put result in dictionary
encoded_words = {}
for word in words:
encoded_words[word] = bert(word)
encoded_keywords = {}
for word in keywords:
encoded_keywords[word] = bert(word)
# let's use our bert conversions to find which keyword is most similar in
# length to the word
for word in encoded_words.keys():
result = [] # make a new result set for each pass
for kword in encoded_keywords.keys():
similarity = bert_compare(encoded_words.get(word), encoded_keywords.get(kword))
# stuff the answer into a tuple that can be sorted
result.append((word, kword, similarity))
result.sort(key=itemgetter(2))
print(f'the keyword with the closest size to {result[0][0]} is {result[0][1]}')

Linear search to find spelling errors in Python

I'm working on learning Python with Program Arcade Games and I've gotten stuck on one of the labs.
I'm supposed to compare each word of a text file (http://programarcadegames.com/python_examples/en/AliceInWonderLand200.txt) to find if it is not in the dictionary file (http://programarcadegames.com/python_examples/en/dictionary.txt) and then print it out if it is not. I am supposed to use a linear search for this.
The problem is even words I know are not in the dictionary file aren't being printed out. Any help would be appreciated.
My code is as follows:
# Imports regular expressions
import re
# This function takes a line of text and returns
# a list of words in the line
def split_line(line):
split = re.findall('[A-Za-z]+(?:\'\"[A-Za-z]+)?', line)
return split
# Opens the dictionary text file and adds each line to an array, then closes the file
dictionary = open("dictionary.txt")
dict_array = []
for item in dictionary:
dict_array.append(split_line(item))
print(dict_array)
dictionary.close()
print("---Linear Search---")
# Opens the text for the first chapter of Alice in Wonderland
chapter_1 = open("AliceInWonderland200.txt")
# Breaks down the text by line
for each_line in chapter_1:
# Breaks down each line to a single word
words = split_line(each_line)
# Checks each word against the dictionary array
for each_word in words:
i = 0
# Continues as long as there are more words in the dictionary and no match
while i < len(dict_array) and each_word.upper() != dict_array[i]:
i += 1
# if no match was found print the word being checked
if not i <= len(dict_array):
print(each_word)
# Closes the first chapter file
chapter_1.close()
Linear search to find spelling errors in Python
Something like this should do (pseudo code)
sampleDict = {}
For each word in AliceInWonderLand200.txt:
sampleDict[word] = True
actualWords = {}
For each word in dictionary.txt:
actualWords[word] = True
For each word in sampleDict:
if not (word in actualDict):
# Oh no! word isn't in the dictionary
A set may be more appropriate than a dict, since the value of the dictionary in the sample isn't important. This should get you going, though

python spambot featureset list size

novice coder here, trying to sort out issues I've found with a simple spam detection python script from Youtube.
Naive Bayes cannot be applied because the list isn't generating correctly. I know the problem step is
featuresets = [(email_features(n),g) for (n,g) in mixedemails]
Could someone help me understand why that line is failing to generate anything?
def email_features(sent):
features = {}
wordtokens = [wordlemmatizer.lemmatize(word.lower()) for word in word_tokenize(sent)]
for word in wordtokens:
if word not in commonwords:
features[word] = True
return features
hamtexts=[]
spamtexts=[]
for infile in glob.glob(os.path.join('ham/','*.txt')):
text_file =open(infile,"r")
hamtexts.append(text_file.read())
text_file.close()
for infile in glob.glob(os.path.join('spam/','*.txt')):
text_file =open(infile,"r")
spamtexts.append(text_file.read())
text_file.close()
mixedemails = ([(email,'spam') for email in spamtexts]+ [(email,'ham') for email in hamtexts])
featuresets = [(email_features(n),g) for (n,g) in mixedemails]
I converted your problem into a minimal, runnable example:
commonwords = []
def lemmatize(word):
return word
def word_tokenize(text):
return text.split(" ")
def email_features(sent):
wordtokens = [lemmatize(word.lower()) for word in word_tokenize(sent)]
features = dict((word, True) for word in wordtokens if word not in commonwords)
return features
hamtexts = ["hello test", "test123 blabla"]
spamtexts = ["buy this", "buy that"]
mixedemails = [(email,'spam') for email in spamtexts] + [(email,'ham') for email in hamtexts]
featuresets = [(email_features(n),g) for (n,g) in mixedemails]
print len(mixedemails), len(featuresets)
Executing that example prints 4 4 on the console. Therefore, most of your code seems to work and the exact cause of the error cannot be estimated based on what you posted. I would suggest you to look at the following points for the bug:
Maybe your spam and ham files are not read properly (e.g. your path might be wrong). To validate that this is not the case add print hamtexts, spamtexts before mixedemails = .... Both variables should contain not-empty lists of strings.
Maybe your implementation of word_tokenize() returns always an empty list. Add a print sent, wordtokens after wordtokens = [...] in email_features() to make sure that sent contains a string and that it gets correctly converted to a list of tokens.
Maybe commonwords contains every single word from your ham and spam emails. To make sure that this is not the case, add the previous print sent, wordtokens before the loop in email_features() and a print features after the loop. All three variables should (usually) be not empty.

how to extract the contextual words of a token in python

Actually i want to extract the contextual words of a specific word. For this purpose i can use the n-gram in python but the draw back of this is that it slides the window by one but i only need the contextual words of a specific word. E.g. my file is like this
IL-2
gene
expression
and
NF-kappa
B
activation
through
CD28
requires
reactive
oxygen
production
by
5-lipoxygenase
.
mean each token on every line. now i want to extract the surrounding words of each e.g. through and requires are the surrounding words of "CD28". I write a python code but did not worked and generating an error of ValueError: list.index(x): x not in list.
My code is
import re;
import nltk;
file=open("C:/Python26/test.txt");
contents= file.read()
tokens = nltk.word_tokenize(contents)
f=open("trigram.txt",'w');
for l in tokens:
print tokens[l],tokens[l+1]
f.close();
First of all, list.index(x) : Return the index in the list of the first item whose value is x.
>>> ["foo", "bar", "baz"].index('bar')
1
In your code, the variable 'word' is populated using range of integers not by actual contents. so we can't directly use 'word' in the list.index() function.
>>> print lines.index(1)
ValueError: 1 is not in list
change your code like this :
file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');
with open(file,'r') as rf:
lines = rf.readlines();
for word in range(1,len(lines)-1):
f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())
f.close()
I dont really understood what you want to do, but, I'll do my best.
If you want to process words with python there is a library called NLTK which means Natural Language Toolkit.
You may need to tokenize a sentence or a document.
import nltk
def tokenize_query(query):
return nltk.word_tokenize(query)
f = open('C:/Python26/tokens.txt')
raw = f.read()
tokenize_query(raw)
We can also read a file one line at a time using a for loop:
f = open('C:/Python26/tokens.txt', 'rU')
for line in f:
print(line.strip())
r means 'read' and U means 'universal', if you are wondering.
strip() is just cutting '\n' from the text.
The context may be provided by wordnet and all its functions.
I guess you should use synsets with the word's pos (part of speech).
A synset is sort of a synonyms list in a semantic way.
NLTK can provide you some others nice features like sentiment analysis and similarity between synsets.
file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');
with open(file,'r') as rf:
lines = rf.readlines();
for word in range(1,len(lines)-1):
f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())
f.write("\n")
f.close()
This code also gives the same result
import nltk;
from nltk.util import ngrams
from nltk import word_tokenize
file = open("C:/Python26/tokens.txt");
contents=file.read();
tokens = nltk.word_tokenize(contents);
f_tri = open("trigram.txt",'w');
trigram = ngrams(tokens,3)
for t in trigram:
f_tri.write(str(t)+"\n")
f_tri.close()

Creating a table which has sentences from a paragraph each on a row with Python

I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.

Categories

Resources