This is how i applied dictionary for stemming. My dictionary (d) is imported and it's in this format now d={'nada.*':'nadas', 'mila.*':'milas'}
I wrote this code to stemm tokens, but it runs TOO SLOW, so i stopped it before it finished. I guess it's problem because dict is large, and there is large number of tokens.
So, how can i implement my stem dictionary, so that code can run normaly?
I tried to find a method in nltk package to apply custom dict, but i didn't find it.
#import stem dict
d = {}
with open("Stem rečnik.txt") as f:
for line in f:
key, val = line.split(":")
d[key.replace("\n","")] = val.replace("\n","")
#define tokenizer
def custom_tokenizer(text):
#split- space
tokens = nltk.tokenize.word_tokenize(text)
#stemmer
for i, token in enumerate(tokens):
for key, val in d.items():
if re.match(key, token):
tokens[i] = val
break
return tokens
Dictionary sample:
bank.{1}$:banka
intes.{1}$:intesa
intes.{1}$:intesa
intez.{1}$:intesa
intezin.*:intesa
banke:banka
banaka:banka
bankama:banka
post_text sample:
post_text = [
'Banca intesa #nocnamora',
'Banca intesa',
'banka haosa i neorganizovanosti!',
'Cucanje u banci umesto setnje posle rucka.',
"Lovin' it #intesa'"
]
Notice that while the keys in your stem dict are regexes, they all start with a short string of some specific characters. Let's say the minimum length of specific characters is 3. Then, construct a dict like this:
'ban' : [('bank.$', 'banka'),
('banke', 'banka'),
('banaka', 'banka'),
('bankama', 'banka'),
],
'int' : [('inte[sz].$', 'intesa'),
('intezin.*', 'intesa'),
],
Of course, you should re.compile() all those patterns at the beginning.
Then you can do a cheaper, three-character lookup in this dict:
def custom_tokenizer(text):
tokens = nltk.tokenize.word_tokenize(text)
for i, token in enumerate(tokens):
for key, val in d.get(token[:3], []):
if re.match(key, token):
tokens[i] = val
break
return tokens
Now instead of checking all 500 stems, you only need to check the few that start with the right prefix.
Related
I am looking for suggestions on how to speed up the process described below, which involves a fuzzy regex search.
What I am trying to do
I am fuzzy searching for keywords, stored in a dictionary d (with example just below, value is list of two always, need to keep track of which was found, if any), in a set of strings, stored in a file testFile (one string each line, ~150 characters each) - 4 mismatches max.
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"], "kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]} #simplified to just two keywords
How I do it
For this, I first compile my regex and store them in a dictionary compd. I then read the file line by line and search each keyword in each line (string). I cannot stop the search once a keyword has been found as multiple keywords may be found in one string/line but I can skip the second element in list associated with keyword if first is found.
Here is how I am doing it:
#/usr/bin/env python3
import argparse
import regex
parser = argparse.ArgumentParser()
parser.add_argument('file', help='file with strings')
args = parser.parse_args()
#dictionary with keywords
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"],"kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]}
#Compile regex (4 mismatches max)
compd = {"kw1": [], "kw2": []} #to store regex
for k, v in d.items(): #for each keyword
compd[k].append(regex.compile(r'(?b)(' + v[0] + '){s<=4}')) #compile 1st elt of list
compd[k].append(regex.compile(r'(?b)(' + v[1] + '){s<=4}')) #compile second
#Search keywords
with open(args.file) as f: #open file with strings
line = f.readline() #first line/string
while line: #go through each line
for k, v in compd.items(): #for each keyword (ID, regex)
for val in [v[0], v[1]]: #for each elt of list
found = val.search(line) #regex search
if found != None: #if match
print("Keyword " + k + " found as " + found[0]) #print match
if val == v[0]: #if 1st elt of list
break #don't search 2nd
line = f.readline() #next line
I have tested the script using the testFile:
AGCTCGATGTATGGGTATATGATCTTGACAGAGAGA
GTCGTAGCTCGTATTCGATGGCTATTCGCTATATGCTAGCTAT
and get the following expected result:
Keyword kw1 found as AGCTCGATGTATGGGTATATGATCTTGAC
Efficiency
With current script, it takes about 3-4 minutes to process 500k strings and six keywords. There will be cases where I have 2 million strings, which should take 12-16 minutes and I would like to have this reduced, if possible.
Having a separate regex for each keyword requires running a match against each regex separately. Instead, combine all the regexes into one using the keywords as names for named groups:
patterns = []
for k, v in d.items(): #for each keyword
patterns.append(f'(?P<{k}>{v[0]}|{v[1]})')
pattern = '(?b)(?:' + '|'.join(patterns) + '){s<=4}'
reSeqs = regex.compile(pattern)
With this, the program can check for which named group was matched in order to get the keyword. You can replace the loop over all the regexes in compd with loops over matches in a line (in case there is more than 1 match) and dictionary items in each match (which could be implemented as a comprehension):
for matched in reSeqs.finditer(line):
try:
keyword = [kw for kw, val in matched.groupdict().items() if val][0]
# perform further processing of keyword
except: # no match
pass
(Note that you don't need to call readline on a file object to loop over lines; instead, you can loop over the file object directly: for line in f:.)
If you need further optimizations, have memory to burn and can sacrifice a little readability, also test whether replacing the loop over lines with a comprehension over matches is more performant:
with open(args.file) as f:
contents = f.read() # slurp entire file
matches = [{
val:kw for kw, val in found.groupdict().items() if val
} for found in reSeqs.finditer(contents)
]
This solution doesn't distinguish between repetitions of a given sequence; in particular, repetitions on a single line are lost. You could merge entries having the same keys into lists, or, if repetitions should be treated as a single instance, you can merge the dictionaries as-is. If you want to distinguish separate instances of a matched sequence, include file position information in the keys:
matches = [{
(val, found.span()):kw for kw, val in found.groupdict().items() if val
} for found in reSeqs.finditer(contents)
]
To merge:
results = {}
for match in matches:
results.update(match)
# or:
results = {k:v for d in matches for k,v in d.items()}
If memory is an issue, another option would be to break up the file into chunks ending on line breaks (either line-based, or by reading blocks and separating partial lines at block ends) and use finditer on each chunk:
# implementation of `chunks` left as exercise
def file_chunks(path, size=2**12):
with open(path) as file:
yield from chunks(file, size=size)
results = {}
for block in file_chunks(args.file, size=2**20):
for found in reSeqs.finditer(block):
results.update({
(val, found.span()):kw for kw, val in found.groupdict().items() if val
})
I'm very new to Python and I'm stuck on a task. First I made a file containing a number of fasta files with sequence names into a dictionary, then managed to select only those I want, based on substrings included in the keys which are defined in list "flu_genes".
Now I'm trying to reorder the items in this dictionary based on the order of substrings defined in the list "flu_genes". I'm completely stuck; I found a way of reordering based on the key order in a list BUT it is not my case, as the order is defined not by the keys but by a substring within the keys.
Should also add that in this case the substring its at the end with format "_GENE", however it could be in the middle of the string with the same format, perhaps "GENE", therefore I'd rather not rely on a code to find the substring at the end of the string.
I hope this is clear enough and thanks in advance for any help!
"full_genome.fasta"
>A/influenza/1/1_NA
atgcg
>A/influenza/1/1_NP
ctgat
>A/influenza/1/1_FluB
agcta
>A/influenza/1/1_HA
tgcat
>A/influenza/1/1_FluC
agagt
>A/influenza/1/1_M
tatag
consensus = {}
flu_genes = ['_HA', '_NP', '_NA', '_M']
with open("full_genome.fasta", 'r') as myseq:
for line in myseq:
line = line.rstrip()
if line.startswith('>'):
key = line[1:]
else:
if key in consensus:
consensus[key] += line
else:
consensus[key] = line
flu_fas = {key : val for key, val in consensus.items() if any(ele in key for ele in flu_genes)}
print("Dictionary after removal of keys : " + str(flu_fas))
>>>Dictionary after removal of keys : {'>A/influenza/1/1_NA': 'atgcg', '>A/influenza/1/1_NP': 'ctgat', '>A/influenza/1/1_HA': 'tgcat', '>A/influenza/1/1_M': 'tatag'}
#reordering by keys order (not going to work!) as in: https://try2explore.com/questions/12586065
reordered_dict = {k: flu_fas[k] for k in flu_genes}
A dictionary is fundamentally unsorted, but as an implementation detail of python3 it remembers its insertion order, and you're not going to change anything later, so you can do what you're doing.
The problem is, of course, that you're not working with the actual keys. So let's just set up a list of the keys, and sort that according to your criteria. Then you can do the other thing you did, except using the actual keys.
flu_genes = ['_HA', '_NP', '_NA', '_M']
def get_gene_index(k):
for index, gene in enumerate(flu_genes):
if k.endswith(gene):
return index
raise ValueError('I thought you removed those already')
reordered_keys = sorted(flu_fas.keys(), key=get_gene_index)
reordered_dict = {k: flu_fas[k] for k in reordered_keys}
for k, v in reordered_dict.items():
print(k, v)
A/influenza/1/1_HA tgcat
A/influenza/1/1_NP ctgat
A/influenza/1/1_NA atgcg
A/influenza/1/1_M tatag
Normally, I wouldn't do an n-squared sort, but I'm assuming the lines in the data file is much larger than the number of flu_genes, making that essentially a fixed constant.
This may or may not be the best data structure for your application, but I'll leave that to code review.
It's because you are trying to reorder it with non-existent dictionary keys. Your keys are
['>A/influenza/1/1_NA', '>A/influenza/1/1_NP', '>A/influenza/1/1_HA', '>A/influenza/1/1_M']
which doesn't match the list
['_HA', '_NP', '_NA', '_M']
you first need to get transform them to make them match and since we know the pattern that it's at the end of the string starting with an underscore, we can split at underscores and get the last match.
consensus = {}
flu_genes = ['_HA', '_NP', '_NA', '_M']
with open("full_genome.fasta", 'r') as myseq:
for line in myseq:
line = line.rstrip()
if line.startswith('>'):
sequence = line
gene = line.split('_')[-1]
key = f"_{gene}"
else:
consensus[key] = {
'sequence': sequence,
'data': line
}
flu_fas = {key : val for key, val in consensus.items() if any(ele in key for ele in flu_genes)}
print("Dictionary after removal of keys : " + str(flu_fas))
reordered_dict = {k: flu_fas[k] for k in flu_genes}
I am trying to look up dictionary indices for thousands of strings and this process is very, very slow. There are package alternatives, like KeyedVectors from gensim.models, which does what I want to do in about a minute, but I want to do what the package does more manually and to have more control over what I am doing.
I have two objects: (1) a dictionary that contains key : values for word embeddings, and (2) my pandas dataframe with my strings that need to be transformed into the index value found for each word in object (1). Consider the code below -- is there any obvious improvement to speed or am I relegated to external packages?
I would have thought that key lookups in a dictionary would be blazing fast.
Object 1
embeddings_dictionary = dict()
glove_file = open('glove.6B.200d.txt', encoding="utf8")
for line in glove_file:
records = line.split()
word = records[0]
vector_dimensions = np.asarray(records[1:], dtype='float32')
embeddings_dictionary [word] = vector_dimensions
Object 2 (The slowdown)
no_matches = []
glove_tokenized_data = []
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
# the line below is the problem
idx = list(embeddings_dictionary.keys()).index(word)
except:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)
You've got a mapping of word -> np.array. It appears you want a quick way to map word to its location in the key list. You can do that with another dict.
no_matches = []
glove_tokenized_data = []
word_to_index = dict(zip(embeddings_dictionary.keys(), range(len(embeddings_dictionary))))
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
idx = word_to_index[word]
except KeyError:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)
In the line you marked as a problem, you are first creating a list from the keys and then looking up the word in the list. You're doing this inside the loop so the first thing you could do is take this logic to the top of the block (outside the loop) to avoid repeated processing and second you're doing all this searching now on a list, not a dictionary.
Why not create another dictionary like this on top of the file:
reverse_lookup = { word: index for word, index in enumerate(embeddings_dictionary.keys()) }
and then use this dictionary to look up the index of your word. Something like this:
for word in doc:
if word in reverse_lookup:
ints.append(reverse_lookup[word])
else:
no_matches.append(word)
I must write a dictionary. This is my first time doing it and I can't wrap my head around it. The first 5 element should be the key to it and the rest the value.
for i in verseny:
if i not in eredmeny:
eredmeny[i] = 1
else:
eredmeny[i] += 1
YS869 CCCADCADBCBCCB this is a line from the hw. This YS869 should be the key and this CCCADCADBCBCCB should be the value.
The problem is that I can't store them in a dictionary. I'm grinding gears here but getting nowhere.
Assuming that erdemeny is your dictionary name and that verseny is the list that includes your values and keys strings. This should do it:
verseny = ['YS869 CCCADCADBCBCCB', 'CS769 CCCADCADBCBCCB', 'BS869 CCCADCADBCBCCB']
eredmeny = {}
for i in verseny:
key, value = i.split(' ')[0], i.split(' ')[1]
if key not in eredmeny.keys():
eredmeny[key] = value
else:
eredmeny[key].append(value)
I'm not really understanding the question well, but an easy way to do the task at hand would be converting the list into a string and then using split():
line = 'YS869 CCCADCADBCBCCB'
words = l.split()
d = {ls[0]: ls[1]}
print(d)
this is the basic skill in python. I hope you can refer to the existing materials. As your example, the following demonstrations are given:
line = 'YS869 CCCADCADBCBCCB'
dictionary = {}
dictionary[line[:4]] = line[5:]
print(dictionary) # {'YS86': ' CCCADCADBCBCCB'}
i wrote a function nextw(fname,enc) that from a book in .txt format returns a dictionary with a word as a key and the adjacent word as value.
For example, if my book has three 'go', one 'go on' and two 'go out', if i search dictionary['go'] my output should be ['on','out'] without repetitions. Unfortunately it doesn't work, or rather it works but only with the last adjacent word, with my book it returns just 'on' as a string, which i've checked and is actually the adjacent word to the last 'go'. How can i make it work as intended? Here's the code:
def nextw(fname,enc):
with open(fname,encoding=enc) as f:
d = {}
data = f.read()
#removes non-alphabetical characters from the book#
for char in data:
if not char.isalpha():
data = data.replace(char,' ')
#converts the book into lower-case and splits it in a list of words#
data = data.lower()
data = data.split()
#iterates on words#
for index in range(len(data)-1):
searched = data[index]
adjacent = data[index+1]
d[searched] =adjacent
return d
I think your problem lies here d[searched] = adjacent. You need to have something like:
if not searched in d.keys():
d[searched] = list()
d[searched].append(adjacent)