Is there a faster way to lookup dictionary indices? - python

I am trying to look up dictionary indices for thousands of strings and this process is very, very slow. There are package alternatives, like KeyedVectors from gensim.models, which does what I want to do in about a minute, but I want to do what the package does more manually and to have more control over what I am doing.
I have two objects: (1) a dictionary that contains key : values for word embeddings, and (2) my pandas dataframe with my strings that need to be transformed into the index value found for each word in object (1). Consider the code below -- is there any obvious improvement to speed or am I relegated to external packages?
I would have thought that key lookups in a dictionary would be blazing fast.
Object 1
embeddings_dictionary = dict()
glove_file = open('glove.6B.200d.txt', encoding="utf8")
for line in glove_file:
records = line.split()
word = records[0]
vector_dimensions = np.asarray(records[1:], dtype='float32')
embeddings_dictionary [word] = vector_dimensions
Object 2 (The slowdown)
no_matches = []
glove_tokenized_data = []
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
# the line below is the problem
idx = list(embeddings_dictionary.keys()).index(word)
except:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)

You've got a mapping of word -> np.array. It appears you want a quick way to map word to its location in the key list. You can do that with another dict.
no_matches = []
glove_tokenized_data = []
word_to_index = dict(zip(embeddings_dictionary.keys(), range(len(embeddings_dictionary))))
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
idx = word_to_index[word]
except KeyError:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)

In the line you marked as a problem, you are first creating a list from the keys and then looking up the word in the list. You're doing this inside the loop so the first thing you could do is take this logic to the top of the block (outside the loop) to avoid repeated processing and second you're doing all this searching now on a list, not a dictionary.
Why not create another dictionary like this on top of the file:
reverse_lookup = { word: index for word, index in enumerate(embeddings_dictionary.keys()) }
and then use this dictionary to look up the index of your word. Something like this:
for word in doc:
if word in reverse_lookup:
ints.append(reverse_lookup[word])
else:
no_matches.append(word)

Related

How to speed up this word-tuple finding algorithm?

I am trying to create a simple model to predict the next word in a sentence. I have a big .txt file that contains sentences seperated by '\n'. I also have a vocabulary file which lists every unique word in my .txt file and a unique ID. I used the vocabulary file to convert the words in my corpus to their corresponding IDs. Now I want to make a simple model which reads the IDs from txt file and find the word pairs and how many times this said word pairs were seen in the corpus. I have managed to write to code below:
tuples = [[]] #array for word tuples to be stored in
data = [] #array for tuple frequencies to be stored in
data.append(0) #tuples array starts with an empty element at the beginning for some reason.
# Adding zero to the beginning of the frequency array levels the indexes of the two arrays
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tupleIndex = 0
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if [tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] in tuples: #if the word pair is was seen before
data[tuples.index([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]])] += 1 #increment the frequency of said pair
else:
tuples.append([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]]) #if the word pair is never seen before
data.append(1) #add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
with open("markovWindowSize1.txt", 'a', encoding="utf8") as markovWindowSize1File:
#write tuples to txt file
for tuple in tuples:
if (len(tuple) > 0): # if tuple is not epmty
markovWindowSize1File.write(str(element[0]) + "," + str(element[1]) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
#blank spaces between two data
#write frequencies of the tuples to txt file
for element in data:
markovWindowSize1File.write(str(element) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
This code seems to be working well for the first couple thousands of lines. Then things start to get slower because the tuple list keeps getting bigger and I have to search the whole tuple list to check if the next word pair was seen before or not. I managed to get the data of 50k lines in 30 minutes but I have much bigger corpuses with millions of lines. Is there a way to store and search for the word pairs in a more efficient way? Matrices would probably work a lot faster but my unique word count is about 300.000 words. Which means I have to create a 300k*300k matrix with integers as data type. Even after taking advantage of symmetric matrices, it would require a lot more memory than what I have.
I tried using memmap from numpy to store the matrix in disk rather than memory but it required about 500 GB free disk space.
Then I studied the sparse matrices and found out that I can just store the non-zero values and their corresponding row and column numbers. Which is what I did in my code.
Right now, this model works but it is very bad at guessing the next word correctly ( about 8% success rate). I need to train with bigger corpuses to get better results. What can I do to make this word pair finding code more efficient?
Thanks.
Edit: Thanks to everyone answered, I am now able to process my corpus of ~500k lines in about 15 seconds. I am adding the final version of the code below for people with similiar problems:
import numpy as np
import time
start = time.time()
myDict = {} # empty dict
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if (tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]) in myDict: #if the word pair is was seen before
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] += 1 #increment the frequency of said pair
else:
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] = 1 #if the word pair is never seen before
#add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
end = time.time()
print(end - start)
keyText= ""
valueText = ""
for key1,key2 in myDict:
keyText += (str(key1) + "," + str(key2) + " ")
valueText += (str(myDict[key1,key2]) + " ")
with open("markovPairs.txt", 'a', encoding="utf8") as markovPairsFile:
markovPairsFile.write(keyText)
with open("markovFrequency.txt", 'a', encoding="utf8") as markovFrequencyFile:
markovFrequencyFile.write(valueText)
As I understand you, you are trying to build a Hidden Markov Model, using frequencies of n-grams (word tupels of length n). Maybe just try out a more efficiently searchable data structure, for example a nested dictionary. It could be of the form
{ID_word1:{ID_word1:x1,... ID_wordk:y1}, ...ID_wordk:{ID_word1:xn, ...ID_wordk:yn}}.
This would mean that you only have k**2 dictionary entries for tuples of 2 words (google uses up to 5 for automatic translation) where k is the cardinality of V, your (finite) vocabulary. This should boost your performance, since you do not have to search a growing list of tuples. x and y are representative for the occurrence counts, which you should increment when encountering a tuple. (Never use in-built function count()!)
I would also look into collections.Counter, a data structure made for your task. A Counter object is like a dictionary but counts the occurrences of a key entry. You could use this by simply incrementing a word pair as you encounter it:
from collections import Counter
word_counts = Counter()
with open("markovData.txt", "r") as f:
# iterate over word pairs
word_counts[(word1, word2)] += 1
Alternatively, you can construct the tuple list as you have and simply pass this into a Counter as an object to compute the frequencies at the end:
word_counts = Counter(word_tuple_list)

Python 3x : Dictionary function that returns adjacent words

i wrote a function nextw(fname,enc) that from a book in .txt format returns a dictionary with a word as a key and the adjacent word as value.
For example, if my book has three 'go', one 'go on' and two 'go out', if i search dictionary['go'] my output should be ['on','out'] without repetitions. Unfortunately it doesn't work, or rather it works but only with the last adjacent word, with my book it returns just 'on' as a string, which i've checked and is actually the adjacent word to the last 'go'. How can i make it work as intended? Here's the code:
def nextw(fname,enc):
with open(fname,encoding=enc) as f:
d = {}
data = f.read()
#removes non-alphabetical characters from the book#
for char in data:
if not char.isalpha():
data = data.replace(char,' ')
#converts the book into lower-case and splits it in a list of words#
data = data.lower()
data = data.split()
#iterates on words#
for index in range(len(data)-1):
searched = data[index]
adjacent = data[index+1]
d[searched] =adjacent
return d
I think your problem lies here d[searched] = adjacent. You need to have something like:
if not searched in d.keys():
d[searched] = list()
d[searched].append(adjacent)

Need faster code for replacing strings with values from dictionary

This is how i applied dictionary for stemming. My dictionary (d) is imported and it's in this format now d={'nada.*':'nadas', 'mila.*':'milas'}
I wrote this code to stemm tokens, but it runs TOO SLOW, so i stopped it before it finished. I guess it's problem because dict is large, and there is large number of tokens.
So, how can i implement my stem dictionary, so that code can run normaly?
I tried to find a method in nltk package to apply custom dict, but i didn't find it.
#import stem dict
d = {}
with open("Stem rečnik.txt") as f:
for line in f:
key, val = line.split(":")
d[key.replace("\n","")] = val.replace("\n","")
#define tokenizer
def custom_tokenizer(text):
#split- space
tokens = nltk.tokenize.word_tokenize(text)
#stemmer
for i, token in enumerate(tokens):
for key, val in d.items():
if re.match(key, token):
tokens[i] = val
break
return tokens
Dictionary sample:
bank.{1}$:banka
intes.{1}$:intesa
intes.{1}$:intesa
intez.{1}$:intesa
intezin.*:intesa
banke:banka
banaka:banka
bankama:banka
post_text sample:
post_text = [
'Banca intesa #nocnamora',
'Banca intesa',
'banka haosa i neorganizovanosti!',
'Cucanje u banci umesto setnje posle rucka.',
"Lovin' it #intesa'"
]
Notice that while the keys in your stem dict are regexes, they all start with a short string of some specific characters. Let's say the minimum length of specific characters is 3. Then, construct a dict like this:
'ban' : [('bank.$', 'banka'),
('banke', 'banka'),
('banaka', 'banka'),
('bankama', 'banka'),
],
'int' : [('inte[sz].$', 'intesa'),
('intezin.*', 'intesa'),
],
Of course, you should re.compile() all those patterns at the beginning.
Then you can do a cheaper, three-character lookup in this dict:
def custom_tokenizer(text):
tokens = nltk.tokenize.word_tokenize(text)
for i, token in enumerate(tokens):
for key, val in d.get(token[:3], []):
if re.match(key, token):
tokens[i] = val
break
return tokens
Now instead of checking all 500 stems, you only need to check the few that start with the right prefix.

Looping through a dictionary in python

I am creating a main function which loops through dictionary that has one key for all the values associated with it. I am having trouble because I can not get the dictionary to be all lowercase. I have tried using .lower but to no avail. Also, the program should look at the words of the sentence, determine whether it has seen more of those words in sentences that the user has previously called "happy", "sad", or "neutral", (based on the three dictionaries) and make a guess as to which label to apply to the sentence.
an example output would be like
Sentence: i started screaming incoherently about 15 mins ago, this is B's attempt to calm me down.
0 appear in happy
0 appear in neutral
0 appear in sad
I think this is sad.
You think this is: sad
Okay! Updating.
CODE:
import csv
def read_csv(filename, col_list):
"""This function expects the name of a CSV file and a list of strings
representing a subset of the headers of the columns in the file, and
returns a dictionary of the data in those columns, as described below."""
with open(filename, 'r') as f:
# Better covert reader to a list (items represent every row)
reader = list(csv.DictReader(f))
dict1 = {}
for col in col_list:
dict1[col] = []
# Going in every row of the file
for row in reader:
# Append to the list the row item of this key
dict1[col].append(row[col])
return dict1
def main():
dictx = read_csv('words.csv', ['happy'])
dicty = read_csv('words.csv', ['sad'])
dictz = read_csv('words.csv', ['neutral'])
dictxcounter = 0
dictycounter = 0
dictzcounter = 0
a=str(raw_input("Sentence: ")).split(' ')
for word in a :
for keys in dictx['happy']:
if word == keys:
dictxcounter = dictxcounter + 1
for values in dicty['sad']:
if word == values:
dictycounter = dictycounter + 1
for words in dictz['neutral']:
if word == words:
dictzcounter = dictzcounter + 1
print dictxcounter
print dictycounter
print dictzcounter
Remove this line from your code:
dict1 = dict((k, v.lower()) for k,v in col_list)
It overwrites the dictionary that you built in the loop.

Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

I am starting with some python task, I am facing a problem while using gensim. I am trying to load files from my disk and process them (split them and lowercase() them)
The code I have is below:
dictionary_arr=[]
for file_path in glob.glob(os.path.join(path, '*.txt')):
with open (file_path, "r") as myfile:
text=myfile.read()
for words in text.lower().split():
dictionary_arr.append(words)
dictionary = corpora.Dictionary(dictionary_arr)
The list (dictionary_arr) contains the list of all words across all the file, I then use gensim corpora.Dictionary to process the list. However I face a error.
TypeError: doc2bow expects an array of unicode tokens on input, not a single string
I cant understand whats a problem, A little guidance would be appreciated.
In dictionary.py, the initialize function is:
def __init__(self, documents=None):
self.token2id = {} # token -> tokenId
self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory
self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared
self.num_docs = 0 # number of documents processed
self.num_pos = 0 # total number of corpus positions
self.num_nnz = 0 # total number of non-zeroes in the BOW matrix
if documents is not None:
self.add_documents(documents)
Function add_documents Build dictionary from a collection of documents. Each document is a list
of tokens:
def add_documents(self, documents):
for docno, document in enumerate(documents):
if docno % 10000 == 0:
logger.info("adding document #%i to %s" % (docno, self))
_ = self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
logger.info("built %s from %i documents (total %i corpus positions)" %
(self, self.num_docs, self.num_pos))
So ,if you initialize Dictionary in this way, you must pass documents but not a single document. For example,
dic = corpora.Dictionary([a.split()])
is OK.
Dictionary needs a tokenized strings for its input:
dataset = ['driving car ',
'drive car carefully',
'student and university']
# be sure to split sentence before feed into Dictionary
dataset = [d.split() for d in dataset]
vocab = Dictionary(dataset)
Hello everyone i ran into the same problem. This is what worked for me
#Tokenize the sentence into words
tokens = [word for word in sentence.split()]
#Create dictionary
dictionary = corpora.Dictionary([tokens])
print(dictionary)

Categories

Resources