I've implemented a inverted index in python, which is essentially a dictionary, whose key is words in the corpus, value is the tuple containing document that the key occurs in together with its bm25 score.
{
"love": [(doc1, 12), (doc3, 7.9), (doc5, 6.5)],
"hate": [(doc2, 8.7), (doc4, 3.2)]
}
However, when I process a query, I find it's hard to benefit from the efficiency of inverted index, because I must iterate all words in the query in a for loop. Within this loop, I must further loop over the documents the word links and maintain a global score table for all documents.
I think this is not the optimal way. Some ideas to speed up? I think a batch dictionary which accepts multiple keys and returns multiple values in parallel would help.
It should be more efficient if you represent the inverted index as a matrix, particular a sparse matrix, where your rows are your corpus and the columns as each document.
Related
I have a Python script which ends up creating a 2D array based on user input. Therefore, the length of the 2D array is unknown and the length of the individual arrays within the 2D array are also unknown until the user has input the information. I would like to sort the individual array pieces based on a value associated with them. An example of a possible output that needs to be sorted is below:
Basically, each individual array is a failure symptom followed by the a list of possible components, each having a "score" associated with them that is the likelihood that this component is causing the failure. My goal is to reorder the array with the components along with their scores in descending order based on the score, i.e., the component and score need to be moved together. The problem I have is like I said, I do not know the length of anything until user input is given. There could be only 1 failure symptom input, or there could be 9. The failure symptom could contain only 1 component, or maybe 12. I know it will take nested for loops and if statements, but I haven't been able to figure it out based on all the possible scenarios. Some possible scenarios I have thought of:
The array is already in order (move to the next failure symptom)
The first component is correct, but the ones after may not be. Or the first two are correct, but the ones after may not be, etc...
The array is completely backwards in order
The array only contains 1 component, therefore there is no need to sort
The array is in some random order, so some positions for some components may already be in the correct spot while some others aren't
Every time I feel like I am making headway, I think of another scenario which wouldn't hold up. Any help is greatly appreciated!
Your problem is a bit special. You don't only want to sort a multidimensional array, which would be rather simple using the default sorting algorithms, you also want to keep the order between the key/value pairs.
The second problem is that the keys are strings with numbers in it. So simple string comparison wouldn't work, because it is compared letter by letter, so "test9" > "test11" would be true (the second 1 wouldn't be even recognized, because 9>1).
The simpliest solution i figured out would be the following:
#get the failure id of one list
def failureId(value):
return int(value[0].replace("failure",""))
#get the id of one component
def componentId(value):
return int(value.replace("component",""))
#sort one failure list using bubble sort
def sortFailure(failure):
#iteraring through the array twice (only the keys, ignoring the values)
for i in range(1,len(failure), 2):
for j in range(1,i, 2):
#comparing the component ids
if (componentId(failure[j])>componentId(failure[j+2])):
#swaping keys and values
failure[j],failure[j+2] = failure[j+2],failure[j]
failure[j+1],failure[j+3] = failure[j+3],failure[j+1]
#sorting the full list
def sortData(data):
#sorting the failures using default sort algorithm
data.sort(key=failureId)
#sorting the single list of failure datas itself
for failure in data:
sortFailure(failure)
data = [['failure2', 'component2', 0.15, 'component1', 0.85], ['failure3', 'component1', 0.95], ['failure1','component1',0.05,'component3', 0.8, 'component2', 0.1, 'component4', 0.05]]
print(data)
sortData(data)
print(data)
The first two functions are required to get the numbers(=id) from the strings as mentioned above. The second function uses "bubble sort" to sort the array. It uses steps 2 for the range function, because we want to skipt the values for each component. If the data are in wrong order we are swapping the key & value. In the sortData function we are using the built in sort function for lists to sort the whole list (by failure ids). Then we take each "sublist" and sort them using the other function.
I have a two-fold problem here.
1) My initial problem, which was trying to improve the computational time on my algorithm
2) My next problem, which is one of my 'improvements' appears to consume over 500G RAM memory and I don't really know why.
I've been writing a permutation for a bioinformatics pipeline. Basically, the algorithm starts off sampling 1000 null variants for each variant I have. If a certain condition is not met, the algorithm repeats the sampling with 10000 nulls this time. If the condition is not met again, this up-scales to 100000, and then 1000000 nulls per variant.
I've done everything I can think of to optimize this, and at this point I'm stumped for improvements. I have this gnarly looking comprehension:
output_names_df, output_p_values_df = map(list, zip(*[random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num) for variant in variant_list]))
Basically all this is doing is calling my random_sampling_for_variants function on each variant in my variant list, and throwing the two outputs from that function into a list of lists (so, I end up with two lists of lists, output_names_df, and output_p_values_df). I then turn these list of lists back into DataFrames, rename the columns, and do whatever I need with them. The function being called looks like:
def random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num):
"""
Inner function to permute_null_variants_single
equivalent to permutation for 1 variant
"""
#Get nulls that are in the same bin as our variant
permuted_null_variant_table = options_tables.get_group(variant_to_bins[variant])
#If number of permutations > number of nulls, sample with replacement
if num_permutations >= len(permuted_null_variant_table):
replace = True
else:
replace = False
#Select rows for permutation, then add as columns to our temporary dfs
picked_indices = permuted_null_variant_table.sample(n=num_permutations, replace=replace)
temp_names_df = picked_indices['variant'].values[:use_num]
temp_p_values_df = picked_indices['GWAS_p_value'].values
return(temp_names_df, temp_p_values_df)
When defining permuted_null_variant_table, I'm just querying a pre-defined grouped up DataFrame to determine the appropriate null table to sample from. I found this to be faster than trying to determine the appropriate nulls to sample from on the fly. The logic that follows in there determines whether I need to sample with or without replacement, and it takes hardly any time at all. Defining picked_indices is where the random sampling is actually being done. temp_names_df and temp_p_values_df get the values that I want out of the null table, and then I send my returns out of the function.
The problem here is that the above doesn't scale very well. For 1000 permutations, the above lines of code take ~254 seconds on 7725 variants. For 10,000 permutations, ~333 seconds. For 100,000 permutations, ~720 seconds. For 1,000,000 permutations, which is the upper boundary I'd like to get to, my process got killed because it was apparently going to take up more RAM than the cluster had.
I'm stumped on how to proceed. I've turned all by ints into 8bit-uint, I've turned all my floats into 16bit-floats. I've dropped columns that I don't need where I don't need them, so I'm only ever sampling from tables with the columns I need. Initially I had the comprehension in the form of a loop, but the computational time on the comprehension is 3/5th that of the loop. Any feedback is appreciated.
I have a thousands of tables, which each contains hundreds of words and and and their corresponding score in the second column. and I need to calculate the correlation of each pair of tables.
So, I started to read each table, and convert it to a dictionary; each word is a dictionary key, and its score is the value.
Now it is a time to calculate the correlations. I have to mention, not necessarily all dictionaries have the same keys; some more, some less; each dictionary should get expanded according to its pair - meaning if the pair has some key which does not exist in the other, the other dictionary should get updated by those key and those key's value should be 0 and eventually then the correlation coefficient must be calculated.
example:
dict1 = {'car': 0.1, 'dog':0.3, 'tiger':0.5, 'lion': 0.1, 'fish':0.2}
dict2 = {'goat':0.3, 'fish':0.3, 'shark':0.4, 'dog':0.3}
so, dict1 should get look like :
dict1.comparable = {'car':0.1, 'goat':0.0 ,'dog':0.3, 'tiger':0.5, 'lion': 0.1, 'fish':'0.2, 'shark':0.0}
dict2.comparable = {'car': 0.0, 'goat':0.3, 'dog':0.3, 'fish':0.3, 'shark':0.4, ,'tiger':0, 'lion': 0}
and then the correlation of their values should be calculated.
I appreciate how to do calculate the similarity/correlation of dictionaries based on their values efficiently.
UPDATE
Here is a post which explain how to compute correlation coefficient technically.
here is the simplest version
import numpy
numpy.corrcoef(list1, list2)[0, 1]
but it only works on "list". Basically I am after calculating correlation coefficient, of two dictionary with respect to their keys, in an efficient manner. (less amount of expanding and sorting keys)
keys = list(dict1.viewkeys() | dict2.viewkeys())
import numpy
numpy.corrcoef(
[dict1.get(x, 0) for x in keys],
[dict2.get(x, 0) for x in keys])[0, 1]
First you get all the keys. No need to sort, but de-duplication is needed. Storing it as a list helps to iterate them in the same order later.
Then you can create the 2 lists that numpy requires.
Don't add zeros to the dictionary. Those are just bloat, and would be eliminated when the similarity is calculated. Leaving out zeros will already save you some, if not a lot of time.
Then, to calculate the similarity, start with the shortest dictionary of the two. For each key in the shortest, check if the key is in the longest dictionary. That also saves a lot of time, because looping over a dict with N items takes N time, while checking if that item is in the larger dict takes only 1 time.
Don't create the intermediate dictionaries, if it is just to calculate similarity. It wastes time and memory.
To eventually calculate similarity, you can try the cosine metric, euclidian distance, or something else, depending on your needs.
So for fun, I decided to revisit an old college assignment I had in which a ciphertext was given of about 75 characters, and a crib that the message was signed with three letters (initials of my teacher)
What I've done:
Hemmed down the results to those that have part or all of the crib in them.
Then I started doing some letter frequency analysis on the smaller subset of results from (1).
Now the task boils down to writing some language recognition software, but there are a few issues to deal with first. I chose to brute force all the rotor settings (type, initial pos)
so the resulting entries with part or all of the crib in them still have some letters swapped from the plugboard.
I know my next move should be to make two matrices and digest a corpus where in the first matrix, I would just do a tally, so if the first letter was an A, in the first matrix, I would be at row 0, and the column I would increase would be the letter directly following the A, say it was a B. Then I would move over to the B and see that the next letter is a U so I would go to row B and increase column U's entry. After digesting a whole corpus, I would put probabilities into the second matrix.
Using the second matrix, I could assign score values to entire sentences and have a means of scoring the outputs and further hemming down the results so finding the message should be easy as finding a pin in a MUCH smaller haystack.
Now I'm doing this in python and I wanted to know if it is better to cast chars to ints, do a subtraction of the smallest char 'A' and then use that as my index, or if I should use a dict and every letter would correspond to an int value and so finding the indices for the location in my matrices would look something like LetterTally[dict['A']][dict['B']].
The cast subtraction method would look like this:
firstChar = 'A'
secondChar = 'B'
LetterTalley[(ord(firstChar)-ord('A'))][(ord(secondChar)-ord('A'))]
Of these two different methods, which is going to be faster?
Instead of building a matrix, did you consider having a dict of dicts so that you can do the lookup (LetterTally['A']['B']) directly?
I'm implementing feature vectors as bit maps for documents in a corpus. I already have the vocabulary for the entire corpus (as a list/set) and a list of the terms in each document.
For example, if the corpus vocabulary is ['a', 'b', 'c', 'd'] and the terms in document d1 is ['a', 'b', 'd', 'd'], the feature vector for d1 should be [1, 1, 0, 2].
To generate the feature vector, I'd iterate over the corpus vocabulary and check if each term is in the list of document terms, then set the bit in the correct position in the document's feature vector.
What would be the most efficient way to implement this? Here are some things I've considered:
Using a set would make checking vocab membership very efficient but sets have no ordering, and the feature vector bits need to be in the order of the sorted corpus vocabulary.
Using a dict for the corpus vocab (mapping each vocab term to an arbitrary value, like 1) would allow iteration over sorted(dict.keys()) so I could keep track of the index. However, I'd have the space overhead of dict.values().
Using a sorted(list) would be inefficient to check membership.
What would StackOverflow suggest?
I think the most efficient way is to loop over each document's terms, get the position of the term in the (sorted) corpus and set the bit accordingly.
The sorted list of corpus terms can be stored as dictionary with term -> index mapping (basically an inverted index).
You can create it like so:
corpus = dict(((term, index) for index, term in enumerate(sorted(all_words))))
For each document you'd have to generate a list of 0's as feature vector:
num_words = len(corpus)
fvs = [[0]*num_words for _ in docs]
Then building the feature vectors would be:
for i, doc_terms in enumerate(docs):
fv = fvs[i]
for term in doc_terms:
fv[corpus[term]] += 1
There is no overhead in testing membership, you just have to loop over all terms of all documents.
That all said, depending on the size of the corpus, you should have a look at numpy and scipy. It is likely that you will run into memory problems and scipy provides special datatypes for sparse matrices (instead of using a list of lists) which can save a lot of memory.
You can use the same approach as shown above, but instead of adding numbers to list elements, you add it to matrix elements (e.g. the rows will be the documents and the columns the terms of the corpus).
You can also make use of some matrix operations provided by numpy if you want to apply local or global weighting schemes.
I hope this gets you started :)