Fastest Search Through Random Nested Data - python

[
[
[2,33,64,276,1],
[234,5,234,7,34,36,7,2],
[]
]
[
[2,4,5]
]
.
.
.
etc
]
I'm not looking for an exact solution to this, as the structure above is just an example. I'm trying to search for an ID that can be nested several levels deep within a group of IDs ordered randomly.
Currently I'm just doing a linear search which takes a few minutes to get a result when each of the deepest levels has a couple hundred of IDs. I was wondering if anyone could suggest a faster algorithm for searching through multiple levels of random data? I am doing this in Python if that matters.
Note: The IDs are always at the deepest level and the number of levels is consistent for each branch down. Not sure if that matters or not.
Also to clarify the data points are unique and cannot be repeated. My example has some repeats because I was just smashing the keyboard.

The fastest search through random data is linear. Pretending your data isn't nested, it's still random, so even flattening it won't help.
To decrease the time complexity, you can increase the space complexity -- keep a dict containing IDs as keys and whatever information you want (possibly a list of indices pointing to the list containing the ID at each level), and update it every time you create/update/delete an element.

Related

How to translate lists into dictionaries or numpy arrays preserving structural information and improving efficiency?

I am building a database of concepts and neighbour concepts from a certain set of text files (arbitral awards). So far, I use lists, as I believe (not an expert here) it is the computable most simple and efficient way to store and retreive the information I'm using. The structure of those lists is:
memory = [
[tag1,number1,
[
[tag11,number11],
[tag12,number12],
...
]
],
...
]
When feed with a string (usually by the lenght of a paragraph or a single page) my script will look for the tag and then for the sub tags of every tag. Basically:
for tag in memory:
if tag[0] in text:
tag[1]+=1
else:
for subtag in tag[2]:
if subtag[0] in text:
subtag[1]+=1
tag[1]+=1
There are some extra rules to break the search and to avoid repetitions, but you can imagine they don't solve the 'for in for in for' loop problem! (accounting for the 'in text' part)
My purpose is to build a semantic structure by enumeration, by relating tags with sub tags. For example: 'Greetings', with: 'Hi', 'Hello', 'Good Morning', and so on. The numbers say how often a tag is found in the text files I use and the sub-tags numbers say how often a certain sub tag is related with its parent tag.
My problem is that, after a few months using this structure, I have a fair amount of tags and sub tags as to make my script run slow every time it has to search every tag and every sub tag within a given string, and I'm looking for options to solve this problem.
One option I have is to migrate to numpy arrays, but I wouldn't know where to start to be sure I will gain on efficiency and not just translate my problem into a more fancy structure. I am familiar with matrix multiplication and tensor products with numpy, which seem to be applicable here with some sort of convolutional algorithm (just guessing over what I've done before), but I'm not sure if they work as well with strings as I've seen them work with numbers, because I would be multipliying small matrices (tag list) with large matrices (strings), and I have been told that numpy is more usefull with large to large multiplications.
The other option is to use dictionaries, at least as an intermediate step, which I sincerely don't get how whould make things go faster, but they were suggested by an engineer (I'm a lawyer so... no idea). The problem with the last is how to keep track of ocurrencies of keys. I can see how I can translate my list into dictionaries, but even tough I can think of ways to translate and update the ocurrency information, I just feel it won't be more efficient, as the same for loops would have to be executed.
One option would be to set the ocurrencies as the first value of every key... but again, not sure how it would make my script go faster.
I get that the increasing amount of computations won't just go away, but as an amateur-intermediate enthusiast of programming, I understand that my method can be sensibily improved by going further than simple for loops.
Therefore, as a non expert, I would appreciate some information, guidance or just references about how to translate my list into dictionaries or numpy arrays which at the same time help me to loop over tags and sub-tags faster than the basic for loop.

Best strategy for storing timed data and plotting afterwards?

I'm trying to learn Python by converting a crude bash script of mine.
It runs every 5 minutes, and basically does the the following:
Loops line-by-line through a .csv file, grabs first element ($URL),
Fetches $URL with wget and extracts $price from the page,
Adds $price to the end of the same line in the .csv file,
Continues looping for the remaining lines.
I use it to keep track of products prices on eBay and similar websites. It notifies me when a lower price is found and plots graphs with the product's price history.
It's simple enough that I could just replicate the algorithm in Python, however, as I'm trying to learn it, it seems there are several types of objects (lists, dicts, etc.) that could do the storing much more efficiently. My plan is using pickle or even a simple DB solution (like dataset) from the beggining, instead of messing around with .csv files and extracting the data via sketchy string manipulations.
One of the improvements I would also like to make is store the absolute time of each fetch alongside its price, so I can plot a "true timed" graph, instead of assuming each cycle is 5 minutes away from each other (which it never is).
Therefore, my question sums to...
Assuming I need to work with the following data structure:
List of Products, each with its respective
URL
And its list of Time<->Prices pairs
What would be the best strategy in Python for doing it so?
Should I use dictionaries, lists, sets or maybe even creating a custom class for products?
Firstly, if you plan on accessing the data structure which holds the URL-(time,Price) for specific URLs - use dictionary, since URLs are unique (the URL will be the key in the dictionary).
Otherwise u can keep a list of (URL,(time,Price)) tuples.
Secondly, a list of (Time, Price) Tuples since you don't need to sort them (they will be sorted already by the way you insert them).
{} - Dictionary
[] - List
() - tuple
Option 1:
[(URL, [(time, Price)])]
Option 2:
{URL, [(time, Price)]}

keep keys of different heaps updated when storing links to the same objects

Designing one algorithm using Python I'm trying to maintain one invariant, but I don't know if that is even possible to maintain. It's part of an MST algorithm.
I have some "wanted nodes". They are wanted by one or more clusters, which are implemented as a list of nodes. If I get a node that is wanted by one cluster, it gets placed into that cluster. However, if more than one cluster want it, all those clusters get merged and then the node gets placed in the resulting cluster.
My goal
I am trying to get the biggest cluster of the list of "wanting clusters" in constant time, as if I had a max-heap and I could use the updated size of each cluster as the key.
What I am doing so far
The structure that I am using right now is a dict, where the keys are the nodes, and the values are lists with the clusters that want the node at the key. This way, if I get a node I can check in constant time if some cluster wants it, and in case there are, I loop through the list of clusters checking who is the biggest. Once I finish the loop, I merge the clusters by updating the information in all the smaller clusters. This way I get a total merging time of O(n logn), instead of O(n²).
Question
I was wondering if I could use something like a heap to store in my dict as the value, but I don't know how that heap would be updated with the current size of each cluster. Is it possible to do something like that by using pointers and possible other dict storing the size of each cluster?

Intersection of Two Lists Of Strings

I had an interview question along these lines:
Given two lists of unordered customers, return a list of the intersection of the two lists. That is, return a list of the customers that appear in both lists.
Some things I established:
Assume each customer has a unique name
If the name is the same in both lists, it's the same customer
The names are of the form first name last name
There's no trickery of II's, Jr's, weird characters, etc.
I think the point was to find an efficient algorithm/use of data structures to do this as efficiently as possible.
My progress went like this:
Read one list in to memory, then read the other list one item at a time to see if there is a match
Alphabetize both lists then start at the top of one list and see if each item appears in the other list
Put both lists into ordered lists, then use the shorter list to check item by item (that way, it one list has 2 items, you only check those 2 items)
Put one list into a hash, and check for the existence of keys from the other list
The interviewer kept asking, "What next?", so I assume I'm missing something else.
Any other tricks to do this efficiently?
Side note, this question was in python, and I just read about sets, which seem to do this as efficiently as possible. Any idea what the data structure/algorithm of sets is?
It really doesnt matter how its implemented ... but I believe it is implemented in C so it is faster and better set([1,2,3,4,5,6]).intersection([1,2,5,9]) is likely what they wanted
In python readability counts for alot! and set operations in python are used extensively and well vetted...
that said another pythonic way of doing it would be
list_new = [itm for itm in listA if itm in listB]
or
list_new = filter(lambda itm:itm in listB,listA)
basically I believe they were testing if you were familliar with python, not if you could implement the algorithm. since they asked a question that is so well suited to python
Put one list into a bloom filter and use that to filter the second list.
Put the filtered second list into a bloom filter and use that to filter the first list.
Sort the two lists and find the intersection by one of the methods above.
The benefit of this approach (besides letting you use a semi-obscure data structure correctly in an interview) is that it doesn't require any O(n) storage until after you have (with high probability) reduced the problem size.
The interviewer kept asking, "What next?", so I assume I'm missing something else.
Maybe they would just keep asking that until you run out of answers.
http://code.google.com/p/python-bloom-filter/ is a python implementation of bloom filters.

How do you efficiently bulk index lookups?

I have these entity kinds:
Molecule
Atom
MoleculeAtom
Given a list(molecule_ids) whose lengths is in the hundreds, I need to get a dict of the form {molecule_id: list(atom_ids)}. Likewise, given a list(atom_ids) whose length is in the hunreds, I need to get a dict of the form {atom_id: list(molecule_ids)}.
Both of these bulk lookups need to happen really fast. Right now I'm doing something like:
atom_ids_by_molecule_id = {}
for molecule_id in molecule_ids:
moleculeatoms = MoleculeAtom.all().filter('molecule =', db.Key.from_path('molecule', molecule_id)).fetch(1000)
atom_ids_by_molecule_id[molecule_id] = [
MoleculeAtom.atom.get_value_for_datastore(ma).id() for ma in moleculeatoms
]
Like I said, len(molecule_ids) is in the hundreds. I need to do this kind of bulk index lookup on almost every single request, and I need it to be FAST, and right now it's too slow.
Ideas:
Will using a Molecule.atoms ListProperty do what I need? Consider that I am storing additional data on the MoleculeAtom node, and remember it's equally important for me to do the lookup in the molecule->atom and atom->molecule directions.
Caching? I tried memcaching lists of atom IDs keyed by molecule ID, but I have tons of atoms and molecules, and the cache can't fit it.
How about denormalizing the data by creating a new entity kind whose key name is a molecule ID and whose value is a list of atom IDs? The idea is, calling db.get on 500 keys is probably faster than looping through 500 fetches with filters, right?
Your third approach (denormalizing the data) is, generally speaking, the right one. In particular, db.get by keys is indeed about as fast as the datastore gets.
Of course, you'll need to denormalize the other way around too (entity with key name atom ID, value a list of molecule IDs) and will need to update everything carefully when atoms or molecules are altered, added, or deleted -- if you need that to be transactional (multiple such modifications being potentially in play at the same time) you need to arrange ancestor relationships.. but I don't see how to do it for both molecules and atoms at the same time, so maybe that could be a problem. Maybe, if modifications are rare enough (and depending on other aspects of your application), you could serialize the modifications in queued tasks.

Categories

Resources