Python Dict vs List for adding unique element only - python

In order to achieve an iterable of unique elements, is [2] acceptable?
# [1]
if element not in list:
list.append(element)
# [2]
dict[element] = None # value doesn't matter

Use set as your data structure.
List is not good performance wise, checking if the element is in a list takes linear time. The longer the list the slower it gets.
Set has constant look up time. Dictionary does too, but you don't need key-value pairs, so it's more elegant to do:
s = set()
s.add(element)
than
s = {}
s[element] = None
Plus you get all the nice set operations, like union, intersection, etc. See the documentation.

Related

How to implement dicts / sets opposed to a list search, to increase speed

I am making a program that has to search through very long lists, and I have seen people suggesting that using sets and dicts speeds it up massively. However, I am at a loss as to how to make it work within my code. Currently, the program does this:
indexes = []
print("Collecting indexes...")
for term in sliced_5:
indexes.append(hex_crypted.index(term))
The code searches through the hex_crypted list, which contains 1,000,000+ terms, finds the index of the term, and then appends it to the the 'indexes' list.
I simply need to speed this process. Thanks for any help.
You want to build a lookup table so you don't need to repeatedly loop over hex_crypted. Then you can simply look up each term in the table.
print("Collecting indexes...")
lookup = {term: index for (index, term) in enumerate(hex_crypted)}
indexes = [lookup[term] for term in sliced_5]
The fastest method if you have a list is to do a set function on the list to return it as a set, but I don't think that is what you want to do in this case.
hex_crypted_set = set(hex_crypted)
If you need to keep that index for some reason, you'll want to instead build a dictionary first.
hex_crypted_dict = {}
for i in enumerate(hex_crypted):
hex_crypted_dict[i[1]] = i[0]
And then to get that index you just search the dict:
indexes = []
for term in sliced_5:
indexes.append(hex_crypted_dict[term])
You will end up with the appropriate indexes which correspond to the original long list and only iterate that long list one time, which will be a lot better performance than iterating it for every time you do a lookup.
The first step is to generate a dict, for example:
hex_crypted_dict = {v: i for i, v in enumerate(hex_crypted)}
Then your code changed to
indexes = []
hex_crypted_dict = {v: i for i, v in enumerate(hex_crypted)}
print("Collecting indexes...")
for term in sliced_5:
indexes.append(hex_crypted_dict[term])

sort set into list in python in order it was first appended

I want to loop through my set in an order. I know sets are unordered and will have to sort it into a list and have done so by sorting the set numerically, but I now want to try to sort it in the order the numbers were first appended to the set. Is this possible to do after the set has been created or will this require me to the numbers to populate something other than a set? The order will need to keep the order the numbers were first appended to the set
The easiest way, short of augmenting the elements inserted with some sort of ordering, is probably to use the keys of an ordereddict to store & retreive the elements of your set, with dummy values mapped to these keys.
from collections import OrderedDict
seq = [6, 7, 4, 3, 2, 1, 5, 0]
my_set = OrderedDict()
for elt in seq:
my_set[elt] = True
You can now iterate or retrieve the keys in the order you inserted them. You get the same properties as a set, i/e uniqueness, fast insertion, retrieval, and contains; what you don't get are specific set operations like symmetric difference, etc...
No, a set is not ordered. There is no way of finding out in which order the elements were appended to the set.
However, you could use a list for that. Every time you add an element to the set append it to the list as well and then you know the order.
But beware: a set only contains each same element once whereas a list can contain the same element multiple times. I am not sure how this effects your use of this feature.
You could work around it by only appending to the list if the element is not yet in the list.
s = set()
l = []
elem = 1
if elem not in l:
l.append(elem)
s.add(elem)
print(s)
print(l)

Text search elements in a big python list

With a list that looks something like:
cell_lines = ["LN18_CENTRAL_NERVOUS_SYSTEM","769P_KIDNEY","786O_KIDNEY"]
With my dabbling in regular expressions, I can't figure out a compelling way to search individual strings in a list besides looping through each element and performing the search.
How can I retrieve indices containing "KIDNEY" in an efficient way (since I have a list of length thousands)?
Make a list comprehension:
[line for line in cell_lines if "KIDNEY" in line]
This is O(n) since we check every item in a list to contain KIDNEY.
If you would need to make similar queries like this often, you should probably think about reorganizing your data and have a dictionary grouped by categories like KIDNEY:
{
"KIDNEY": ["769P_KIDNEY","786O_KIDNEY"],
"NERVOUS_SYSTEM": ["LN18_CENTRAL_NERVOUS_SYSTEM"]
}
In this case, every "by category" lookup would take "constant" time.
You can use a set instead of a list since it performs lookups in constant time.
from bisect import bisect_left
def bi_contains(lst, item):
""" efficient `item in lst` for sorted lists """
# if item is larger than the last its not in the list, but the bisect would
# find `len(lst)` as the index to insert, so check that first. Else, if the
# item is in the list then it has to be at index bisect_left(lst, item)
return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)
Slightly modifying the above code will give you pretty good efficiency.
Here's a list of the data structures available in Python along with the time complexities.
https://wiki.python.org/moin/TimeComplexity

Access array based on number of named key

json_data = {"fruits": ["apple", "banana", "orange"],"vegetables":["tomatoe", "cucumber", "potato"]}
How do I access my array numerically without having to include a numeric key?
ex:
json_data[0][0] #result should equal "apple"
You can't. The outer container is an unordered dictionary, not a list, so an index of 0 is meaningless. If you have some way of ordering the keys, you could then use the dict.keys() function to build a list and index that. The problem is, that keys() can come up in any order, so you'd still need some other ordering principle.
json_data[list(json_data.keys())[0]][0]
this is how to do it, but it is extremely wrong, ugly and unpythonic, and you should probably be looking for another way to do this.
starting from the inside json_data.keys() returns all the keys
list() turns those keys into a list [0] after it, accesses the zeroth item in the list
json_data[] around that accesses the list by key
[0] after it accesses the zeroth item in the returned list
Also it is not guaranteed to work 100% of the time, because json_data.keys() is not guaranteed to always output at the same order.

What is the fastest way to add data to a list without duplication in python (2.5)

I have about half a million items that need to be placed in a list, I can't have duplications, and if an item is already there I need to get it's index. So far I have
if Item in List:
ItemNumber=List.index(Item)
else:
List.append(Item)
ItemNumber=List.index(Item)
The problem is that as the list grows it gets progressively slower until at some point it just isn't worth doing. I am limited to python 2.5 because it is an embedded system.
You can use a set (in CPython since version 2.4) to efficiently look up duplicate values. If you really need an indexed system as well, you can use both a set and list.
Doing your lookups using a set will remove the overhead of if Item in List, but not that of List.index(Item)
Please note ItemNumber=List.index(Item) will be very inefficient to do after List.append(Item). You know the length of the list, so your index can be retrieved with ItemNumber = len(List)-1.
To completely remove the overhead of List.index (because that method will search through the list - very inefficient on larger sets), you can use a dict mapping Items back to their index.
I might rewrite it as follows:
# earlier in the program, NOT inside the loop
Dup = {}
# inside your loop to add items:
if Item in Dup:
ItemNumber = Dup[Item]
else:
List.append(Item)
Dup[Item] = ItemNumber = len(List)-1
If you really need to keep the data in an array, I'd use a separate dictionary to keep track of duplicates. This requires twice as much memory, but won't slow down significantly.
existing = dict()
if Item in existing:
ItemNumber = existing[Item]
else:
ItemNumber = existing[Item] = len(List)
List.append(Item)
However, if you don't need to save the order of items you should just use a set instead. This will take almost as little space as a list, yet will be as fast as a dictionary.
Items = set()
# ...
Items.add(Item) # will do nothing if Item is already added
Both of these require that your object is hashable. In Python, most types are hashable unless they are a container whose contents can be modified. For example: lists are not hashable because you can modify their contents, but tuples are hashable because you cannot.
If you were trying to store values that aren't hashable, there isn't a fast general solution.
You can improve the check a lot:
check = set(List)
for Item in NewList:
if Item in check: ItemNumber = List.index(Item)
else:
ItemNumber = len(List)
List.append(Item)
Or, even better, if order is not important you can do this:
oldlist = set(List)
addlist = set(AddList)
newlist = list(oldlist | addlist)
And if you need to loop over the items that were duplicated:
for item in (oldlist & addlist):
pass # do stuff

Categories

Resources