Populate dictionary from list in loop - python

I have the following code that works fine and I was wondering how to implement the same logic using list comprehension.
def get_features(document, feature_space):
features = {}
for w in feature_space:
features[w] = (w in document)
return features
Also am I going to get any improvements in performance by using a list comprehension?
The thing is that both feature_space and document are relatively big and many iterations will run.
Edit: Sorry for not making it clear at first, both feature_space and document are lists.
document is a list of words (a word may exist more than once!)
feature_space is a list of labels (features)

Like this, with a dict comprehension:
def get_features(document, feature_space):
return {w: (w in document) for w in feature_space}
The features[key] = value expression becomes the key: value part at the start, and the rest of the for loop(s) and any if statements follow in nesting order.
Yes, this will give you a performance boost, because you've now removed all features local name lookups and the dict.__setitem__ calls.
Note that you need to make sure that document is a data structure that has fast membership tests. If it is a list, convert it to a set() first, for example, to ensure that membership tests take O(1) (constant) time, not the O(n) linear time of a list:
def get_features(document, feature_space):
document = set(document)
return {w: (w in document) for w in feature_space}
With a set, this is now a O(K) loop instead of a O(KN) loop (where N is the size of document, K the size of feature_space).

Related

Is there a better way than a while loop to perform this function?

I was attempting some python exercises and I hit the 5s timeout on one of the tests. The function is pre-populated with the parameters and I am tasked with writing code that is fast enough to run within the max timeframe of 5s.
There are N dishes in a row on a kaiten belt, with the ith dish being of type Di​. Some dishes may be of the same type as one another. The N dishes will arrive in front of you, one after another in order, and for each one you'll eat it as long as it isn't the same type as any of the previous K dishes you've eaten. You eat very fast, so you can consume a dish before the next one gets to you. Any dishes you choose not to eat as they pass will be eaten by others.
Determine how many dishes you'll end up eating.
Issue
The code "works" but is not fast enough.
Code
The idea here is to add the D[i] entry if it is not in the pastDishes list (which can be of size K).
from typing import List
# Write any import statements here
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
numDishes=0
pastDishes=[]
i=0
while (i<N):
if(D[i] not in pastDishes):
numDishes+=1
pastDishes.append(D[i])
if len(pastDishes)>K:
pastDishes.pop(0)
i+=1
return numDishes
Is there a more effective way?
After much trial and error, I have finally found a solution that is fast enough to pass the final case in the puzzle you are working on. My previous code was very neat and quick, however, I have finally found a module with a tool that makes this much faster. Its from collections just as deque is, however it is called Counter.
This was my original code:
def getMaximumEatenDishCount(N: int, D: list, K: int) -> int:
numDishes=lastMod=0
pastDishes=[0]*K
for Dval in D:
if Dval in pastDishes:continue
pastDishes[lastMod] = Dval
numDishes,lastMod = numDishes+1,(lastMod+1)%K
return numDishes
I then implemented Counter like so:
from typing import List
# Write any import statements here
from collections import Counter
def getMaximumEatenDishCount(N: int, D: 'list[int]', K: int) -> int:
eatCount=lastMod = 0
pastDishes=[0]*K
eatenCounts = Counter({0:K})
for Dval in D:
if Dval in eatenCounts:continue
eatCount +=1
eatenCounts[Dval] +=1
val = pastDishes[lastMod]
if eatenCounts[val] <= 1: eatenCounts.pop(val)
else: eatenCounts[val] -= 1
pastDishes[lastMod]=Dval
lastMod = (lastMod+1)%K
return eatCount
Which ended up working quite well. I'm sure you can make it less clunky, but this should work fine on its own.
Some Explanation of what i am doing:
Typically while loops are actually marginally faster than a for loop, however since I need to access the value at an index multiple times if i used it, using a for loop I believe is actually better in this situation. You can see i also initialised the list to the max size it needs to be and am writing over the values instead of popping+appending which saves alot of time. Additionally, as pointed out by #outis, another small improvement was made in my code by using the modulo operator in conjunction with the variable which removes the need for an additional if statement. The Counter is essentially a special dict object that holds a hashable as the key and an int as the value. I use the fact that lastMod is an index to what would normally be accesed through list.pop(0) to access the object needed to either remove or decrement in the counter
Note that it is not considered 'pythonic' to assign multiple variable on one line, however I believe it adds a slight performance boost which is why I have done it. This can be argued though, see this post.
If anyone else is interested the problem that we were trying to solve, it can be found here: https://www.facebookrecruiting.com/portal/coding_puzzles/?puzzle=958513514962507
Can we use an appropriate data structure? If so:
Data structures
Seems like an ordered set which you have to shrink to a capacity restriction of K.
To meet that, if exceeds (len(ordered_set) > K) we have to remove the first n items where n = len(ordered_set) - K. Ideally the removal will perform in O(1).
However since removal on a set is in unordered fashion. We first transform it to a list. A list containing unique elements in the order of appearance in their original sequence.
From that ordered list we can then remove the first n elements.
For example: the function lru returns the least-recently-used items for a sequence seq limited by capacity-limit k.
To obtain the length we can simply call len() on that LRU return value:
maximumEatenDishCount = len(lru(seq, k))
See also:
Does Python have an ordered set?
Fastest way to get sorted unique list in python?
Using set for uniqueness (up to Python 3.6)
def lru(seq, k):
return list(set(seq))[:k]
Using dict for uniqueness (since Python 3.6)
Same mechanics as above, but using the preserved insertion order of dicts since 3.7:
using OrderedDict explicitly
from collections import OrderedDict
def lru(seq, k):
return list(OrderedDict.fromkeys(seq).keys())[:k]
using dict factory-method:
def lru(seq, k):
return list(dict.fromkeys(seq).keys())[:k]
using dict-comprehension:
def lru(seq, k):
return list({i:0 for i in seq}.keys())[:k]
See also:
The order of keys in dictionaries
Using ordered dictionary as ordered set
How do you remove duplicates from a list whilst preserving order?
Real Python: OrderedDict vs dict in Python: The Right Tool for the Job
As the problem is an exercise, exact solutions are not included. Instead, strategies are described.
There are at least a couple potential approaches:
Use a data structure that supports fast containment testing (a set in use, if not in name) limited to the K most recently eaten dishes. Fortunately, since dict preserves insertion order in newer Python versions and testing key containment is fast, it will fit the bill. dict requires that keys be hashable, but since the problem uses ints to represent dish types, that requirement is met.
With this approach, the algorithm in the question remains unchanged.
Rather than checking whether the next dish type is any of the last K dishes, check whether the last time the next dish was eaten is within K of the current plate count. If it is, skip the dish. If not, eat the dish (update both the record of when the next dish was last eaten and the current dish count). In terms of data structures, the program will need to keep a record of when any given dish type was last eaten (initialized to -K-1 to ensure that the first time a dish type is encountered it will be eaten; defaultdict can be very useful for this).
With this approach, the algorithm is slightly different. The code ends up being slightly shorter, as there's no shortening of the data structure storing information about the dishes as there is in the original algorithm.
There are two takeaways from the latter approach that might be applied when solving other problems:
More broadly, reframing a problem (such as from "the dish is in the last K dishes eaten" to "the dish was last eaten within K dishes of now") can result in a simpler approach.
Less broadly, sometimes it's more efficient to work with a flipped data structure, swapping keys/indices and values.
Approach & takeaway 2 both remind me of a substring search algorithm (the name escapes me) that uses a table of positions in the needle (the string to search for) of where each character first appears (for characters not in the string, the table has the length of the string); when a mismatch occurs, the algorithm uses the table to align the substring with the mismatching character, then starts checking at the start of the substring. It's not the most efficient string search algorithm, but it's simple and more efficient than the naive algorithm. It's similar to but simpler and less efficient than the skip search algorithm, which uses the positions of every occurrence of each character in the needle.
from typing import List
# Write any import statements here
from collections import deque, Counter
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
q = deque()
cnt = 0
dish_counter = Counter()
for d in D:
if dish_counter[d] == 0:
cnt += 1
q.append(d)
dish_counter[d] += 1
if len(q) == K + 1:
remove = q.popleft()
dish_counter[remove] -= 1
return cnt

How to implement dicts / sets opposed to a list search, to increase speed

I am making a program that has to search through very long lists, and I have seen people suggesting that using sets and dicts speeds it up massively. However, I am at a loss as to how to make it work within my code. Currently, the program does this:
indexes = []
print("Collecting indexes...")
for term in sliced_5:
indexes.append(hex_crypted.index(term))
The code searches through the hex_crypted list, which contains 1,000,000+ terms, finds the index of the term, and then appends it to the the 'indexes' list.
I simply need to speed this process. Thanks for any help.
You want to build a lookup table so you don't need to repeatedly loop over hex_crypted. Then you can simply look up each term in the table.
print("Collecting indexes...")
lookup = {term: index for (index, term) in enumerate(hex_crypted)}
indexes = [lookup[term] for term in sliced_5]
The fastest method if you have a list is to do a set function on the list to return it as a set, but I don't think that is what you want to do in this case.
hex_crypted_set = set(hex_crypted)
If you need to keep that index for some reason, you'll want to instead build a dictionary first.
hex_crypted_dict = {}
for i in enumerate(hex_crypted):
hex_crypted_dict[i[1]] = i[0]
And then to get that index you just search the dict:
indexes = []
for term in sliced_5:
indexes.append(hex_crypted_dict[term])
You will end up with the appropriate indexes which correspond to the original long list and only iterate that long list one time, which will be a lot better performance than iterating it for every time you do a lookup.
The first step is to generate a dict, for example:
hex_crypted_dict = {v: i for i, v in enumerate(hex_crypted)}
Then your code changed to
indexes = []
hex_crypted_dict = {v: i for i, v in enumerate(hex_crypted)}
print("Collecting indexes...")
for term in sliced_5:
indexes.append(hex_crypted_dict[term])

What's the most efficient way to perform a multiple match lookup in a python dictionary?

I'm looking to maximally optimize the runtime for this chunk of code:
aDictionary= {"key":["value", "value2", ...
rests = \
list(map((lambda key: Resp(key=key)),
[key for key, values in
aDictionary.items() if (test1 in values or test2 in values)]))
using python3. willing to throw as much memory at it as possible.
considering throwing the two dictionary lookups on separate processes for speedup (does that make sense?). any other optimization ideas welcome
values can definitely be sorted and turned into a set; it is precomputed, very very large.
always len(values) >>>> len(tests), though they're both growing over time
len(tests) grows very very slowly, and has new values for each execution
currently looking at strings (considering doing a string->integer mapping)
For starters, there is no reason to use map when you are already using a list comprehension, so you can remove that, as well as the outer list call:
rests = [Resp(key=key) for key, values in aDictionary.items()
if (test1 in values or test2 in values)]
A second possible optimization might be to turn each list of values into a set. That would take up time initially, but it would change your lookups (in uses) from linear time to constant time. You might need to create a separate helper function for that. Something like:
def anyIn(checking, checkingAgainst):
checkingAgainst = set(checkingAgainst)
for val in checking:
if val in checkingAgainst:
return True
return False
Then you could change the end of your list comprehension to read
...if anyIn([test1, test2], values)]
But again, this would probably only be worth it if you had more than two values you were checking, or if the list of values in values is very long.
If tests are sufficiently numerous, it will surely pay off to switch to set operations:
tests = set([test1, test2, ...])
resps = map(Resp, (k for k, values in dic.items() if not tests.isdisjoint(values)))
# resps this is a lazy iterable, not a list, and it uses a
# generator inside, thus saving the overhead of building
# the inner list.
Turning the dict values into sets would not gain anything as the conversion would be O(N) with N being the added size of all values-lists, while the above disjoint operation will only iterate each values until it encounters a testx with O(1) lookup.
map is possibly more performant compared to a comprehension if you do not have to use lambda, e.g. if key can be used as the first positional argument in Resp's __init__, but certainly not with the lambda! (Python List Comprehension Vs. Map). Otherwise, a generator or comprehension will be better:
resps = (Resp(key=k) for k, values in dic.items() if not tests.isdisjoint(values))
#resps = [Resp(key=k) for k, values in dic.items() if not tests.isdisjoint(values)]

How to faster compute the count frequency of words in a large words list with python and be a dictionary

There is a very long words list, the length of list is about 360000. I want to get the each word frequency, and to be a dictionary.
For example:
{'I': 50, 'good': 30,.......}
Since the word list is large, I found it take a lot of time to compute it. Do you have faster method to accomplish this?
My code, so far, is the following:
dict_pronoun = dict([(i, lst_all_tweet_noun.count(i)) for i in
lst_all_tweet_noun])
sorted(dict_pronoun)
You are doing several things wrong here:
You are building a huge list first, then turn that list object into a dictionary. There is no need to use the [..] list comprehension; just dropping the [ and ] would turn it into a much more memory-efficient generator expression.
You are using dict() with a loop instead of a {keyexpr: valueexpr for ... in ...} dictionary comprehension; this would avoid a generator expression altogether and go straight to building a dictionary.
You are using list.count(), this does a full scan of the list for every element. You turned a linear scan to count N items into a O(N**2) quadratic problem. You could simply increment an integer in the dictionary each time you find the key already is present, set the value to 0 otherwise, but there are better options (see below).
The sorted() call is busy-work; it returns a sorted list of keys that is then discarded again. Dictionaries are not sortable, not and produce a dictionary again at any rate.
Use a collections.Counter() object here to do your counting; it uses a linear scan:
from collections import Counter
dict_pronoun = Counter(lst_all_tweet_noun)
A Counter has a Counter.most_common() method which will efficiently give you output sorted by counts, which is what I suspect you wanted to achieve with the sorted() call.
For example, to get the top K elements (where K is smaller than N, the size of the dictionary), a heapq is used to get you those elements in O(NlogK) time (avoiding a full O(NlogN) sort).

python nested generator objects content

I have a problem with Python.
I'm trying to understand which are the information stored in an object that I discovered be a generator.
I don't know anything about Python, but I have to understand how this code works in order to convert it to Java.
The code is the following:
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
def splits(text, L=20):
"Return a list of all possible (first, rem) pairs, len(first)<=L."
pairs = [(text[:i+1], text[i+1:]) for i in range(min(len(text), L))]
return pairs
def Pwords(words):
"The Naive Bayes probability of a sequence of words."
productw = 1
for w in words:
productw = productw * Pw(w)
return productw
while I understood how the methods Pwords and splits work (the function Pw(w) simply get a value from a matrix), I'm still trying to understand how the "candidates" object, in the "segment" method is built and what it contains.
As well as, how the "max()" function analyzes this object.
I hope that someone could help me because I didn't find any feasible solution here to print this object.
Thanks a lot to everybody.
Mauro.
generator is quite simple abstraction. It looks like single-use custom iterator.
gen = (f(x) for x in data)
means that gen is iterator which each next value is equal to f(x) where x is corresponding value of data
nested generator is similar to list comprehension with small differences:
it is single use
it doesn't create whole sequence
code runs only during iterations
for easier debugging You can try to replace nested generator with list comprehension
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = [[first]+segment(rem) for first,rem in splits(text)]
return max(candidates, key=Pwords)

Categories

Resources