I have a dictionary like
{'A': 0, 'B': 1, 'C': 2, 'D': 3, etc}
How can I remove elements from this dictionary without creating gaps in values, in case the dictionary is not ordered?
An example:
I have a big matrix, where rows represent words, and columns represent documents where these words are encountered. I store the words and their corresponding indices as a dictionary. E.g. for this matrix
2 0 0
1 0 3
0 5 1
4 1 2
the dictionary would look like:
words = {'apple': 0, 'orange': 1, 'banana': 2, 'pear': 3}
If I remove the words 'apple' and 'banana', the matrix would contain only two rows. So the value of 'orange' in the dictionary should now equal 0 and not 1, and the value of 'pear' should be 1 instead of 3.
In Python 3.6+ dictionaries are ordered, so I can just write something like this to reassign the values:
i = 0
for k, v in words.items():
v = i
i += 1
or, alternatively
words = dict(zip(terms.keys(), range(0, matrix.shape[0])))
I think, this is far from being the most efficient way to change the values, and it wouldn't work with unordered dictionaries. How to do it efficiently? Is there any way to easily reassign the values in case the dictionary is not ordered?
Turn the dict into a sorted list and then build a new dict without the words you want to remove:
import itertools
to_remove = {'apple', 'banana'}
# Step 1: sort the words
ordered_words = [None] * len(words)
for word, index in words.items():
ordered_words[index] = word
# ordered_words: ['apple', 'orange', 'banana', 'pear']
# Step 2: Remove unwanted words and create a new dict
counter = itertools.count()
words = {word: next(counter) for word in ordered_words if word not in to_remove}
# result: {'orange': 0, 'pear': 1}
This has a runtime of O(n) because manually ordering the list with indexing operations is a linear operation, as opposed to sorted which would be O(n log n).
See also the documentation for itertools.count and next.
You could always keep an inverted dictionary that maps indices to words, and use that as a reference for keeping the order of the original dictionary. Then you could remove the words, and rebuild the dictionary again:
words = {'apple': 0, 'orange': 1, 'banana': 2, 'pear': 3}
# reverse dict for index -> word mappings
inverted = {i: word for word, i in words.items()}
remove = {'apple', 'banana'}
# sort/remove the words
new_words = [inverted[i] for i in range(len(inverted)) if inverted[i] not in remove]
# rebuild new dictionary
new_dict = {word: i for i, word in enumerate(new_words)}
print(new_dict)
Which Outputs:
{'orange': 0, 'pear': 1}
Note: Like the accepted answer, this is also O(n).
You can use your existing logic, using a representation of the dictionary that is sorted:
import operator
words = {'apple': 0, 'orange': 1, 'banana': 2, 'pear': 3}
sorted_words = sorted(words.items(), key=operator.itemgetter(1))
for i, (k, v) in enumerate(sorted_words):
words[k] = i
Initially we have:
words = {'apple': 0, 'orange': 1, 'banana': 2, 'pear': 3}
To reorder from minimum to maximum, you may use sorted and dictionary comprehension:
std = sorted(words, key=lambda x: words[x])
newwords = {word:std.index(word) for word in std}
You are using the wrong tool (dict) for the job, you should use a list
class vocabulary:
def __init__(self, *words):
self.words=list(words)
def __getitem__(self, key):
try:
return self.words.index(key)
except ValueError:
print (key + " is not in vocabulary")
def remove(self, word):
if type(word)==int:
del self.words[word]
return
return self.remove(self[word])
words = vocabulary("apple" ,"banana", "orange")
print (words["banana"]) # outputs 1
words.remove("apple")
print (words["banana"]) # outputs 0
A note on complexity
I had several comments mentioning that a dict is more efficient because it's lookup time is O(1) and the lookup time of a list is O(n).
This is simply not true in this case.
The O(1) guarantee of a hash table (dict in python), is a result of an amortised complexity, meaning, that you average a common usage of lookup table that is generated once, assuming that your hash function is balanced.
This amortised calculation does not take into account deleting the entire dictionary and regenerating it every time you remove an item, as some of the other answers suggest.
The list implementation and the dict implementation have the same worst-case complexity of O(n).
Yet, the list implementation could be optimised with two lines of python (bisect) to have a worst-case complexity of O(log(n))
Related
I'd like to write a function that will take one argument (a text file) to use its contents as keys and assign values to the keys. But I'd like the keys to go from 1 to n:
{'A': 1, 'B': 2, 'C': 3, 'D': 4... }.
I tried to write something like this:
Base code which kind of works:
filename = 'words.txt'
with open(filename, 'r') as f:
text = f.read()
ready_text = text.split()
def create_dict(lst):
""" go through the arg, stores items in it as keys in a dict"""
dictionary = dict()
for item in lst:
if item not in dictionary:
dictionary[item] = 1
else:
dictionary[item] += 1
return dictionary
print(create_dict(ready_text))
The output: {'A': 1, 'B': 1, 'C': 1, 'D': 1... }.
Attempt to make the thing work:
def create_dict(lst):
""" go through the arg, stores items in it as keys in a dict"""
dictionary = dict()
values = list(range(100)) # values
for item in lst:
if item not in dictionary:
for value in values:
dictionary[item] = values[value]
else:
dictionary[item] = values[value]
return dictionary
The output: {'A': 99, 'B': 99, 'C': 99, 'D': 99... }.
My attempt doesn't work. It gives all the keys 99 as their value.
Bonus question: How can I optimaze my code and make it look more elegant/cleaner?
Thank you in advance.
You can use dict comprehension with enumerate (note the start parameter):
words.txt:
colorless green ideas sleep furiously
Code:
with open('words.txt', 'r') as f:
words = f.read().split()
dct = {word: i for i, word in enumerate(words, start=1)}
print(dct)
# {'colorless': 1, 'green': 2, 'ideas': 3, 'sleep': 4, 'furiously': 5}
Note that "to be or not to be" will result in {'to': 5, 'be': 6, 'or': 3, 'not': 4}, perhaps what you don't want. Having only one entry out of two (same) words is not the result of the algorithm here. Rather, it is inevitable as long as you use a dict.
Your program sends a list of strings to create_dict. For each string in the list, if that string is not in the dictionary, then the dictionary value for that key is set to 1. If that string has been encountered before, then the value of that key is increased by 1. So, since every key is being set to 1, then that must mean there are no repeat keys anywhere, meaning you're sending a list of unique strings.
So, in order to have the numerical values increase with each new key, you just have to increment some number during your loop:
num = 0
for item in lst:
num += 1
dictionary[item] = num
There's an easier way to loop through both numbers and list items at the same time, via enumerate():
for num, item in enumerate(lst, start=1): # start at 1 and not 0
dictionary[item] = num
You can use this code. If an item has been in the lst more than once, the idx is considered one time in dictionary!
def create_dict(lst):
""" go through the arg, stores items in it as keys in a dict"""
dictionary = dict()
idx = 1
for item in lst:
if item not in dictionary:
dictionary[item]=idx
idx += 1
return dictionary
I have a list of alphanumeric data,
my_list = ["A1B2244", "B3H7654", "A1O6541", "J4777"]
I need to divide each word in dict form like
{"A1": ["B2244", "O6541"], "B3": ["H7654"], "J4": ["777"]}
Could you please let me know the easiest way to do this in python.
You can use itertools.groupby to group the elements of your list based on your condition(first two characters). Then supply the result to dict constructor
>>> from itertools import groupby
>>> dict([(k,list(g)) for k,g in groupby(sorted(k),key=lambda x: x[:2])])
>>> {'J4': ['J4777'], 'A1': ['A1B2244', 'A1O6541'], 'B3': ['B3H7654']}
list = ['A1B2244', 'B3H7654', 'A1O6541', 'J4777']
#first initialize lists based 2 first elements
d= {i[:2]:[] for i in list}
#loop to add items by key
[d.get(i[:2]).append(i[2:]) for i in list]
print(d)
output:
{'A1': ['B2244', 'O6541'], 'J4': ['777'], 'B3': ['H7654']}
my_list = ['A1B2244', 'B3H7654', 'A1O6541', 'J4777']
my_dict={i[:2]:i[2:] for i in my_list}
Edit:
Sorry I didn't notice the replication in your output. Others have short solutions, but a pure pythonic way is:
my_list = ['A1B2244', 'B3H7654', 'A1O6541', 'J4777']
my_dict={}
for i in my_list:
if i[:2] in my_dict:
my_dict[i[:2]].append(i[2:])
else:
my_dict[i[:2]]=[i[2:]]
Just to add to the first answer by Willian Vieira, I thought it would be helpful to know the output print(d) right after d= {i[:2]:[] for i in list}, which is:
{'A1': [], 'B3': [], 'J4': []}
just to clarify this line in the two-line solution. Just to see how the keys are made from taking the first two characters in each element of the list (without duplication of these characters), and the values are initialized as empty lists.
As per your comment to the question, rule to split is: split after the first digit. So, you can search for the index of the first digit, split and add to the dict. I ignored an input with no digits.
def first_index_of_digit(st):
for i in range(len(st)):
if st[i].isdigit():
return i
return -1
my_list = ["A1B2244", "B3H7654", "A1O6541", "J4777"]
dd = dict()
for item in my_list:
i = first_index_of_digit(item)
if (i == -1):
continue
k, v = item[:i+1], item[i+1:]
if (dd.get(k, 0) == 0):
dd[k] = list()
dd[k].append(v)
print(dd)
# {'A1': ['B2244', 'O6541'], 'B3': ['H7654'], 'J4': ['777']}
I am working on a function
def common_words(dictionary, N):
if len(dictionary) > N:
max(dictionary, key=dictionary.get)
Description of the function is:
The first parameter is the dictionary of word counts and the second is
a positive integer N. This function should update the dictionary so
that it includes the most common (highest frequency words). At most N
words should be included in the dictionary. If including all words
with some word count would result in a dictionary with more than N
words, then none of the words with that word count should be included.
(i.e., in the case of a tie for the N+1st most common word, omit all
of the words in the tie.)
So I know that I need to get the N items with the highest values but I am not sure how to do that. I also know that once I get N items that if there are any duplicate values that I need to pop them out.
For example, given
k = {'a':5, 'b':4, 'c':4, 'd':1}
then
common_words(k, 2)
should modify k so that it becomes {'a':5}.
Here's my algorithm for this problem.
Extract the data from the dictionary into a list and sort it in descending order on the dictionary values.
Clear the original dictionary.
Group the sorted data into groups that have the same value.
Re-populate the dictionary with the all (key, value) pairs from each group in the sorted list if that will keep the total dictionary size <= N. If adding a group would make the total dictionary size > N, then return.
The grouping operation can be easily done using the standard itertools.groupby function.
To perform the sorting and grouping we need an appropriate key function, as described in the groupby, list and sorted docs. Since we need the second item of each tuple we could use
def keyfunc(t):
return t[1]
or
keyfunc = lambda t: t[1]
but it's more efficient to use operator.itemgetter.
from operator import itemgetter
from itertools import groupby
def common_words(d, n):
keyfunc = itemgetter(1)
lst = sorted(d.items(), key=keyfunc, reverse=True)
d.clear()
for _, g in groupby(lst, key=keyfunc):
g = list(g)
if len(d) + len(g) <= n:
d.update(g)
else:
break
# test
data = {'a':5, 'b':4, 'c':4, 'd':1}
common_words(data, 4)
print(data)
common_words(data, 2)
print(data)
output
{'c': 4, 'd': 1, 'b': 4, 'a': 5}
{'a': 5}
my algorithm as below
1st build tuple list from dictionary sorted based on value from
largest to smallest
check for if item[N-1] match item[N] value, if yes, drop item[N-1]
(index start from 0, so -1 there)
finally, convert the slice of tuple list up to N element back to
dict, may change to use OrderedDict here if wanna retain the items order
it will just return the dictionary as it is if the dictionary length is less than N
def common_words(dictionary, N):
if len(dictionary) > N:
tmp = [(k,dictionary[k]) for k in sorted(dictionary, key=dictionary.get, reverse=True)]
if tmp[N-1][1] == tmp[N][1]:
N -= 1
return dict(tmp[:N])
# return [i[0] for i in tmp[:N]] # comment line above and uncomment this line to get keys only as your title mention how to get keys
else:
return dictionary
# return dictionary.keys() # comment line above and uncomment this line to get keys only as your title mention how to get keys
>>> common_words({'a':5, 'b':4, 'c':4, 'd':1}, 2)
{'a': 5}
OP wanna modify input dictionary within function and return None, it can be modified as below
def common_words(dictionary, N):
if len(dictionary) > N:
tmp = [(k,dictionary[k]) for k in sorted(dictionary, key=dictionary.get, reverse=True)]
if tmp[N-1][1] == tmp[N][1]:
N -= 1
# return dict(tmp[:N])
for i in tmp[N:]:
dictionary.pop(i[0])
>>> k = {'a':5, 'b':4, 'c':4, 'd':1}
>>> common_words(k, 2)
>>> k
{'a': 5}
I have not found a solution to my question, yet I hope it's trivial.
I have two dictionaries:
dictA:
contains the order number of a word in a text as key: word as value
e.g.
{0:'Roses',1:'are',2:'red'...12:'blue'}
dictB:
contains counts of those words in the text
e.g.
{'Roses':2,'are':4,'blue':1}
I want to replace the values in dictA by values in dictB via keys in dictB, checking for nones, replacing by 0.
So output should look like:
{0:2,1:4,2:0...12:1}
Is there a way for doing it, preferentially without introducing own functions?
Use a dictionary comprehension and apply the get method of dict B to return 0 for items that are not found in B:
>>> A = {0:'Roses',1:'are',2:'red', 12:'blue'}
>>> B = {'Roses':2,'are':4,'blue':1}
>>> {k: B.get(v, 0) for k, v in A.items()}
{0: 2, 1: 4, 2: 0, 12: 1}
I construct a dictionary from an excel sheet and end up with something like:
d = {('a','b','c'): val1, ('a','d'): val2}
The tuples I use as keys contain a handful of values, the goal is to get a list of these values which occur more than a certain number of times.
I've tried two solutions, both of which take entirely too long.
Attempt 1, simple list comprehension filter:
keyList = []
for k in d.keys():
keyList.extend(list(k))
# The script makes it to here before hanging
commonkeylist = [key for key in keyList if keyList.count(key) > 5]
This takes forever since list.count() traverses the least on each iteration of the comprehension.
Attempt 2, create a count dictionary
keyList = []
keydict = {}
for k in d.keys():
keyList.extend(list(k))
# The script makes it to here before hanging
for k in keyList:
if k in keydict.keys():
keydict[k] += 1
else:
keydict[k] = 1
commonkeylist = [k for k in keyList if keydict[k] > 50]
I thought this would be faster since we only traverse all of keyList a handful of times, but it still hangs the script.
What other steps can I take to improve the efficiency of this operation?
Use collections.Counter() and a generator expression:
from collections import Counter
counts = Counter(item for key in d for item in key)
commonkkeylist = [item for item, count in counts.most_common() if count > 50]
where iterating over the dictionary directly yields the keys without creating an intermediary list object.
Demo with a lower count filter:
>>> from collections import Counter
>>> d = {('a','b','c'): 'val1', ('a','d'): 'val2'}
>>> counts = Counter(item for key in d for item in key)
>>> counts
Counter({'a': 2, 'c': 1, 'b': 1, 'd': 1})
>>> [item for item, count in counts.most_common() if count > 1]
['a']
I thought this would be faster since we only traverse all of keyList a
handful of times, but it still hangs the script.
That's because you're still doing an O(n) search. Replace this:
for k in keyList:
if k in keydict.keys():
with this:
for k in keyList:
if k in keydict:
and see if that helps your 2nd attempt perform better.