fast way to find occurrences in a list in python - python

I have a set of unique words called h_unique. I also have a 2D list of documents called h_tokenized_doc which has a structure like:
[ ['hello', 'world', 'i', 'am'],
['hello', 'stackoverflow', 'i', 'am'],
['hello', 'world', 'i', 'am', 'mr'],
['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ]
and h_unique as:
('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm')
what I want is to find the occurrences of the unique words in the tokenized documents list.
So far I came up with this code but this seems to be VERY slow. Is there any efficient way to do this?
term_id = []
for term in h_unique:
print term
for doc_id, doc in enumerate(h_tokenized_doc):
term_id.append([doc_id for t in doc if t == term])
In my case I have a document list of 7000 documents, structured like:
[ [doc1], [doc2], [doc3], ..... ]

It'll be slow because you're running through your entire document list once for every unique word. Why not try storing the unique words in a dictionary and appending to it for each word found?
unique_dict = {term: [] for term in h_unique}
for doc_id, doc in enumerate(h_tokenized_doc):
for term_id, term in enumerate(doc):
try:
# Not sure what structure you want to keep it in here...
# This stores a tuple of the doc, and position in that doc
unique_dict[term].append((doc_id, term_id))
except KeyError:
# If the term isn't in h_unique, don't do anything
pass
This runs through all the document's only once.
From your above example, unique_dict would be:
{'pycharm': [], 'i': [(0, 2), (1, 2), (2, 2), (3, 2)], 'stackoverflow': [(1, 1), (3, 1)], 'am': [(0, 3), (1, 3), (2, 3), (3, 3)], 'mr': [(2, 4)], 'world': [(0, 1), (2, 1)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0)]}
(Of course assuming the typo 'pycahrm' in your example was deliberate)

term_id.append([doc_id for t in doc if t == term])
This will not append one doc_id for each matching term; it will append an entire list of potentially many identical values of doc_id. Surely you did not mean to do this.
Based on your sample code, term_id ends up as this:
[[0], [1], [2], [3], [0], [], [2], [], [0], [1], [2], [3], [0], [1], [2], [3], [], [1], [], [3], [], [], [2], [], [], [], [], []]
Is this really what you intended?

If I understood correctly, and based on your comment to the question where you say
yes because a single term may appear in multiple docs like in the above case for hello the result is [0,1, 2, 3] and for world it is [0, 2]
it looks like what you wanna do is: For each of the words in the h_unique list (which, as mentioned, should be a set, or keys in a dict, which both have a search access of O(1)), go through all the lists contained in the h_tokenized_doc variable and find the indexes in which of those lists the word appears.
IF that's actually what you want to do, you could do something like the following:
#!/usr/bin/env python
h_tokenized_doc = [['hello', 'world', 'i', 'am'],
['hello', 'stackoverflow', 'i', 'am'],
['hello', 'world', 'i', 'am', 'mr'],
['hello', 'stackoverflow', 'i', 'am', 'pycahrm']]
h_unique = ['hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm']
# Initialize a dict with empty lists as the value and the items
# in h_unique the keys
results = {k: [] for k in h_unique}
for i, line in enumerate(h_tokenized_doc):
for k in results:
if k in line:
results[k].append(i)
print results
Which outputs:
{'pycharm': [], 'i': [0, 1, 2, 3], 'stackoverflow': [1, 3],
'am': [0, 1, 2, 3], 'mr': [2], 'world': [0, 2],
'hello': [0, 1, 2, 3]}
The idea is using the h_unique list as keys in a dictionary (the results = {k: [] for k in h_unique} part).
Keys in dictionaries have the advantage of a constant lookup time, which is great for the if k in line: part (if it were a list, that in would take O(n)) and then check if the word (the key k) appears in the list. If it does, append the index of the list within the matrix to the dictionary of results.
Although I'm not certain this is what you want to achieve, though.

You can optimize your code to do the trick with
Using just a single for loop
Generators dictionaries for constant lookup time, as suggested previously. Generators are faster than for loops because the generate values on the fly
In [75]: h_tokenized_doc = [ ['hello', 'world', 'i', 'am'],
...: ['hello', 'stackoverflow', 'i', 'am'],
...: ['hello', 'world', 'i', 'am', 'mr'],
...: ['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ]
In [76]: h_unique = ('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm')
In [77]: term_id = {k: [] for k in h_unique}
In [78]: for term in h_unique:
...: term_id[term].extend(i for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i])
which yields the output
{'am': [0, 1, 2, 3],
'hello': [0, 1, 2, 3],
'i': [0, 1, 2, 3],
'mr': [2],
'pycharm': [],
'stackoverflow': [1, 3],
'world': [0, 2]}
A more descriptive solution would be
In [79]: for term in h_unique:
...: term_id[term].extend([(i,h_tokenized_doc[i].index(term)) for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i]])
In [80]: term_id
Out[80]:
{'am': [(0, 3), (1, 3), (2, 3), (3, 3)],
'hello': [(0, 0), (1, 0), (2, 0), (3, 0)],
'i': [(0, 2), (1, 2), (2, 2), (3, 2)],
'mr': [(2, 4)],
'pycharm': [],
'stackoverflow': [(1, 1), (3, 1)],
'world': [(0, 1), (2, 1)]}

Related

Build a dictionary with the words of a sentence as keys and the number of position of the words from 1 as values in python

I expect this output from the code below:
{'Tell': 1, 'a': 2, 'little': 3, 'more': 4, 'about': 5, 'yourself': 6, 'as': 7, 'a': 8, 'developer': 9}
But I get this output:
{'Tell': 1, 'a': 8, 'little': 3, 'more': 4, 'about': 5, 'yourself': 6, 'as': 7, 'developer': 9}
This is the code:
sentence = 'Tell a little more about yourself as a developer'
list_words = sentence.split()
d = {word: i for i, word in enumerate(list_words, 1)}
print(d)
What do you think is the problem? What is the code that gives the output I want?
You cannot have two identical keys in a dictionary so it is impossible to get your expected result where 'a' is present twice (once for 'a':2 and again for 'a':8).
You output data structure could be a list of tuples instead of a dictionary:
r = [(word,i) for i,word in enumerate(list_words,1)]
[('Tell', 1), ('a', 2), ('little', 3), ('more', 4), ('about', 5),
('yourself', 6), ('as', 7), ('a', 8), ('developer', 9)]
Or, it could be a dictionary with a list of positions for each word:
d = dict()
for i,word in enumerate(list_words,1):
d.setdefault(word,[]).append(i)
{'Tell': [1], 'a': [2, 8], 'little': [3], 'more': [4],
'about': [5], 'yourself': [6], 'as': [7], 'developer': [9]}
You need to access the Index of the list to get the order of the words in your sentence.
sentence = 'Tell a little more about yourself as a developer'
list_words = sentence.split()
words = [(value, index+1) for index, value in enumerate(list_words)]
print(words)
#output
[('Tell', 1), ('a', 2), ('little', 3), ('more', 4), ('about', 5), ('yourself', 6), ('as', 7), ('a', 8), ('developer', 9)]
Your requested output is a dictionary, but in a specific order. Python dictionaries don't support duplicate keys (a, a), which creates problems with getting this output.
sentence = 'Tell a little more about yourself as a developer'
list_words = sentence.split()
words = [(value, index+1) for index, value in enumerate(list_words)]
dict_words = {}
for item in words:
dict_words.update({item[0]:item[1]})
print(dict_words)
#output
{1: 'Tell', 2: 'a', 3: 'little', 4: 'more', 5: 'about', 6: 'yourself', 7: 'as', 8: 'a', 9: 'developer'}
sentence ='Tell a little more about yourself as a developer'
list_words=sentence.split()
uniquewords = list(set(list_words))
d = {i:0 for i in uniquewords}
for i in list_words:
for j in s1:
if i==j:
d[j]+=1
print(d)
Maybe you printed the index of the letters instead of the index of the words.
You can try:
sentence = 'Tell a little more about yourself as a developer'
words_list = sentence.split()
words_dictionary = dict()
for word in words_list:
words_dictionary[word] = words_list.index(word) + 1
print(words_dictionary)
#output :
# {'Tell': 1, 'a': 2, 'little': 3, 'more': 4, 'about': 5, 'yourself': 6, 'as': 7, 'developer': 9}

numpy prints list() around each list

I have an array of list containing lemmatized words. When I print many of them at once, this is the output:
print(data[:3])
[list(['#', 'switchfoot', 'http', ':', '//twitpic.com/2y1zl', '-', 'Awww', ',', 'that', "'s", 'a', 'bummer', '.', 'You', 'shoulda', 'got', 'David', 'Carr', 'of', 'Third', 'Day', 'to', 'do', 'it', '.', ';', 'D'])
list(['is', 'upset', 'that', 'he', 'ca', "n't", 'update', 'his', 'Facebook', 'by', 'texting', 'it', '...', 'and', 'might', 'cry', 'a', 'a', 'result', 'School', 'today', 'also', '.', 'Blah', '!'])
list(['#', 'Kenichan', 'I', 'dived', 'many', 'time', 'for', 'the', 'ball', '.', 'Managed', 'to', 'save', '50', '%', 'The', 'rest', 'go', 'out', 'of', 'bound'])]
I tried many thing to get rid of it but it never does, but when I tried:
a = [[i for i in range(5)] for _ in range(5)]
print(np.array(a))
the output is not with list() around every list:
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
does it mean they are different lists? Does it just happen with list of string? How can I get rid of it, if it is necessary of course, thanks for your time.
you could potentialy loop through the 3 lists and print them with the * sympol in front.
for i in data[:3]:
print(*i)
this would in normal cases remove the brackets and commas of the list and just print it with spaces. I must admit though that i do not under stand how you got your output, so this is just my 2 cents. Hope it helps :)
print(data[:3].tolist())
Convert the array to a list. This will use the list of list display, as opposed to an array of lists.
But as hashed out in the comments, there is a significant difference between an array of lists, and a 2d array.
Including list is the display is a relatively recent change in numpy. It was added, I think, to clarify the underlying nature of the elements of an object dtype array.
Consider, for example, an array with a variety of element types:
In [532]: x=np.empty(5,object)
In [533]: x[0]=[1,2,3]; x[1]=(1,2,3); x[2]=np.array([1,2,3]); x[3]=np.matrix([1,2,3]); x[4]={0:1}
In [534]: x
Out[534]:
array([list([1, 2, 3]),
(1, 2, 3),
array([1, 2, 3]),
matrix([[1, 2, 3]]),
{0: 1}],
dtype=object)
I tweaked the layout for clarity. But note that without the words, the list and array elements would look a lot alike.
Converting the array to a list, we get the default formatting of a list:
In [537]: x.tolist()
Out[537]: [[1, 2, 3], (1, 2, 3), array([1, 2, 3]), matrix([[1, 2, 3]]), {0: 1}]
The elements of the array and list are same.

How to return a list as a dictionary, with the placement of the lists' variables as the dictionary's values?

For example, in a race, I have a list of runners and their names in a list ordered from their places, such as ['Bob', 'Charlie', 'Sarah', 'Alex', 'Bob']
I want to create a dictionary with this list such as
{'Bob': [0, 4], 'Charlie': [1], 'Sarah': [2], 'Alex': [3]}
If you only need to create a dictionary with the list variables as the dictionary keys and the positions of the lists' variables as the dictionary values, how would you do so?
[A, B, C, A] -> {A: [0, 3] B: [1], C:[2]}
(I'm having trouble figuring this out.)
Thank you. Sorry for the changed output. Thank you very much!
You can use enumerate(). This will iterate through the list, providing you with both the current element and that element's index.
my_list = ['Bob', 'Charlie', 'Sarah']
my_dict = {}
for index, name in enumerate(my_list):
my_dict[name] = index
EDIT: Since the OP has changed.
To get exactly what you requested, you could use a defaultdict. This will create a dict and you specify what you want the default values to be. So if you go to access a key that does not yet exist, an empty list will automatically be added as the value. This way you can do the following:
from collections import defualtdict
my_list = ['Bob', 'Charlie', 'Sarah', 'Bob']
my_dict = defaultdict(list)
for index, name in enumerate(my_list):
my_dict[name].append(index)
you can use enumerate() and itertools.groupby():
>>> your_list=['A','B','C','C','A','A','B','D']
>>> l=[(j,i) for i,j in enumerate(your_list,1)]
>>> l
[('A', 1), ('B', 2), ('C', 3), ('C', 4), ('A', 5), ('A', 6), ('B', 7), ('D', 8)]
>>> g=[list(g) for k, g in groupby(sorted(l),itemgetter(0))]
>>> g
[[('A', 1), ('A', 5), ('A', 6)], [('B', 2), ('B', 7)], [('C', 3), ('C', 4)], [('D', 8)]]
>>> z=[zip(*i) for i in g]
>>> z
[[('A', 'A', 'A'), (1, 5, 6)], [('B', 'B'), (2, 7)], [('C', 'C'), (3, 4)], [('D',), (8,)]]
>>> {i[0]:j for i,j in z}
{'A': (1, 5, 6), 'C': (3, 4), 'B': (2, 7), 'D': (8,)}
how about a simple loop to get the desired result:
x = ['Bob', 'Charlie', 'Sarah', 'Alex', 'Bob']
y = {}
for i, name in enumerate(x):
if name in y.keys():
y[name].append(i)
else:
y[name] = [i]

One-step initialization of defaultdict that appends to list?

It would be convenient if a defaultdict could be initialized along the following lines
d = defaultdict(list, (('a', 1), ('b', 2), ('c', 3), ('d', 4), ('a', 2),
('b', 3)))
to produce
defaultdict(<type 'list'>, {'a': [1, 2], 'c': [3], 'b': [2, 3], 'd': [4]})
Instead, I get
defaultdict(<type 'list'>, {'a': 2, 'c': 3, 'b': 3, 'd': 4})
To get what I need, I end up having to do this:
d = defaultdict(list)
for x, y in (('a', 1), ('b', 2), ('c', 3), ('d', 4), ('a', 2), ('b', 3)):
d[x].append(y)
This is IMO one step more than should be necessary, am I missing something here?
What you're apparently missing is that defaultdict is a straightforward (not especially "magical") subclass of dict. All the first argument does is provide a factory function for missing keys. When you initialize a defaultdict, you're initializing a dict.
If you want to produce
defaultdict(<type 'list'>, {'a': [1, 2], 'c': [3], 'b': [2, 3], 'd': [4]})
you should be initializing it the way you would initialize any other dict whose values are lists:
d = defaultdict(list, (('a', [1, 2]), ('b', [2, 3]), ('c', [3]), ('d', [4])))
If your initial data has to be in the form of tuples whose 2nd element is always an integer, then just go with the for loop. You call it one extra step; I call it the clear and obvious way to do it.
the behavior you describe would not be consistent with the defaultdicts other behaviors. Seems like what you want is FooDict such that
>>> f = FooDict()
>>> f['a'] = 1
>>> f['a'] = 2
>>> f['a']
[1, 2]
We can do that, but not with defaultdict; lets call it AppendDict
import collections
class AppendDict(collections.MutableMapping):
def __init__(self, container=list, append=None, pairs=()):
self.container = collections.defaultdict(container)
self.append = append or list.append
for key, value in pairs:
self[key] = value
def __setitem__(self, key, value):
self.append(self.container[key], value)
def __getitem__(self, key): return self.container[key]
def __delitem__(self, key): del self.container[key]
def __iter__(self): return iter(self.container)
def __len__(self): return len(self.container)
Sorting and itertools.groupby go a long way:
>>> L = [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('a', 2), ('b', 3)]
>>> L.sort(key=lambda t:t[0])
>>> d = defaultdict(list, [(tup[0], [t[1] for t in tup[1]]) for tup in itertools.groupby(L, key=lambda t: t[0])])
>>> d
defaultdict(<type 'list'>, {'a': [1, 2], 'c': [3], 'b': [2, 3], 'd': [4]})
To make this more of a one-liner:
L = [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('a', 2), ('b', 3)]
d = defaultdict(list, [(tup[0], [t[1] for t in tup[1]]) for tup in itertools.groupby(sorted(L, key=operator.itemgetter(0)), key=lambda t: t[0])])
Hope this helps
I think most of this is a lot of smoke and mirrors to avoid a simple for loop:
di={}
for k,v in [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('a', 2),('b', 3)]:
di.setdefault(k,[]).append(v)
# di={'a': [1, 2], 'c': [3], 'b': [2, 3], 'd': [4]}
If your goal is one line and you want abusive syntax that I cannot at all endorse or support you can use a side effect comprehension:
>>> li=[('a', 1), ('b', 2), ('c', 3), ('d', 4), ('a', 2),('b', 3)]
>>> di={};{di.setdefault(k[0],[]).append(k[1]) for k in li}
set([None])
>>> di
{'a': [1, 2], 'c': [3], 'b': [2, 3], 'd': [4]}
If you really want to go overboard into the unreadable:
>>> {k1:[e for _,e in v1] for k1,v1 in {k:filter(lambda x: x[0]==k,li) for k,v in li}.items()}
{'a': [1, 2], 'c': [3], 'b': [2, 3], 'd': [4]}
You don't want to do that. Use the for loop Luke!
>>> kvs = [(1,2), (2,3), (1,3)]
>>> reduce(
... lambda d,(k,v): d[k].append(v) or d,
... kvs,
... defaultdict(list))
defaultdict(<type 'list'>, {1: [2, 3], 2: [3]})

create dictionary from list - in sequence

I would like to create a dictionary from list
>>> list=['a',1,'b',2,'c',3,'d',4]
>>> print list
['a', 1, 'b', 2, 'c', 3, 'd', 4]
I use dict() to produce dictionary from list
but the result is not in sequence as expected.
>>> d = dict(list[i:i+2] for i in range(0, len(list),2))
>>> print d
{'a': 1, 'c': 3, 'b': 2, 'd': 4}
I expect the result to be in sequence as the list.
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
Can you guys please help advise?
Dictionaries don't have any order, use collections.OrderedDict if you want the order to be preserved. And instead of using indices use an iterator.
>>> from collections import OrderedDict
>>> lis = ['a', 1, 'b', 2, 'c', 3, 'd', 4]
>>> it = iter(lis)
>>> OrderedDict((k, next(it)) for k in it)
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
Dictionary is an unordered data structure. To preserve order use collection.OrderedDict:
>>> lst = ['a',1,'b',2,'c',3,'d',4]
>>> from collections import OrderedDict
>>> OrderedDict(lst[i:i+2] for i in range(0, len(lst),2))
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
You could use the grouper recipe: zip(*[iterable]*n) to collect the items into groups of n:
In [5]: items = ['a',1,'b',2,'c',3,'d',4]
In [6]: items = iter(items)
In [7]: dict(zip(*[items]*2))
Out[7]: {'a': 1, 'b': 2, 'c': 3, 'd': 4}
PS. Never name a variable list, since it shadows the builtin (type) of the same name.
The grouper recipe is easy to use, but a little harder to explain.
Items in a dict are unordered. So if you want the dict items in a certain order, use a collections.OrderedDict (as falsetru already pointed out):
In [13]: collections.OrderedDict(zip(*[items]*2))
Out[13]: OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])

Categories

Resources