Python inverted index issue: "ValueError: too many values to unpack" - python

i'm trying to build an inverted index, but i continue to get the same error
ValueError: too many values to unpack
It occurs in this part of my code:
def inverted_index(self, text):
terms = self.getTerms(text)
inverted = {}
for index, word in terms:
locations = inverted.setdefault(word)
locations.append(index)
return inverted
to be more specific in the line "for inde, word in terms:"
if i print terms i get a list of words:
['12th', 'comput', 'olympiad', 'ciao', 'chiamo', 'alberto', 'lancellotti', 'scrivendo', 'primo', 'articolo', 'piccolo', 'motor', 'ricerca', 'manual']
any ideas?

Python is complaining because each iteration of terms only returns one thing - the value. What you want is
for index, word in enumerate(terms):
That will return both the index for the item and the item itself.

for index, word in enumerate(terms)
enumerate will give you the index of each item in the list
In [5]: for index, word in enumerate(terms):
print(index,word)
(0, '12th')
(1, 'comput')
(2, 'olympiad')
(3, 'ciao')
(4, 'chiamo')
(5, 'alberto')
(6, 'lancellotti')
(7, 'scrivendo')
(8, 'primo')
(9, 'articolo')
(10, 'piccolo')
(11, 'motor')
(12, 'ricerca')
(13, 'manual')

Related

IndexError: list index out of range positive check in Tuple

Hi I am getting this error to checking the positive value check in the tuple, i am getting this error
my_tuple = []
a=(5,6)
b=(8,9)
#print("The list is :")
my_tuple.append((a,b))
#print(my_tuple)
my_result = [sub for sub in my_tuple[0] if all(element >= 0 for element in sub)]
print(len(my_result))
#print("The result is :")
#print(my_result)
If you uncomment all of the commented-out lines, your code runs fine with the following output:
The list is :
[((5, 6), (8, 9))]
2
The result is :
[(5, 6), (8, 9)]

Tokenize concatenated characters based on given dictionary

I would like to tokenize concatenated characters based on the given dictionary and give and output of tokenized words found. For example, I have the following
dictionary = ['yak', 'kin', 'yakkin', 'khai', 'koo']
chars = 'yakkinpadthaikhaikoo'
Output should be something like the following:
[('yakkin', (0, 6), 6), ('padthai', (6, 13), 7), ('khai', (13, 17), 4), ('koo', (17, 20), 3)]
I would like to get the list of tuple as an output. The first element in tuple is the word found in dictionary, second element is character offset and third element is length of the word found. If characters is not found, we'll chunk them together into one word e.g. padthai in above case. If multiple words found from the dictionary, we'll select the longest one (select yakkin instead of yak and kin).
I have my current implementation below. It starts with index if 0 then looping through characters (it doesn't work yet).
import numpy as np
def tokenize(chars, dictionary):
n_chars = len(chars)
start = 0
char_found = []
words = []
for _ in range(int(n_chars/3)):
for r in range(1, n_chars + 1):
if chars[start:(start + r)] in dictionary:
char_found.append((chars[start:(start + r)], (start, start + r), len(chars[start:start+r])))
id_offset = np.argmax([t[1][1] for t in char_found])
start = char_found[id_offset][2]
if char_found[id_offset] not in words:
words.append(char_found[id_offset])
return words
tokenize(chars, dictionary) # give only [('yakkin', (0, 6), 6)]
I have hard time wrap around my head to solve this problem. Please feels free to comment/suggest!
it can look a bit nasty, but it works
def tokenize(string, dictionary):
# sorting dictionary words by length
# because we need to find longest word if its possible
# like "yakkin" instead of "yak"
sorted_dictionary = sorted(dictionary,
key=lambda word: len(word),
reverse=True)
start = 0
tokens = []
while start < len(string):
substring = string[start:]
try:
word = next(word
for word in sorted_dictionary
if substring.startswith(word))
offset = len(word)
except StopIteration:
# no words from dictionary were found
# at the beginning of substring,
# looking for next appearance of dictionary words
words_indexes = [substring.find(word)
for word in sorted_dictionary]
# if word is not found, "str.find" method returns -1
appeared_words_indexes = filter(lambda index: index > 0,
words_indexes)
try:
offset = min(appeared_words_indexes)
except ValueError:
# an empty sequence was passed to "min" function
# because there are no words from dictionary in substring
offset = len(substring)
word = substring[:offset]
token = word, (start, start + offset), offset
tokens.append(token)
start += offset
return tokens
gives output
>>>tokenize('yakkinpadthaikhaikoo', dictionary)
[('yakkin', (0, 6), 6),
('padthai', (6, 13), 7),
('khai', (13, 17), 4),
('koo', (17, 20), 3)]
>>>tokenize('lolyakhaiyakkinpadthaikhaikoolol', dictionary)
[('lol', (0, 3), 3),
('yak', (3, 6), 3),
('hai', (6, 9), 3),
('yakkin', (9, 15), 6),
('padthai', (15, 22), 7),
('khai', (22, 26), 4),
('koo', (26, 29), 3),
('lol', (29, 32), 3)]
You can use find() to find the starting index of the word, and the length of the word is known thanks to len(). Iterate through each word in the dictionary, and your list is complete!
def tokenize(chars, word_list):
tokens = []
for word in word_list:
word_len = len(word)
index = 0
# skips words that appear in longer words
skip = False
for other_word in word_list:
if word in other_word and len(other_word) > len(word):
print("skipped word:", word)
skip = True
if skip:
continue
while index < len(chars):
index = chars.find(word, index) # start search from index
if index == -1: # find() returns -1 if not found
break
# Append the tuple and continue the search at the end of the word
tokens.append((word, (index, word_len+index), word_len))
index += word_len
return tokens
Then we can run it for the following output:
>>>tokenize('yakkinpadthaikhaikoo', ['yak', 'kin', 'yakkin', 'khai', 'koo'])
skipped word: yak
skipped word: kin
[('yakkin', (0, 6), 6),
('khai', (13, 17), 4),
('koo', (17, 20), 3)]

Sort a list of tuples in consecutive order

I want to sort a list of tuples in a consecutive order, so the first element of each tuple is equal to the last element of the previous one.
For example:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
output = [(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]
I have developed a search like this:
output=[]
given = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
t = given[0][0]
for i in range(len(given)):
# search tuples starting with element t
output += [e for e in given if e[0] == t]
t = output[-1][-1] # Get the next element to search
print(output)
Is there a pythonic way to achieve such order?
And a way to do it "in-place" (with only a list)?
In my problem, the input can be reordered in a circular way using all the tuples, so it is not important the first element chosen.
Assuming your tuples in the list will be circular, you may use dict to achieve it within complexity of O(n) as:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
input_dict = dict(input) # Convert list of `tuples` to dict
elem = input[0][0] # start point in the new list
new_list = [] # List of tuples for holding the values in required order
for _ in range(len(input)):
new_list.append((elem, input_dict[elem]))
elem = input_dict[elem]
if elem not in input_dict:
# Raise exception in case list of tuples is not circular
raise Exception('key {} not found in dict'.format(elem))
Final value hold by new_list will be:
>>> new_list
[(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]
if you are not afraid to waste some memory you could create a dictionary start_dict containing the start integers as keys and the tuples as values and do something like this:
tpl = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
start_dict = {item[0]: item for item in tpl}
start = tpl[0][0]
res = []
while start_dict:
item = start_dict[start]
del start_dict[start]
res.append(item)
start = item[-1]
print(res)
if two tuples start with the same number you will lose one of them... if not all the start numbers are used the loop will not terminate.
but maybe this is something to build on.
Actually there're many questions about what you intend to have as an output and what if the input list has invalid structure to do what you need.
Assuming you have an input of pairs where each number is included twice only. So we can consider such input as a graph where numbers are nodes and each pair is an edge. And as far as I understand your question you suppose that this graph is cyclic and looks like this:
10 - 7 - 13 - 4 - 9 - 10 (same 10 as at the beginning)
This shows you that you can reduce the list to store the graph to [10, 7, 13, 4, 9]. And here is the script that sorts the input list:
# input
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
# sorting and archiving
first = input[0][0]
last = input[0][1]
output_in_place = [first, last]
while last != first:
for item in input:
if item[0] == last:
last = item[1]
if last != first:
output_in_place.append(last)
print(output_in_place)
# output
output = []
for i in range(len(output_in_place) - 1):
output.append((output_in_place[i], output_in_place[i+1]))
output.append((output_in_place[-1], output_in_place[0]))
print(output)
I would first create a dictionary of the form
{first_value: [list of tuples with that first value], ...}
Then work from there:
from collections import defaultdict
chosen_tuples = input[:1] # Start from the first
first_values = defaultdict()
for tup in input[1:]:
first_values[tup[0]].append(tup)
while first_values: # Loop will end when all lists are removed
value = chosen_tuples[-1][1] # Second item of last tuple
tuples_with_that_value = first_values[value]
chosen_tuples.append(tuples_with_that_value.pop())
if not chosen_with_that_value:
del first_values[value] # List empty, remove it
You can try this:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
output = [input[0]] # output contains the first element of input
temp = input[1:] # temp contains the rest of elements in input
while temp:
item = [i for i in temp if i[0] == output[-1][1]].pop() # We compare each element with output[-1]
output.append(item) # We add the right item to output
temp.remove(item) # We remove each handled element from temp
Output:
>>> output
[(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]
Here is a robust solution using the sorted function and a custom key function:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
def consec_sort(lst):
def key(x):
nonlocal index
if index <= lower_index:
index += 1
return -1
return abs(x[0] - lst[index - 1][1])
for lower_index in range(len(lst) - 2):
index = 0
lst = sorted(lst, key=key)
return lst
output = consec_sort(input)
print(output)
The original list is not modified. Note that sorted is called 3 times for your input list of length 5. In each call, one additional tuple is placed correctly. The first tuple keeps it original position.
I have used the nonlocal keyword, meaning that this code is for Python 3 only (one could use global instead to make it legal Python 2 code).
My two cents:
def match_tuples(input):
# making a copy to not mess up with the original one
tuples = input[:] # [(10,7), (4,9), (13, 4), (7, 13), (9, 10)]
last_elem = tuples.pop(0) # (10,7)
# { "first tuple's element": "index in list"}
indexes = {tup[0]: i for i, tup in enumerate(tuples)} # {9: 3, 4: 0, 13: 1, 7: 2}
yield last_elem # yields de firts element
for i in range(len(tuples)):
# get where in the list is the tuple which first element match the last element in the last tuple
list_index = indexes.get(last_elem[1])
last_elem = tuples[list_index] # just get that tuple
yield last_elem
Output:
input = [(10,7), (4,9), (13, 4), (7, 13), (9, 10)]
print(list(match_tuples(input)))
# output: [(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]
this is a (less efficient than the dictionary version) variant where the list is changed in-place:
tpl = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
for i in range(1, len(tpl)-1): # iterate over the indices of the list
item = tpl[i]
for j, next_item in enumerate(tpl[i+1:]): # find the next item
# in the remaining list
if next_item[0] == item[1]:
next_index = i + j
break
tpl[i], tpl[next_index] = tpl[next_index], tpl[i] # now swap the items
here is a more efficient version of the same idea:
tpl = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
start_index = {item[0]: i for i, item in enumerate(tpl)}
item = tpl[0]
next_index = start_index[item[-1]]
for i in range(1, len(tpl)-1):
tpl[i], tpl[next_index] = tpl[next_index], tpl[i]
# need to update the start indices:
start_index[tpl[next_index][0]] = next_index
start_index[tpl[i][0]] = i
next_index = start_index[tpl[i][-1]]
print(tpl)
the list is changed in-place; the dictionary only contains the starting values of the tuples and their index in the list.
To get a O(n) algorithm one needs to make sure that one doesn't do a double-loop over the array. One way to do this is by keeping already processed values in some sort of lookup-table (a dict would be a good choice).
For example something like this (I hope the inline comments explain the functionality well). This modifies the list in-place and should avoid unnecessary (even implicit) looping over the list:
inp = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
# A dictionary containing processed elements, first element is
# the key and the value represents the tuple. This is used to
# avoid the double loop
seen = {}
# The second value of the first tuple. This must match the first
# item of the next tuple
current = inp[0][1]
# Iteration to insert the next element
for insert_idx in range(1, len(inp)):
# print('insert', insert_idx, seen)
# If the next value was already found no need to search, just
# pop it from the seen dictionary and continue with the next loop
if current in seen:
item = seen.pop(current)
inp[insert_idx] = item
current = item[1]
continue
# Search the list until the next value is found saving all
# other items in the dictionary so we avoid to do unnecessary iterations
# over the list.
for search_idx in range(insert_idx, len(inp)):
# print('search', search_idx, inp[search_idx])
item = inp[search_idx]
first, second = item
if first == current:
# Found the next tuple, break out of the inner loop!
inp[insert_idx] = item
current = second
break
else:
seen[first] = item

Merging overlapping items in a list

My goal is to merge overlapping tuples in the example list below.
If an item falls within the range of the next, the two tuples will have to be merged. The resulting tuple is one that covers the range of the two items (minimum to maximum values). For instance; [(1,6),(2,5)] will result in [(1,6)], as [2,5] falls within the range of [(1,6)]
mylist=[(1, 1), (1, 6), (2, 5), (4, 4), (9, 10)]
My attempt:
c=[]
t2=[]
for i, x in enumerate(mylist):
w=x,mylist[i-1]
if x[0]-my[i-1][1]<=1:
d=min([x[0] for x in w]),max([x[1] for x in w])
c.append(d)
for i, x in enumerate(set(c)):
t=x,c[i-1]
if x[0]-c[i-1][1]<=1:
t1=min([x[0] for x in t]),max([x[1] for x in t])
t2.append(t1)
print sorted(set(t2))
Derived Output:
[(1, 6), (1, 10)]
Desired output:
[(1, 6), (9, 10)]
Any suggestions on how to get the desired output (in fewer lines if possible)? Thanks.
Basing on answer from #Valera, python implementation:
mylist=[(1, 6), (2, 5), (1, 1), (3, 7), (4, 4), (9, 10)]
result = []
for item in sorted(mylist):
result = result or [item]
if item[0] > result[-1][1]:
result.append(item)
else:
old = result[-1]
result[-1] = (old[0], max(old[1], item[1]))
print result # [(1, 7), (9, 10)]
You can solve this problem in O(nlogn)
First, you need to sort your intervals by it's starting points. After that, you create a new stack, and for each interval do the following:
if it's empty, just push the current interval
if it's not, you check if the first interval in the stack overlaps with you current interval. If it does, you pop it, merge it with your current interval, and push the result back. If it doesn't, you just push your current interval. After you check all the intervals, your stack will contain all merged intervals.

Separating sets using tuples

In the list of tuples called mixed_sets, three separate sets exist. Each set contains tuples with values that intersect. A tuple from one set will not intersect with a tuple from another set.
I've come up with the following code to sort out the sets. I found that the python set functionality was limited when tuples are involved. It would be nice if the set intersection operation could look into each tuple index and not stop at the enclosing tuple object.
Here's the code:
mixed_sets= [(1,15),(2,22),(2,23),(3,13),(3,15),
(3,17),(4,22),(4,23),(5,15),(5,17),
(6,21),(6,22),(6,23),(7,15),(8,12),
(8,15),(9,19),(9,20),(10,19),(10,20),
(11,14),(11,16),(11,18),(11,19)]
def sort_sets(a_set):
idx= 0
idx2=0
while len(mixed_sets) > idx and len(a_set) > idx2:
if a_set[idx2][0] == mixed_sets[idx][0] or a_set[idx2][1] == mixed_sets[idx][1]:
a_set.append(mixed_sets[idx])
mixed_sets.pop(idx)
idx=0
else:
idx+=1
if idx == len(mixed_sets):
idx2+=1
idx=0
a_set.pop(0) #remove first item; duplicate
print a_set, 'a returned set'
return a_set
sorted_sets=[]
for new_set in mixed_sets:
sorted_sets.append(sort_sets([new_set]))
print mixed_sets #Now empty.
OUTPUT:
[(1, 15), (3, 15), (5, 15), (7, 15), (8, 15), (3, 13), (3, 17), (5, 17), (8, 12)] a returned set
[(2, 22), (2, 23), (4, 23), (6, 23), (4, 22), (6, 22), (6, 21)] a returned set
[(9, 19), (10, 19), (10, 20), (11, 19), (9, 20), (11, 14), (11, 16), (11, 18)] a returned set
Now this doesn't look like the most pythonic way of doing this task. This code is intended for large lists of tuples (approx 2E6) and I felt the program would run quicker if it didn't have to check tuples already sorted. Therefore I used pop() to shrink the mixed_sets list. I found using pop() made list comprehensions, for loops or any iterators problematic, so I've used the while loop instead.
It does work, but is there a more pythonic way of carrying out this task that doesn't use while loops and the idx and idx2 counters?.
Probably you can increase the speed by first computing a set of all the first elements in the tuples in the mixed_sets, and a set of all the second elements. Then in your iteration you can check if the first or the second element is in one of these sets, and find the correct complete tuple using binary search.
Actually you'd need multi-sets, which you can simulate using dictionaries.
Something like[currently not tested]:
from collections import defaultdict
# define the mixed_sets list.
mixed_sets.sort()
first_els = defaultdict(int)
secon_els = defaultdict(int)
for first,second in mixed_sets:
first_els[first] += 1
second_els[second] += 1
def sort_sets(a_set):
index= 0
while mixed_sets and len(a_set) > index:
first, second = a_set[index]
if first in first_els or second in second_els:
if first in first_els:
element = find_tuple(mixed_sets, first, index=0)
first_els[first] -= 1
if first_els[first] <= 0:
del first_els[first]
else:
element = find_tuple(mixed_sets, second, index=1)
second_els[second] -= 1
if second_els[second] <= 0:
del second_els[second]
a_set.append(element)
mixed_sets.remove(element)
index += 1
a_set.pop(0) #remove first item; duplicate
print a_set, 'a returned set'
return a_set
Where "find_tuple(mixed_sets, first, index=0,1)" return the tuple belonging to mixed_sets that has "first" at the given index.
Probably you'll have to duplicate also mixed_sets and order one of the copies by the first element and the other one by the second element.
Or maybe you could play with dictionaries again. Adding to the values in "first_els" and "second_els" also a sorted list of tuples.
I don't know how the performances will scale, but I think that if the data is in the order of 2 millions you shouldn't have too much to worry about.

Categories

Resources