Tokenize concatenated characters based on given dictionary - python

I would like to tokenize concatenated characters based on the given dictionary and give and output of tokenized words found. For example, I have the following
dictionary = ['yak', 'kin', 'yakkin', 'khai', 'koo']
chars = 'yakkinpadthaikhaikoo'
Output should be something like the following:
[('yakkin', (0, 6), 6), ('padthai', (6, 13), 7), ('khai', (13, 17), 4), ('koo', (17, 20), 3)]
I would like to get the list of tuple as an output. The first element in tuple is the word found in dictionary, second element is character offset and third element is length of the word found. If characters is not found, we'll chunk them together into one word e.g. padthai in above case. If multiple words found from the dictionary, we'll select the longest one (select yakkin instead of yak and kin).
I have my current implementation below. It starts with index if 0 then looping through characters (it doesn't work yet).
import numpy as np
def tokenize(chars, dictionary):
n_chars = len(chars)
start = 0
char_found = []
words = []
for _ in range(int(n_chars/3)):
for r in range(1, n_chars + 1):
if chars[start:(start + r)] in dictionary:
char_found.append((chars[start:(start + r)], (start, start + r), len(chars[start:start+r])))
id_offset = np.argmax([t[1][1] for t in char_found])
start = char_found[id_offset][2]
if char_found[id_offset] not in words:
words.append(char_found[id_offset])
return words
tokenize(chars, dictionary) # give only [('yakkin', (0, 6), 6)]
I have hard time wrap around my head to solve this problem. Please feels free to comment/suggest!

it can look a bit nasty, but it works
def tokenize(string, dictionary):
# sorting dictionary words by length
# because we need to find longest word if its possible
# like "yakkin" instead of "yak"
sorted_dictionary = sorted(dictionary,
key=lambda word: len(word),
reverse=True)
start = 0
tokens = []
while start < len(string):
substring = string[start:]
try:
word = next(word
for word in sorted_dictionary
if substring.startswith(word))
offset = len(word)
except StopIteration:
# no words from dictionary were found
# at the beginning of substring,
# looking for next appearance of dictionary words
words_indexes = [substring.find(word)
for word in sorted_dictionary]
# if word is not found, "str.find" method returns -1
appeared_words_indexes = filter(lambda index: index > 0,
words_indexes)
try:
offset = min(appeared_words_indexes)
except ValueError:
# an empty sequence was passed to "min" function
# because there are no words from dictionary in substring
offset = len(substring)
word = substring[:offset]
token = word, (start, start + offset), offset
tokens.append(token)
start += offset
return tokens
gives output
>>>tokenize('yakkinpadthaikhaikoo', dictionary)
[('yakkin', (0, 6), 6),
('padthai', (6, 13), 7),
('khai', (13, 17), 4),
('koo', (17, 20), 3)]
>>>tokenize('lolyakhaiyakkinpadthaikhaikoolol', dictionary)
[('lol', (0, 3), 3),
('yak', (3, 6), 3),
('hai', (6, 9), 3),
('yakkin', (9, 15), 6),
('padthai', (15, 22), 7),
('khai', (22, 26), 4),
('koo', (26, 29), 3),
('lol', (29, 32), 3)]

You can use find() to find the starting index of the word, and the length of the word is known thanks to len(). Iterate through each word in the dictionary, and your list is complete!
def tokenize(chars, word_list):
tokens = []
for word in word_list:
word_len = len(word)
index = 0
# skips words that appear in longer words
skip = False
for other_word in word_list:
if word in other_word and len(other_word) > len(word):
print("skipped word:", word)
skip = True
if skip:
continue
while index < len(chars):
index = chars.find(word, index) # start search from index
if index == -1: # find() returns -1 if not found
break
# Append the tuple and continue the search at the end of the word
tokens.append((word, (index, word_len+index), word_len))
index += word_len
return tokens
Then we can run it for the following output:
>>>tokenize('yakkinpadthaikhaikoo', ['yak', 'kin', 'yakkin', 'khai', 'koo'])
skipped word: yak
skipped word: kin
[('yakkin', (0, 6), 6),
('khai', (13, 17), 4),
('koo', (17, 20), 3)]

Related

How do I compare pairs within a list in Python?

I'm trying to loop through a concatenated list of two lists that is essentially a bag of words - example outputs yields [('brexit', 11), ('say', 11), ('uk', 7), ('eu', 6), ('deal', 5), ('may', 5), ..., ('brexit', 35), ('say', 28), , ('may', 5), ('uk', 1), ... ]
Having gathered all the text inputs from .txt files, I've removed the stop-words and using stemming to remove duplicated from tenses.
The next step I want to take is to loop through the list and find the differences in the number of appearances a given word - I would want 'brexit', 'say' and 'uk' to be flagged as significant words with either the two numbers of appearances or just the difference. My start of the code (partly python, partly pseudocode) is below.
def findSimilarities (word, count):
for (word, count) in biasDict:
if word == word and count != count:
print (word, count - count)
elif word ==word and count == count:
del (word, count)
(word, count)++
Any advice on how to approach this and edit the code to work? If it would be better, I can have the words come from two separate lists (which is how they are created; I concatenated them after they were created).
Many thanks.
This would be an option. Not efficient, but the output is as desired. That is, if you want to delete word's with the same count (as shown in your code). If you want to keep the entries, just skip the biasDict.remove() part.
If your just interested in the word's that occur twice with a different count, you could append the tuples to a new list instead of printing the difference. Afterwards return the new list.
import numpy as np
def findSimilarities (biasDict):
similarities = {}
#remove_later = []
for i in range(0, len(biasDict)):
word, count = biasDict[i][0], biasDict[i][1]
for c in range(0, len(biasDict)):
word_compare, count_compare = biasDict[c][0], biasDict[c][1]
if c==i:
pass #Same entry
elif word == word_compare and count != count_compare:
delta = count - count_compare
if word not in similarities and delta != 0:
similarities[word] = np.abs(delta)
#elif word == word_compare and count == count_compare and (word, count) not in remove_later:
# remove_later.append((word, count))
#for entry in remove_later:
# biasDict.remove(entry)
return similarities
biasDict = [('brexit', 11), ('say', 11), ('uk', 7), ('eu', 6), ('deal', 5), ('may', 5), ('brexit', 35), ('say', 28), ('may', 5), ('uk', 1)]
print(findSimilarities(biasDict))
Output:
{'brexit': 24, 'say': 17, 'uk': 6}
The idea of combining occurrences seems fine for me. Here is my implementation. Any comment or optimization is appreciated.
def merge_list(words_count_list):
updated_list = list()
words_list = list()
for i in range(len(words_count_list)):
word = words_count_list[i][0]
count = words_count_list[i][1]
if word not in words_list:
words_list.append(word)
for j in range(i+1,len(words_count_list),1):
if word == words_count_list[j][0]:
count += words_count_list[j][1]
updated_list.append((word,count))
return updated_list
print(merge_list([('brexit', 11), ('say', 11), ('uk', 7), ('eu', 6), ('deal', 5), ('may', 5),
('brexit', 35), ('say', 28),('may', 5), ('uk', 1)]))
output:
[('brexit', 46), ('say', 39), ('uk', 8), ('eu', 6), ('deal', 5), ('may', 10)]
Now, you can specify a threshold on the word count, sort by the count, then remove the most significant words.
Assuming you have two lists of the words, then you can do
#Converts list of tuples to dictionary.
#[('a',1'),('b',2)] => {'a':1,'b',2}
def tupleListToDict(list):
dictobj = {}
for item in list:
dictobj[item[0]] = item[1]
return dictobj
def findSimilarities(list1, list2):
dict1 = tupleListToDict(list1)
dict2 = tupleListToDict(list2)
dict3 = {} #To store the difference
#Find occurence of key in 2nd dict, if found, calculate the difference
for key, value in dict1.items():
if key in dict2.keys():
dict3[key] = abs(value - dict2[key])
return dict3
Example output
list1 = [('brexit', 11), ('say', 11), ('uk', 7), ('eu', 6), ('deal', 5), ('may', 5)]
list2 = [('brexit', 35), ('say', 28), ('may', 5), ('uk', 1)]
print(findSimilarities(list1, list2))
{'brexit': 24, 'say': 17, 'uk': 6, 'may': 0}

Lower elements in list given a certain position

I must lower letters in a list if the occupy a certain position given in a previous function I did. The function I must program is lower_words.
I'm having an issue: every time I lower an element the row is repeated.
I don't need to use the list "words" for this. Just left it there so you could understand better what the function does/must do. Can someone help me?
words= ["PATO", "GATO", "BOI", "CAO"]
grid1= ["PIGATOS",
"ANRBKFD",
"TMCAOXA",
"OOBBYQU",
"MACOUIV",
"EEJMIWL"]
positions_words_occupy = ((0, 0), (1, 0), (2, 0), (3, 0), (0, 2), (0, 3), (0, 4), (0, 5), (3, 2), (4, 3), (5, 4), (2, 2), (2, 3), (2, 4)) #these are the positions the words occupy. I have determined these positions with a previous function. first is the line, second the column
def lower_words(grid, positions_words_occupy):
new= []
for position in positions_words_occupy:
line= position[0]
column= position[1]
row= grid[line]
element= row[column]
new.append(row.replace(element, element.lower()))
return new
Expected output:
['pIgatoS', 'aNRBKFD', 'tMcaoXA', 'oObBYQU', 'MACoUIV', 'EEJMiWL']
Actual output:
['pIGATOS', 'aNRBKFD', 'tMCAOXA', 'ooBBYQU', 'PIgATOS', 'PIGaTOS', 'PIGAtOS', 'PIGAToS', 'OObbYQU', 'MACoUIV', 'EEJMiWL', 'TMcAOXA', 'TMCaOXa', 'TMCAoXA']
Changing the perspective, you can see it lowers the words I have in the list words:
['pIgatoS',
'aNRBKFD',
'tMcaoXA',
'oObBYQU',
'MACoUIV',
'EEJMiWL']
You are very close! You're actually appending to your new list new every time you replace a letter. That is why you are getting so many values in your list.
Another way you would run your code is to create a copy of grid1, and then replace each word every time you replace a letter. Here is a new function implementing these small changes:
def lower_words(grid, positions_words_occupy):
new = grid1.copy()
for position in positions_words_occupy:
line= position[0]
column= position[1]
row= new[line]
element= row[column]
#new.remove(row)
new_word = row[:column] + element.lower() + row[column+1:]
new[line] = new_word
return new
Output running lower_words(grid1, positions_words_occupy):
['pIgatoS', 'aNRBKFD', 'tMcaoXa', 'oObBYQU', 'MACoUIV', 'EEJMiWL']
I would first collect your grid positions in a collections.defaultdict or sets, then rebuild the strings with lowercase letters if their positions exist in these sets.
Demo:
from collections import defaultdict
grid1 = ["PIGATOS", "ANRBKFD", "TMCAOXA", "OOBBYQU", "MACOUIV", "EEJMIWL"]
positions_words_occupy = (
(0, 0),
(1, 0),
(2, 0),
(3, 0),
(0, 2),
(0, 3),
(0, 4),
(0, 5),
(3, 2),
(4, 3),
(5, 4),
(2, 2),
(2, 3),
(2, 4),
)
d = defaultdict(set)
for grid, pos in positions_words_occupy:
d[grid].add(pos)
result = []
for grid, pos in d.items():
result.append(
"".join(x.lower() if i in pos else x for i, x in enumerate(grid1[grid]))
)
print(result)
Output:
['pIgatoS', 'aNRBKFD', 'tMcaoXA', 'oObBYQU', 'MACoUIV', 'EEJMiWL']

Sort a list of tuples in consecutive order

I want to sort a list of tuples in a consecutive order, so the first element of each tuple is equal to the last element of the previous one.
For example:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
output = [(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]
I have developed a search like this:
output=[]
given = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
t = given[0][0]
for i in range(len(given)):
# search tuples starting with element t
output += [e for e in given if e[0] == t]
t = output[-1][-1] # Get the next element to search
print(output)
Is there a pythonic way to achieve such order?
And a way to do it "in-place" (with only a list)?
In my problem, the input can be reordered in a circular way using all the tuples, so it is not important the first element chosen.
Assuming your tuples in the list will be circular, you may use dict to achieve it within complexity of O(n) as:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
input_dict = dict(input) # Convert list of `tuples` to dict
elem = input[0][0] # start point in the new list
new_list = [] # List of tuples for holding the values in required order
for _ in range(len(input)):
new_list.append((elem, input_dict[elem]))
elem = input_dict[elem]
if elem not in input_dict:
# Raise exception in case list of tuples is not circular
raise Exception('key {} not found in dict'.format(elem))
Final value hold by new_list will be:
>>> new_list
[(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]
if you are not afraid to waste some memory you could create a dictionary start_dict containing the start integers as keys and the tuples as values and do something like this:
tpl = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
start_dict = {item[0]: item for item in tpl}
start = tpl[0][0]
res = []
while start_dict:
item = start_dict[start]
del start_dict[start]
res.append(item)
start = item[-1]
print(res)
if two tuples start with the same number you will lose one of them... if not all the start numbers are used the loop will not terminate.
but maybe this is something to build on.
Actually there're many questions about what you intend to have as an output and what if the input list has invalid structure to do what you need.
Assuming you have an input of pairs where each number is included twice only. So we can consider such input as a graph where numbers are nodes and each pair is an edge. And as far as I understand your question you suppose that this graph is cyclic and looks like this:
10 - 7 - 13 - 4 - 9 - 10 (same 10 as at the beginning)
This shows you that you can reduce the list to store the graph to [10, 7, 13, 4, 9]. And here is the script that sorts the input list:
# input
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
# sorting and archiving
first = input[0][0]
last = input[0][1]
output_in_place = [first, last]
while last != first:
for item in input:
if item[0] == last:
last = item[1]
if last != first:
output_in_place.append(last)
print(output_in_place)
# output
output = []
for i in range(len(output_in_place) - 1):
output.append((output_in_place[i], output_in_place[i+1]))
output.append((output_in_place[-1], output_in_place[0]))
print(output)
I would first create a dictionary of the form
{first_value: [list of tuples with that first value], ...}
Then work from there:
from collections import defaultdict
chosen_tuples = input[:1] # Start from the first
first_values = defaultdict()
for tup in input[1:]:
first_values[tup[0]].append(tup)
while first_values: # Loop will end when all lists are removed
value = chosen_tuples[-1][1] # Second item of last tuple
tuples_with_that_value = first_values[value]
chosen_tuples.append(tuples_with_that_value.pop())
if not chosen_with_that_value:
del first_values[value] # List empty, remove it
You can try this:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
output = [input[0]] # output contains the first element of input
temp = input[1:] # temp contains the rest of elements in input
while temp:
item = [i for i in temp if i[0] == output[-1][1]].pop() # We compare each element with output[-1]
output.append(item) # We add the right item to output
temp.remove(item) # We remove each handled element from temp
Output:
>>> output
[(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]
Here is a robust solution using the sorted function and a custom key function:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
def consec_sort(lst):
def key(x):
nonlocal index
if index <= lower_index:
index += 1
return -1
return abs(x[0] - lst[index - 1][1])
for lower_index in range(len(lst) - 2):
index = 0
lst = sorted(lst, key=key)
return lst
output = consec_sort(input)
print(output)
The original list is not modified. Note that sorted is called 3 times for your input list of length 5. In each call, one additional tuple is placed correctly. The first tuple keeps it original position.
I have used the nonlocal keyword, meaning that this code is for Python 3 only (one could use global instead to make it legal Python 2 code).
My two cents:
def match_tuples(input):
# making a copy to not mess up with the original one
tuples = input[:] # [(10,7), (4,9), (13, 4), (7, 13), (9, 10)]
last_elem = tuples.pop(0) # (10,7)
# { "first tuple's element": "index in list"}
indexes = {tup[0]: i for i, tup in enumerate(tuples)} # {9: 3, 4: 0, 13: 1, 7: 2}
yield last_elem # yields de firts element
for i in range(len(tuples)):
# get where in the list is the tuple which first element match the last element in the last tuple
list_index = indexes.get(last_elem[1])
last_elem = tuples[list_index] # just get that tuple
yield last_elem
Output:
input = [(10,7), (4,9), (13, 4), (7, 13), (9, 10)]
print(list(match_tuples(input)))
# output: [(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]
this is a (less efficient than the dictionary version) variant where the list is changed in-place:
tpl = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
for i in range(1, len(tpl)-1): # iterate over the indices of the list
item = tpl[i]
for j, next_item in enumerate(tpl[i+1:]): # find the next item
# in the remaining list
if next_item[0] == item[1]:
next_index = i + j
break
tpl[i], tpl[next_index] = tpl[next_index], tpl[i] # now swap the items
here is a more efficient version of the same idea:
tpl = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
start_index = {item[0]: i for i, item in enumerate(tpl)}
item = tpl[0]
next_index = start_index[item[-1]]
for i in range(1, len(tpl)-1):
tpl[i], tpl[next_index] = tpl[next_index], tpl[i]
# need to update the start indices:
start_index[tpl[next_index][0]] = next_index
start_index[tpl[i][0]] = i
next_index = start_index[tpl[i][-1]]
print(tpl)
the list is changed in-place; the dictionary only contains the starting values of the tuples and their index in the list.
To get a O(n) algorithm one needs to make sure that one doesn't do a double-loop over the array. One way to do this is by keeping already processed values in some sort of lookup-table (a dict would be a good choice).
For example something like this (I hope the inline comments explain the functionality well). This modifies the list in-place and should avoid unnecessary (even implicit) looping over the list:
inp = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
# A dictionary containing processed elements, first element is
# the key and the value represents the tuple. This is used to
# avoid the double loop
seen = {}
# The second value of the first tuple. This must match the first
# item of the next tuple
current = inp[0][1]
# Iteration to insert the next element
for insert_idx in range(1, len(inp)):
# print('insert', insert_idx, seen)
# If the next value was already found no need to search, just
# pop it from the seen dictionary and continue with the next loop
if current in seen:
item = seen.pop(current)
inp[insert_idx] = item
current = item[1]
continue
# Search the list until the next value is found saving all
# other items in the dictionary so we avoid to do unnecessary iterations
# over the list.
for search_idx in range(insert_idx, len(inp)):
# print('search', search_idx, inp[search_idx])
item = inp[search_idx]
first, second = item
if first == current:
# Found the next tuple, break out of the inner loop!
inp[insert_idx] = item
current = second
break
else:
seen[first] = item

Group continuous numbers in a tuple with tolerance range

if i have a tuple set of numbers:
locSet = [(62.5, 121.0), (62.50000762939453, 121.00001525878906), (63.0, 121.0),(63.000003814697266, 121.00001525878906), (144.0, 41.5)]
I want to group them with a tolerance range of +/- 3.
aFunc(locSet)
which returns
[(62.5, 121.0), (144.0, 41.5)]
I have seen Identify groups of continuous numbers in a list but that is for continous integers.
If I have understood well, you are searching the tuples whose values differs in an absolute amount that is in the tolerance range: [0, 1, 2, 3]
Assuming this, my solution returns a list of lists, where every internal list contains tuples that satisfy the condition.
def aFunc(locSet):
# Sort the list.
locSet = sorted(locSet,key=lambda x: x[0]+x[1])
toleranceRange = 3
resultLst = []
for i in range(len(locSet)):
sum1 = locSet[i][0] + locSet[i][1]
tempLst = [locSet[i]]
for j in range(i+1,len(locSet)):
sum2 = locSet[j][0] + locSet[j][1]
if (abs(sum1-sum2) in range(toleranceRange+1)):
tempLst.append(locSet[j])
if (len(tempLst) > 1):
for lst in resultLst:
if (list(set(tempLst) - set(lst)) == []):
# This solution is part of a previous solution.
# Doesn't include it.
break
else:
# Valid solution.
resultLst.append(tempLst)
return resultLst
Here two use examples:
locSet1 = [(62.5, 121.0), (62.50000762939453, 121.00001525878906), (63.0, 121.0),(63.000003814697266, 121.00001525878906), (144.0, 41.5)]
locSet2 = [(10, 20), (12, 20), (13, 20), (14, 20)]
print aFunc(locSet1)
[[(62.5, 121.0), (144.0, 41.5)]]
print aFunc(locSet2)
[[(10, 20), (12, 20), (13, 20)], [(12, 20), (13, 20), (14, 20)]]
I hope to have been of help.

Python inverted index issue: "ValueError: too many values to unpack"

i'm trying to build an inverted index, but i continue to get the same error
ValueError: too many values to unpack
It occurs in this part of my code:
def inverted_index(self, text):
terms = self.getTerms(text)
inverted = {}
for index, word in terms:
locations = inverted.setdefault(word)
locations.append(index)
return inverted
to be more specific in the line "for inde, word in terms:"
if i print terms i get a list of words:
['12th', 'comput', 'olympiad', 'ciao', 'chiamo', 'alberto', 'lancellotti', 'scrivendo', 'primo', 'articolo', 'piccolo', 'motor', 'ricerca', 'manual']
any ideas?
Python is complaining because each iteration of terms only returns one thing - the value. What you want is
for index, word in enumerate(terms):
That will return both the index for the item and the item itself.
for index, word in enumerate(terms)
enumerate will give you the index of each item in the list
In [5]: for index, word in enumerate(terms):
print(index,word)
(0, '12th')
(1, 'comput')
(2, 'olympiad')
(3, 'ciao')
(4, 'chiamo')
(5, 'alberto')
(6, 'lancellotti')
(7, 'scrivendo')
(8, 'primo')
(9, 'articolo')
(10, 'piccolo')
(11, 'motor')
(12, 'ricerca')
(13, 'manual')

Categories

Resources