finding gappy sublists within a larger list - python

Let's say I have a list like this:
[['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
Now I have a list like this:
['she', 'is', 'student']
I want to query the larger list with this one, and return all the lists that contain the words within the query list in the same order. There might be gaps, but the order should be the same. How can I do that? I tried using the in operator but I don't get the desired output.

If all that you care about is that the words appear in order somehwere in the array, you can use a collections.deque and popleft to iterate through the list, and if the deque is emptied, you have found a valid match:
from collections import deque
def find_gappy(arr, m):
dq = deque(m)
for word in arr:
if word == dq[0]:
dq.popleft()
if not dq:
return True
return False
By comparing each word in arr with the first element of dq, we know that when we find a match, it has been found in the correct order, and then we popleft, so we now are comparing with the next element in the deque.
To filter your initial list, you can use a simple list comprehension that filters based on the result of find_gappy:
matches = ['she', 'is', 'student']
x = [i for i in x if find_gappy(i, matches)]
# [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student'], ['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]

You can compare two lists, with a function like this one. The way it works is it loops through your shorter list, and every time it finds the next word in the long list, cuts off the first part of the longer list at that point. If it can't find the word it returns false.
def is_sub_sequence(long_list, short_list):
for word in short_list:
if word in long_list:
i = long_list.index(word)
long_list = long_list[i+1:]
else:
return False
return True
Now you have a function to tell you if the list is the desired type, you can filter out all the lists you need from the 'list of lists' using a list comprehension like the following:
a = [['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
b = ['she', 'is', 'student']
filtered = [x for x in a if is_sub_sequence(x,b)]
The list filtered will include only the lists of the desired type.

Related

Find words that do not match a list in a list

Would like to find words in a list that do not match words in a master list.
Code is:
master = ['This', 'is', 'a', 'pond', 'full', 'of', 'good', 'words']
dontfindme = ['po', 'go', 'a']
Expected result is:
['This', 'is', 'full', 'of', 'words']
Can do:
list(set(master).difference(set([m for m in master for df in dontfindme if df in m])))
...but it screws up the order.
Is there a better way using just list comprehension?
master = ['This', 'is', 'a', 'pond', 'full', 'of', 'good', 'words']
dontfindme = ['po', 'go', 'a']
result = [x for x in master if all(item not in x for item in dontfindme)]
print(result)
Gives:
['This', 'is', 'full', 'of', 'words']
You can use filter() python built-in method.
filter(function, iterable)
Construct an iterator from those elements of iterable for which function returns true. iterable may be either a sequence, a container which supports iteration, or an iterator. If function is None, the identity function is assumed, that is, all elements of iterable that are false are removed.
Note that filter(function, iterable) is equivalent to the generator expression (item for item in iterable if function(item)) if function is not None and (item for item in iterable if item) if function is None.
def _filter():
master = ['This', 'is', 'a', 'pond', 'full', 'of', 'good', 'words']
dontfindme = ['po', 'go', 'a']
return list(filter(lambda x: all([item not in x for item in dontfindme]), master))
if __name__ == '__main__':
print(_filter())
Output:
['This', 'is', 'full', 'of', 'words']

Is there a better way to tokenize some strings?

I was trying to write a code for tokenization of strings in python for some NLP and came up with this code:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
for line in str:
s.append([])
s[a].append(line.split())
a+=1
print(s)
the output came out to be:
[[['I', 'am', 'Batman.']], [['I', 'loved', 'the', 'tea.']], [['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]]
As you can see, the list now has an extra dimension, for example, If I want the word 'Batman', I would have to type s[0][0][2] instead of s[0][2], so I changed the code to:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
m = []
for line in str:
s.append([])
m=(line.split())
for word in m:
s[a].append(word)
a += 1
print(s)
which got me the correct output:
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
But I have this feeling that this could work with a single loop, because the dataset that I will be importing would be pretty large and a complexity of n would be a lot better that n^2, so, is there a better way to do this/a way to do this with one loop?
Your original code is so nearly there.
>>> str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> s=[]
>>> for line in str:
... s.append(line.split())
...
>>> print(s)
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
The line.split() gives you a list, so append that in your loop.
Or go straight for a comprehension:
[line.split() for line in str]
When you say s.append([]), you have an empty list at index 'a', like this:
L = []
If you append the result of the split to that, like L.append([1]) you end up with a list in this list: [[1]]
You should use split() for every string in loop
Example with list comprehension:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
[s.split() for s in str]
[['I', 'am', 'Batman.'],
['I', 'loved', 'the', 'tea.'],
['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
See this:-
>>> list1 = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> [i.split() for i in list1]
# split by default slits on whitespace strings and give output as list
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

Nested List Iteration

I was attempting some preprocessing on nested list before attempting a small word2vec and encounter an issue as follow:
corpus = ['he is a brave king', 'she is a kind queen', 'he is a young boy', 'she is a gentle girl']
corpus = [_.split(' ') for _ in corpus]
[['he', 'is', 'a', 'brave', 'king'], ['she', 'is', 'a', 'kind', 'queen'], ['he', 'is', 'a', 'young', 'boy'], ['she', 'is', 'a', 'gentle', 'girl']]
So the output above was given as a nested list & I intended to remove the stopwords e.g. 'is', 'a'.
for _ in range(0, len(corpus)):
for x in corpus[_]:
if x == 'is' or x == 'a':
corpus[_].remove(x)
[['he', 'a', 'brave', 'king'], ['she', 'a', 'kind', 'queen'], ['he', 'a', 'young', 'boy'], ['she', 'a', 'gentle', 'girl']]
The output seems indicating that the loop skipped to the next sub-list after removing 'is' in each sub-list instead of iterating entirely.
What is the reasoning behind this? Index? If so, how to resolve assuming I'd like to retain the nested structure.
All you code is correct except a minor change: Use [:] to iterate over the contents using a copy of the list and avoid doing changes via reference to the original list. Specifically, you create a copy of a list as lst_copy = lst[:]. This is one way to copy among several others (see here for comprehensive ways). When you iterate through the original list and modify the list by removing items, the counter creates the problem which you observe.
for _ in range(0, len(corpus)):
for x in corpus[_][:]: # <--- create a copy of the list using [:]
if x == 'is' or x == 'a':
corpus[_].remove(x)
OUTPUT
[['he', 'brave', 'king'],
['she', 'kind', 'queen'],
['he', 'young', 'boy'],
['she', 'gentle', 'girl']]
Maybe you can define a custom method to reject elements matching a certain condition. Similar to itertools (for example: itertools.dropwhile).
def reject_if(predicate, iterable):
for element in iterable:
if not predicate(element):
yield element
Once you have the method in place, you can use this way:
stopwords = ['is', 'and', 'a']
[ list(reject_if(lambda x: x in stopwords, ary)) for ary in corpus ]
#=> [['he', 'brave', 'king'], ['she', 'kind', 'queen'], ['he', 'young', 'boy'], ['she', 'gentle', 'girl']]
nested = [input()]
nested = [i.split() for i in nested]

finding gappy sublists within a certain range

Recently I asked a question here where I wanted to find sublists within a larger list. I have a similar but slightly different question. Suppose I have this list:
[['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
and I want to query it using matches = ['she', 'is', 'student'], with the intention to bring from the queried list, all the sublists that contain the elements ofmatches in the same order. The only difference with the question in the link is that I want to add a range parameter to the find_gappy function so it would refrain from retrieving sublists in which the gap(s) between elements exceeds the specified range. For instance, in the example above, I would like a function like this:
matches = ['she', 'is', 'student']
x = [i for i in x if find_gappy(i, matches, range=2)]
which would return:
[['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
The last element doesn't show up since in the sentence she is a very very exceptionally good student, the distance between a and good exceeds the range limit.
How can I write such a function?the gap between
Here is one way that also takes the order of items in match list into the consideration:
In [102]: def find_gappy(all_sets, matches, gap_range=2):
...: zip_m = list(zip(matches, matches[1:]))
...: for lst in all_sets:
...: indices = {j: i for i, j in enumerate(lst)}
...: try:
...: if all(0 <= indices[j]-indices[i] - 1 <= gap_range for i, j in zip_m):
...: yield lst
...: except KeyError:
...: pass
...:
...:
Demo:
In [110]: lst = [['she', 'is', 'a', 'student'],
...: ['student', 'she', 'is', 'a', 'lawer'], # for order check
...: ['she', 'is', 'a', 'great', 'student'],
...: ['i', 'am', 'a', 'teacher'],
...: ['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
...:
In [111]:
In [111]: list(find_gappy(lst, ['she', 'is', 'student'], gap_range=2))
Out[111]: [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
If there are duplicate words in your sublists, you can use a defaultdict() to keep track of all indexes and use itertools.prodcut to compare the gap for all ordered products of word pairs.
In [9]: from collections import defaultdict
In [10]: from itertools import product
In [10]: def find_gappy(all_sets, matches, gap_range=2):
...: zip_m = list(zip(matches, matches[1:]))
...: for lst in all_sets:
...: indices = defaultdict(list)
...: for i, j in enumerate(lst):
...: indices[j].append(i)
...: try:
...: if all(any(0 <= v - k - 1 <= gap_range for k, v in product(indices[j], indices[i])) for i, j in zip_m):
...: yield lst
...: except KeyError:
...: pass
Technique in the linked question is decent enough, you just need to add gaps counting along the way and, since you don't want a global count, reset the counter whenever you encounter a match. Something like:
import collections
def find_gappy(source, matches, max_gap=-1):
matches = collections.deque(matches)
counter = max_gap # initialize as -1 if you want to begin counting AFTER the first match
for word in source:
if word == matches[0]:
counter = max_gap # or remove this for global gap counting
matches.popleft()
if not matches:
return True
else:
counter -= 1
if counter == -1:
return False
return False
data = [['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
matches = ['she', 'is', 'student']
x = [i for i in data if find_gappy(i, matches, 2)]
# [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
As a bonus, you can use it as the original function, the gap counting is applied only if you pass a positive number as max_gap.

Splitting python lists

I'm a newbie , I've written a tokenize function which basically takes in a txt file that consists of sentences and splits them based on whitespaces and punctuations. The thing here is it gives me an output with sublists present within a parent list.
My code:
def tokenize(document)
file = open("document.txt")
text = file.read()
hey = text.lower()
words = re.split(r'\s\s+', hey)
print [re.findall(r'\w+', b) for b in words]
My output:
[['what', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'eggs', 'warden'], ['his', 'dad', 'was', 'warden', 'in', 'the', 'kitchen', 'poaching', 'eggs']]
Desired Output:
['what', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'eggs', 'warden']['his', 'dad', 'was', 'warden', 'in', 'the', 'kitchen', 'poaching', 'eggs']
How do i remove the parent list out in my output ?? What changes do i need to make in my code inorder to remove the outer list brackets ??
I want them as individual lists
A function in Python can only return one value. If you want to return two things (for example, in your case, there are two lists of words) you have to return an object that can hold two things like a list, a tuple, a dictionary.
Do not confuse how you want to print the output vs. what is the object returned.
To simply print the lists:
for b in words:
print(re.findall(r'\w+', b))
If you do this, then your method doesn't return anything (it actually returns None).
To return both the lists:
return [re.findall(r'\w+', b) for b in words]
Then call your method like this:
word_lists = tokenize(document)
for word_list in word_lists:
print(word_list)
this should work
print ','.join([re.findall(r'\w+', b) for b in words])
I have an example, which I guess is not much different from the problem you have...
where I only take a certain part of the list.
>>> a = [['sa', 'bbb', 'ccc'], ['dad', 'des', 'kkk']]
>>>
>>> print a[0], a[1]
['sa', 'bbb', 'ccc'] ['dad', 'des', 'kkk']
>>>

Categories

Resources