I was attempting some preprocessing on nested list before attempting a small word2vec and encounter an issue as follow:
corpus = ['he is a brave king', 'she is a kind queen', 'he is a young boy', 'she is a gentle girl']
corpus = [_.split(' ') for _ in corpus]
[['he', 'is', 'a', 'brave', 'king'], ['she', 'is', 'a', 'kind', 'queen'], ['he', 'is', 'a', 'young', 'boy'], ['she', 'is', 'a', 'gentle', 'girl']]
So the output above was given as a nested list & I intended to remove the stopwords e.g. 'is', 'a'.
for _ in range(0, len(corpus)):
for x in corpus[_]:
if x == 'is' or x == 'a':
corpus[_].remove(x)
[['he', 'a', 'brave', 'king'], ['she', 'a', 'kind', 'queen'], ['he', 'a', 'young', 'boy'], ['she', 'a', 'gentle', 'girl']]
The output seems indicating that the loop skipped to the next sub-list after removing 'is' in each sub-list instead of iterating entirely.
What is the reasoning behind this? Index? If so, how to resolve assuming I'd like to retain the nested structure.
All you code is correct except a minor change: Use [:] to iterate over the contents using a copy of the list and avoid doing changes via reference to the original list. Specifically, you create a copy of a list as lst_copy = lst[:]. This is one way to copy among several others (see here for comprehensive ways). When you iterate through the original list and modify the list by removing items, the counter creates the problem which you observe.
for _ in range(0, len(corpus)):
for x in corpus[_][:]: # <--- create a copy of the list using [:]
if x == 'is' or x == 'a':
corpus[_].remove(x)
OUTPUT
[['he', 'brave', 'king'],
['she', 'kind', 'queen'],
['he', 'young', 'boy'],
['she', 'gentle', 'girl']]
Maybe you can define a custom method to reject elements matching a certain condition. Similar to itertools (for example: itertools.dropwhile).
def reject_if(predicate, iterable):
for element in iterable:
if not predicate(element):
yield element
Once you have the method in place, you can use this way:
stopwords = ['is', 'and', 'a']
[ list(reject_if(lambda x: x in stopwords, ary)) for ary in corpus ]
#=> [['he', 'brave', 'king'], ['she', 'kind', 'queen'], ['he', 'young', 'boy'], ['she', 'gentle', 'girl']]
nested = [input()]
nested = [i.split() for i in nested]
Related
I am interested in the finding of the same words in two lists. I have two lists of words in the text_list I also stemmed the words.
text_list = [['i', 'am', 'interest' ,'for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']
So I need this output:
same_words= ['a', 'sentence', 'interest']
You need to apply stemming to both the lists, There are discrepancies for example interesting and interest and if you apply stemming to only words_list then Sentence becomes sentenc so, therefore, apply stemmer to both the lists and then find the common elements:
from nltk.stem import PorterStemmer
text_list = [['i', 'am', 'interest','for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']
ps = PorterStemmer()
words_list = [ps.stem(w) for w in words_list]
text_list = [list(map(ps.stem,i)) for i in text_list]
answer = []
for i in text_list:
answer.append(list(set(words_list).intersection(set(i))))
output = sum(answer, [])
print(output)
>>> ['interest', 'a', 'sentenc']
There is a package called fuzzywuzzy which allows you to match the string from a list with the strings from another list with approximation.
First of all, you will need to flatten your nested list to a list/set with unique strings.
from itertools import chain
newset = set(chain(*text_list))
{'sentence', 'i', 'interest', 'am', 'is', 'for', 'a', 'second', 'subject', 'this'}
Next, from the fuzzywuzzy package, we import the fuzz function.
from fuzzywuzzy import fuzz
result = [max([(fuzz.token_set_ratio(i,j),j) for j in newset]) for i in words_list]
[(100, 'a'), (57, 'for'), (100, 'sentence'), (84, 'interest')]
by looking at here, the fuzz.token_set_ratio actually helps you to match the every element from the words_list to all the elements in newset and gives the percentage of matching alphabets between the two elements. You can remove the max to see the full list of it. (Some alphabets in for is in the word, that's why it's shown in this tuple list too with 57% of matching. You can later use a for loop and a percentage tolerance to remove those matches below the percentage tolerance)
Finally, you will use map to get your desired output.
similarity_score, fuzzy_match = map(list,zip(*result))
fuzzy_match
Out[40]: ['a', 'for', 'sentence', 'interest']
Extra
If your input is not the usual ASCII standard, you can put another argument in the fuzz.token_set_ratio
a = ['У', 'вас', 'є', 'чашка', 'кави?']
b = ['ви']
[max([(fuzz.token_set_ratio(i, j, force_ascii= False),j) for j in a]) for i in b]
Out[9]: [(67, 'кави?')]
I have a list of lists that I would like to iterate over using a for loop, and create a new list with only the unique words. This is similar to a question asked previously, but I could not get the solution to work for me for a list within a list
For example, the nested list is as follows:
ListofList = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']],
The desired output would be a single list:
List_Unique = [['is','and','so','he','his','run']]
I have tried the following two variations of code, but the output of all of them is a list of repeats:
unique_redundant = []
for i in redundant_search:
redundant_i = [j for j in i if not i in unique_redundant]
unique_redundant.append(redundant_i)
unique_redundant
unique_redundant = []
for list in redundant_search:
for j in list:
redundant_j = [i for i in j if not i in unique_redundant]
unique_redundant.append(length_j)
unique_redundant
Example output given for the above two (incorrect) variations
(I ran the code on my real set of data and it gave repeating lists within lists of the same pair of words, though this isn't the actual two words, just an example):
List_Unique = [['is','and'],['is','and'],['is','and']]
I'd suggest using the set() class union() in this way:
ListofList = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']]
set().union(*ListofList)
# => {'run', 'and', 'so', 'is', 'his', 'he'}
Explanation
It works like the following:
test_set = set().union([1])
print(test_set)
# => {1}
The asterisk operator before the list (*ListofList) unpacks the list:
lst = [[1], [2], [3]]
print(lst) #=> [[1], [2], [3]]
print(*lst) #=> [1] [2] [3]
First flatten the list with itertools.chain, then use set to return the unique elements and pass that into a list:
from itertools import chain
if __name__ == '__main__':
print([{list(chain(*list_of_lists))}])
Use itertools.chain to flatten the list and dict.fromkeys to keep the unique values in order:
ListofList = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']]
from itertools import chain
List_Unique = [list(dict.fromkeys(chain.from_iterable(ListofList)))]
Just index out nested list with the help of while and acquire all the values in new list while cnt<len(listoflist)
ListofList = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']]
list_new=[]
cnt=0
while cnt<len(ListofList):
for i in ListofList[cnt]:
if i in list_new:
continue
else:
list_new.append(i)
cnt+=1
print(list_new)
OUTPUT
['is', 'and', 'so', 'he', 'his', 'run']
flat_list = [item for sublist in ListofList for item in sublist]
# use this if order should not change
List_Unique = []
for item in flat_list:
if item not in List_Unique:
List_Unique.append(item)
# use this if order is not an issue
# List_Unique = list(set(flat_list))
You could try this:
ListofList = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']]
uniqueItems = []
for firstList in ListofList:
for item in firstList:
if item not in uniqueItems:
uniqueItems.append(item)
print(uniqueItems)
It uses a nested for loop to access each item and check whether it is in uniqueItems.
using basic set concept, set consists of unique elements
lst = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']]
new_list = []
for x in lst:
for y in set(x):
new_list.append(y)
print(list(set(new_list)))
['run', 'and', 'is', 'so', 'he', 'his']
I was trying to write a code for tokenization of strings in python for some NLP and came up with this code:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
for line in str:
s.append([])
s[a].append(line.split())
a+=1
print(s)
the output came out to be:
[[['I', 'am', 'Batman.']], [['I', 'loved', 'the', 'tea.']], [['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]]
As you can see, the list now has an extra dimension, for example, If I want the word 'Batman', I would have to type s[0][0][2] instead of s[0][2], so I changed the code to:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
m = []
for line in str:
s.append([])
m=(line.split())
for word in m:
s[a].append(word)
a += 1
print(s)
which got me the correct output:
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
But I have this feeling that this could work with a single loop, because the dataset that I will be importing would be pretty large and a complexity of n would be a lot better that n^2, so, is there a better way to do this/a way to do this with one loop?
Your original code is so nearly there.
>>> str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> s=[]
>>> for line in str:
... s.append(line.split())
...
>>> print(s)
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
The line.split() gives you a list, so append that in your loop.
Or go straight for a comprehension:
[line.split() for line in str]
When you say s.append([]), you have an empty list at index 'a', like this:
L = []
If you append the result of the split to that, like L.append([1]) you end up with a list in this list: [[1]]
You should use split() for every string in loop
Example with list comprehension:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
[s.split() for s in str]
[['I', 'am', 'Batman.'],
['I', 'loved', 'the', 'tea.'],
['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
See this:-
>>> list1 = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> [i.split() for i in list1]
# split by default slits on whitespace strings and give output as list
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
Let's say I have a list like this:
[['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
Now I have a list like this:
['she', 'is', 'student']
I want to query the larger list with this one, and return all the lists that contain the words within the query list in the same order. There might be gaps, but the order should be the same. How can I do that? I tried using the in operator but I don't get the desired output.
If all that you care about is that the words appear in order somehwere in the array, you can use a collections.deque and popleft to iterate through the list, and if the deque is emptied, you have found a valid match:
from collections import deque
def find_gappy(arr, m):
dq = deque(m)
for word in arr:
if word == dq[0]:
dq.popleft()
if not dq:
return True
return False
By comparing each word in arr with the first element of dq, we know that when we find a match, it has been found in the correct order, and then we popleft, so we now are comparing with the next element in the deque.
To filter your initial list, you can use a simple list comprehension that filters based on the result of find_gappy:
matches = ['she', 'is', 'student']
x = [i for i in x if find_gappy(i, matches)]
# [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student'], ['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
You can compare two lists, with a function like this one. The way it works is it loops through your shorter list, and every time it finds the next word in the long list, cuts off the first part of the longer list at that point. If it can't find the word it returns false.
def is_sub_sequence(long_list, short_list):
for word in short_list:
if word in long_list:
i = long_list.index(word)
long_list = long_list[i+1:]
else:
return False
return True
Now you have a function to tell you if the list is the desired type, you can filter out all the lists you need from the 'list of lists' using a list comprehension like the following:
a = [['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
b = ['she', 'is', 'student']
filtered = [x for x in a if is_sub_sequence(x,b)]
The list filtered will include only the lists of the desired type.
I'm a newbie , I've written a tokenize function which basically takes in a txt file that consists of sentences and splits them based on whitespaces and punctuations. The thing here is it gives me an output with sublists present within a parent list.
My code:
def tokenize(document)
file = open("document.txt")
text = file.read()
hey = text.lower()
words = re.split(r'\s\s+', hey)
print [re.findall(r'\w+', b) for b in words]
My output:
[['what', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'eggs', 'warden'], ['his', 'dad', 'was', 'warden', 'in', 'the', 'kitchen', 'poaching', 'eggs']]
Desired Output:
['what', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'eggs', 'warden']['his', 'dad', 'was', 'warden', 'in', 'the', 'kitchen', 'poaching', 'eggs']
How do i remove the parent list out in my output ?? What changes do i need to make in my code inorder to remove the outer list brackets ??
I want them as individual lists
A function in Python can only return one value. If you want to return two things (for example, in your case, there are two lists of words) you have to return an object that can hold two things like a list, a tuple, a dictionary.
Do not confuse how you want to print the output vs. what is the object returned.
To simply print the lists:
for b in words:
print(re.findall(r'\w+', b))
If you do this, then your method doesn't return anything (it actually returns None).
To return both the lists:
return [re.findall(r'\w+', b) for b in words]
Then call your method like this:
word_lists = tokenize(document)
for word_list in word_lists:
print(word_list)
this should work
print ','.join([re.findall(r'\w+', b) for b in words])
I have an example, which I guess is not much different from the problem you have...
where I only take a certain part of the list.
>>> a = [['sa', 'bbb', 'ccc'], ['dad', 'des', 'kkk']]
>>>
>>> print a[0], a[1]
['sa', 'bbb', 'ccc'] ['dad', 'des', 'kkk']
>>>