Splitting python lists - python

I'm a newbie , I've written a tokenize function which basically takes in a txt file that consists of sentences and splits them based on whitespaces and punctuations. The thing here is it gives me an output with sublists present within a parent list.
My code:
def tokenize(document)
file = open("document.txt")
text = file.read()
hey = text.lower()
words = re.split(r'\s\s+', hey)
print [re.findall(r'\w+', b) for b in words]
My output:
[['what', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'eggs', 'warden'], ['his', 'dad', 'was', 'warden', 'in', 'the', 'kitchen', 'poaching', 'eggs']]
Desired Output:
['what', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'eggs', 'warden']['his', 'dad', 'was', 'warden', 'in', 'the', 'kitchen', 'poaching', 'eggs']
How do i remove the parent list out in my output ?? What changes do i need to make in my code inorder to remove the outer list brackets ??

I want them as individual lists
A function in Python can only return one value. If you want to return two things (for example, in your case, there are two lists of words) you have to return an object that can hold two things like a list, a tuple, a dictionary.
Do not confuse how you want to print the output vs. what is the object returned.
To simply print the lists:
for b in words:
print(re.findall(r'\w+', b))
If you do this, then your method doesn't return anything (it actually returns None).
To return both the lists:
return [re.findall(r'\w+', b) for b in words]
Then call your method like this:
word_lists = tokenize(document)
for word_list in word_lists:
print(word_list)

this should work
print ','.join([re.findall(r'\w+', b) for b in words])

I have an example, which I guess is not much different from the problem you have...
where I only take a certain part of the list.
>>> a = [['sa', 'bbb', 'ccc'], ['dad', 'des', 'kkk']]
>>>
>>> print a[0], a[1]
['sa', 'bbb', 'ccc'] ['dad', 'des', 'kkk']
>>>

Related

Stemmer function that takes a string and returns the stems of each word in a list

I am trying to create this function which takes a string as input and returns a list containing the stem of each word in the string. The problem is, that using a nested for loop, the words in the string are appended multiple times in the list. Is there a way to avoid this?
def stemmer(text):
stemmed_string = []
res = text.split()
suffixes = ('ed', 'ly', 'ing')
for word in res:
for i in range(len(suffixes)):
if word.endswith(suffixes[i]):
stemmed_string.append(word[:-len(suffixes[i])])
elif len(word) > 8:
stemmed_string.append(word[:8])
else:
stemmed_string.append(word)
return stemmed_string
If I call the function on this text ('I have a dog is barking') this is the output:
['I',
'I',
'I',
'have',
'have',
'have',
'a',
'a',
'a',
'dog',
'dog',
'dog',
'that',
'that',
'that',
'is',
'is',
'is',
'barking',
'barking',
'bark']
You are appending something in each round of the loop over suffixes. To avoid the problem, don't do that.
It's not clear if you want to add the shortest possible string out of a set of candidates, or how to handle stacked suffixes. Here's a version which always strips as much as possible.
def stemmer(text):
stemmed_string = []
suffixes = ('ed', 'ly', 'ing')
for word in text.split():
for suffix in suffixes:
if word.endswith(suffix):
word = word[:-len(suffix)]
stemmed_string.append(word)
return stemmed_string
Notice the fixed syntax for looping over a list, too.
This will reduce "sparingly" to "spar", etc.
Like every naïve stemmer, this will also do stupid things with words like "sly" and "thing".
Demo: https://ideone.com/a7FqBp

How to find all words in a string that begin with an uppercase letter, for multiple strings in a list

I have a list of strings, each string is about 10 sentences. I am hoping to find all words from each string that begin with a capital letter. Preferably after the first word in the sentence. I am using re.findall to do this. When I manually set the string = '' I have no trouble do this, however when I try to use a for loop to loop over each entry in my list I get a different output.
for i in list_3:
string = i
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['I', 'I', 'As', 'I', 'University', 'Illinois', 'It', 'To', 'It', 'I', 'One', 'Manu', 'I', 'I', 'Once', 'And', 'Through', 'I', 'I', 'Most', 'Its', 'The', 'I', 'That', 'I', 'I', 'I', 'I', 'I', 'I']
When I manually input the string value
txt = 0
for i in list_3:
string = list_3[txt]
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['Remember', 'The', 'Common', 'App', 'Do', 'Your', 'Often', 'We', 'Monica', 'Lannom', 'Co', 'Founder', 'Campus', 'Ventures', 'One', 'Break', 'Campus', 'Ventures', 'Universities', 'Undermatching', 'Stanford', 'Yale', 'Undermatching', 'What', 'A', 'Yale', 'Lannom', 'There', 'During', 'Some', 'The', 'Lannom', 'That', 'It', 'Lannom', 'Institutions', 'University', 'Chicago', 'Boston', 'College', 'These', 'Students', 'If', 'Lannom', 'Recruiting', 'Elite', 'Campus', 'Ventures', 'Understanding', 'Campus', 'Ventures', 'The', 'For', 'Lannom', 'What', 'I', 'Wish', 'I', 'Knew', 'Before', 'Starting', 'Company', 'I', 'Even', 'I', 'Lannom', 'The', 'There']
But I can't seem to write a for loop that correctly prints the output for each of the 5 items in the list. Any ideas?
The easiest way yo do that is to write a for loop which checks whether the first letter of an element of the list is capitalized. If it is, it will be appended to the output list.
output = []
for i in list_3:
if i[0] == i[0].upper():
output.append(i)
print(output)
We can also use the list comprehension and made that in 1 line. We are also checking whether the first letter of an element is the capitalized letter.
output = [x for x in list_3 if x[0].upper() == x[0]]
print(output)
EDIT
You want to place the sentence as an element of a list so here is the solution. We iterate over the list_3, then iterate for every word by using the split() function. We are thenchecking whether the word is capitalized. If it is, it is added to an output.
list_3 = ["Remember your college application process? The tedious Common App applications, hours upon hours of research, ACT/SAT, FAFSA, visiting schools, etc. Do you remember who helped you through this process? Your family and guidance counselors perhaps, maybe your peers or you may have received little to no help"]
output = []
for i in list_3:
for j in i.split():
if j[0].isupper():
output.append(j)
print(output)
Assuming sentences are separated by one space, you could use re.findall with the following regular expression.
r'(?m)(?<!^)(?<![.?!] )[A-Z][A-Za-z]*'
Start your engine! | Python code
Python's regex engine performs the following operations.
(?m) : set multiline mode so that ^ and $ match the beginning
and the end of a line
(?<!^) : negative lookbehind asserts current location is not
at the beginning of a line
(?<![.?!] ) : negative lookbehind asserts current location is not
preceded by '.', '?' or '!', followed by a space
[A-Z] : match an uppercase letter
[A-Za-z]* : match 1+ letters
If sentences can be separated by one or two spaces, insert the negative lookbehind (?<![.?!] ) after (?<![.?!] ).
If the PyPI regex module were used, one could use the variable-length lookbehind (?<![.?!] +)
As i understand, you have list like this:
list_3 = [
'First sentence. Another Sentence',
'And yet one another. Sentence',
]
You are iterating over the list but every iteration overrides test variable, thus you have incorrect result. You eihter have to accumulate result inside additional variable or print it right away, every iteration:
acc = []
for item in list_3:
acc.extend(re.findall(regexp, item))
print(acc)
or
for item in list_3:
print(re.findall(regexp, item))
As for regexp, that ignores first word in the sentence, you can use
re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', s)
(?<!\A) - not the beginning of the string
(?<!\.) - not the first word after dot
\s+ - optional spaces after dot.
You'll receive words potentialy prefixed by space, so here's final example:
acc = []
for item in list_3:
words = [w.strip() for w in re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', item)]
acc.extend(words)
print(acc)
as I really like regexes, try this one:
#!/bin/python3
import re
PATTERN = re.compile(r'[A-Z][A-Za-z0-9]*')
all_sentences = [
"My House! is small",
"Does Annie like Cats???"
]
def flat_list(sentences):
for sentence in sentences:
yield from PATTERN.findall(sentence)
upper_words = list(flat_list(all_sentences))
print(upper_words)
# Result: ['My', 'House', 'Does', 'Annie', 'Cats']

Is there a better way to tokenize some strings?

I was trying to write a code for tokenization of strings in python for some NLP and came up with this code:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
for line in str:
s.append([])
s[a].append(line.split())
a+=1
print(s)
the output came out to be:
[[['I', 'am', 'Batman.']], [['I', 'loved', 'the', 'tea.']], [['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]]
As you can see, the list now has an extra dimension, for example, If I want the word 'Batman', I would have to type s[0][0][2] instead of s[0][2], so I changed the code to:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
m = []
for line in str:
s.append([])
m=(line.split())
for word in m:
s[a].append(word)
a += 1
print(s)
which got me the correct output:
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
But I have this feeling that this could work with a single loop, because the dataset that I will be importing would be pretty large and a complexity of n would be a lot better that n^2, so, is there a better way to do this/a way to do this with one loop?
Your original code is so nearly there.
>>> str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> s=[]
>>> for line in str:
... s.append(line.split())
...
>>> print(s)
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
The line.split() gives you a list, so append that in your loop.
Or go straight for a comprehension:
[line.split() for line in str]
When you say s.append([]), you have an empty list at index 'a', like this:
L = []
If you append the result of the split to that, like L.append([1]) you end up with a list in this list: [[1]]
You should use split() for every string in loop
Example with list comprehension:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
[s.split() for s in str]
[['I', 'am', 'Batman.'],
['I', 'loved', 'the', 'tea.'],
['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
See this:-
>>> list1 = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> [i.split() for i in list1]
# split by default slits on whitespace strings and give output as list
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

Nested List Iteration

I was attempting some preprocessing on nested list before attempting a small word2vec and encounter an issue as follow:
corpus = ['he is a brave king', 'she is a kind queen', 'he is a young boy', 'she is a gentle girl']
corpus = [_.split(' ') for _ in corpus]
[['he', 'is', 'a', 'brave', 'king'], ['she', 'is', 'a', 'kind', 'queen'], ['he', 'is', 'a', 'young', 'boy'], ['she', 'is', 'a', 'gentle', 'girl']]
So the output above was given as a nested list & I intended to remove the stopwords e.g. 'is', 'a'.
for _ in range(0, len(corpus)):
for x in corpus[_]:
if x == 'is' or x == 'a':
corpus[_].remove(x)
[['he', 'a', 'brave', 'king'], ['she', 'a', 'kind', 'queen'], ['he', 'a', 'young', 'boy'], ['she', 'a', 'gentle', 'girl']]
The output seems indicating that the loop skipped to the next sub-list after removing 'is' in each sub-list instead of iterating entirely.
What is the reasoning behind this? Index? If so, how to resolve assuming I'd like to retain the nested structure.
All you code is correct except a minor change: Use [:] to iterate over the contents using a copy of the list and avoid doing changes via reference to the original list. Specifically, you create a copy of a list as lst_copy = lst[:]. This is one way to copy among several others (see here for comprehensive ways). When you iterate through the original list and modify the list by removing items, the counter creates the problem which you observe.
for _ in range(0, len(corpus)):
for x in corpus[_][:]: # <--- create a copy of the list using [:]
if x == 'is' or x == 'a':
corpus[_].remove(x)
OUTPUT
[['he', 'brave', 'king'],
['she', 'kind', 'queen'],
['he', 'young', 'boy'],
['she', 'gentle', 'girl']]
Maybe you can define a custom method to reject elements matching a certain condition. Similar to itertools (for example: itertools.dropwhile).
def reject_if(predicate, iterable):
for element in iterable:
if not predicate(element):
yield element
Once you have the method in place, you can use this way:
stopwords = ['is', 'and', 'a']
[ list(reject_if(lambda x: x in stopwords, ary)) for ary in corpus ]
#=> [['he', 'brave', 'king'], ['she', 'kind', 'queen'], ['he', 'young', 'boy'], ['she', 'gentle', 'girl']]
nested = [input()]
nested = [i.split() for i in nested]

finding gappy sublists within a larger list

Let's say I have a list like this:
[['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
Now I have a list like this:
['she', 'is', 'student']
I want to query the larger list with this one, and return all the lists that contain the words within the query list in the same order. There might be gaps, but the order should be the same. How can I do that? I tried using the in operator but I don't get the desired output.
If all that you care about is that the words appear in order somehwere in the array, you can use a collections.deque and popleft to iterate through the list, and if the deque is emptied, you have found a valid match:
from collections import deque
def find_gappy(arr, m):
dq = deque(m)
for word in arr:
if word == dq[0]:
dq.popleft()
if not dq:
return True
return False
By comparing each word in arr with the first element of dq, we know that when we find a match, it has been found in the correct order, and then we popleft, so we now are comparing with the next element in the deque.
To filter your initial list, you can use a simple list comprehension that filters based on the result of find_gappy:
matches = ['she', 'is', 'student']
x = [i for i in x if find_gappy(i, matches)]
# [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student'], ['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
You can compare two lists, with a function like this one. The way it works is it loops through your shorter list, and every time it finds the next word in the long list, cuts off the first part of the longer list at that point. If it can't find the word it returns false.
def is_sub_sequence(long_list, short_list):
for word in short_list:
if word in long_list:
i = long_list.index(word)
long_list = long_list[i+1:]
else:
return False
return True
Now you have a function to tell you if the list is the desired type, you can filter out all the lists you need from the 'list of lists' using a list comprehension like the following:
a = [['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
b = ['she', 'is', 'student']
filtered = [x for x in a if is_sub_sequence(x,b)]
The list filtered will include only the lists of the desired type.

Categories

Resources