Convert a list of string sentences to words - python

I'm trying to essentially take a list of strings containg sentences such as:
sentence = ['Here is an example of what I am working with', 'But I need to change the format', 'to something more useable']
and convert it into the following:
word_list = ['Here', 'is', 'an', 'example', 'of', 'what', 'I', 'am',
'working', 'with', 'But', 'I', 'need', 'to', 'change', 'the format',
'to', 'something', 'more', 'useable']
I tried using this:
for item in sentence:
for word in item:
word_list.append(word)
I thought it would take each string and append each item of that string to word_list, however the output is something along the lines of:
word_list = ['H', 'e', 'r', 'e', ' ', 'i', 's' .....etc]
I know I am making a stupid mistake but I can't figure out why, can anyone help?

You need str.split() to split each string into words:
word_list = [word for line in sentence for word in line.split()]

Just .split and .join:
word_list = ' '.join(sentence).split(' ')

You haven't told it how to distinguish a word. By default, iterating through a string simply iterates through the characters.
You can use .split(' ') to split a string by spaces. So this would work:
for item in sentence:
for word in item.split(' '):
word_list.append(word)

for item in sentence:
for word in item.split():
word_list.append(word)

Split sentence into words:
print(sentence.rsplit())

Related

Stemmer function that takes a string and returns the stems of each word in a list

I am trying to create this function which takes a string as input and returns a list containing the stem of each word in the string. The problem is, that using a nested for loop, the words in the string are appended multiple times in the list. Is there a way to avoid this?
def stemmer(text):
stemmed_string = []
res = text.split()
suffixes = ('ed', 'ly', 'ing')
for word in res:
for i in range(len(suffixes)):
if word.endswith(suffixes[i]):
stemmed_string.append(word[:-len(suffixes[i])])
elif len(word) > 8:
stemmed_string.append(word[:8])
else:
stemmed_string.append(word)
return stemmed_string
If I call the function on this text ('I have a dog is barking') this is the output:
['I',
'I',
'I',
'have',
'have',
'have',
'a',
'a',
'a',
'dog',
'dog',
'dog',
'that',
'that',
'that',
'is',
'is',
'is',
'barking',
'barking',
'bark']
You are appending something in each round of the loop over suffixes. To avoid the problem, don't do that.
It's not clear if you want to add the shortest possible string out of a set of candidates, or how to handle stacked suffixes. Here's a version which always strips as much as possible.
def stemmer(text):
stemmed_string = []
suffixes = ('ed', 'ly', 'ing')
for word in text.split():
for suffix in suffixes:
if word.endswith(suffix):
word = word[:-len(suffix)]
stemmed_string.append(word)
return stemmed_string
Notice the fixed syntax for looping over a list, too.
This will reduce "sparingly" to "spar", etc.
Like every naïve stemmer, this will also do stupid things with words like "sly" and "thing".
Demo: https://ideone.com/a7FqBp

MapReduce WordCount with prop nouns

I'm trying to make a MapReduce WordCount that get a large article and counts proper nouns.
Here's requirements:
Starts with a capital letter and has never been found in the text with a small letter
Has length between 2 and 7 letters
Sort in descending order
Looks like a typical WordCount mapreduce, but I couldn't do this. How to get rid of all the punctuation marks? What's the right way to construct mapper and reducer?
import sys
import re
for line in sys.stdin:
article_id, text = line.strip().split('\t', 1)
text = re.sub('\W', ' ', text).split(' ')
for word in text:
if len(word) >= 2 and len(word) < 7:
key = "".join(sorted(word.lower()))
print("{}\t{}\t{}".format(key, word.lower(), 1))
If you are only looking for words, since you already imported re, you can use re.compile (one solution) :
re.compile('\w+').findall(text)
This way you remove all the punctuation in the string, keeping only letters and numbers.
If you take the string below :
text = "Looks like a typical WordCount mapreduce, but I couldn't do this. How to get rid of all the punctuation marks"
you quickly obtain :
liste = ['Looks', 'like', 'a', 'typical', 'WordCount', 'mapreduce', 'but', 'I', 'couldn', 't', 'do', 'this', 'How', 'to', 'get', 'rid', 'of', 'all', 'the', 'punctuation', 'marks']
On which you can run your for loop in the same way.

How can I split a txt file into a list by word but including commas on the elements

I have a big txt file and I want to split it into a list where every word is a element of the list. I want to commas to be included on the elements like the example.
txt file
Hi, my name is Mick and I want to split this with commas included, like this.
list ['Hi,','my','name','is','Mick' etc. ]
Thank you very much for the help
Just use str.split() without any pattern, it'll split on space(s)
value = 'Hi, my name is Mick and I want to split this with commas included, like this.'
res = value.split()
print(res) # ['Hi,', 'my', 'name', 'is', 'Mick', 'and', 'I', 'want', 'to', 'split', 'this', 'with', 'commas', 'included,', 'like', 'this.']
res = [r for r in value.split() if ',' not in r]
print(res) # ['my', 'name', 'is', 'Mick', 'and', 'I', 'want', 'to', 'split', 'this', 'with', 'commas', 'like', 'this.']

How to find all words in a string that begin with an uppercase letter, for multiple strings in a list

I have a list of strings, each string is about 10 sentences. I am hoping to find all words from each string that begin with a capital letter. Preferably after the first word in the sentence. I am using re.findall to do this. When I manually set the string = '' I have no trouble do this, however when I try to use a for loop to loop over each entry in my list I get a different output.
for i in list_3:
string = i
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['I', 'I', 'As', 'I', 'University', 'Illinois', 'It', 'To', 'It', 'I', 'One', 'Manu', 'I', 'I', 'Once', 'And', 'Through', 'I', 'I', 'Most', 'Its', 'The', 'I', 'That', 'I', 'I', 'I', 'I', 'I', 'I']
When I manually input the string value
txt = 0
for i in list_3:
string = list_3[txt]
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['Remember', 'The', 'Common', 'App', 'Do', 'Your', 'Often', 'We', 'Monica', 'Lannom', 'Co', 'Founder', 'Campus', 'Ventures', 'One', 'Break', 'Campus', 'Ventures', 'Universities', 'Undermatching', 'Stanford', 'Yale', 'Undermatching', 'What', 'A', 'Yale', 'Lannom', 'There', 'During', 'Some', 'The', 'Lannom', 'That', 'It', 'Lannom', 'Institutions', 'University', 'Chicago', 'Boston', 'College', 'These', 'Students', 'If', 'Lannom', 'Recruiting', 'Elite', 'Campus', 'Ventures', 'Understanding', 'Campus', 'Ventures', 'The', 'For', 'Lannom', 'What', 'I', 'Wish', 'I', 'Knew', 'Before', 'Starting', 'Company', 'I', 'Even', 'I', 'Lannom', 'The', 'There']
But I can't seem to write a for loop that correctly prints the output for each of the 5 items in the list. Any ideas?
The easiest way yo do that is to write a for loop which checks whether the first letter of an element of the list is capitalized. If it is, it will be appended to the output list.
output = []
for i in list_3:
if i[0] == i[0].upper():
output.append(i)
print(output)
We can also use the list comprehension and made that in 1 line. We are also checking whether the first letter of an element is the capitalized letter.
output = [x for x in list_3 if x[0].upper() == x[0]]
print(output)
EDIT
You want to place the sentence as an element of a list so here is the solution. We iterate over the list_3, then iterate for every word by using the split() function. We are thenchecking whether the word is capitalized. If it is, it is added to an output.
list_3 = ["Remember your college application process? The tedious Common App applications, hours upon hours of research, ACT/SAT, FAFSA, visiting schools, etc. Do you remember who helped you through this process? Your family and guidance counselors perhaps, maybe your peers or you may have received little to no help"]
output = []
for i in list_3:
for j in i.split():
if j[0].isupper():
output.append(j)
print(output)
Assuming sentences are separated by one space, you could use re.findall with the following regular expression.
r'(?m)(?<!^)(?<![.?!] )[A-Z][A-Za-z]*'
Start your engine! | Python code
Python's regex engine performs the following operations.
(?m) : set multiline mode so that ^ and $ match the beginning
and the end of a line
(?<!^) : negative lookbehind asserts current location is not
at the beginning of a line
(?<![.?!] ) : negative lookbehind asserts current location is not
preceded by '.', '?' or '!', followed by a space
[A-Z] : match an uppercase letter
[A-Za-z]* : match 1+ letters
If sentences can be separated by one or two spaces, insert the negative lookbehind (?<![.?!] ) after (?<![.?!] ).
If the PyPI regex module were used, one could use the variable-length lookbehind (?<![.?!] +)
As i understand, you have list like this:
list_3 = [
'First sentence. Another Sentence',
'And yet one another. Sentence',
]
You are iterating over the list but every iteration overrides test variable, thus you have incorrect result. You eihter have to accumulate result inside additional variable or print it right away, every iteration:
acc = []
for item in list_3:
acc.extend(re.findall(regexp, item))
print(acc)
or
for item in list_3:
print(re.findall(regexp, item))
As for regexp, that ignores first word in the sentence, you can use
re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', s)
(?<!\A) - not the beginning of the string
(?<!\.) - not the first word after dot
\s+ - optional spaces after dot.
You'll receive words potentialy prefixed by space, so here's final example:
acc = []
for item in list_3:
words = [w.strip() for w in re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', item)]
acc.extend(words)
print(acc)
as I really like regexes, try this one:
#!/bin/python3
import re
PATTERN = re.compile(r'[A-Z][A-Za-z0-9]*')
all_sentences = [
"My House! is small",
"Does Annie like Cats???"
]
def flat_list(sentences):
for sentence in sentences:
yield from PATTERN.findall(sentence)
upper_words = list(flat_list(all_sentences))
print(upper_words)
# Result: ['My', 'House', 'Does', 'Annie', 'Cats']

Python's alphabetical sort file error

i'm new to python & here is my question:
Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.
This is the file:
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
Desired Output:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
This is my code:
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
#loop through the text to get the lines
for line in fh:
line = line.rstrip()
#loop through the line to get the words
for word in line:
words = line.split()
#if a word is not in the empty list, append it
if not word in lst: lst.append(word)
lst.sort()
print lst
My output:
[' ', 'A', 'B', 'I', 'J', 'W', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y']
If you could tell me what is wrong (to get only the first letters of the words with a space in the beginning instead of the whole words), it would be great..
Note: I want the code using these instructions, not other advanced instructions (to keep my learning sequence)
Thank you
You should be calling split() on the line, not the words in the line.
for line in file:
result = line.split() # this returns a list of values
for word in result:
# check if it already is in your list of words
list.sort()
Let's take the code line-by-line
for line in fh:
line = line.rstrip()
So now our first line contains "But soft what light through yonder window breaks", and it's a string.
for word in line:
ah, but now, we've said "Let's iterate over line (a string) and let word be each part of it when we go through the loop. But line is a complete string! When you iterate over a string like this you get one letter at a time, which is what you're seeing in your results.
Instead, don't have that for loop and just split the line as you were before:
for line in fh:
line = line.rstrip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
Try this:
with open('test.txt') as f:
words = []
for line in f:
if line:
words.extend(line.split())
print(sorted(set(words)))
Output:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
just small thing you are missing is this line
for word in line.split(): #split() gives you list of word seprated by space
by doing this you are making line => list of words
right now it is list of char(a simple string). try to print word in your example and print word after using line.split() you will get better idea.
Checkout this link > how to use split

Categories

Resources