MapReduce WordCount with prop nouns - python

I'm trying to make a MapReduce WordCount that get a large article and counts proper nouns.
Here's requirements:
Starts with a capital letter and has never been found in the text with a small letter
Has length between 2 and 7 letters
Sort in descending order
Looks like a typical WordCount mapreduce, but I couldn't do this. How to get rid of all the punctuation marks? What's the right way to construct mapper and reducer?
import sys
import re
for line in sys.stdin:
article_id, text = line.strip().split('\t', 1)
text = re.sub('\W', ' ', text).split(' ')
for word in text:
if len(word) >= 2 and len(word) < 7:
key = "".join(sorted(word.lower()))
print("{}\t{}\t{}".format(key, word.lower(), 1))

If you are only looking for words, since you already imported re, you can use re.compile (one solution) :
re.compile('\w+').findall(text)
This way you remove all the punctuation in the string, keeping only letters and numbers.
If you take the string below :
text = "Looks like a typical WordCount mapreduce, but I couldn't do this. How to get rid of all the punctuation marks"
you quickly obtain :
liste = ['Looks', 'like', 'a', 'typical', 'WordCount', 'mapreduce', 'but', 'I', 'couldn', 't', 'do', 'this', 'How', 'to', 'get', 'rid', 'of', 'all', 'the', 'punctuation', 'marks']
On which you can run your for loop in the same way.

Related

Stemmer function that takes a string and returns the stems of each word in a list

I am trying to create this function which takes a string as input and returns a list containing the stem of each word in the string. The problem is, that using a nested for loop, the words in the string are appended multiple times in the list. Is there a way to avoid this?
def stemmer(text):
stemmed_string = []
res = text.split()
suffixes = ('ed', 'ly', 'ing')
for word in res:
for i in range(len(suffixes)):
if word.endswith(suffixes[i]):
stemmed_string.append(word[:-len(suffixes[i])])
elif len(word) > 8:
stemmed_string.append(word[:8])
else:
stemmed_string.append(word)
return stemmed_string
If I call the function on this text ('I have a dog is barking') this is the output:
['I',
'I',
'I',
'have',
'have',
'have',
'a',
'a',
'a',
'dog',
'dog',
'dog',
'that',
'that',
'that',
'is',
'is',
'is',
'barking',
'barking',
'bark']
You are appending something in each round of the loop over suffixes. To avoid the problem, don't do that.
It's not clear if you want to add the shortest possible string out of a set of candidates, or how to handle stacked suffixes. Here's a version which always strips as much as possible.
def stemmer(text):
stemmed_string = []
suffixes = ('ed', 'ly', 'ing')
for word in text.split():
for suffix in suffixes:
if word.endswith(suffix):
word = word[:-len(suffix)]
stemmed_string.append(word)
return stemmed_string
Notice the fixed syntax for looping over a list, too.
This will reduce "sparingly" to "spar", etc.
Like every naïve stemmer, this will also do stupid things with words like "sly" and "thing".
Demo: https://ideone.com/a7FqBp

Print the dictionary value and key(index) in python

I am trying to write a code which inputs a line from user, splits it and feed it up to majestic dictionary named counts. All is well until we ask her majesty for some data. I want the data in the format such that the word is printed first and number of times it repeats printed next to it. Below is the code I managed to write.
counts = dict()
print('Enter a line of text:')
line = input('')
words = line.split()
print('Words:', words)
print('Counting:')
for word in words:
counts[word] = counts.get(word,0) + 1
for wording in counts:
print('trying',counts[wording], '' )
When it executes its output is unforgivable.
Words: ['You', 'will', 'always', 'only', 'get', 'an', 'indent', 'error', 'if', 'there', 'is', 'actually', 'an', 'indent', 'error.', 'Double', 'check', 'that', 'your', 'final', 'line', 'is', 'indented', 'the', 'same', 'was', 'as', 'the', 'other', 'lines', '--', 'either', 'with', 'spaces', 'or', 'with', 'tabs.', 'Most', 'likely,', 'some', 'of', 'the', 'lines', 'had', 'spaces', '(or', 'tabs)', 'and', 'the', 'other', 'line', 'had', 'tabs', '(or', 'spaces).']
Counting:
trying 1
trying 1
trying 1
trying 1
trying 1
trying 2
trying 2
trying 1
trying 1
trying 1
trying 2
trying 1
trying 1
trying 1
trying 1
trying 1
trying 1
trying 1
trying 2
It just prints trying and number of times it is repeated and without the word(I think it is called index in dictionary, correct me if I am wrong)
Thankyou
Please help me and when replying to this question please keep in mind I am a newbie, both to python and stack overflow.
Nowhere in your code do you attempt to print the word. How did you expect it to appear in the output? If you want the word, put it in the list of things to print:
print(wording, counts[wording])
For more education, look up the package collections, and use the Counter construct.
counts = Counter(words)
will do all of your word counts for you.
I'm confuzled as to why you print trying.
Try this instead.
counts = dict()
print('Enter a line of text:')
line = input('')
words = line.split()
print('Words:', words)
print('Counting:')
for word in words:
counts[word] = counts.get(word,0) + 1
for wording in counts:
print(wording,counts[wording], '' )
You should use counts.items() to iterate over the key and value of counts as follows:
counts = dict()
print('Enter a line of text:')
line = input('')
words = line.split()
print('Words:', words)
print('Counting:')
for word in words:
counts[word] = counts.get(word,0) + 1
for word, count in counts.items(): # notice this!
print(f'trying {word} {count}')
Also notice that you can use an f-string when printing.
The code you have iterates over the dictionary keys and prints only the count in the dictionary. You would want to do something like this:
for word, count in counts.items():
print('trying', word, count)
You might also want to use
from collections defaultdict
counts = defaultdict(lambda: 0)
So while adding to the dictionary, the code would be as simple as
counts[word] += 1

How to find all words in a string that begin with an uppercase letter, for multiple strings in a list

I have a list of strings, each string is about 10 sentences. I am hoping to find all words from each string that begin with a capital letter. Preferably after the first word in the sentence. I am using re.findall to do this. When I manually set the string = '' I have no trouble do this, however when I try to use a for loop to loop over each entry in my list I get a different output.
for i in list_3:
string = i
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['I', 'I', 'As', 'I', 'University', 'Illinois', 'It', 'To', 'It', 'I', 'One', 'Manu', 'I', 'I', 'Once', 'And', 'Through', 'I', 'I', 'Most', 'Its', 'The', 'I', 'That', 'I', 'I', 'I', 'I', 'I', 'I']
When I manually input the string value
txt = 0
for i in list_3:
string = list_3[txt]
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['Remember', 'The', 'Common', 'App', 'Do', 'Your', 'Often', 'We', 'Monica', 'Lannom', 'Co', 'Founder', 'Campus', 'Ventures', 'One', 'Break', 'Campus', 'Ventures', 'Universities', 'Undermatching', 'Stanford', 'Yale', 'Undermatching', 'What', 'A', 'Yale', 'Lannom', 'There', 'During', 'Some', 'The', 'Lannom', 'That', 'It', 'Lannom', 'Institutions', 'University', 'Chicago', 'Boston', 'College', 'These', 'Students', 'If', 'Lannom', 'Recruiting', 'Elite', 'Campus', 'Ventures', 'Understanding', 'Campus', 'Ventures', 'The', 'For', 'Lannom', 'What', 'I', 'Wish', 'I', 'Knew', 'Before', 'Starting', 'Company', 'I', 'Even', 'I', 'Lannom', 'The', 'There']
But I can't seem to write a for loop that correctly prints the output for each of the 5 items in the list. Any ideas?
The easiest way yo do that is to write a for loop which checks whether the first letter of an element of the list is capitalized. If it is, it will be appended to the output list.
output = []
for i in list_3:
if i[0] == i[0].upper():
output.append(i)
print(output)
We can also use the list comprehension and made that in 1 line. We are also checking whether the first letter of an element is the capitalized letter.
output = [x for x in list_3 if x[0].upper() == x[0]]
print(output)
EDIT
You want to place the sentence as an element of a list so here is the solution. We iterate over the list_3, then iterate for every word by using the split() function. We are thenchecking whether the word is capitalized. If it is, it is added to an output.
list_3 = ["Remember your college application process? The tedious Common App applications, hours upon hours of research, ACT/SAT, FAFSA, visiting schools, etc. Do you remember who helped you through this process? Your family and guidance counselors perhaps, maybe your peers or you may have received little to no help"]
output = []
for i in list_3:
for j in i.split():
if j[0].isupper():
output.append(j)
print(output)
Assuming sentences are separated by one space, you could use re.findall with the following regular expression.
r'(?m)(?<!^)(?<![.?!] )[A-Z][A-Za-z]*'
Start your engine! | Python code
Python's regex engine performs the following operations.
(?m) : set multiline mode so that ^ and $ match the beginning
and the end of a line
(?<!^) : negative lookbehind asserts current location is not
at the beginning of a line
(?<![.?!] ) : negative lookbehind asserts current location is not
preceded by '.', '?' or '!', followed by a space
[A-Z] : match an uppercase letter
[A-Za-z]* : match 1+ letters
If sentences can be separated by one or two spaces, insert the negative lookbehind (?<![.?!] ) after (?<![.?!] ).
If the PyPI regex module were used, one could use the variable-length lookbehind (?<![.?!] +)
As i understand, you have list like this:
list_3 = [
'First sentence. Another Sentence',
'And yet one another. Sentence',
]
You are iterating over the list but every iteration overrides test variable, thus you have incorrect result. You eihter have to accumulate result inside additional variable or print it right away, every iteration:
acc = []
for item in list_3:
acc.extend(re.findall(regexp, item))
print(acc)
or
for item in list_3:
print(re.findall(regexp, item))
As for regexp, that ignores first word in the sentence, you can use
re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', s)
(?<!\A) - not the beginning of the string
(?<!\.) - not the first word after dot
\s+ - optional spaces after dot.
You'll receive words potentialy prefixed by space, so here's final example:
acc = []
for item in list_3:
words = [w.strip() for w in re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', item)]
acc.extend(words)
print(acc)
as I really like regexes, try this one:
#!/bin/python3
import re
PATTERN = re.compile(r'[A-Z][A-Za-z0-9]*')
all_sentences = [
"My House! is small",
"Does Annie like Cats???"
]
def flat_list(sentences):
for sentence in sentences:
yield from PATTERN.findall(sentence)
upper_words = list(flat_list(all_sentences))
print(upper_words)
# Result: ['My', 'House', 'Does', 'Annie', 'Cats']

Searching for a specific list of words in a text using regular expression. but it is giving an incorrect result

import re
complex_sen_count = 0
sen = '5th grade. Very easy to read. Easily understood by an average 11-year-old student.'
search_list = [',', 'after', 'although', 'as', 'because',
'before', 'even though', 'if', 'since', 'though',
'unless', 'until', 'when', 'whenever', 'whereas',
'wherever','while']
s = sen.split('. ')
for n in s:
print(n)
if re.compile('|'.join(search_list),re.IGNORECASE).search(n):
complex_sen_count+=1
print("value: ",complex_sen_count)
the value should return 0 because there are no "search_list" words in the string. but still it is incrementing the variable complex_sen_count.
output is:
5th grade
Very easy to read
Easily understood by an average 11-year-old student.
value: 2
expected output: 0
please help.
There are exactly 2 matches:
'5th grade. Very easy to read. Easily understood by an average 11-year-old student.'
To search for a word, add whitespace before and after the word eg: \sas\s (\s means a whitespace).

Convert a list of string sentences to words

I'm trying to essentially take a list of strings containg sentences such as:
sentence = ['Here is an example of what I am working with', 'But I need to change the format', 'to something more useable']
and convert it into the following:
word_list = ['Here', 'is', 'an', 'example', 'of', 'what', 'I', 'am',
'working', 'with', 'But', 'I', 'need', 'to', 'change', 'the format',
'to', 'something', 'more', 'useable']
I tried using this:
for item in sentence:
for word in item:
word_list.append(word)
I thought it would take each string and append each item of that string to word_list, however the output is something along the lines of:
word_list = ['H', 'e', 'r', 'e', ' ', 'i', 's' .....etc]
I know I am making a stupid mistake but I can't figure out why, can anyone help?
You need str.split() to split each string into words:
word_list = [word for line in sentence for word in line.split()]
Just .split and .join:
word_list = ' '.join(sentence).split(' ')
You haven't told it how to distinguish a word. By default, iterating through a string simply iterates through the characters.
You can use .split(' ') to split a string by spaces. So this would work:
for item in sentence:
for word in item.split(' '):
word_list.append(word)
for item in sentence:
for word in item.split():
word_list.append(word)
Split sentence into words:
print(sentence.rsplit())

Categories

Resources