I am looking to change my MapReduce code that finds words in a text with the same vowels. For example:
hEllo’ and ‘pOle’ both contain exactly 1 e and exactly 1 o. The order of the vowels and the case from the original input word does not matter.
Imagine the following example:
hEllo moose
pOle cccttt.ggg
We would end up with the following output:
:1
eo:2
eoo:1
The map code that I have so far is:
import sys
import re
line = sys.stdin.readline()
pattern = re.compile("[a,e,i,o,u]+")
while line:
for char in pattern.findall(line):
print(char+"\t"+"1")
line = sys.stdin.readline()
and the reducer code:
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print('%s\t%s' % (current_word, current_count))
When I run this MapReduce code in Hadoop I get the following output:
a 1
e 4
o 1
Standard version
Not sure if there's a way using regex (there probably is), but here's a non-regex approach that at least gets you the desired result. This approach could likely be improved upon, right now it's a sort of rough draft that seems to work for the case outlined in the original question. Let me know if I need to clarify anything in particular.
from collections import defaultdict
def vowel_count_map(sentence: str):
"""
Return a map of vowel sequence to frequency per word in sentence.
"""
lowercased = sentence.lower()
count = defaultdict(int)
for word in lowercased.split(' '):
vowel_seq = _count_vowels(word)
count[vowel_seq] += 1
return count
def _count_vowels(word, vowels='aeiou') -> str:
"""Return an alphabetized sequence of vowels found in word."""
count = defaultdict(int) # init counter
for char in word:
# Technically this `if` condition is not needed, and can be
# omitted (not sure how it would affect performance?)
if char in vowels:
count[char] += 1
# For example, this uses the same logic as below:
# 'a' * 3 = 'aaa'
return ''.join([vowel * count[vowel] for vowel in vowels])
Shorter version
Here's a one-liner version of the same function that I think is pretty cool, at the cost of being a little harder to understand:
from collections import Counter
vowel_count_map = lambda sentence: Counter([''.join([v * word.count(v) for v in 'aeiou']) for word in sentence.lower().split(' ')])
Usage
I put together some sample strings we can use for test data, and pass as inputs to either version of the vowel_count_map function above. The first approach above returns a defaultdict, the second one returns a Counter. They are essentially both dict-like objects, so you can iterate over their key-value pairs as usual.
a_string = "hEllo moose pOle cccttt.ggg"
b_string = "testuueaaxyzioabceezu actionable conciliatory"
print(vowel_count_map(a_string))
# defaultdict(<class 'int'>, {'eo': 2, 'eoo': 1, '': 1})
print(vowel_count_map(b_string))
# defaultdict(<class 'int'>, {'aaaeeeeiouuu': 1, 'aaeio': 1, 'aiioo': 1})
Related
I want to find a word with the most repeated letters given an input a sentence.
I know how to find the most repeated letters given the sentence but I'm not able how to print the word.
For example:
this is an elementary test example
should print
elementary
def most_repeating_word(strg):
words =strg.split()
for words1 in words:
dict1 = {}
max_repeat_count = 0
for letter in words1:
if letter not in dict1:
dict1[letter] = 1
else:
dict1[letter] += 1
if dict1[letter]> max_repeat_count:
max_repeat_count = dict1[letter]
most_repeated_char = letter
result=words1
return result
You are resetting the most_repeat_count variable for each word to 0. You should move that upper in you code, above first for loop, like this:
def most_repeating_word(strg):
words =strg.split()
max_repeat_count = 0
for words1 in words:
dict1 = {}
for letter in words1:
if letter not in dict1:
dict1[letter] = 1
else:
dict1[letter] += 1
if dict1[letter]> max_repeat_count:
max_repeat_count = dict1[letter]
most_repeated_char = letter
result=words1
return result
Hope this helps
Use a regex instead. It is simple and easy. Iteration is an expensive operation compared to regular expressions.
Please refer to the solution for your problem in this post:
Count repeated letters in a string
Interesting exercise! +1 for using Counter(). Here's my suggestion also making use of max() and its key argument, and the * unpacking operator.
For a final solution note that this (and the other proposed solutions to the question) don't currently consider case, other possible characters (digits, symbols etc) or whether more than one word will have the maximum letter count, or if a word will have more than one letter with the maximum letter count.
from collections import Counter
def most_repeating_word(strg):
# Create list of word tuples: (word, max_letter, max_count)
counters = [ (word, *max(Counter(word).items(), key=lambda item: item[1]))
for word in strg.split() ]
max_word, max_letter, max_count = max(counters, key=lambda item: item[2])
return max_word
word="SBDDUKRWZHUYLRVLIPVVFYFKMSVLVEQTHRUOFHPOALGXCNLXXGUQHQVXMRGVQTBEYVEGMFD"
def most_repeating_word(strg):
dict={}
max_repeat_count = 0
for word in strg:
if word not in dict:
dict[word] = 1
else:
dict[word] += 1
if dict[word]> max_repeat_count:
max_repeat_count = dict[word]
result={}
for word, value in dict.items():
if value==max_repeat_count:
result[word]=value
return result
print(most_repeating_word(word))
i have a programm that counts words of a text file. Now i want to restrict the counter to strings with more than x characters
from collections import Counter
input = 'C:/Users/micha/Dropbox/IPCC_Boox/FOD_v1_ch15.txt'
Counter = {}
words = {}
with open(input,'r', encoding='utf-8-sig') as fh:
for line in fh:
word_list = line.replace(',','').replace('\'','').replace('.','').lower().split()
for word in word_list:
if word not in Counter:
Counter[word] = 1
else:
Counter[word] = Counter[word] + 1
N = 20
top_words = Counter(Counter).most_common(N)
for word, frequency in top_words:
print("%s %d" % (word, frequency))
I tried the re code, but it did not work.
re.sub(r'\b\w{1,3}\b')
I dont know how to implement it...
At the end I would like to have an output that ignores all the short words like and, you, be etc.
You could do this more simply with:
for word in word_list:
if len(word) < 5: # check the length of each word is less than 5 for example
continue # this skips the counter portion and jumps to next word in word_list
elif word not in Counter:
Counter[word] = 1
else:
Counter[word] = Counter[word] + 1
Few notes.
1) You import a Counter but don't use it properly (you do a Counter = {} thus overwriting the import).
from collections import Counter
2) Instead of doing several replaces use list comprehension with a set, its faster and only does one (two with the join) iterations instead of several:
sentence = ''.join([char for char in line if char not in {'.', ',', "'"}])
word_list = sentence.split()
3) Use the counter and list comp for length:
c = Counter(word for word in word_list if len(word) > 3)
Thats it.
Counter already does what you want. You can "feed" it wiht an iterable and this will work.
https://docs.python.org/2/library/collections.html#counter-objects
You can use the filter function too https://docs.python.org/3.7/library/functions.html#filter
The could look alike:
counted = Counter(filter(lambda x: len(x) >= 5, words))
Define a function lineStats() that takes one parameter:
1. paragraph, a string of words and white spaces
The function returns a list containing the number of vowels in each line.
for example,
t="Apple\npear and kiwi"
print(lineStats(t))
[2,5]
This is what I have. I've gotten the output to be 7 but not to be able to make it 2,5. I tried to make a counter for each line but that didn't work, any suggestions?
def lineStats(paragraph):
vowels = "AEIOUaeiou"
for line in paragraph:
for word in line:
for letter in word:
if letter in vowels:
counter +=1
else:
continue
return counter
t = "Apple\npear and kiwi"
print(lineStats(t))
Here's an adaption of your current code
def lineStats(paragraph):
vowels = "AEIOUaeiou"
counter = []
current_line_count = 0
newline = "\n"
for letter in paragraph:
if letter in vowels:
current_line_count += 1
elif letter == newline:
counter.append(current_line_count)
current_line_count = 0
counter.append(current_line_count)
return counter
t="Apple\npear and kiwi"
def lineStats(p):
#find vowels in each line and sum the occurences using map
return map(sum, [[1 for e in f if e in "AEIOUaeiou"] for f in p.split('\n')])
lineStats(t)
Out[601]: [2, 5]
Try this
Create this function
def temp(x):
return sum(v for k, v in Counter(x).items() if k.lower() in 'aeiuo')
Now
from collections import Counter
print [temp(x) for x in lines.split('\n')]
Changing as little of your code as necessary, other answers offer improvements.
def lineStats(paragraph):
counter = []
vowels = "AEIOUaeiou"
lines = paragraph.split('\n')
for line in lines:
count = 0
for word in line:
for letter in word:
if letter in vowels:
count +=1
else:
continue
counter.append(count)
return counter
t = "Apple\npear and kiwi"
print(lineStats(t)) # [2, 5]
The problem states it wants the result to be a list of counts, so changing counter to be a list that we can append to, then use that list to store the vowel count for each line. That may be the only major change to your code that you need to get the required output.
However, there is the concern of newlines ('\n') in the "paragraph", so we str.split() the paragraph into individual lines before entering the for-loop. This will break the count for each line, instead of the total count that you were getting.
I need to display the 10 most frequent words in a text file, from the most frequent to the least as well as the number of times it has been used. I can't use the dictionary or counter function. So far I have this:
import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
words = line.split()
for word in words:
if word not in uniques:
uniques.append(word)
for word in words:
while i<len(uniques):
i+=1
if word in uniques:
cnt += 1
print cnt
Now I think I should look for every word in the array 'uniques' and see how many times it is repeated in this file and then add that to another array that counts the instance of each word. But this is where I am stuck. I don't know how to proceed.
Any help would be appreciated. Thank you
The above problem can be easily done by using python collections
below is the Solution.
from collections import Counter
data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \
# split() returns list of all the words in the string
split_it = data_set.split()
# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)
You're on the right track. Note that this algorithm is quite slow because for each unique word, it iterates over all of the words. A much faster approach without hashing would involve building a trie.
# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()
# Get the set of unique words.
uniques = []
for word in words:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in words: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))
from string import punctuation #you will need it to strip the punctuation
import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
counter = {}
for line in txtFile:
words = line.split()
for word in words:
k = word.strip(punctuation).lower() #the The or you You counted only once
# you still have words like I've, you're, Alice's
# you could change re to are, ve to have, etc...
if "'" in k:
ks = k.split("'")
else:
ks = [k,]
#now the tally
for k in ks:
counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
print word, "\t", counter[word]
import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.
word_counter = {}
for word in txtFile.split(" "): # split in every space.
if len(word) > 0 and word != '\r\n':
if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
word_counter[word] = 1
else:
word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1
for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
# sorts the dict by the values, from top to botton, takes the 10 top items,
print "%s: %s - %s"%(i+1,word,word_counter[word])
output:
1: the - 1432
2: and - 734
3: to - 703
4: a - 579
5: of - 501
6: she - 466
7: it - 440
8: said - 434
9: I - 371
10: in - 338
This methods ensures that only alphanumeric and spaces are in the counter. Doesn't matter that much tho.
Personally I'd make my own implementation of collections.Counter. I assume you know how that object works, but if not I'll summarize:
text = "some words that are mostly different but are not all different not at all"
words = text.split()
resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}
We can certainly sort that based on frequency by using the key keyword argument of sorted, and return the first 10 items in that list. However that doesn't much help you because you don't have Counter implemented. I'll leave THAT part as an exercise for you, and show you how you might implement Counter as a function rather than an object.
def counter(iterable):
d = {}
for element in iterable:
if element in d:
d[element] += 1
else:
d[element] = 1
return d
Not difficult, actually. Go through each element of an iterable. If that element is NOT in d, add it to d with a value of 1. If it IS in d, increment that value. It's more easily expressed by:
def counter(iterable):
d = {}
for element in iterable:
d.setdefault(element, 0) += 1
Note that in your use case, you probably want to strip out the punctuation and possibly casefold the whole thing (so that someword gets counted the same as Someword rather than as two separate words). I'll leave that to you as well, but I will point out str.strip takes an argument as to what to strip out, and string.punctuation contains all the punctuation you're likely to need.
You can also do it through pandas dataframes and get result in convinient form as a table: "word-its freq." ordered.
def count_words(words_list):
words_df = pn.DataFrame(words_list)
words_df.columns = ["word"]
words_df_unique = pn.DataFrame(pn.unique(words_list))
words_df_unique.columns = ["unique"]
words_df_unique["count"] = 0
i = 0
for word in pn.Series.tolist(words_df_unique.unique):
words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word])
i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)
To do the same operation on a pandas data frame, you may use the following through Counter function from Collections:
from collections import Counter
cnt = Counter()
for text in df['text']:
for word in text.split():
cnt[word] += 1
# Find most common 10 words from the Pandas dataframe
cnt.most_common(10)
Here is the code for my function:
def calcVowelProportion(wordList):
"""
Calculates the proportion of vowels in each word in wordList.
"""
VOWELS = 'aeiou'
ratios = []
for word in wordList:
numVowels = 0
for char in word:
if char in VOWELS:
numVowels += 1
ratios.append(numVowels/float(len(word)))
Right now, I'm working with a list of over 87,000 words and this algorithm is obviously extremely slow.
Is there a better way to do this?
EDIT:
I tested the algorithms #ExP provided with the following class:
import time
class vowelProportions(object):
"""
A series of methods that all calculate the vowel/word length ratio
in a list of words.
"""
WORDLIST_FILENAME = "words_short.txt"
def __init__(self):
self.wordList = self.buildWordList()
print "Original: " + str(self.calcMeanTime(10000, self.cvpOriginal, self.wordList))
print "Generator: " + str(self.calcMeanTime(10000, self.cvpGenerator, self.wordList))
print "Count: " + str(self.calcMeanTime(10000, self.cvpCount, self.wordList))
print "Translate: " + str(self.calcMeanTime(10000, self.cvpTranslate, self.wordList))
def buildWordList(self):
inFile = open(self.WORDLIST_FILENAME, 'r', 0)
wordList = []
for line in inFile:
wordList.append(line.strip().lower())
return wordList
def cvpOriginal(self, wordList):
""" My original, slow algorithm"""
VOWELS = 'aeiou'
ratios = []
for word in wordList:
numVowels = 0
for char in word:
if char in VOWELS:
numVowels += 1
ratios.append(numVowels/float(len(word)))
return ratios
def cvpGenerator(self, wordList):
""" Using a generator expression """
return [sum(char in 'aeiou' for char in word)/float(len(word)) for word in wordList]
def cvpCount(self, wordList):
""" Using str.count() """
return [sum(word.count(char) for char in 'aeiou')/float(len(word)) for word in wordList]
def cvpTranslate(self, wordList):
""" Using str.translate() """
return [len(word.translate(None, 'bcdfghjklmnpqrstxyz'))/float(len(word)) for word in wordList]
def timeFunc(self, func, *args):
start = time.clock()
func(*args)
return time.clock() - start
def calcMeanTime(self, numTrials, func, *args):
times = [self.timeFunc(func, *args) for x in range(numTrials)]
return sum(times)/len(times)
The output was (for a list of 200 words):
Original: 0.0005613667
Generator: 0.0008402738
Count: 0.0012531976
Translate: 0.0003343548
Surprisingly, Generator and Count were even slower than the original (please let me know if my implementation was incorrect).
I would like to test #John's solution, but don't know anything about trees.
Since you're just concerned with the ratio of vowels to letters in each word, you could first replace all of the vowels with a. Now you can try a couple of things that might be faster:
You're testing for one letter instead of five at each step. That's bound to be faster.
You might be able to sort the whole list and search for the points where you go from vowel (now represented categorically as a) to non-vowel. This is a tree structure. The number of letters in the word is the level of the tree. The number of vowels is the number of left branches.
You should optimize the innermost loop.
I'm pretty sure there are several alternative approaches. Here is what I can come up with right now. I'm not sure how they will compare in speed (with respect to each other and to your solution).
Using a generator expression:
numVowels = sum(x in 'aeiou' for x in word)
Using str.count():
numVowels = sum(word.count(x) for x in 'aeiou')
Using str.translate() (assuming there are no capital letters or special symbols):
numVowels = len(word.translate(None, 'bcdfghjklmnpqrstxyz'))
With all of these, you can even write the whole function in a single line without list.append().
I would be curious to know which turns out to be the fastest.
Use a regular expression to match the list of vowels and count the number of matches.
>>> import re
>>> s = 'supercalifragilisticexpialidocious'
>>> len(re.findall('[aeiou]', s))
16
for word in wordlist:
numVowels = 0
for letter in VOWELS:
numVowels += word.count(letter)
ratios.append(numVowels/float(len(word)))
less decision making, should mean less time, also uses built in things, which i believe work faster.
import timeit
words = 'This is a test string'
def vowelProportions(words):
counts, vowels = {}, 'aeiou'
wordLst = words.lower().split()
for word in wordLst:
counts[word] = float(sum(word.count(v) for v in vowels)) / len(word)
return counts
def f():
return vowelProportions(words)
print timeit.timeit(stmt = f, number = 17400) # 5 (len of words) * 17400 = 87,000
# 0.838676
Here's how to calculate it with one command line on Linux:-
cat wordlist.txt | tr -d aeiouAEIOU | paste - wordlist.txt | gawk '{ FS="\t"; RATIO = length($1)/ length($2); print $2, RATIO }'
Output:
aa 0
ab 0.5
abs 0.666667
Note: Each line in wordlist.txt contains a word. Empty lines will produce divide by zero error