This is a question from NLTK book but I got stuck. Any one know how to write this as a nested list comprehension?
>>> words = ['attribution', 'confabulation', 'elocution',
... 'sequoia', 'tenacious', 'unidirectional']
>>> vsequences = set()
>>> for word in words:
... vowels = []
... for char in word:
... if char in 'aeiou':
... vowels.append(char)
... vsequences.add(''.join(vowels))
>>> sorted(vsequences)
['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']
You can do
In [75]: ["".join([char for char in word if char in 'aeiou']) for word in words]
Out[75]: ['aiuio', 'oauaio', 'eouio', 'euoia', 'eaiou', 'uiieioa']
If you need as set and sorted
sorted(set(["".join([char for char in word if char in 'aeiou']) for word in words]))
Related
Write a program that asks a user for a file name, then reads in the file. The program should then determine how frequently each word in the file is used. The words should be counted regardless of case, for example Spam and spam would both be counted as the same word. You should disregard punctuation. The program should then output the the words and how frequently each word is used. The output should be sorted by the most frequent word to the least frequent word.
Only problem I am having is getting the code to count "The" and "the" as the same thing. The code counts them as different words.
userinput = input("Enter a file to open:")
if len(userinput) < 1 : userinput = 'ran.txt'
f = open(userinput)
di = dict()
for lin in f:
lin = lin.rstrip()
wds = lin.split()
for w in wds:
di[w] = di.get(w,0) + 1
lst = list()
for k,v in di.items():
newtup = (v, k)
lst.append(newtup)
lst = sorted(lst, reverse=True)
print(lst)
Need to count "the" and "The" as on single word.
We start by getting the words in a list, updating the list so that all words are in lowercase. You can disregard punctuation by replacing them from the string with an empty character
punctuations = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
s = "I want to count how many Words are there.i Want to Count how Many words are There"
for punc in punctuations:
s = s.replace(punc,' ')
words = s.split(' ')
words = [word.lower() for word in words]
We then iterate through the list, and update a frequency map.
freq = {}
for word in words:
if word in freq:
freq[word] += 1
else:
freq[word] = 1
print(freq)
#{'i': 2, 'want': 2, 'to': 2, 'count': 2, 'how': 2, 'many': 2,
#'words': 2, 'are': #2, 'there': 2}
You can use counter and re like this,
from collections import Counter
import re
sentence = 'Egg ? egg Bird, Goat afterDoubleSpace\nnewline'
# some punctuations (you can add more here)
punctuationsToBeremoved = ",|\n|\?"
#to make all of them in lower case
sentence = sentence.lower()
#to clean up the punctuations
sentence = re.sub(punctuationsToBeremoved, " ", sentence)
# getting the word list
words = sentence.split()
# printing the frequency of each word
print(Counter(words))
I'm trying to process a list of words and return a new list
containing only unique word. My definite loop works, however it will only print the words all together, instead of one per line. Can anyone help me out? This is probably a simple question but I am very new to Python. Thank you!
uniqueWords = [ ]
for word in allWords:
if word not in uniqueWords:
uniqueWords.append(word)
else:
uniqueWords.remove(word)
return uniqueWords
You can use str.join:
>>> all_words = ['two', 'two', 'one', 'uno']
>>> print('\n'.join(get_unique_words(all_words)))
one
uno
Or plain for loop:
>>> for word in get_unique_words(all_words):
... print(word)
...
one
uno
However, your method won't work for odd counts:
>>> get_unique_words(['three', 'three', 'three'])
['three']
If your goal is to get all words that appear exactly once, here's a shorter method that works using collections.Counter:
from collections import Counter
def get_unique_words(all_words):
return [word for word, count in Counter(all_words).items() if count == 1]
This code may help, it prints unique words line by line, is what I understood in your question:
allWords = ['hola', 'hello', 'distance', 'hello', 'hola', 'yes']
uniqueWords = [ ]
for word in allWords:
if word not in uniqueWords:
uniqueWords.append(word)
else:
uniqueWords.remove(word)
for i in uniqueWords:
print i
If the order of the words is not important I recommend you to create a set to store the unique words:
uniqueWords = set(allWords)
As you can see running the code below, it can be much faster, but it may depend on the original list of words:
import timeit
setup="""
word_list = [str(x) for x in range(1000, 2000)]
allWords = []
for word in word_list:
allWords.append(word)
allWords.append(word)
"""
smt1 = "unique = set(allWords)"
smt2 = """
uniqueWords = [ ]
for word in allWords:
if word not in uniqueWords:
uniqueWords.append(word)
else:
uniqueWords.remove(word)
"""
print("SET:", timeit.timeit(smt1, setup, number=1000))
print("LOOP:", timeit.timeit(smt2, setup, number=1000))
OUTPUT:
SET: 0.03147706200002176
LOOP: 0.12346845000001849
maybe this fits your idea:
allWords=['hola', 'hello', 'distance', 'hello', 'hola', 'yes']
uniqueWords=dict()
for word in allWords:
if word not in uniqueWords:
uniqueWords.update({word:1})
else:
uniqueWords[word]+=1
for k, v in uniqueWords.items():
if v==1:
print(k)
Prints:
distance
yes
I wanted to know how to iterate through a string word by word.
string = "this is a string"
for word in string:
print (word)
The above gives an output:
t
h
i
s
i
s
a
s
t
r
i
n
g
But I am looking for the following output:
this
is
a
string
When you do -
for word in string:
You are not iterating through the words in the string, you are iterating through the characters in the string. To iterate through the words, you would first need to split the string into words , using str.split() , and then iterate through that . Example -
my_string = "this is a string"
for word in my_string.split():
print (word)
Please note, str.split() , without passing any arguments splits by all whitespaces (space, multiple spaces, tab, newlines, etc).
This is one way to do it:
string = "this is a string"
ssplit = string.split()
for word in ssplit:
print (word)
Output:
this
is
a
string
for word in string.split():
print word
Using nltk.
from nltk.tokenize import sent_tokenize, word_tokenize
sentences = sent_tokenize("This is a string.")
words_in_each_sentence = word_tokenize(sentences)
You may use TweetTokenizer for parsing casual text with emoticons and such.
One way to do this is using a dictionary. The problem for the code above is it counts each letter in a string, instead of each word. To solve this problem, you should first turn the string into a list by using the split() method, and then create a variable counts each comma in the list as its own value. The code below returns each time a word appears in a string in the form of a dictionary.
s = input('Enter a string to see if strings are repeated: ')
d = dict()
p = s.split()
word = ','
for word in p:
if word not in d:
d[word] = 1
else:
d[word] += 1
print (d)
s = 'hi how are you'
l = list(map(lambda x: x,s.split()))
print(l)
Output: ['hi', 'how', 'are', 'you']
You can try this method also:
sentence_1 = "This is a string"
list = sentence_1.split()
for i in list:
print (i)
If I have the following list
vowels = ["a","e","i","o","u"]
and another list
words = ["happiness", "yellow"]
how do I count the number of vowels in each word, i.e. happiness = 3, yellow=2?
Using list comprehension:
>>> vowels = ["a","e","i","o","u"]
>>> words = ["happiness", "yellow"]
>>> [sum(c in vowels for c in word) for word in words]
[3, 2]
If you want mapping between the words and occurences, use dictionary comprehension:
>>> {word: sum(c in vowels for c in word) for word in words}
{'happiness': 3, 'yellow': 2}
Converting vowels to set will make it more effective.
data = [0]*len(words) # Initializing the data list
for index, word in enumerate(words): # Iterating through the list of words
for letter in list(word):
if letter in vowels: #checking if the letter is in vowels
data[index] = data[index]+1
print data
data now contains number of vowels corresponding to the same index as the words list. Cheers! :)
Say I have the following dictionary:
d = {"word1":0, "word2":0}
For this regex I need to verify that a word in the string isn't a key in that dictionary.
Is it possible to set a variable to anything not in a dictionary, for the purposes of a regex?
Forget about regex in this case:
test = "word1 word2 word3" # your string
words = test.split(' ') # words in your string
dict = {"word1":0, "word2":0} # your dict
for word in words:
if word in dict:
print word, "is a key in dict"
else:
print word, "isn't a key in dict"
>>> d = {"foo":0, "spam":0}
>>> test = "This is a string with many words, including foo and bar"
>>> any(word in d for word in test.split())
True
If punctuation is a problem (for example, "This is foo." would not find foo with this approach), and since you said all your words are alphanumeric, you could also use
>>> import re
>>> test = "This is foo."
>>> any(word in d for word in re.findall("[A-Za-z0-9]+", test))