I am trying to change the words that are nouns in a text to "noun".
I am having trouble. Here is what I have so far.
def noun(file):
for word in file:
for ch in word:
if ch[-1:-3] == "ion" or ch[-1:-3] == "ism" or ch[-1:-3] == "ity":
word = "noun"
if file(word-1) == "the" and (file(word+1)=="of" or file(word+1) == "on"
word = "noun"
# words that appear after the
return outfile
Any ideas?
Your slices are empty:
>>> 'somethingion'[-1:-3]
''
because the endpoint lies before the start. You could just use [-3:] here:
>>> 'somethingion'[-3:]
'ion'
But you'd be better of using str.endswith() instead:
ch.endswith(("ion", "ism", "ity"))
The function will return True if the string ends with any of the 3 given strings.
Not that ch is actually a word; if word is a string, then for ch in word iterates over individual characters, and those are never going to end in 3-character strings, being only one character long themselves.
Your attempts to look at the next and previous words are also going to fail; you cannot use a list or file object as a callable, let alone use file(word - 1) as a meaningful expression (a string - 1 fails, as well as file(...)).
Instead of looping over the 'word', you could use a regular expression here:
import re
nouns = re.compile(r'(?<=\bthe\b)(\s*\w+(?:ion|ism|ity)\s*)(?=\b(?:of|on)\b)')
some_text = nouns.sub(' noun ', some_text)
This looks for words ending in your three substrings, but only if preceded by the and followed by of or on and replaces those with noun.
Demo:
>>> import re
>>> nouns = re.compile(r'(?<=\bthe\b)(\s*\w+(?:ion|ism|ity)\s*)(?=\b(?:of|on)\b)')
>>> nouns.sub(' noun ', 'the scion on the prism of doom')
'the noun on the noun of doom'
Related
I have a list of tweets and I have to count the instances of each word and turn that into a dictionary. But I also have to remove certain characters, ignore the newline ('\n') character, and make all characters uppercase.
This is my code but somehow some of the characters that I want to remove are still in the output. I don't know if I missed something here.
Note: "tweet_texts" is the name of the list of tweets.
words_dict = {} #where I store the words
remove_chars = "&$#[].,'#()-\"!?’_" #characters to be removed
tweet_texts = [t.upper() for t in tweet_texts]
tweet_texts = [t.replace('\n','') for t in tweet_texts]
for chars in remove_chars:
tweet_texts = [t.replace(chars,'') for t in tweet_texts]
for texts in tweet_texts:
words = texts.split()
for word in words:
if word in words_dict:
words_dict[word] += 1
else:
words_dict[word] = 1
print(words_dict)
>>> {'RT': 53, '1969ENIGMA:': 1, 'SONA': 60,“WALANG': 1, 'SUSTANSYA”:': 1} #this isn't the whole output, the actual output is really long so I cut it
Looking at your example output, I can see the character “, which looks a lot like ", but is not in your list of characters to remove.
print('"' == '“') # False
print(ord('"')) # 34
print(ord('“')) # 8220
Perhaps you could try using a regular expression to keep word and whitespace characters only. Like this.
import re
from collections import Counter
clean_tweets = [re.sub(r"[^\w\s]", "", tweet) for tweet in tweet_texts]
words_dict = Counter()
for tweet in clean_tweets:
words_dict.update(tweet.split())
I am trying to solve the question
Implement the mapper, mapFileToCount, which takes a string (text from a file) and returns the number of capitalized words in that string. A word is defined as a series
of characters separated from other words by either a space or a newline. A word is capitalized if its first letter is capitalized (A vs a).
and my python code currently reads
def mapFileToCount(s):
lines = (str(s)).splitlines()
words = (str(lines)).split(" ")
up = 0
for word in words:
if word[0].isupper() == True:
up = up + 1
return up
However I keep getting the error IndexError: string index out of range
please help
For now
given Hi huy \n hi you there
lines will be ['Hi huy ', ' hi you there']
words will be ["['Hi", 'huy', "',", "'", 'hi', 'you', "there']"] as you use the str(lines) to split on
I'd suggest you split on any whitespace at once with words = re.split("\s+", s).
Then the problem of IndexError comes in cases like Hi where are you__ (_ is space), when you split there will be an empty string at the end, and you can't access the first char char of this, so just add a condition in the if
if word because 0-size word are False, and other True
if word[0].isupper() for you test
import re
def mapFileToCount(s):
words = re.split("\s+", s)
up = 0
for word in words:
if word and word[0].isupper():
up = up + 1
return up
The string index out of range means that the index you are trying to access does not exist in a string. That means you're trying to get a character from the string at a given point. If that given point does not exist , then you will be trying to get a character that is not inside of the string.
In your code its that word[0].
I have created a function that returns a boolean if a argument contains every letter in the ascii.lowercase string (panagram).
within the code, I am using a for loop to find membership of whitespace and punctuation with the string module methods string.whitespace and string.punctuation.
When testing the for loop, the special characters in string.punctuation portion seems to not be matching the special characters provide from the for loop.
Please provide the reasoning to string.punctuation not working as planned.
import string
def ispanagram(text, alphabet = string.ascii_lowercase):
"""Return boolean if argument is contains every letter in the ascii alphabet"""
alphabet_list = list(alphabet)
letter_set = sorted(set(text.lower()))
for char in letter_set:
if char in string.whitespace or char in string.punctuation:
letter_set.remove(char)
return letter_set == alphabet_list
ispanagram("The quick brown !fox jumps over the lazy dog")
The main issue is that you're modifying letter_set while iterating over it. This does not work as expected (explanation).
To fix, iterate over a copy:
for char in letter_set[:]:
Let me know if this help.
import string
import re
def ispanagram(text, alphabet = string.ascii_lowercase):
"""Return boolean if argument is contains every letter in the ascii alphabet"""
alphabet_list = list(alphabet)
# just remove all the special characters including space
text_only_chars = re.sub(r"[-()\"#/#;:<>{}`+=~|.!?, ]", "", text)
letter_set = sorted(set(text_only_chars.lower()))
return letter_set == alphabet_list
print(ispanagram("The quick brown !fox jumps over the lazy dog"))
#### Output ####
True
I have a text where all the words are tagged with "parts of speech" tags. example of the text here:
What/NOUN could/VERB happen/VERB next/ADJ ?/PUNCT
I need to find all the occurrences where there is a /PUNCT followed by either NOUN, PRON or PROPN - and also count which one occurs the most often.
So one of the answers would appear like this:
?/PUNCT What/NOUN or ./PUNCT What/NOUN
Further on the word "Deal" appears 6 times, and I need to show this by code.
I am not allowed to use NLTK, only Collections.
Tried several different things, but don't really know what to do here. I think I need to use defaultdict, and then somehow do a while loop, that gives me back a list with the right connectives.
Here is a test program that does what you want.
It first splits the long string by spaces ' ' which creates a list of word/class elements. The for loop then check if the combination of PUNCT followed by NOUN, PRON, or PROPN occurs and saves that to a list.
The code is as follows:
from collections import Counter
string = "What/NOUN could/VERB happen/VERB next/ADJ ?/PUNCT What/NOUN could/VERB happen/VERB next/ADJ ?/PUNCT"
words = string.split(' ')
found = []
for n, (first, second) in enumerate(zip(words[:-1], words[1:])):
first_class = first.split('/')[1]
second_class = second.split('/')[1]
if first_class == 'PUNCT' and second_class in ["NOUN", "PRON", "PROPN"]:
print(f"Found occurence at data list index {n} and {n+1} with {first_class}, {second_class}")
found.append(f'{words[n]} {words[n+1]}')
To count the words:
words_only = [i.split('/')[0] for i in words]
word_counts = Counter(words_only).most_common()
I have written a really good program that uses text files as word banks for generating sentences from sentence skeletons. An example:
The skeleton
"The noun is good at verbing nouns"
can be made into a sentence by searching a word bank of nouns and verbs to replace "noun" and "verb" in the skeleton. I would like to get a result like
"The dog is good at fetching sticks"
Unfortunately, the handy replace() method was designed for speed, not custom functions in mind. I made methods that accomplish the task of selecting random words from the right banks, but doing something like skeleton = skeleton.replace('noun', getNoun(file.txt)) replaces ALL instances of 'noun' with the single call of getNoun(), instead of calling it for each replacement. So the sentences look like
"The dog is good at fetching dogs"
How might I work around this feature of replace() and make my method get called for each replacement? My minimum length code is below.
import random
def getRandomLine(rsv):
#parameter must be a return-separated value text file whose first line contains the number of lines in the file.
f = open(rsv, 'r') #file handle on read mode
n = int(f.readline()) #number of lines in file
n = random.randint(1, n) #line number chosen to use
s = "" #string to hold data
for x in range (1, n):
s = f.readline()
s = s.replace("\n", "")
return s
def makeSentence(rsv):
#parameter must be a return-separated value text file whose first line contains the number of lines in the file.
pattern = getRandomLine(rsv) #get a random pattern from file
#replace word tags with random words from matching files
pattern = pattern.replace('noun', getRandomLine('noun.txt'))
pattern = pattern.replace('verb', getRandomLine('verb.txt'))
return str(pattern);
def main():
result = makeSentence('pattern.txt');
print(result)
main()
The re module's re.sub function does the job str.replace does, but with far more abilities. In particular, it offers the ability to pass a function for the replacement, rather than a string. The function is called once for each match with a match object as an argument and must return the string that will replace the match:
import re
pattern = re.sub('noun', lambda match: getRandomLine('noun.txt'), pattern)
The benefit here is added flexibility. The downside is that if you don't know regexes, the fact that the replacement interprets 'noun' as a regex may cause surprises. For example,
>>> re.sub('Aw, man...', 'Match found.', 'Aw, manatee.')
'Match found.e.'
If you don't know regexes, you may want to use re.escape to create a regex that will match the raw text you're searching for even if the text contains regex metacharacters:
>>> re.sub(re.escape('Aw, man...'), 'Match found.', 'Aw, manatee.')
'Aw, manatee.'
I don't know if you are asking to edit your code or to write new code, so I wrote new code:
import random
verbs = open('verb.txt').read().split()
nouns = open('noun.txt').read().split()
def makeSentence(sent):
sent = sent.split()
for k in range(0, len(sent)):
if sent[k] == 'noun':
sent[k] = random.choice(nouns)
elif sent[k] == 'nouns':
sent[k] = random.choice(nouns)+'s'
elif sent[k] == 'verbing':
sent[k] = random.choice(verbs)
return ' '.join(sent)
var = raw_input('Enter: ')
print makeSentence(var)
This runs as:
$ python make.py
Enter: the noun is good at verbing nouns
the mouse is good at eating cats