Find term with multiple words from dictionary - python

So I'm doing a project where I'm finding words from a file, and then checking to see if it is in the dictionary. I don't know if I'm following the proper syntax because it prints out the else statement that it doesn't find "does not work" in dictionary.
Does it have anything to do with the spaces in between?
test for term with multiple words -- does not work: -3
if 'does not work' in dictionary:
expected_value3 = str(-3)
actual_value3 = dictionary['does not work']
if actual_value3 == expected_value3:
print "---------------------------------"
print "words with spaces passes| word: does not work"
else:
print "---------------------------------"
print "words with spaces FALSE| word: does not work"
else:
print "---------------------------------"
print "does not work not in dictionary"

To deal with phrases you can't split a line in the dictionary file of the format word def by spaces because word might be made up of several words with spaces in-between. You need to have a character which won't appear in word or def to separate them, for instance a tab \t or pipe |, and then build your dictionary like so:
d = {}
with open('dict.txt') as df:
for line in df:
word,definition = line.split('\t')
d[word] = definition
Otherwise you end up with
sentiments = ['does','not','work','the','operation',...]
In your loop, and you end up setting
dictionary['does'] = 'not'
With the code
for line in scores_file:
sentiments = line.split()
dictionary[sentiments[0]] = sentiments[1]

Related

How to remove all instances of a character from a list of strings?

I have a list of tweets and I have to count the instances of each word and turn that into a dictionary. But I also have to remove certain characters, ignore the newline ('\n') character, and make all characters uppercase.
This is my code but somehow some of the characters that I want to remove are still in the output. I don't know if I missed something here.
Note: "tweet_texts" is the name of the list of tweets.
words_dict = {} #where I store the words
remove_chars = "&$#[].,'#()-\"!?’_" #characters to be removed
tweet_texts = [t.upper() for t in tweet_texts]
tweet_texts = [t.replace('\n','') for t in tweet_texts]
for chars in remove_chars:
tweet_texts = [t.replace(chars,'') for t in tweet_texts]
for texts in tweet_texts:
words = texts.split()
for word in words:
if word in words_dict:
words_dict[word] += 1
else:
words_dict[word] = 1
print(words_dict)
>>> {'RT': 53, '1969ENIGMA:': 1, 'SONA': 60,“WALANG': 1, 'SUSTANSYA”:': 1} #this isn't the whole output, the actual output is really long so I cut it
Looking at your example output, I can see the character “, which looks a lot like ", but is not in your list of characters to remove.
print('"' == '“') # False
print(ord('"')) # 34
print(ord('“')) # 8220
Perhaps you could try using a regular expression to keep word and whitespace characters only. Like this.
import re
from collections import Counter
clean_tweets = [re.sub(r"[^\w\s]", "", tweet) for tweet in tweet_texts]
words_dict = Counter()
for tweet in clean_tweets:
words_dict.update(tweet.split())

How to print a text without substring in Python

I want to search for a word in the text and then print the text without that word. For example, we have the text "I was with my friend", I want the text be "I with my friend". I have done the following so far:
text=re.compile("[^was]")
val = "I was with my friend"
if text.search(val):
print text.search(val) #in this line it is obvious wrong
else:
print 'no'
val = "I was with my friend"
print val.replace("was ", "")
Output:
I with my friend
If you want to remove what you've found using a regular expression:
match = text.search(val)
if match is not None:
print val.replace(match.group(0), "")
(However, if you were searching for the word was then your pattern is wrong.)
Substitute an empty string if matched.
text=re.compile(r"was")
val = "I was with my friend"
if text.search(val):
print text.sub('',val)
else:
print 'no'
or you can split by match and join again.
if text.search(val):
print(''.join(text.split(val)))
May be something like this:
print val[:val.index('was')] + val[val.index('was ') + 4:]
This example assumes that word is was. But you can define a variable and use that variable
search_word = 'was'
print val[:val.index(search_word)] + val[val.index(search_word) + len(search_word) + 1:]
Also, this only works for the first occurrence of the search_word and doesn't do any validation if it contains the word or not
to search a substring simply do
if 'was' in 'i was with my friend':
print ...

How do I output the acronym on one line

I am following the hands-on python tutorials from Loyola university and for one exercise I am supposed to get a phrase from the user, capatalize the first letter of each word and print the acronym on one line.
I have figured out how to print the acronym but I can't figure out how to print all the letters on one line.
letters = []
line = input('?:')
letters.append(line)
for l in line.split():
print(l[0].upper())
Pass end='' to your print function to suppress the newline character, viz:
for l in line.split():
print(l[0].upper(), end='')
print()
Your question would be better if you shared the code you are using so far, I'm just guessing that you have saved the capital letters into a list.
You want the string method .join(), which takes a string separator before the . and then joins a list of items with that string separator between them. For an acronym you'd want empty quotes
e.g.
l = ['A','A','R','P']
acronym = ''.join(l)
print(acronym)
You could make a string variable at the beginning string = "".
Then instead of doing print(l[0].upper()) just append to the string string += #yourstuff
Lastly, print(string)

Trying to use output of one function to influence the next function to count words in text file

I'm trying to use one function to count the number of words in a text file, after having this text file "cleaned" up by only including letters and single spaces. So i have my first function, which i want to clean up the text file, then i have my next function to actually return the length of the result of the previous function
(cleaned text). Here are those two functions.
def cleanUpWords(file):
words = (file.replace("-", " ").replace(" ", " ").replace("\n", " "))
onlyAlpha = ""
for i in words:
if i.isalpha() or i == " ":
onlyAlpha += i
return onlyAlpha
So words is the text file cleaned up without double spaces, hyphens, line feeds.
Then, i take out all numbers, then return the cleaned up onlyAlpha text file.
Now if i put return len(onlyAlpha.split()) instead of just return onlyAlpha...it gives me the correct amount of words in the file (I know because i have the answer). But if i do it this way, and try to split it into two functions, it screws up the amount of words. Here's what i'm talking about (here's my word counting function)
def numWords(newWords):
'''Function finds the amount of words in the text file by returning
the length of the cleaned up version of words from cleanUpWords().'''
return len(newWords.split())
newWords i define in main(), where `newWords = cleanUpWords(harper)-----harper is a varible that runs another read funtion (besides the point).
def main():
harper = readFile("Harper's Speech.txt") #readFile function reads
newWords = cleanUpWords(harper)
print(numWords(harper), "Words.")
Given all of this, please tell me why it gives a different answer if i split it into two functions.
for reference, here is the one that counts the words right, but doesn't split the word cleaning and word counting functions, numWords cleans and counts now, which isn't preffered.
def numWords(file):
'''Function finds the amount of words in the text file by returning
the length of the cleaned up version of words from cleanUpWords().'''
words = (file.replace("-", " ").replace(" ", " ").replace("\n", " "))
onlyAlpha = ""
for i in words:
if i.isalpha() or i == " ":
onlyAlpha += i
return len(onlyAlpha.split())
def main():
harper = readFile("Harper's Speech.txt")
print(numWords(harper), "Words.")
Hope i gave enough info.
The problem is quite simple: You split it into two function, but you completely ignore the result of the first function and instead calculate the number of words before the cleanup!
Change your main function to this, then it should work.
def main():
harper = readFile("Harper's Speech.txt")
newWords = cleanUpWords(harper)
print(numWords(newWords), "Words.") # use newWords here!
Also, your cleanUpWords function could be improved a bit. It can still leave double or triple spaces in the text, and you could also make it a bit shorter. Either, you could use regular expressions:
import re
def cleanUpWords(string):
only_alpha = re.sub("[^a-zA-Z]", " ", string)
single_spaces = re.sub("\s+", " ", only_alpha)
return single_spaces
Or you could first filter out all the illegal characters, and then split the words and join them back together with a single space.
def cleanUpWords(string):
only_alpha = ''.join(c for c in string if c.isalpha() or c == ' ')
single_spaces = ' '.join(only_alpha.split())
return single_spaces
Example, for which your original function would leave some double spaces:
>>> s = "text with triple spaces and other \n sorts \t of strange ,.-#+ stuff and 123 numbers"
>>> cleanUpWords(s)
text with triple spaces and other sorts of strange stuff and numbers
(Of course, if you intend to split the words anyway, double spaces are not a problem.)

How to search for words containg certain letters in a txt file with Python?

Look at the code below. This finds the letter 'b' containing in the text file and prints all the words containing the letter 'b' right?
x = open("text file", "r")
for line in x:
if "b" and in line: print line
searchfile.close()
Now here is my problem. I would like to search with not only one, but several letters.
Like, a and b both has to be in the same word.
And then print the list of words containing both letters.
And I'd like to have the user decide what the letters should be.
How do I do that?
Now I've come up with something new. After reading an answer.
x = open("text file", "r")
for line in x:
if "b" in line and "c" in line and "r" in line: print line
Would this work instead?
And how do I make the user enter the letters?
No, your code (apart from the fact that it's syntactically incorrect), will print every line that has a "b", not the words.
In order to do what you want to do, we need more information about the text file. Suppossing words are separated by single spaces, you could do something like this
x = open("file", "r")
words = [w for w in x.read().split() if "a" in w or "b" in w]
You could use sets for this:
letters = set(('l','e'))
for line in open('file'):
if letters <= set(line):
print line
In the above,letters <= set(line) tests whether every element of letters is present in the set consisting of the unique letters of line.
First you need to split the contents of the file into a list of words. To do this you need to split it on line-breaks and on spaces, possibly hypens too, I don't really know. You might want to use re.split depending on how complicated the requirements are. But for this examples lets just go:
words = []
with open('file.txt', 'r') as f:
for line in f:
words += line.split(' ')
Now it will help efficiency if we only have to scan words once and presumably you only want a word to appear once in the final list anyway, so we cast this list as a set
words = set(words)
Then to get only those selected_words containing all of the letters in some other iterable letters:
selected_words = [word for word in words if
[letter for letter in letters if letter in word] == letters]
I think that should work. Any thoughts on efficiency? I don't know the details of how those list comprehensions run.
x = open("text file", "r")
letters = raw_input('Enter the letters to match') # "ro" would match "copper" and "word"
letters = letters.lower()
for line in x:
for word in line.split()
if all(l in word.lower() for l in letters): # could optimize with sets if needed
print word

Categories

Resources