Ignore punctuation and case when comparing two strings in Python - python

I have a two dimensional array called "beats" with a bunch of data. In the second column of the array, there is a list of words in alphabetical order.
I also have a sentence called "words" which was originally a string, which I've turned into an array.
I need to check if one of the words in "words" matches any of the words in the second column of the array "beats". If a match has been found, the program changes the matched word in the sentence "words" to "match" and then return the words in a string. This is the code I'm using:
i = 0
while i < len(words):
n = 0
while n < len(beats):
if words[i] == beats[n][1]:
words[i] = "match"
n = n + 1
i = i + 1
mystring = ' '.join(words)
return mystring
So if I have the sentence:
"Money is the last money."
And "money" is in the second column of the array "beats", the result would be:
"match is the last match."
But since there's a period behind "match", it doesn't consider it a match.
Is there a way to ignore punctuation when comparing the two strings? I don't want to strip the sentence of punctuation because I want the punctuation to be in tact when I return the string once my program's done replacing the matches.

You can create a new string that has the properties you want, and then compare with the new string(s). This will strip everything but numbers, letters, and spaces while making all letters lowercase.
''.join([letter.lower() for letter in ' '.join(words) if letter.isalnum() or letter == ' '])
To strip everything but letters from a string you can do something like:
from string import ascii_letters
''.join([letter for letter in word if letter in ascii_letters])

You could use a regex:
import re
st="Money is the last money."
words=st.split()
beats=['money','nonsense']
for i,word in enumerate(words):
if word=='match': continue
for tgt in beats:
word=re.sub(r'\b{}\b'.format(tgt),'match',word,flags=re.I)
words[i]=word
print print ' '.join(words)
prints
match is the last match.

If it is only the fullstop that you are worried about, then you can add another if case to match that too. Or similar you can add custom handling if your cases are limited. or otherwise regex is the way to go.
words="Money is the last money. This money is another money."
words = words.split()
i = 0
while i < len(words):
if (words[i].lower() == "money".lower()):
words[i] = "match"
if (words[i].lower() == "money".lower() + '.'):
words[i] = "match."
i = i + 1
mystring = ' '.join(words)
print mystring
Output:
match is the last match. This match is another match.

Related

move initial letter to the end of the word and add arg while punctuation to be in the end of word

I need to take the initial letter of every word, moving it to the end of the word and adding 'arg'. For such I tried the following way
def pirate(str):
list_str = str.split(' ')
print(list_str)
new_str = ''
for lstr in list_str:
first_element = lstr[0]
second_element = lstr[1:]
new_str += second_element + first_element + 'arg' + ' '
return new_str
print(pirate('Hello! how are, you!!'))
The expected output is: elloHarg! owharg reaarg, ouyarg!!
However, I am getting following output: ello!Harg owharg re,aarg ou!!yarg
How can I make it work the following usecase?
Punctuations should remain at the end of the word even after translation. Assume Punctuations wont appear after than end of the word. Punctuations to be considered are .,:;?! There could be multiple punctuations present (e.g yes!!)
Here is a short and efficient solution using a regex:
import re
re.sub(r'(\w)(\w+)', r'\2\1arg', 'Hello! how are, you!!')
This is literally: replace each single letter followed by more letters by the more letters first, then the single letter and 'arg'
Output:
'elloHarg! owharg reaarg, ouyarg!!'
As a function:
def pirate(s):
return re.sub(r'(\w)(\w+)', r'\2\1arg', s)

Append last letter in a string to another string

I am constructing a chatbot that rhymes in Python. Is it possible to identify the last vowel (and all the letters after that vowel) in a random word and then append those letters to another string without having to go through all the possible letters one by one (like in the following example)
lastLetters = '' # String we want to append the letters to
if user_answer.endswith("a")
lastLetters.append("a")
else if user_answer.endswith("b")
lastLetters.append("b")
Like if the word was right we’d want to get ”ight”
You need to find the last index of a vowel, for that you could do something like this (a bit fancy):
s = input("Enter the word: ") # You can do this to get user input
last_index = len(s) - next((i for i, e in enumerate(reversed(s), 1) if e in "aeiou"), -1)
result = s[last_index:]
print(result)
Output
ight
An alternative using regex:
import re
s = "right"
last_index = -1
match = re.search("[aeiou][^aeiou]*$", s)
if match:
last_index = match.start()
result = s[last_index:]
print(result)
The pattern [aeiou][^aeiou]*$ means match a vowel followed by possibly several characters that are not a vowel ([^aeiou] means not a vowel, the sign ^ inside brackets means negation in regex) until the end of the string. So basically match the last vowel. Notice this assumes a string compose only of consonants and vowels.

String Index Out of Range Issue - Python

I am trying to make a lossy text compression program that removes all vowels from the input, except for if the vowel is the first letter of a word. I keep getting this "string index out of range" error on line 6. Please help!
text = str(input('Message: '))
text = (' ' + text)
for i in range(0, len(text)):
i = i + 1
if str(text[i-1]) != ' ': #LINE 6
text = text.replace('a', '')
text = text.replace('e', '')
text = text.replace('i', '')
text = text.replace('o', '')
text = text.replace('u', '')
print(text)
As busybear notes, the loop isn't necessary: your replacements don't depend on i.
Here's how I'd do it:
def strip_vowels(s): # Remove all vowels from a string
for v in 'aeiou':
s = s.replace(v, '')
return s
def compress_word(s):
if not s: return '' # Needed to avoid an out-of-range error on the empty string
return s[0] + strip_vowels(s[1:]) # Strip vowels from all but the first letter
def compress_text(s): # Apply to each word
words = text.split(' ')
new_words = compress_word(w) for w in words
return ' '.join(new_words)
When you replace letters with a blank, your word gets shorter. So what was originally len(text) is going to be out of bounds if you remove any letters. Do note however, replace is replacing all occurrences within your string, so a loop isn't even necessary.
An alternative to use the loop is to just keep track of the index of letters to replace while going through the loop, then replace after the loop is complete.
Shortening your string length by replacing any char with "" means that if you remove a character, len(text) used in your iterator is longer than the actual string length. There are plenty of alternative solutions. for example,
text_list = list(text)
for i in range(1, len(text_list)):
if text_list[i] in "aeiou":
text_list[i] = ""
text = "".join(text_list)
By turning your string into a list of its composite characters, you can remove characters but maintain the list length (since empty elements are allowed) then rejoin them.
Be sure to account for special cases, such as len(text)<2.

Acronym input text and reverse it

My task is to turn the input text to acronym and reverse it. The word should be more than 3 characters long and do not contain symbols such as ,!'?. For example if I have this sentence "That was quite easy?" the function should return EQT
I have done so far:
def acr(message):
words = message.split()
if check_length(words) is False:
return "the input long!"
else:
first_letters = []
for word in words:
first_letters.append(word[0])
result = "".join(first_letters)
return reverse(result.upper())
def check(word):
if len(word) > 3:
return False
def check_length(words):
if len(words) > 50:
return False
def rev(message):
reversed_message = message[::-1]
return reversed_message
I have problems with check function. How to correctly control the length of words and symbols?
A bit hacky in the sense that a comma is technically a special character (but you want the 'e' from easy), but this works perfectly for your example. Set up the "if" statement in the "for word in words" section.
def acronymize(message):
"""Turn the input text into the acronym and reverse it, if the text is not too long."""
words = message.split()
if check_message_length(words) is False:
return "Sorry, the input's just too long!"
else:
first_letters = []
for word in words:
if len(word) > 3 and word.isalnum()== True or (len(word) > 4 and ',' in word): #satisfies all conditions. Allows commas, but no other special characters.
first_letters.append(word[0])
result = "".join(first_letters)
return reverse(result.upper())
Basically the 'if' condition became if you have word of length > 3 characters AND the word is alphanumeric (then that satisfies all conditions) OTHERWISE (OR) if there is a comma next to the word (there will be len(word)+1 characters) and it will have a comma (,), that still satisfies the previous conditions, then populate the first_letters list.
Otherwise, ignore the word.
This way you don't even have to set up a check_word function.
This spits out the answer
'EQT'
A couple more examples from my code:
Input: Holy cow, does this really work??
Output: 'RTDH'
** Note that it did NOT include the word 'cow' because it did not have more than 3 letters.
Input: Holy cows, this DOES work!!
Output: 'DTCH'
** Note, now the term 'cows' gets counted because it has more than 3 letters.
You can similarly add any exceptions that you want (!, ? and .) using the 'or' format:
Ex: or (len(word) > 4 and '!' in word) or (len(word) > 4 and '?' in word)
The only assumption made for this is that the sentence is grammatically correct (as in, it won't have exclamation marks followed by commas).
It can be further cleaned up by making a list of the special characters that you would allow and passing that list into the or clause.
Hope that helps!
re.findall(r'(\w)\w{3,}', sentence) finds first letter of every at least four letter word
''.join(reversed(re.findall(r'(\w)\w{3,}', sentence))).upper()
re docs
If you want to ignore words preceding non-word characters, use (\w)\w{3,},?(?:$|\s) – this also allows a comma explicitly.
''.join(reversed(re.findall(r'(\w)\w{3,},?(?:$|\s)', sentence))).upper()

Need assistance with cleaning words that were counted from a text file

I have an input text file from which I have to count sum of characters, sum of lines, and sum of each word.
So far I have been able to get the count of characters, lines and words. I also converted the text to all lower case so I don't get 2 different counts for same word where one is in lower case and the other is in upper case.
Now looking at the output I realized that, the count of words is not as clean. I have been struggling to output clean data where it does not count any special characters, and also when counting words not to include a period or a comma at the end of it.
Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"
it should output:
2 Hello
2 Bob
1 I
1 am
1 to
Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *
Below is the code I have as of now.
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
Thanks
Try replacing
words = fname.split()
With
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
Let me explain the various parts of the code.
Starting with the first line, whenever you have a declaration of the form
function_name = lambda argument1, argument2, ..., argumentN: some_python_expression
What you're looking at is the definition of a function that doesn't have any side effects, meaning it can't change the value of variables, it can only return a value.
So get_alphabetical_characters is a function that we know due to the suggestive name, that it takes a word and returns only the alphabetical characters contained within it.
This is accomplished using the "".join(some_list) idiom which takes a list of strings and concatenates them (in other words, it producing a single string by joining them together in the given order).
And the some_list here is provided by the generator expression [char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]
What this does is it steps through every character in the given word, and puts it into the list if it's alphebetical, or if it isn't it puts a blank string in it's place.
For example
[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]
Evaluates to the following list:
['h','e','l','l','o','']
Which is then evaluates by
"".join(['h','e','l','l','o',''])
Which is equivalent to
'h'+'e'+'l'+'l'+'o'+''
Notice that the blank string added at the end will not have any effect. Adding a blank string to any string returns that same string again.
And this in turn ultimately yields
"hello"
Hope that's clear!
Edit #2: If you want to include periods used to mark decimal we can write a function like this:
include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()
What we're doing here is that the include_char function checks if a character is "alphanumeric" (i.e. is a letter or a digit) or that it's a period and that the character preceding it is numeric, and using this function to strip out all the characters in the string we want, and joining them into a single string, which we then separate into a list of strings using the str.split method.
This program may help you:
#I created a list of characters that I don't want \
# them to be considered as words!
char2remove = (".",",",";","!","?","*",":")
#Received an string of the user.
string = raw_input("Enter your string: ")
#Make all the letters lower-case
string = string.lower()
#replace the special characters with white-space.
for char in char2remove:
string = string.replace(char," ")
#Extract all the words in the new string (have repeats)
words = string.split(" ")
#creating a dictionary to remove repeats
to_count = dict()
for word in words:
to_count[word]=0
#counting the word repeats.
for word in to_count:
#if there is space in a word, it is white-space!
if word.isalpha():
print word, string.count(word)
Works as below:
>>> ================================ RESTART ================================
>>>
Enter your string: Hello, I am Bob. Hello to Bob *
i 1
am 1
to 1
bob 2
hello 2
>>>
Another way is using Regex to remove all non-letter chars (to get rid off char2remove list):
import re
regex = re.compile('[^a-zA-Z]')
your_str = raw_input("Enter String: ")
your_str = your_str.lower()
regex.sub(' ', your_str)
words = your_str.split(" ")
to_count = dict()
for word in words:
to_count[word]=0
for word in to_count:
if word.isalpha():
print word, your_str.count(word)

Categories

Resources