Removing special symbols in from python string - python

I am trying to remove all kinds of special symbols from each word in the given string sen but I'm not able to figure a method in python to properly achieve it.
import string
def LongestWord(sen):
maxlen = 0
count = 0
words = sen.split()
for word in words:
''.join(e for e in word if e.isalnum())
if maxlen < len(word):
maxlen = len(word)
sen = words[count]
count = count +1
return sen
# keep this function call here
print LongestWord(raw_input())
For the following string :
"a beautiful sentence^&!"
I get this as the output : sentence^&!
Please help in figuring out how to remove this special symbols and punctuation marks.

Have a look at Python String join() Method.
This method returns a string, which is the concatenation of the strings in the sequence seq. The separator between elements is the string providing this method.
In short, you need to save what it returns in a variable.
def LongestWord(sen):
words = sen.split()
answer_string = ''
for word in words:
answer_string += ''.join(e for e in word if e.isalnum())
return answer_string
print(LongestWord("a beautiful sentence^&!"))
Output:
abeautifulsentence

Just need to store result of the ''.join(...) in a variable
word = ''.join(e for e in word if e.isalnum())

Related

How to solve the string indices must be integers problem in a for loop for capitalizing every word in a string

I hope everyone is safe.
I am trying to go over a string and capitalize every first letter of the string.
I know I can use .title() but
a) I want to figure out how to use capitalize or something else in this case - basics, and
b) The strings in the tests, have some words with (') which makes .title() confused and capitalize the letter after the (').
def to_jaden_case(string):
appended_string = ''
word = len(string.split())
for word in string:
new_word = string[word].capitalize()
appended_string +=str(new_word)
return appended_string
The problem is the interpreter gives me "TypeError: string indices must be integers" even tho I have an integer input in 'word'. Any help?
thanks!
You are doing some strange things in the code.
First, you split the string just to count the number of words, but don't store it to manipulate the words after that.
Second, when iterating a string with a for in, what you get are the characters of the string, not the words.
I have made a small snippet to help you do what you desire:
def first_letter_of_word_upper(string, exclusions=["a", "the"]):
words = string.split()
for i, w in enumerate(words):
if w not in exclusions:
words[i] = w[0].upper() + w[1:]
return " ".join(words)
test = first_letter_of_word_upper("miguel angelo santos bicudo")
test2 = first_letter_of_word_upper("doing a bunch of things", ["a", "of"])
print(test)
print(test2)
Notes:
I assigned the value of the string splitting to a variable to use it in the loop
As a bonus, I included a list to allow you exclude words that you don't want to capitalize.
I use the original same array of split words to build the result... and then join based on that array. This a way to do it efficiently.
Also, I show some useful Python tricks... first is enumerate(iterable) that returns tuples (i, j) where i is the positional index, and j is the value at that position. Second, I use w[1:] to get a substring of the current word that starts at character index 1 and goes all the way to the end of the string. Ah, and also the usage of optional parameters in the list of arguments of the function... really useful things to learn! If you didn't know them already. =)
You have a logical error in your code:
You have used word = len(string.split()) which is of no use ,Also there is an issue in the for loop logic.
Try this below :
def to_jaden_case(string):
appended_string = ''
word_list = string.split()
for i in range(len(word_list)):
new_word = word_list[i].capitalize()
appended_string += str(new_word) + " "
return appended_string
from re import findall
def capitalize_words(string):
words = findall(r'\w+[\']*\w+', string)
for word in words:
string = string.replace(word, word.capitalize())
return string
This just grabs all the words in the string, then replaces the words in the original string, the characters inside the [ ] will be included in the word aswell
You are using string index to access another string word is a string you are accessing word using string[word] this causing the error.
def to_jaden_case(string):
appended_string = ''
for word in string.split():
new_word = word.capitalize()
appended_string += new_word
return appended_string
Simple solution using map()
def to_jaden_case(string):
return ' '.join(map(str.capitalize, string.split()))
In for word in string: word will iterate over the characters in string. What you want to do is something like this:
def to_jaden_case(string):
appended_string = ''
splitted_string = string.split()
for word in splitted_string:
new_word = word.capitalize()
appended_string += new_word
return appended_string
The output for to_jaden_case("abc def ghi") is now "AbcDefGhi", this is CammelCase. I suppose you actually want this: "Abc Def Ghi". To achieve that, you must do:
def to_jaden_case(string):
appended_string = ''
splitted_string = string.split()
for word in splitted_string:
new_word = word.capitalize()
appended_string += new_word + " "
return appended_string[:-1] # removes the last space.
Look, in your code word is a character of string, it is not index, therefore you can't use string[word], you can correct this problem by modifying your loop or using word instead of string[word]
So your rectified code will be:
def to_jaden_case(string):
appended_string = ''
for word in range(len(string)):
new_word = string[word].capitalize()
appended_string +=str(new_word)
return appended_string
Here I Changed The Third Line for word in string with for word in len(string), the counterpart give you index of each character and you can use them!
Also I removed the split line, because it's unnecessary and you can do it on for loop like len(string)

String Index Out of Range Issue - Python

I am trying to make a lossy text compression program that removes all vowels from the input, except for if the vowel is the first letter of a word. I keep getting this "string index out of range" error on line 6. Please help!
text = str(input('Message: '))
text = (' ' + text)
for i in range(0, len(text)):
i = i + 1
if str(text[i-1]) != ' ': #LINE 6
text = text.replace('a', '')
text = text.replace('e', '')
text = text.replace('i', '')
text = text.replace('o', '')
text = text.replace('u', '')
print(text)
As busybear notes, the loop isn't necessary: your replacements don't depend on i.
Here's how I'd do it:
def strip_vowels(s): # Remove all vowels from a string
for v in 'aeiou':
s = s.replace(v, '')
return s
def compress_word(s):
if not s: return '' # Needed to avoid an out-of-range error on the empty string
return s[0] + strip_vowels(s[1:]) # Strip vowels from all but the first letter
def compress_text(s): # Apply to each word
words = text.split(' ')
new_words = compress_word(w) for w in words
return ' '.join(new_words)
When you replace letters with a blank, your word gets shorter. So what was originally len(text) is going to be out of bounds if you remove any letters. Do note however, replace is replacing all occurrences within your string, so a loop isn't even necessary.
An alternative to use the loop is to just keep track of the index of letters to replace while going through the loop, then replace after the loop is complete.
Shortening your string length by replacing any char with "" means that if you remove a character, len(text) used in your iterator is longer than the actual string length. There are plenty of alternative solutions. for example,
text_list = list(text)
for i in range(1, len(text_list)):
if text_list[i] in "aeiou":
text_list[i] = ""
text = "".join(text_list)
By turning your string into a list of its composite characters, you can remove characters but maintain the list length (since empty elements are allowed) then rejoin them.
Be sure to account for special cases, such as len(text)<2.

Split a string, loop through it character by character, and replace specific ones?

I'm working on an assignment and have gotten stuck on a particular task. I need to write two functions that do similar things. The first needs to correct capitalization at the beginning of a sentence, and count when this is done. I've tried the below code:
def fix_capitalization(usrStr):
count = 0
fixStr = usrStr.split('.')
for sentence in fixStr:
if sentence[0].islower():
sentence[0].upper()
count += 1
print('Number of letters capitalized: %d' % count)
print('Edited text: %s' % fixStr)
Bu receive an out of range error. I'm getting an "Index out of range error" and am not sure why. Should't sentence[0] simply reference the first character in that particular string in the list?
I also need to replace certain characters with others, as shown below:
def replace_punctuation(usrStr):
s = list(usrStr)
exclamationCount = 0
semicolonCount = 0
for sentence in s:
for i in sentence:
if i == '!':
sentence[i] = '.'
exclamationCount += 1
if i == ';':
sentence[i] = ','
semicolonCount += 1
newStr = ''.join(s)
print(newStr)
print(semicolonCount)
print(exclamationCount)
But I'm struggling to figure out how to actually do the replacing once the character is found. Where am I going wrong here?
Thank you in advance for any help!
I would use str.capitalize over str.upper on one character. It also works correctly on empty strings. The other major improvement would be to use enumerate to also track the index as you iterate over the list:
def fix_capitalization(s):
sentences = [sentence.strip() for sentence in s.split('.')]
count = 0
for index, sentence in enumerate(sentences):
capitalized = sentence.capitalize()
if capitalized != sentence:
count += 1
sentences[index] = capitalized
result = '. '.join(sentences)
return result, count
You can take a similar approach to replacing punctuation:
replacements = {'!': '.', ';': ','}
def replace_punctuation(s):
l = list(s)
counts = dict.fromkeys(replacements, 0)
for index, item in enumerate(l):
if item in replacements:
l[index] = replacements[item]
counts[item] += 1
print("Replacement counts:")
for k, v in counts.items():
print("{} {:>5}".format(k, v))
return ''.join(l)
There are better ways to do these things but I'll try to change your code minimally so you will learn something.
The first function's issue is that when you split the sentence like "Hello." there will be two sentences in your fixStr list that the last one is an empty string; so the first index of an empty string is out of range. fix it by doing this.
def fix_capitalization(usrStr):
count = 0
fixStr = usrStr.split('.')
for sentence in fixStr:
# changed line
if sentence != "":
sentence[0].upper()
count += 1
print('Number of letters capitalized: %d' % count)
print('Edited text: %s' % fixStr)
In second snippet you are trying to write, when you pass a string to list() you get a list of characters of that string. So all you need to do is to iterate over the elements of the list and replace them and after that get string from the list.
def replace_punctuation(usrStr):
newStr = ""
s = list(usrStr)
exclamationCount = 0
semicolonCount = 0
for c in s:
if c == '!':
c = '.'
exclamationCount += 1
if c == ';':
c = ','
semicolonCount += 1
newStr = newStr + c
print(newStr)
print(semicolonCount)
print(exclamationCount)
Hope I helped!
Python has a nice build in function for this
for str in list:
new_str = str.replace('!', '.').replace(';', ',')
You can write a oneliner to get a new list
new_list = [str.replace('!', '.').replace(';', ',') for str in list]
You also could go for the split/join method
new_str = '.'.join(str.split('!'))
new_str = ','.join(str.split(';'))
To count capitalized letters you could do
result = len([cap for cap in str if str(cap).isupper()])
And to capitalize them words just use the
str.capitalize()
Hope this works out for you

Python strings :Punctuation fix please

The program correctly identifies the words regardless of punctuation. I am having trouble integrate this into spam_indicator(text).
def spam_indicator(text):
text=text.split()
w=0
s=0
words=[]
for char in string.punctuation:
text = text.replace(char, '')
return word
for word in text:
if word.lower() not in words:
words.append(word.lower())
w=w+1
if word.lower() in SPAM_WORDS:
s=s+1
return float("{:.2f}".format(s/w))
enter image description here
The second block is wrong. I am trying to remove punctuations to run the function.
Try removing the punctuation first, then split the text into words.
def spam_indicator(text):
for char in string.punctuation:
text = text.replace(char, ' ') # N.B. replace with ' ', not ''
text = text.split()
w = 0
s = 0
words = []
for word in text:
if word.lower() not in words:
words.append(word.lower())
w=w+1
if word.lower() in SPAM_WORDS:
s=s+1
return float("{:.2f}".format(s/w))
There are many improvements that could be made to your code.
Use a set for words rather than a list. Since a set can not contain duplicates you don't need to check whether you've already seen the word before adding it to the set.
Use str.translate() to remove the punctuation. You want to replace punctuation with whitespace so that the split() will split the text into words.
Use round() instead of converting to a string then to a float.
Here is an example:
import string
def spam_indicator(text):
trans_table = {ord(c): ' ' for c in string.punctuation}
text = text.translate(trans_table).lower()
text = text.split()
word_count = 0
spam_count = 0
words = set()
for word in text:
if word not in SPAM_WORDS:
words.add(word)
word_count += 1
else:
spam_count += 1
return round(spam_count / word_count, 2)
You need to take care not to divide by 0 if there are no non-spam words. Anyway, I'm not sure what you want as the spam indicator value. Perhaps it should be the number of spam words divided by the total number of words (both spam and non-spam) to make it a value between 0 and 1?

Need assistance with cleaning words that were counted from a text file

I have an input text file from which I have to count sum of characters, sum of lines, and sum of each word.
So far I have been able to get the count of characters, lines and words. I also converted the text to all lower case so I don't get 2 different counts for same word where one is in lower case and the other is in upper case.
Now looking at the output I realized that, the count of words is not as clean. I have been struggling to output clean data where it does not count any special characters, and also when counting words not to include a period or a comma at the end of it.
Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"
it should output:
2 Hello
2 Bob
1 I
1 am
1 to
Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *
Below is the code I have as of now.
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
Thanks
Try replacing
words = fname.split()
With
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
Let me explain the various parts of the code.
Starting with the first line, whenever you have a declaration of the form
function_name = lambda argument1, argument2, ..., argumentN: some_python_expression
What you're looking at is the definition of a function that doesn't have any side effects, meaning it can't change the value of variables, it can only return a value.
So get_alphabetical_characters is a function that we know due to the suggestive name, that it takes a word and returns only the alphabetical characters contained within it.
This is accomplished using the "".join(some_list) idiom which takes a list of strings and concatenates them (in other words, it producing a single string by joining them together in the given order).
And the some_list here is provided by the generator expression [char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]
What this does is it steps through every character in the given word, and puts it into the list if it's alphebetical, or if it isn't it puts a blank string in it's place.
For example
[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]
Evaluates to the following list:
['h','e','l','l','o','']
Which is then evaluates by
"".join(['h','e','l','l','o',''])
Which is equivalent to
'h'+'e'+'l'+'l'+'o'+''
Notice that the blank string added at the end will not have any effect. Adding a blank string to any string returns that same string again.
And this in turn ultimately yields
"hello"
Hope that's clear!
Edit #2: If you want to include periods used to mark decimal we can write a function like this:
include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()
What we're doing here is that the include_char function checks if a character is "alphanumeric" (i.e. is a letter or a digit) or that it's a period and that the character preceding it is numeric, and using this function to strip out all the characters in the string we want, and joining them into a single string, which we then separate into a list of strings using the str.split method.
This program may help you:
#I created a list of characters that I don't want \
# them to be considered as words!
char2remove = (".",",",";","!","?","*",":")
#Received an string of the user.
string = raw_input("Enter your string: ")
#Make all the letters lower-case
string = string.lower()
#replace the special characters with white-space.
for char in char2remove:
string = string.replace(char," ")
#Extract all the words in the new string (have repeats)
words = string.split(" ")
#creating a dictionary to remove repeats
to_count = dict()
for word in words:
to_count[word]=0
#counting the word repeats.
for word in to_count:
#if there is space in a word, it is white-space!
if word.isalpha():
print word, string.count(word)
Works as below:
>>> ================================ RESTART ================================
>>>
Enter your string: Hello, I am Bob. Hello to Bob *
i 1
am 1
to 1
bob 2
hello 2
>>>
Another way is using Regex to remove all non-letter chars (to get rid off char2remove list):
import re
regex = re.compile('[^a-zA-Z]')
your_str = raw_input("Enter String: ")
your_str = your_str.lower()
regex.sub(' ', your_str)
words = your_str.split(" ")
to_count = dict()
for word in words:
to_count[word]=0
for word in to_count:
if word.isalpha():
print word, your_str.count(word)

Categories

Resources