string.punctuation not working as expected in python - python

I was trying to create a program that removes all sorts of punctuation from a given input sentence. The code looked somewhat like this
from string import punctuation
sent = str(input())
def rempunc(string):
for i in string:
word =''
list = [0]
if i in punctuation:
x = string.index(i)
word += string[list[-1]:x]+' '
list.append(x)
list_2 = word.split(' ')
return list_2
print(rempunc(sent))
However the output is coming out as follows:
This state ment has # 1 ! punc.
['This', 'state', 'ment', 'has', '#', '1', '!', 'punc', '']
Why isn't the punctuation being removed entirely? Am I missing something in the code?
I tried changing x with x-1 in line 7 but it did not help. Now I'm stuck and don't know what else to try.

Repeated string slicing isn't necessary here.
I would suggest using filter() to filter out the undesired characters for each word, and then reading that result into a list comprehension. From there, you can use a second filter() operation to remove the empty strings:
from string import punctuation
def remove_punctuation(s):
cleaned_words = [''.join(filter(lambda x: x not in punctuation, word))
for word in s.split()]
return list(filter(lambda x: x != "", cleaned_words))
print(remove_punctuation(input()))
This outputs:
['This', 'state', 'ment', 'has', '1', 'punc']

Related

Replacing character in string doesn't do anything

I have a list like this,
['Therefore', 'allowance' ,'(#)', 't(o)o', 'perfectly', 'gentleman', '(##)' ,'su(p)posing', 'man', 'his', 'now']
Expected output:
['Therefore', 'allowance' ,'(#)', 'too', 'perfectly', 'gentleman', '(##)' ,'supposing', 'man', 'his', 'now']
Removing the brackets is easy by using .replace(), but I don't want to remove the brackets from strings (#) and (##).
my code:
ch = "()"
for w in li:
if w in ["(#)", "(##)"]:
print(w)
else:
for c in ch:
w.replace(c, "")
print(w)
but this doesn't remove the brackets from the words.
You can use re.sub. In particular, note that it can take a function as repl parameter. The function takes a match object, and returns the desired replacement based on the information the match object has (e.g., m.group(1)).
import re
lst = ['Therefore', 'allowance', '(#)', 't(o)o', 'perfectly', 'gentleman', '(##)', 'su(p)posing', 'man', 'his', 'now']
def remove_paren(m):
return m.group(0) if m.group(1) in ('#', '##') else m.group(1)
output = [re.sub(r"\((.*?)\)", remove_paren, word) for word in lst]
print(output) # ['Therefore', 'allowance', '(#)', 'too', 'perfectly', 'gentleman', '(##)', 'supposing', 'man', 'his', 'now']
def removeparanthesis(s):
a=''
for i in s:
if i not in '()':
a+=i
return a
a = ['Therefore', 'allowance' , '(#)' , 't(o)o' , 'perfectly' , 'gentleman' , '(##)' , 'su(p)posing', 'man', 'his', 'now']
b=[]
for i in a:
if i == '(#)' or i == '(##)':
b.append(i)
else:
b.append(removeparanthesis(i))
print(b)
#I just created a function to remove parenthesis to those with not having them as a start and end
Give this a try!
Here, I define another empty array. And by looping in the original array to append the words again except the ones that we don't need.
At first, as you can see we got two loops. In the second one, we loop through each character and whenever we encounter a ( or ) we skip it and continue appending our string word.
If you notice that; to keep the (#) and (##) we skip the second loop but do not forget to add them again to the new list.
li = ["Therefore", "allowance", "(#)", "t(o)o" , "perfectly", "gentleman", "(##)", "su(p)posing", "man", "his", "now"]
new_li = []
for index, w in enumerate(li):
if w in ["(#)", "(##)"]:
new_li.append(w)
continue
new_word = ""
for c in w:
if c == "(" or c == ")":
continue
new_word = new_word + c
new_li.append(new_word)
print(new_li)

Remove a specifc repeated word using python regex? [duplicate]

This question already has answers here:
Removing duplicates in lists
(56 answers)
Closed 1 year ago.
I have a string like :
'hi', 'what', 'are', 'are', 'what', 'hi'
I want to remove a specific repeated word. For example:
'hi', 'what', 'are', 'are', 'what'
Here, I am just removing the repeated word of hi, and keeping rest of the repeated words.
How to do this using regex?
Regex is used for text search. You have structured data, so this is unnecessary.
def remove_all_but_first(iterable, removeword='hi'):
remove = False
for word in iterable:
if word == removeword:
if remove:
continue
else:
remove = True
yield word
Note that this will return an iterator, not a list. Cast the result to list if you need it to remain a list.
You can do this
import re
s= "['hi', 'what', 'are', 'are', 'what', 'hi']"
# convert string to list. Remove first and last char, remove ' and empty spaces
s=s[1:-1].replace("'",'').replace(' ','').split(',')
remove = 'hi'
# store the index of first occurance so that we can add it after removing all occurance
firstIndex = s.index(remove)
# regex to remove all occurances of a word
regex = re.compile(r'('+remove+')', flags=re.IGNORECASE)
op = regex.sub("", '|'.join(s)).split('|')
# clean up the list by removing empty items
while("" in op) :
op.remove("")
# re-insert the removed word in the same index as its first occurance
op.insert(firstIndex, remove)
print(str(op))
You don't need regex for that, convert the string to list and then you can find the index of the first occurrence of the word and filter it from a slice of the rest of the list
lst = "['hi', 'what', 'are', 'are', 'what', 'hi']"
lst = ast.literal_eval(lst)
word = 'hi'
index = lst.index('hi') + 1
lst = lst[:index] + [x for x in lst[index:] if x != word]
print(lst) # ['hi', 'what', 'are', 'are', 'what']

Return a list of words that contain a letter

I wanna return a list of words containing a letter disregarding its case.
Say if i have sentence = "Anyone who has never made a mistake has never tried anything new", then f(sentence, a) would return
['Anyone', 'has', 'made', 'a', 'mistake', 'has', 'anything']
This is what i have
import re
def f(string, match):
string_list = string.split()
match_list = []
for word in string_list:
if match in word:
match_list.append(word)
return match_list
You don't need re. Use str.casefold:
[w for w in sentence.split() if "a" in w.casefold()]
Output:
['Anyone', 'has', 'made', 'a', 'mistake', 'has', 'anything']
You can use string splitting for it, if there is not punctuation.
match_list = [s for s in sentence.split(' ') if 'a' in s.lower()]
Here's another variation :
sentence = 'Anyone who has never made a mistake has never tried anything new'
def f (string, match) :
match_list = []
for word in string.split () :
if match in word.lower ():
match_list.append (word)
return match_list
print (f (sentence, 'a'))

How to remove a corresponding word in a dictionary from a string?

I have a dictionary and a text:
{"love":1, "expect":2, "annoy":-2}
test="i love you, that is annoying"
I need to remove the words from the string if they appear in the dictionary. I have tried this code:
for k in dict:
if k in test:
test=test.replace(k, "")
However the result is:
i you,that is ing
And this is not what I am looking for, as it should not remove "annoy" as a part of the word, the whole word should be evaluated. How can I achieve it?
First, you should not assign names to variables that are also names of builtin in classes, such as dict.
Variable test is a string composed of characters. When you say, if k in test:, you will be testing k to see if it is a substring of test. What you want to do is break up test into a list of words and compare k against each complete word in that list. If words are separated by a single space, then they may be "split" with:
test.split(' ')
The only complication is that it will create the following list:
['i', '', 'you,', 'that', 'is', 'annoying']
Note that the third item still has a , in it. So we should first get rid of punctuation marks we might expect to find in our sentence:
test.replace('.', '').replace(',', ' ').split(' ')
Yielding:
['i', '', 'you', '', 'that', 'is', 'annoying']
The following will actually get rid of all punctuation:
import string
test.translate(str.maketrans('', '', string.punctuation))
So now our code becomes:
>>> import string
>>> d = {"love":1, "expect":2, "annoy":-2}
>>> test="i love you, that is annoying"
>>> for k in d:
... if k in test.translate(str.maketrans('', '', string.punctuation)).split(' '):
... test=test.replace(k, "")
...
>>> print(test)
i you, that is annoying
>>>
You may now find you have extra spaces in your sentence, but you can figure out how to get rid of those.
you can use this:
query = "i love you, that is annoying"
query = query.replace('.', '').replace(',', '')
my_dict = {"love": 1, "expect": 2, "annoy": -2}
querywords = query.split()
resultwords = [word for word in querywords if word.lower() not in my_dict]
result = ' '.join(resultwords)
print(result)
>> 'i you, that is annoying'
If you want to exclude all words without being key sensitive convert all keys in my_dict to lowercase:
my_dict = {k.lower(): v for k, v in my_dict.items()}

Converting a String to a List of Words?

I'm trying to convert a string to a list of words using python. I want to take something like the following:
string = 'This is a string, with words!'
Then convert to something like this :
list = ['This', 'is', 'a', 'string', 'with', 'words']
Notice the omission of punctuation and spaces. What would be the fastest way of going about this?
I think this is the simplest way for anyone else stumbling on this post given the late response:
>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']
Try this:
import re
mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ", mystr).split()
How it works:
From the docs :
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.
so in our case :
pattern is any non-alphanumeric character.
[\w] means any alphanumeric character and is equal to the character set
[a-zA-Z0-9_]
a to z, A to Z , 0 to 9 and underscore.
so we match any non-alphanumeric character and replace it with a space .
and then we split() it which splits string by space and converts it to a list
so 'hello-world'
becomes 'hello world'
with re.sub
and then ['hello' , 'world']
after split()
let me know if any doubts come up.
To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:
>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
... nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']
The most simple way:
>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']
Using string.punctuation for completeness:
import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()
This handles newlines as well.
Well, you could use
import re
list = re.sub(r'[.!,;?]', ' ', string).split()
Note that both string and list are names of builtin types, so you probably don't want to use those as your variable names.
Inspired by #mtrw's answer, but improved to strip out punctuation at word boundaries only:
import re
import string
def extract_words(s):
return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()]
>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']
>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']
Personally, I think this is slightly cleaner than the answers provided
def split_to_words(sentence):
return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed
A regular expression for words would give you the most control. You would want to carefully consider how to deal with words with dashes or apostrophes, like "I'm".
list=mystr.split(" ",mystr.count(" "))
This way you eliminate every special char outside of the alphabet:
def wordsToList(strn):
L = strn.split()
cleanL = []
abc = 'abcdefghijklmnopqrstuvwxyz'
ABC = abc.upper()
letters = abc + ABC
for e in L:
word = ''
for c in e:
if c in letters:
word += c
if word != '':
cleanL.append(word)
return cleanL
s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L) # ['She', 'loves', 'you', 'yea', 'yea', 'yea']
I'm not sure if this is fast or optimal or even the right way to program.
def split_string(string):
return string.split()
This function will return the list of words of a given string.
In this case, if we call the function as follows,
string = 'This is a string, with words!'
split_string(string)
The return output of the function would be
['This', 'is', 'a', 'string,', 'with', 'words!']
This is from my attempt on a coding challenge that can't use regex,
outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')
The role of apostrophe seems interesting.
Probably not very elegant, but at least you know what's going on.
my_str = "Simple sample, test! is, olny".lower()
my_lst =[]
temp=""
len_my_str = len(my_str)
number_letter_in_data=0
list_words_number=0
for number_letter_in_data in range(0, len_my_str, 1):
if my_str[number_letter_in_data] in [',', '.', '!', '(', ')', ':', ';', '-']:
pass
else:
if my_str[number_letter_in_data] in [' ']:
#if you want longer than 3 char words
if len(temp)>3:
list_words_number +=1
my_lst.append(temp)
temp=""
else:
pass
else:
temp = temp+my_str[number_letter_in_data]
my_lst.append(temp)
print(my_lst)
You can try and do this:
tryTrans = string.maketrans(",!", " ")
str = "This is a string, with words!"
str = str.translate(tryTrans)
listOfWords = str.split()

Categories

Resources