Removing punctuation in lists in Python - python

Creating a Python program that converts the string to a list, uses a loop to remove any punctuation and then converts the list back into a string and prints the sentence without punctuation.
punctuation=['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
str=input("Type in a line of text: ")
alist=[]
alist.extend(str)
print(alist)
#Use loop to remove any punctuation (that appears on the punctuation list) from the list
print(''.join(alist))
This is what I have so far. I tried using something like: alist.remove(punctuation) but I get an error saying something like list.remove(x): x not in list. I didn't read the question properly at first and realized that I needed to do this by using a loop so I added that in as a comment and now I'm stuck. I was, however, successful in converting it from a list back into a string.

import string
punct = set(string.punctuation)
''.join(x for x in 'a man, a plan, a canal' if x not in punct)
Out[7]: 'a man a plan a canal'
Explanation: string.punctuation is pre-defined as:
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
The rest is a straightforward comprehension. A set is used to speed up the filtering step.

I found a easy way to do it:
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
str = raw_input("Type in a line of text: ")
for i in punctuation:
str = str.replace(i,"")
print str
With this way you will not get any error.

punctuation=['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
result = ""
for character in str:
if(character not in punctuation):
result += character
print result

Here is the answer of how to tokenize the given statements by using python. the python version I used is 3.4.4
Assume that I have text which is saved as one.txt. then I have saved my python program in the directory where my file is (i.e. one.txt). The following is my python program:
with open('one.txt','r')as myFile:
str1=myFile.read()
print(str1)# This is to print the given statements with punctuations(before removal of punctuations)
# The following is the list of punctuations that we need to remove, add any more if I forget
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for i in punctuation:
str1 = str1.replace(i," ") #to make empty the place where punctuation is there.
myList=[]
myList.extend(str1.split(" "))
print (str1) #this is to print the given statements without puctions(after Removal of punctuations)
for i in myList:
# print ("____________")
print(i,end='\n')
print ("____________")
==============next I will post for you how to remove stop words============
until that let you comment if it is useful.
Thank you

Related

remove all the special chars from a list [duplicate]

This question already has answers here:
Removing punctuation from a list in python
(2 answers)
Closed last year.
i have a list of strings with some strings being the special characters what would be the approach to exclude them in the resultant list
list = ['ben','kenny',',','=','Sean',100,'tag242']
expected output = ['ben','kenny','Sean',100,'tag242']
please guide me with the approach to achieve the same. Thanks
The string module has a list of punctuation marks that you can use and exclude from your list of words:
import string
punctuations = list(string.punctuation)
input_list = ['ben','kenny',',','=','Sean',100,'tag242']
output = [x for x in input_list if x not in punctuations]
print(output)
Output:
['ben', 'kenny', 'Sean', 100, 'tag242']
This list of punctuation marks includes the following characters:
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
It can simply be done using the isalnum() string function. isalnum() returns true if the string contains only digits or letters, if a string contains any special character other than that, the function will return false. (no modules needed to be imported for isalnum() it is a default function)
code:
list = ['ben','kenny',',','=','Sean',100,'tag242']
olist = []
for a in list:
if str(a).isalnum():
olist.append(a)
print(olist)
output:
['ben', 'kenny', 'Sean', 100, 'tag242']
my_list = ['ben', 'kenny', ',' ,'=' ,'Sean', 100, 'tag242']
stop_words = [',', '=']
filtered_output = [i for i in my_list if i not in stop_words]
The list with stop words can be expanded if you need to remove other characters.

Using a List as 1 argument in replace()

Trying to solve.
I have a string from a user input. And I want to reomove all special characters from a list =
[',', '.', '"', '\'', ':',]
using the replace function I´m able to remove one by one. using somethin like:
string = "a,bhc:kalaej jff!"
string.replace(",", "")
but I want to do remove all the special chr. in one go. I have tried:
unwanted_specialchr = [',', '.', '"', '\'', ':',]
string = "a,bhc:kalaej jff!"
string.replace(unwanted_specialchr, "")
figured it out:
def remove_specialchr(string):
unwanted_specialchr = [',', '.', '"', '\'', ':',]
for chr in string:
if chr in unwanted_specialchr:
string = string.replace(chr, '')
return string
you can use re.sub:
import re
unwanted_specialchr = [',', '.', '"', '\'', ':',]
string = "a,bhc:kalaej jff!"
re.sub(f'[{"".join(unwanted_specialchr)}]', '', string)
output:
'abhckalaej jff!'
or you could use:
''.join(c for c in string if c not in unwanted_specialchr)
output:
'abhckalaej jff!'
Well i think that your solution could be better with the optimization:
def remove_specialchr(string):
specialChr = {',', '.', '"', '\'', ':'}
stringS = ''
for chr in string:
if chr not in specialChr:
stringS += it
return stringS

Python, Split the input string on elements of other list and remove digits from it

I have had some trouble with this problem, and I need your help.
I have to make a Python method (mySplit(x)) which takes an input list (which only has one string as element), split that element on the elements of other list and digits.
I use Python 3.6
So here is an example:
l=['I am learning']
l1=['____-----This4ex5ample---aint___ea5sy;782']
banned=['-', '+' , ',', '#', '.', '!', '?', ':', '_', ' ', ';']
The returned lists should be like this:
mySplit(l)=['I', 'am', 'learning']
mySplit(l1)=['This', 'ex', 'ample', 'aint', 'ea', 'sy']
I have tried the following, but I always get stuck:
def mySplit(x):
l=['-', '+' , ',', '#', '.', '!', '?', ':', '_', ';'] #Banned chars
l2=[i for i in x if i not in l] #Removing chars from input list
l2=",".join(l2)
l3=[i for i in l2 if not i.isdigit()] #Removes all the digits
l4=[i for i in l3 if i is not ',']
l5=[",".join(l4)]
l6=l5[0].split(' ')
return l6
and
mySplit(l1)
mySplit(l)
returns:
['T,h,i,s,e,x,a,m,p,l,e,a,i,n,t,e,a,s,y']
['I,', ',a,m,', ',l,e,a,r,n,i,n,g']
Use re.split() for this task:
import re
w_list = [i for i in re.split(r'[^a-zA-Z]',
'____-----This4ex5ample---aint___ea5sy;782') if i ]
Out[12]: ['This', 'ex', 'ample', 'aint', 'ea', 'sy']
I would import the punctuation marks from string and proceed with regular expressions as follows.
l=['I am learning']
l1=['____-----This4ex5ample---aint___ea5sy;782']
import re
from string import punctuation
punctuation # to see the punctuation marks.
>>> '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
' '.join([re.sub('[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~\d]',' ', w) for w in l]).split()
Here is the output:
>>> ['I', 'am', 'learning']
Notice the \d attached at the end of the punctuation marks to remove any digits.
Similarly,
' '.join([re.sub('[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~\d]',' ', w) for w in l1]).split()
Yields
>>> ['This', 'ex', 'ample', 'aint', 'ea', 'sy']
You can also modify your function as follows:
def mySplit(x):
banned = ['-', '+' , ',', '#', '.', '!', '?', ':', '_', ';'] + list('0123456789')#Banned chars
return ''.join([word if not word in banned else ' ' for word in list(x[0]) ]).split()

python remove weird apostrophe and other weird characters not in string.punctuation [duplicate]

This question already has answers here:
Remove punctuation from Unicode formatted strings
(4 answers)
Closed 6 years ago.
This is my string:
mystring = "How’s it going?"
This is what i did:
import string
exclude = set(string.punctuation)
def strip_punctuations(mystring):
for c in string.punctuation:
new_string=''.join(ch for ch in mystring if ch not in exclude)
new_string = chat_string.replace("\xe2\x80\x99","")
new_string = chat_string.replace("\xc2\xa0\xc2\xa0","")
return chat_string
OUTPUT:
If i did not include this line new_string = chat_string.replace("\xe2\x80\x99","") this will be the output:
'How\xe2\x80\x99s it going'
i realized
exclude does not have that weird looking apostrophe in the list:
print set(exclude)
set(['!', '#', '"', '%', '$', "'", '&', ')', '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', '<', '?', '>', '#', '[', ']', '\\', '_', '^', '`', '{', '}', '|', '~'])
How do i ensure all such characters are taken out instead of manually replacing them in the future?
If you are working with long texts like news articles or web scraping, then you can either use "goose" or "NLTK" python libraries. These two are not pre-installed. Here are the links to the libraries. goose, NLTK
You can go through the document and learn how to do.
OR
if you don't want to use these libraries, you may want to create your own "exclude" list manually.
import re
toReplace = "how's it going?"
regex = re.compile('[!#%$\"&)\'(+*-/.;:=<?>#\[\]_^`\{\}|~"\\\\"]')
newVal = regex.sub('', toReplace)
print(newVal)
The regex matches all the characters you've set and it replaces them with empty whitespace.

strip punctuation with regex - python

I need to use regex to strip punctuation at the start and end of a word. It seems like regex would be the best option for this. I don't want punctuation removed from words like 'you're', which is why I'm not using .replace().
You don't need regular expression to do this task. Use str.strip with string.punctuation:
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>> '!Hello.'.strip(string.punctuation)
'Hello'
>>> ' '.join(word.strip(string.punctuation) for word in "Hello, world. I'm a boy, you're a girl.".split())
"Hello world I'm a boy you're a girl"
I think this function will be helpful and concise in removing punctuation:
import re
def remove_punct(text):
new_words = []
for word in text:
w = re.sub(r'[^\w\s]','',word) #remove everything except words and space
w = re.sub(r'_','',w) #how to remove underscore as well
new_words.append(w)
return new_words
If you persist in using Regex, I recommend this solution:
import re
import string
p = re.compile("[" + re.escape(string.punctuation) + "]")
print(p.sub("", "\"hello world!\", he's told me."))
### hello world hes told me
Note also that you can pass your own punctuation marks:
my_punct = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '.',
'/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_',
'`', '{', '|', '}', '~', '»', '«', '“', '”']
punct_pattern = re.compile("[" + re.escape("".join(my_punct)) + "]")
re.sub(punct_pattern, "", "I've been vaccinated against *covid-19*!") # the "-" symbol should remain
### Ive been vaccinated against covid-19
You can remove punctuation from a text file or a particular string file using regular expression as follows -
new_data=[]
with open('/home/rahul/align.txt','r') as f:
f1 = f.read()
f2 = f1.split()
all_words = f2
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
# You can add and remove punctuations as per your choice
#removing stop words in hungarian text and english text and
#display the unpunctuated string
# To remove from a string, replace new_data with new_str
# new_str = "My name$## is . rahul -~"
for word in all_words:
if word not in punctuations:
new_data.append(word)
print (new_data)
P.S. - Do the identation properly as per required.
Hope this helps!!

Categories

Resources