I have a numpy.ndarray with Strings. I have created a character list, which I would like to use against the strings array, to remove all characters which appear in the character list. I want to put the symbol free strings in a new array. How can I do this?
Input:
symbols = string.printable[62:]
symbolsList = list(symbols)
symbolsList
Output:
['!',
'"',
'#',
'$',
'%',
'&',
"'",
'(',
')',
'*',
'+',
',',
'-',
'.',
'/',
':',
';',
'<',
'=',
'>',
'?',
'#',
'[',
'\\',
']',
'^',
'_',
'`',
'{',
'|',
'}',
'~',
' ',
'\t',
'\n',
'\r',
'\x0b',
'\x0c']
A sample output of the string_array:
array(['[KFC] CHicken_Gravy_Coke_Biscuit This is my Order!!!<lf><lf>', dtype=object)
I want it to look like this:
array(['KFC CHicken Gravy Coke Biscuit This is my Order lf lf', dtype=object)
I tried:
cleanData = []
for i in string_array:
cleanData.append(string_array[i].replace(symbolsList[i], " "))
and:
cleanData = []
for i in summary_data:
cleanData = summary_data[i].replace(symbolsList[i], " ")
Both give same Output:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
But does not work :( How to make this work? Or do what I want?
This is one way.
import re, string, numpy as np
def remove_chars_re(x):
x = re.sub('[' + re.escape(''.join(string.printable[62:])) + ']', ' ', x)
return re.sub(' +', ' ', x).strip()
arr = np.array(['[KFC] CHicken_Gravy_Coke_Biscuit This is my Order!!!<lf><lf>'], dtype=object)
list(map(remove_chars_re, arr))
# ['KFC CHicken Gravy Coke Biscuit This is my Order lf lf']
Explanation
The first re.sub removes unwanted characters with a single space.
The second re.sub removes double spaces.
strip() removes whitespace from start and end of the string.
Here is how I would go about it.
Iterate over each string, str
Iterate over each undesired charactered, chr, for each str (Nested Loop)
Use the str.replace(chr, '')
Here is the code.
cleanData = []
for str in string_array:
tmp_str = str #you need to do this because you need to filter every character one by one
for chr in symbolsList:
tmp_str = tmp_str.replace(chr, ' ') #if you want to replace your undesired symbols with a space
cleanData.append(tmp_str)
Related
This question already has answers here:
Removing punctuation from a list in python
(2 answers)
Closed last year.
i have a list of strings with some strings being the special characters what would be the approach to exclude them in the resultant list
list = ['ben','kenny',',','=','Sean',100,'tag242']
expected output = ['ben','kenny','Sean',100,'tag242']
please guide me with the approach to achieve the same. Thanks
The string module has a list of punctuation marks that you can use and exclude from your list of words:
import string
punctuations = list(string.punctuation)
input_list = ['ben','kenny',',','=','Sean',100,'tag242']
output = [x for x in input_list if x not in punctuations]
print(output)
Output:
['ben', 'kenny', 'Sean', 100, 'tag242']
This list of punctuation marks includes the following characters:
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
It can simply be done using the isalnum() string function. isalnum() returns true if the string contains only digits or letters, if a string contains any special character other than that, the function will return false. (no modules needed to be imported for isalnum() it is a default function)
code:
list = ['ben','kenny',',','=','Sean',100,'tag242']
olist = []
for a in list:
if str(a).isalnum():
olist.append(a)
print(olist)
output:
['ben', 'kenny', 'Sean', 100, 'tag242']
my_list = ['ben', 'kenny', ',' ,'=' ,'Sean', 100, 'tag242']
stop_words = [',', '=']
filtered_output = [i for i in my_list if i not in stop_words]
The list with stop words can be expanded if you need to remove other characters.
I have punctuation array like this
punctuation_data = [ '=' '+' '_' '-' ')' '(' '*' '&' '^' '%'
'SSSS' 'AAAA' 'wwww' '!' '~' '،']
and i have text to remove punctuation of this text, i use this but its not working
list = [''.join(c for c in original_data if c not in punctuation_data)
for s in list]
Edit: Original post did not delete longer substrings. I included a function that loops through the punctuation data and deletes the substrings.
You need to separate your list by comma. Also, don't use predefined names like list.
This will work:
punctuation_data = [ '=', '+', '_', '-', ')', '(', '*', '&', '^', '%',
'SSSS', 'AAAA', 'wwww', '!', '~', '،']
orig_string = ['3+5=8']
def delete_substrings(orig_sub_string, punctuation_data):
for element_to_delete in punctuation_data:
orig_sub_string = orig_sub_string.replace(element_to_delete, "")
return orig_sub_string
lst = [''.join(c for c in orig_sub_string if c not in punctuation_data) for orig_sub_string in orig_string]
print(lst) #['358']
Since you're trying match a number of strings of varying lengths, it's best to use regex instead. Escape the strings with re.escape first so that they don't get interpreted as special characters in regex:
import re
punctuation_data = [ '=', '+', '_', '-', ')', '(', '*', '&', '^', '%', 'SSSS', 'AAAA', 'wwww', '!', '~', '،']
print(re.sub('|'.join(map(re.escape, punctuation_data)), '', 'abc*xyzAAAA123'))
This outputs:
abcxyz123
this is worked for me
original_data = 'What is hello'
punctuation_data = [ '=' '+' '_' '-' ')' '(' '*' '&' '^'
'%'
'SSSS' 'AAAA' 'wwww' '!' '~' '،']
original_data = original_data.split()
resultwords = [word for word in original_data if
word.lower() not in punctuation_data]
result = ' '.join(resultwords)
print result
I have had some trouble with this problem, and I need your help.
I have to make a Python method (mySplit(x)) which takes an input list (which only has one string as element), split that element on the elements of other list and digits.
I use Python 3.6
So here is an example:
l=['I am learning']
l1=['____-----This4ex5ample---aint___ea5sy;782']
banned=['-', '+' , ',', '#', '.', '!', '?', ':', '_', ' ', ';']
The returned lists should be like this:
mySplit(l)=['I', 'am', 'learning']
mySplit(l1)=['This', 'ex', 'ample', 'aint', 'ea', 'sy']
I have tried the following, but I always get stuck:
def mySplit(x):
l=['-', '+' , ',', '#', '.', '!', '?', ':', '_', ';'] #Banned chars
l2=[i for i in x if i not in l] #Removing chars from input list
l2=",".join(l2)
l3=[i for i in l2 if not i.isdigit()] #Removes all the digits
l4=[i for i in l3 if i is not ',']
l5=[",".join(l4)]
l6=l5[0].split(' ')
return l6
and
mySplit(l1)
mySplit(l)
returns:
['T,h,i,s,e,x,a,m,p,l,e,a,i,n,t,e,a,s,y']
['I,', ',a,m,', ',l,e,a,r,n,i,n,g']
Use re.split() for this task:
import re
w_list = [i for i in re.split(r'[^a-zA-Z]',
'____-----This4ex5ample---aint___ea5sy;782') if i ]
Out[12]: ['This', 'ex', 'ample', 'aint', 'ea', 'sy']
I would import the punctuation marks from string and proceed with regular expressions as follows.
l=['I am learning']
l1=['____-----This4ex5ample---aint___ea5sy;782']
import re
from string import punctuation
punctuation # to see the punctuation marks.
>>> '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
' '.join([re.sub('[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~\d]',' ', w) for w in l]).split()
Here is the output:
>>> ['I', 'am', 'learning']
Notice the \d attached at the end of the punctuation marks to remove any digits.
Similarly,
' '.join([re.sub('[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~\d]',' ', w) for w in l1]).split()
Yields
>>> ['This', 'ex', 'ample', 'aint', 'ea', 'sy']
You can also modify your function as follows:
def mySplit(x):
banned = ['-', '+' , ',', '#', '.', '!', '?', ':', '_', ';'] + list('0123456789')#Banned chars
return ''.join([word if not word in banned else ' ' for word in list(x[0]) ]).split()
I need to check a string for some symbols and replace them with a whitespace. My code:
string = 'so\bad'
symbols = ['•', '!', '"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '>', '=', '?', '#', '[', ']', '\\', '^', '_', '`', '{', '}', '~', '|', '"', '⌐', '¬', '«', '»', '£', '$', '°', '§', '–', '—']
for symbol in symbols:
string = string.replace(symbol, ' ')
print string
>> sad
Why does it replace a\b with nothing?
This is because \b is ASCII backspace character:
>>> string = 'so\bad'
>>> print string
sad
You can find it and all the other escape characters from Python Reference Manual.
In order to get the behavior you expect escape the backslash character or use raw strings:
# Both result to 'so bad'
string = 'so\\bad'
string = r'so\bad'
The issue you are facing is the use of \ as a escape character.
\b is a special character (backspace)
Use a String literal with prefix r.
With the r, backslashes \ are treated as literal
string = r'so\bad'
You are not replacing anything "\b" is backspace, moving your cursor to the left one step.
Note that even if you omit the symbols list and your for symbol in symbols: code, you will always get the result "sad" when you print string. This is because \b means something as an ascii character, and is being interpreted together.
Check out this stackoverflow answer for a solution on how to work around this issue: How can I print out the string "\b" in Python
Creating a Python program that converts the string to a list, uses a loop to remove any punctuation and then converts the list back into a string and prints the sentence without punctuation.
punctuation=['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
str=input("Type in a line of text: ")
alist=[]
alist.extend(str)
print(alist)
#Use loop to remove any punctuation (that appears on the punctuation list) from the list
print(''.join(alist))
This is what I have so far. I tried using something like: alist.remove(punctuation) but I get an error saying something like list.remove(x): x not in list. I didn't read the question properly at first and realized that I needed to do this by using a loop so I added that in as a comment and now I'm stuck. I was, however, successful in converting it from a list back into a string.
import string
punct = set(string.punctuation)
''.join(x for x in 'a man, a plan, a canal' if x not in punct)
Out[7]: 'a man a plan a canal'
Explanation: string.punctuation is pre-defined as:
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
The rest is a straightforward comprehension. A set is used to speed up the filtering step.
I found a easy way to do it:
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
str = raw_input("Type in a line of text: ")
for i in punctuation:
str = str.replace(i,"")
print str
With this way you will not get any error.
punctuation=['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
result = ""
for character in str:
if(character not in punctuation):
result += character
print result
Here is the answer of how to tokenize the given statements by using python. the python version I used is 3.4.4
Assume that I have text which is saved as one.txt. then I have saved my python program in the directory where my file is (i.e. one.txt). The following is my python program:
with open('one.txt','r')as myFile:
str1=myFile.read()
print(str1)# This is to print the given statements with punctuations(before removal of punctuations)
# The following is the list of punctuations that we need to remove, add any more if I forget
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for i in punctuation:
str1 = str1.replace(i," ") #to make empty the place where punctuation is there.
myList=[]
myList.extend(str1.split(" "))
print (str1) #this is to print the given statements without puctions(after Removal of punctuations)
for i in myList:
# print ("____________")
print(i,end='\n')
print ("____________")
==============next I will post for you how to remove stop words============
until that let you comment if it is useful.
Thank you