Removing all punctuation except white spaces python [duplicate]

Removing all punctuation except white spaces python [duplicate] - python

It seems like there should be a simpler way than:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
Is there?

From an efficiency perspective, you're not going to beat
s.translate(None, string.punctuation)
For higher versions of Python use the following code:
s.translate(str.maketrans('', '', string.punctuation))
It's performing raw string operations in C with a lookup table - there's not much that will beat that but writing your own C code.
If speed isn't a worry, another option though is:
exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)
This is faster than s.replace with each char, but won't perform as well as non-pure python approaches such as regexes or string.translate, as you can see from the below timings. For this type of problem, doing it at as low a level as possible pays off.
Timing code:
import re, string, timeit
s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))
def test_set(s):
return ''.join(ch for ch in s if ch not in exclude)
def test_re(s): # From Vinko's solution, with fix.
return regex.sub('', s)
def test_trans(s):
return s.translate(table, string.punctuation)
def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s
print "sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)
This gives the following results:
sets : 19.8566138744
regex : 6.86155414581
translate : 2.12455511093
replace : 28.4436721802

Regular expressions are simple enough, if you know them.
import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

For the convenience of usage, I sum up the note of striping punctuation from a string in both Python 2 and Python 3. Please refer to other answers for the detailed description.
Python 2
import string
s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation) # Output: string without punctuation
Python 3
import string
s = "string. With. Punctuation?"
table = str.maketrans(dict.fromkeys(string.punctuation)) # OR {key: None for key in string.punctuation}
new_s = s.translate(table) # Output: string without punctuation

myString.translate(None, string.punctuation)

Not necessarily simpler, but a different way, if you are more familiar with the re family.
import re, string
s = "string. With. Punctuation?" # Sample string
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

string.punctuation is ASCII only! A more correct (but also much slower) way is to use the unicodedata module:
# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with - «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s
You can generalize and strip other types of characters as well:
''.join(ch for ch in s if category(ch)[0] not in 'SP')
It will also strip characters like ~*+§$ which may or may not be "punctuation" depending on one's point of view.

I usually use something like this:
>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
... s= s.replace(c,"")
...
>>> s
'string With Punctuation'

For Python 3 str or Python 2 unicode values, str.translate() only takes a dictionary; codepoints (integers) are looked up in that mapping and anything mapped to None is removed.
To remove (some?) punctuation then, use:
import string
remove_punct_map = dict.fromkeys(map(ord, string.punctuation))
s.translate(remove_punct_map)
The dict.fromkeys() class method makes it trivial to create the mapping, setting all values to None based on the sequence of keys.
To remove all punctuation, not just ASCII punctuation, your table needs to be a little bigger; see J.F. Sebastian's answer (Python 3 version):
import unicodedata
import sys
remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)
if unicodedata.category(chr(i)).startswith('P'))

string.punctuation misses loads of punctuation marks that are commonly used in the real world. How about a solution that works for non-ASCII punctuation?
import regex
s = u"string. With. Some・Really Weird、Non？ASCII。 「（Punctuation）」?"
remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
remove.sub(u" ", s).strip()
Personally, I believe this is the best way to remove punctuation from a string in Python because:
It removes all Unicode punctuation
It's easily modifiable, e.g. you can remove the \{S} if you want to remove punctuation, but keep symbols like $.
You can get really specific about what you want to keep and what you want to remove, for example \{Pd} will only remove dashes.
This regex also normalizes whitespace. It maps tabs, carriage returns, and other oddities to nice, single spaces.
This uses Unicode character properties, which you can read more about on Wikipedia.

I haven't seen this answer yet. Just use a regex; it removes all characters besides word characters (\w) and number characters (\d), followed by a whitespace character (\s):
import re
s = "string. With. Punctuation?" # Sample string
out = re.sub(ur'[^\w\d\s]+', '', s)

Here's a one-liner for Python 3.5:
import string
"l*ots! o(f. p#u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))

This might not be the best solution however this is how I did it.
import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])

import re
s = "string. With. Punctuation?" # Sample string
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)

Here is a function I wrote. It's not very efficient, but it is simple and you can add or remove any punctuation that you desire:
def stripPunc(wordList):
"""Strips punctuation from list of words"""
puncList = [".",";",":","!","?","/","\\",",","#","#","$","&",")","(","\""]
for punc in puncList:
for word in wordList:
wordList=[word.replace(punc,'') for word in wordList]
return wordList

>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)
['string', 'With', 'Punctuation']

Just as an update, I rewrote the #Brian example in Python 3 and made changes to it to move regex compile step inside of the function. My thought here was to time every single step needed to make the function work. Perhaps you are using distributed computing and can't have regex object shared between your workers and need to have re.compile step at each worker. Also, I was curious to time two different implementations of maketrans for Python 3
table = str.maketrans({key: None for key in string.punctuation})
vs
table = str.maketrans('', '', string.punctuation)
Plus I added another method to use set, where I take advantage of intersection function to reduce number of iterations.
This is the complete code:
import re, string, timeit
s = "string. With. Punctuation"
def test_set(s):
exclude = set(string.punctuation)
return ''.join(ch for ch in s if ch not in exclude)
def test_set2(s):
_punctuation = set(string.punctuation)
for punct in set(s).intersection(_punctuation):
s = s.replace(punct, ' ')
return ' '.join(s.split())
def test_re(s): # From Vinko's solution, with fix.
regex = re.compile('[%s]' % re.escape(string.punctuation))
return regex.sub('', s)
def test_trans(s):
table = str.maketrans({key: None for key in string.punctuation})
return s.translate(table)
def test_trans2(s):
table = str.maketrans('', '', string.punctuation)
return(s.translate(table))
def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s
print("sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2 :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))
This is my results:
sets : 3.1830138750374317
sets2 : 2.189873124472797
regex : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace : 4.579746678471565

A one-liner might be helpful in not very strict cases:
''.join([c for c in s if c.isalnum() or c.isspace()])

Here's a solution without regex.
import string
input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()
Output>> where and or then
Replaces the punctuations with spaces
Replace multiple spaces in between words with a single space
Remove the trailing spaces, if any with
strip()

Why none of you use this?
''.join(filter(str.isalnum, s))
Too slow?

I was looking for a really simple solution. here's what I got:
import re
s = "string. With. Punctuation?"
s = re.sub(r'[\W\s]', ' ', s)
print(s)
'string With Punctuation '

Here's one other easy way to do it using RegEx
import re
punct = re.compile(r'(\w+)')
sentence = 'This ! is : a # sample $ sentence.' # Text with punctuation
tokenized = [m.group() for m in punct.finditer(sentence)]
sentence = ' '.join(tokenized)
print(sentence)
'This is a sample sentence'

# FIRST METHOD
# Storing all punctuations in a variable
punctuation='!?,.:;"\')(_-'
newstring ='' # Creating empty string
word = raw_input("Enter string: ")
for i in word:
if(i not in punctuation):
newstring += i
print ("The string without punctuation is", newstring)
# SECOND METHOD
word = raw_input("Enter string: ")
punctuation = '!?,.:;"\')(_-'
newstring = word.translate(None, punctuation)
print ("The string without punctuation is",newstring)
# Output for both methods
Enter string: hello! welcome -to_python(programming.language)??,
The string without punctuation is: hello welcome topythonprogramminglanguage

with open('one.txt','r')as myFile:
str1=myFile.read()
print(str1)
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for i in punctuation:
str1 = str1.replace(i," ")
myList=[]
myList.extend(str1.split(" "))
print (str1)
for i in myList:
print(i,end='\n')
print ("____________")

Try that one :)
regex.sub(r'\p{P}','', s)

The question does not have a lot of specifics, so the approach I took is to come up with a solution with the simplest interpretation of the problem: just remove the punctuation.
Note that solutions presented don't account for contracted words (e.g., you're) or hyphenated words (e.g., anal-retentive)...which is debated as to whether they should or shouldn't be treated as punctuations...nor to account for non-English character set or anything like that...because those specifics were not mentioned in the question. Someone argued that space is punctuation, which is technically correct...but to me it makes zero sense in the context of the question at hand.
# using lambda
''.join(filter(lambda c: c not in string.punctuation, s))
# using list comprehension
''.join('' if c in string.punctuation else c for c in s)

Apparently I can't supply edits to the selected answer, so here's an update which works for Python 3. The translate approach is still the most efficient option when doing non-trivial transformations.
Credit for the original heavy lifting to #Brian above. And thanks to #ddejohn for his excellent suggestion for improvement to the original test.
#!/usr/bin/env python3
"""Determination of most efficient way to remove punctuation in Python 3.
Results in Python 3.8.10 on my system using the default arguments:
set : 51.897
regex : 17.901
translate : 2.059
replace : 13.209
"""
import argparse
import re
import string
import timeit
parser = argparse.ArgumentParser()
parser.add_argument("--filename", "-f", default=argparse.__file__)
parser.add_argument("--iterations", "-i", type=int, default=10000)
opts = parser.parse_args()
with open(opts.filename) as fp:
s = fp.read()
exclude = set(string.punctuation)
table = str.maketrans("", "", string.punctuation)
regex = re.compile(f"[{re.escape(string.punctuation)}]")
def test_set(s):
return "".join(ch for ch in s if ch not in exclude)
def test_regex(s): # From Vinko's solution, with fix.
return regex.sub("", s)
def test_translate(s):
return s.translate(table)
def test_replace(s): # From S.Lott's solution
for c in string.punctuation:
s = s.replace(c, "")
return s
opts = dict(globals=globals(), number=opts.iterations)
solutions = "set", "regex", "translate", "replace"
for solution in solutions:
elapsed = timeit.timeit(f"test_{solution}(s)", **opts)
print(f"{solution:<10}: {elapsed:6.3f}")

For serious natural language processing (NLP), you should let a library like SpaCy handle punctuation through tokenization, which you can then manually tweak to your needs.
For example, how do you want to handle hyphens in words? Exceptional cases like abbreviations? Begin and end quotes? URLs? IN NLP it's often useful to separate out a contraction like "let's" into "let" and "'s" for further processing.

Considering unicode. Code checked in python3.
from unicodedata import category
text = 'hi, how are you?'
text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))

You can also do this:
import string
' '.join(word.strip(string.punctuation) for word in 'text'.split())

When you deal with the Unicode strings, I suggest using PyPi regex module because it supports both Unicode property classes (like \p{X} / \P{X}) and POSIX character classes (like [:name:]).
Just install the package by typing pip install regex (or pip3 install regex) in your terminal and hit ENTER.
In case you need to remove punctuation and symbols of any kind (that is, anything other than letters, digits and whitespace) you can use
regex.sub(r'[\p{P}\p{S}]', '', text) # to remove one by one
regex.sub(r'[\p{P}\p{S}]+', '', text) # to remove all consecutive punctuation/symbols with one go
regex.sub(r'[[:punct:]]+', '', text) # Same with a POSIX character class
See a Python demo online:
import regex
text = 'भारत India <><>^$.,,! 002'
new_text = regex.sub(r'[\p{P}\p{S}\s]+', ' ', text).lower().strip()
# OR
# new_text = regex.sub(r'[[:punct:]\s]+', ' ', text).lower().strip()
print(new_text)
# => भारत india 002
Here, I added a whitespace \s pattern to the character class

Related

Python return if statement

Unclear on how to frame the following function correctly:
Creating a function that will take in a string and return the string in camel case without spaces (or pascal case if the first letter was already capital), removing special characters
text = "This-is_my_test_string,to-capitalize"
def to_camel_case(text):
# Return 1st letter of text + all letters after
return text[:1] + text.title()[1:].replace(i" ") if not i.isdigit()
# Output should be "ThisIsMyTestStringToCapitalize"
the "if" statement at the end isn't working out, and I wrote this somewhat experimentally, but with a syntax fix, could the logic work?

Providing the input string does not contain any spaces then you could do this:
from re import sub
def to_camel_case(text, pascal=False):
r = sub(r'[^a-zA-Z0-9]', ' ', text).title().replace(' ', '')
return r if pascal else r[0].lower() + r[1:]
ts = 'This-is_my_test_string,to-capitalize'
print(to_camel_case(ts, pascal=True))
print(to_camel_case(ts))
Output:
ThisIsMyTestStringToCapitalize
thisIsMyTestStringToCapitalize

Here is a short solution using regex. First it uses title() as you did, then the regex finds non-alphanumeric-characters and removes them, and finally we take the first character to handle pascal / camel case.
import re
def to_camel_case(s):
s1 = re.sub('[^a-zA-Z0-9]+', '', s.title())
return s[0] + s1[1:]
text = "this-is2_my_test_string,to-capitalize"
print(to_camel_case(text)) # ThisIsMyTestStringToCapitalize

The below should work for your example.
Splitting apart your example by anything that isn's alphanumeric or a space. Then capitalizing each word. Finally, returning the re-joined string.
import re
def to_camel_case(text):
words = re.split(r'[^a-zA-Z0-9\s]', text)
return "".join([word.capitalize() for word in words])
text_to_camelcase = "This-is_my_test_string,to-capitalize"
print(to_camel_case(text_to_camelcase))

use the split function to split between anything that is not a letter or a whitespace and the function .capitalize() to capitalize single words
import re
text_to_camelcase = "This-is_my_test_string,to-capitalize"
def to_camel_case(text):
split_text = re.split(r'[^a-zA-Z0-9\s]', text)
cap_string = ''
for word in split_text:
cap_word = word.capitalize()
cap_string += cap_word
return cap_string
print(to_camel_case(text_to_camelcase))

Trying to remove all punctuation characters from a string but everything I keep getting // left in

I am trying to write a function to remove all punctuation characters from a string. I've tried several permutations on translate, replace, strip, etc. My latest attempt uses a brute force approach:
def clean_lower(sample):
punct = list(string.punctuation)
for c in punct:
sample.replace(c, ' ')
return sample.split()
That gets rid of almost all of the punctuation but I'm left with // in front of one of the words. I can't seem to find any way to remove it. I've even tried explicitly replacing it with sample.replace('//', ' ').
What do I need to do?

using translate is the fastest way to remove punctuations, this will remove // too:
import string
s = "This is! a string, with. punctuations? //"
def clean_lower(s):
return s.translate(str.maketrans('', '', string.punctuation))
s = clean_lower(s)
print(s)

Use regular expressions
import re
def clean_lower(s):
return(re.sub(r'\W','',s))
Above function erases any symbols except underscore

Perhaps you should approach it from the perspective of what you want to keep:
For example:
import string
toKeep = set(string.ascii_letters + string.digits + " ")
toRemove = set(string.printable) - toKeep
cleanUp = str.maketrans('', '', "".join(toRemove))
usage:
s = "Hello! world of / and dice".translate(cleanUp)
# s will be 'Hello world of and dice'

as suggested by #jasonharper you need to redefine "sample" and it should work:
import string
sample='// Hello?) // World!'
print(sample)
punct=list(string.punctuation)
for c in punct:
sample=sample.replace(c,'')
print(sample.split())

How do I replace punctuation in a string in Python?

I would like to replace (and not remove) all punctuation characters by " " in a string in Python.
Is there something efficient of the following flavour?
text = text.translate(string.maketrans("",""), string.punctuation)

This answer is for Python 2 and will only work for ASCII strings:
The string module contains two things that will help you: a list of punctuation characters and the "maketrans" function. Here is how you can use them:
import string
replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
text = text.translate(replace_punctuation)

Modified solution from Best way to strip punctuation from a string in Python
import string
import re
regex = re.compile('[%s]' % re.escape(string.punctuation))
out = regex.sub(' ', "This is, fortunately. A Test! string")
# out = 'This is fortunately A Test string'

This workaround works in python 3:
import string
ex_str = 'SFDF-OIU .df !hello.dfasf sad - - d-f - sd'
#because len(string.punctuation) = 32
table = str.maketrans(string.punctuation,' '*32)
res = ex_str.translate(table)
# res = 'SFDF OIU df hello dfasf sad d f sd'

There is a more robust solution which relies on a regex exclusion rather than inclusion through an extensive list of punctuation characters.
import re
print(re.sub('[^\w\s]', '', 'This is, fortunately. A Test! string'))
#Output - 'This is fortunately A Test string'
The regex catches anything which is not an alpha-numeric or whitespace character

Replace by ''?.
What's the difference between translating all ; into '' and remove all ;?
Here is to remove all ;:
s = 'dsda;;dsd;sad'
table = string.maketrans('','')
string.translate(s, table, ';')
And you can do your replacement with translate.

In my specific way, I removed "+" and "&" from the punctuation list:
all_punctuations = string.punctuation
selected_punctuations = re.sub(r'(\&|\+)', "", all_punctuations)
print selected_punctuations
str = "he+llo* ithis& place% if you * here ##"
punctuation_regex = re.compile('[%s]' % re.escape(selected_punctuations))
punc_free = punctuation_regex.sub("", str)
print punc_free
Result: he+llo ithis& place if you here

Remove all special characters, punctuation and spaces from string

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.

This can be done without regex:
>>> string = "Special $#! characters spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'
You can use str.isalnum:
S.isalnum() -> bool
Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.
If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.

Here is a regex to match a string of characters that are not a letters or numbers:
[^A-Za-z0-9]+
Here is the Python command to do a regex substitution:
re.sub('[^A-Za-z0-9]+', '', mystring)

Shorter way :
import re
cleanString = re.sub('\W+','', string )
If you want spaces between words and numbers substitute '' with ' '

TLDR
I timed the provided answers.
import re
re.sub('\W+','', string)
is typically 3x faster than the next fastest provided top answer.
Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.
After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:
string1 = 'Special $#! characters spaces 888323'
string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'
Example 1
'.join(e for e in string if e.isalnum())
string1 - Result: 10.7061979771
string2 - Result: 7.78372597694
Example 2
import re
re.sub('[^A-Za-z0-9]+', '', string)
string1 - Result: 7.10785102844
string2 - Result: 4.12814903259
Example 3
import re
re.sub('\W+','', string)
string1 - Result: 3.11899876595
string2 - Result: 2.78014397621
The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)
Example 3 can be 3x faster than Example 1.

Python 2.*
I think just filter(str.isalnum, string) works
In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'
Python 3.*
In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:
''.join(filter(str.isalnum, string))
or to pass list in join use (not sure but can be fast a bit)
''.join([*filter(str.isalnum, string)])
note: unpacking in [*args] valid from Python >= 3.5

#!/usr/bin/python
import re
strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr
you can add more special character and that will be replaced by '' means nothing i.e they will be removed.

Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.
For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:
import re
s = re.sub(r"[^a-zA-Z0-9]","",s)
This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".
In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.
Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.
import re
s = re.sub(r"[^a-z0-9]","",s.lower())

string.punctuation contains following characters:
'!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~'
You can use translate and maketrans functions to map punctuations to empty values (replace)
import string
'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))
Output:
'This is A test'

s = re.sub(r"[-()\"#/#;:<>{}`+=~|.!?,]", "", s)

Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:
>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:
import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien
PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))
def filter_non_printable(s):
result = []
ws_last = False
for c in s:
c = unicodedata.category(c) in PRINTABLE and c or u'#'
result.append(c)
return u''.join(result).replace(u'#', u' ')
Look at the given URL above for all related categories. You also can of course filter
by the punctuation categories.

For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü, ä, ö) simply add these to the regex search string:
Example for German:
re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)

This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.
import re
sample_str = "Hel&&lo %% Wo$#rl#d"
# using isalnum()
print("".join(k for k in sample_str if k.isalnum()))
# using regex
op2 = re.sub("[^A-Za-z]", "", sample_str)
print(f"op2 = ", op2)
special_char_list = ["$", "#", "#", "&", "%"]
# using list comprehension
op1 = "".join([k for k in sample_str if k not in special_char_list])
print(f"op1 = ", op1)
# using lambda function
op3 = "".join(filter(lambda x: x not in special_char_list, sample_str))
print(f"op3 = ", op3)

Use translate:
import string
def clean(instr):
return instr.translate(None, string.punctuation + ' ')
Caveat: Only works on ascii strings.

This will remove all non-alphanumeric characters except spaces.
string = "Special $#! characters spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))
Special characters spaces 888323

import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the
same as double quotes."""
# if we need to count the word python that ends with or without ',' or '.' at end
count = 0
for i in text:
if i.endswith("."):
text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
count += 1
print("The count of Python : ", text.count("python"))

After 10 Years, below I wrote there is the best solution.
You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.
from clean_text import clean
string = 'Special $#! characters spaces 888323'
new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='')
print(new)
Output ==> 'Special characters spaces 888323'
you can replace space if you want.
update = new.replace(' ','')
print(update)
Output ==> 'Specialcharactersspaces888323'

function regexFuntion(st) {
const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space]
st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space]
st = st.replace(/\s\s+/g, ' '); // remove multiple space
return st;
}
console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;'));
// Output: Hello world78asdfasdflkjlkjasdfj67

import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)
and you shall see your result as
'askhnlaskdjalsdk

splitting merged words in python

I am working with a text where all "\n"s have been deleted (which merges two words into one, like "I like bananasAnd this is a new line.And another one.") What I would like to do now is tell Python to look for combinations of a small letter followed by capital letter/punctuation followed by capital letter and insert a whitespace.
I thought this would be easy with reg. expressions, but it is not - I couldnt find an "insert" function or anything, and the string commands seem not to be helpful either. How do I do this?
Any help would be greatly appreciated, I am despairing over here...
Thanks, patrick

Try the following:
re.sub(r"([a-z\.!?])([A-Z])", r"\1 \2", your_string)
For example:
import re
lines = "I like bananasAnd this is a new line.And another one."
print re.sub(r"([a-z\.!?])([A-Z])", r"\1 \2", lines)
# I like bananas And this is a new line. And another one.
If you want to insert a newline instead of a space, change the replacement to r"\1\n\2".

Using re.sub you should be able to make a pattern that grabs a lowercase and uppercase letter and substitutes them for the same two letters, but with a space in between:
import re
re.sub(r'([a-z][.?]?)([A-Z])', '\\1\n\\2', mystring)

You're looking for the sub function. See http://docs.python.org/library/re.html for documentation.

Hmm, interesting. You can use regular expressions to replace text with the sub() function:
>>> import re
>>> string = 'fooBar'
>>> re.sub(r'([a-z][.!?]*)([A-Z])', r'\1 \2', string)
'foo Bar'

If you really don't have any caps except at the beginning of a sentence, it will probably be easiest to just loop through the string.
>>> import string
>>> s = "a word endsA new sentence"
>>> lastend = 0
>>> sentences = list()
>>> for i in range(0, len(s)):
... if s[i] in string.uppercase:
... sentences.append(s[lastend:i])
... lastend = i
>>> sentences.append(s[lastend:])
>>> print sentences
['a word ends', 'A new sentence']

Here's another approach, which avoids regular expressions and does not use any imported libraries, just built-ins...
s = "I like bananasAnd this is a new line.And another one."
with_whitespace = ''
last_was_upper = True
for c in s:
if c.isupper():
if not last_was_upper:
with_whitespace += ' '
last_was_upper = True
else:
last_was_upper = False
with_whitespace += c
print with_whitespace
Yields:
I like bananas And this is a new line. And another one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing all punctuation except white spaces python [duplicate] - python

It seems like there should be a simpler way than: import string s = "string. With. Punctuation?" # Sample string out = s.translate(string.maketrans("",""), string.punctuation) Is there?

Regular expressions are simple enough, if you know them. import re s = "string. With. Punctuation?" s = re.sub(r'[^\w\s]','',s)

myString.translate(None, string.punctuation)

Not necessarily simpler, but a different way, if you are more familiar with the re family. import re, string s = "string. With. Punctuation?" # Sample string out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

I usually use something like this: >>> s = "string. With. Punctuation?" # Sample string >>> import string >>> for c in string.punctuation: ... s= s.replace(c,"") ... >>> s 'string With Punctuation'

I haven't seen this answer yet. Just use a regex; it removes all characters besides word characters (\w) and number characters (\d), followed by a whitespace character (\s): import re s = "string. With. Punctuation?" # Sample string out = re.sub(ur'[^\w\d\s]+', '', s)

Here's a one-liner for Python 3.5: import string "l*ots! o(f. p#u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))

This might not be the best solution however this is how I did it. import string f = lambda x: ''.join([i for i in x if i not in string.punctuation])

import re s = "string. With. Punctuation?" # Sample string out = re.sub(r'[^a-zA-Z0-9\s]', '', s)

>>> s = "string. With. Punctuation?" >>> s = re.sub(r'[^\w\s]','',s) >>> re.split(r'\s*', s) ['string', 'With', 'Punctuation']

A one-liner might be helpful in not very strict cases: ''.join([c for c in s if c.isalnum() or c.isspace()])

Why none of you use this? ''.join(filter(str.isalnum, s)) Too slow?

I was looking for a really simple solution. here's what I got: import re s = "string. With. Punctuation?" s = re.sub(r'[\W\s]', ' ', s) print(s) 'string With Punctuation '

Here's one other easy way to do it using RegEx import re punct = re.compile(r'(\w+)') sentence = 'This ! is : a # sample $ sentence.' # Text with punctuation tokenized = [m.group() for m in punct.finditer(sentence)] sentence = ' '.join(tokenized) print(sentence) 'This is a sample sentence'

with open('one.txt','r')as myFile: str1=myFile.read() print(str1) punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"] for i in punctuation: str1 = str1.replace(i," ") myList=[] myList.extend(str1.split(" ")) print (str1) for i in myList: print(i,end='\n') print ("____________")

Try that one :) regex.sub(r'\p{P}','', s)

Considering unicode. Code checked in python3. from unicodedata import category text = 'hi, how are you?' text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))

You can also do this: import string ' '.join(word.strip(string.punctuation) for word in 'text'.split())

Related

Python return if statement

Trying to remove all punctuation characters from a string but everything I keep getting // left in

How do I replace punctuation in a string in Python?

Remove all special characters, punctuation and spaces from string

splitting merged words in python

Categories

Resources