How can I remove all the punctuations from a string? - python

for removing all punctuations from a string, x.
i want to use re.findall(), but i've been struggling to know what to write in it..
i know that i can get all the punctuations by writing:
import string
y = string.punctuation
but if i write:
re.findall(y,x)
it says:
raise error("multiple repeat")
sre_constants.error: multiple repeat
can someone explain what exactly we should write in re.findall function?

You may not even need RegEx for this. You can simply use translate, like this
import string
print data.translate(None, string.punctuation)

Several characters in string.punctuation have special meaning in regular expression. They should be escaped.
>>> import re
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>> import re
>>> re.escape(string.punctuation)
'\\!\\"\\#\\$\\%\\&\\\'\\(\\)\\*\\+\\,\\-\\.\\/\\:\\;\\<\\=\\>\\?\\#\\[\\\\\\]\\^\\_\\`\\{\\|\\}\\~'
And if you want to match any one of them, use character class ([...])
>>> '[{}]'.format(re.escape(string.punctuation))
'[\\!\\"\\#\\$\\%\\&\\\'\\(\\)\\*\\+\\,\\-\\.\\/\\:\\;\\<\\=\\>\\?\\#\\[\\\\\\]\\^\\_\\`\\{\\|\\}\\~]'
>>> import re
>>> pattern = '[{}]'.format(re.escape(string.punctuation))
>>> re.sub(pattern, '', 'Hell,o World.')
'Hello World'

Related

Remove n characters after certain character

I have an string that looks something like this:
*45hello I'm a string *2jwith some *plweird things
I need to remove all the * and the 2 chars that follow those * to get this:
hello I'm a string with some weird things
Is there a practical way to do it without iterating over the string?
Thanks!
Using regular expression:
import re
s = "*45hello I'm a string *2jwith some *plweird things"
s = re.sub(r'\*..', '', s)
You can use regex:
import re
regex = r"\*(.{2})"
test_str = "*45hello I'm a string *2jwith some *plweird things"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, '', test_str, 0)

Remove just the alphabet characters from a string

In python,
I have string like
"dsafsadf_afasa_2.2.14_43.33_dsfd"
I need to get just
"2.2.14_43.33"
How do I do it?
Seems like you're trying to remove all alphabets and all the underscores except if the undercore is present inbetween digits,.
>>> s = "dsafsadf_afasa_2.2.14_43.33_dsfd"
>>> re.sub(r'[a-z]|(?<=\D)_(?=\d)|(?<=\d)_(?=\D)|(?<=\D)_(?=\D)|^_+|_+$', '', s)
'2.2.14_43.33'
You can use str.translate if you just want to remove the letters:
s = "dsafsadf_afasa_2.2.14_43.33_dsfd"
from string import ascii_letters
print(s.translate(None,ascii_letters))
which outputs:
__2.2.14_43.33_
For python3:
from string import ascii_letters
print(s.translate({ord(ch):"" for ch in ascii_letters}))
If you really want to remove underscores from the end use strip:
s = "dsafsadf_afasa_2.2.14_43.33_dsfd"
from string import ascii_letters
print(s.translate(None,ascii_letters).strip("_"))
Output:
2.2.14_43.33
You can simply do a re.findall.
import re
p = re.compile(r'\d+(?:[\W_]\d+)*')
test_str = "dsafsadf_afasa_2.2.14_43.33_dsfd"
re.findall(p, test_str)
See demo.
https://regex101.com/r/hF1wE3/2

Python converting string to latex using regular expression

Say I have a string
string = "{1/100}"
I want to use regular expressions in Python to convert it into
new_string = "\frac{1}{100}"
I think I would need to use something like this
new_string = re.sub(r'{.+/.+}', r'', string)
But I'm stuck on what I would put in order to preserve the characters in the fraction, in this example 1 and 100.
You can use () to capture the numbers. Then use \1 and \2 to refer to them:
new_string = re.sub(r'{(.+)/(.+)}', r'\\frac{\1}{\2}', string)
# \frac{1}{100}
Note: Don't forget to escape the backslash \\.
Capture the numbers using parens and then reference them in the replacement text using \1 and \2. For example:
>>> print re.sub(r'{(.+)/(.+)}', r'\\frac{\1}{\2}', "{1/100}")
\frac{1}{100}
Anything inside the braces would be a number/number. So in the regex place numbers([0-9]) instead of a .(dot).
>>> import re
>>> string = "{1/100}"
>>> new = re.sub(r'{([0-9]+)/([0-9]+)}', r'\\frac{\1}{\2}', string)
>>> print new
\frac{1}{100}
Use re.match. It's more flexible:
>>> m = re.match(r'{(.+)/(.+)}', string)
>>> m.groups()
('1', '100')
>>> new_string = "\\frac{%s}{%s}"%m.groups()
>>> print new_string
\frac{1}{100}

Python: strip a wildcard word

I have strings with words separated by points.
Example:
string1 = 'one.two.three.four.five.six.eight'
string2 = 'one.two.hello.four.five.six.seven'
How do I use this string in a python method, assigning one word as wildcard (because in this case for example the third word varies). I am thinking of regular expressions, but do not know if the approach like I have it in mind is possible in python.
For example:
string1.lstrip("one.two.[wildcard].four.")
or
string2.lstrip("one.two.'/.*/'.four.")
(I know that I can extract this by split('.')[-3:], but I am looking for a general way, lstrip is just an example)
Use re.sub(pattern, '', original_string) to remove matching part from original_string:
>>> import re
>>> string1 = 'one.two.three.four.five.six.eight'
>>> string2 = 'one.two.hello.four.five.six.seven'
>>> re.sub(r'^one\.two\.\w+\.four', '', string1)
'.five.six.eight'
>>> re.sub(r'^one\.two\.\w+\.four', '', string2)
'.five.six.seven'
BTW, you are misunderstanding str.lstrip:
>>> 'abcddcbaabcd'.lstrip('abcd')
''
str.replace is more appropriate (of course, re.sub, too):
>>> 'abcddcbaabcd'.replace('abcd', '')
'dcba'
>>> 'abcddcbaabcd'.replace('abcd', '', 1)
'dcbaabcd'

Python regex split, integer of arbitrary length

I'm trying to do a simple regex split in Python. The string is in the form of FooX where Foo is some string and X is an arbitrary integer. I have a feeling this should be really simple, but I can't quite get it to work.
On that note, can anyone recommend some good Regex reading materials?
You can't use split() since that has to consume some characters, but you can use normal matching to do it.
>>> import re
>>> r = re.compile(r'(\D+)(\d+)')
>>> r.match('abc444').groups()
('abc', '444')
Using groups:
import re
m=re.match('^(?P<first>[A-Za-z]+)(?P<second>[0-9]+)$',"Foo9")
print m.group('first')
print m.group('second')
Using search:
import re
s='Foo9'
m=re.search('(?<=\D)(?=\d)',s)
first=s[:m.start()]
second=s[m.end():]
print first, second
Keeping it simple:
>>> import re
>>> a = "Foo1String12345"
>>> re.split(r'(\d+)$', a)[0:2]
['Foo1String', '12345']
Assuming you want to split between the "Foo" and the number, you'd want something like:
r/(?<=\D)(?=\d)/
Which will match at a point between a nondigit and a digit, without consuming any characters in the split.
>>> import re
>>> s="gnibbler1234"
>>> re.findall(r'(\D+)(\d+)',s)[0]
('gnibbler', '1234')
In the regex, \D means anything that is not a digit, so \D+ matches one or more things that are not digits.
Likewise \d means anything that is a digit, so \d+ matches one or more digits

Categories

Resources