Remove n characters after certain character - python

I have an string that looks something like this:
*45hello I'm a string *2jwith some *plweird things
I need to remove all the * and the 2 chars that follow those * to get this:
hello I'm a string with some weird things
Is there a practical way to do it without iterating over the string?
Thanks!

Using regular expression:
import re
s = "*45hello I'm a string *2jwith some *plweird things"
s = re.sub(r'\*..', '', s)

You can use regex:
import re
regex = r"\*(.{2})"
test_str = "*45hello I'm a string *2jwith some *plweird things"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, '', test_str, 0)

Related

Python Regex: Remove optional characters

I have a regex pattern with optional characters however at the output I want to remove those optional characters. Example:
string = 'a2017a12a'
pattern = re.compile("((20[0-9]{2})(.?)(0[1-9]|1[0-2]))")
result = pattern.search(string)
print(result)
I can have a match like this but what I want as an output is:
desired output = '201712'
Thank you.
You've already captured the intended data in groups and now you can use re.sub to replace the whole match with just contents of group1 and group2.
Try your modified Python code,
import re
string = 'a2017a12a'
pattern = re.compile(".*(20[0-9]{2}).?(0[1-9]|1[0-2]).*")
result = re.sub(pattern, r'\1\2', string)
print(result)
Notice, how I've added .* around the pattern, so any of the extra characters around your data is matched and gets removed. Also, removed extra parenthesis that were not needed. This will also work with strings where you may have other digits surrounding that text like this hello123 a2017a12a some other 99 numbers
Output,
201712
Regex Demo
You can just use re.sub with the pattern \D (=not a number):
>>> import re
>>> string = 'a2017a12a'
>>> re.sub(r'\D', '', string)
'201712'
Try this one:
import re
string = 'a2017a12a'
pattern = re.findall("(\d+)", string) # this regex will capture only digit
print("".join(p for p in pattern)) # combine all digits
Output:
201712
If you want to remove all character from string then you can do this
import re
string = 'a2017a12a'
re.sub('[A-Za-z]+','',string)
Output:
'201712'
You can use re module method to get required output, like:
import re
#method 1
string = 'a2017a12a'
print (re.sub(r'\D', '', string))
#method 2
pattern = re.findall("(\d+)", string)
print("".join(p for p in pattern))
You can also refer below doc for further knowledge.
https://docs.python.org/3/library/re.html

Substring regex from characters to end of word

I looking for a regex term that will capture a subset of a string beginning with a a certain sequence of characters (http in my case)up until a whitespace.
I am doing the problem in python, working over a list of strings and replacing the 'bad' substring with ''.
The difficulty stems from the characters not necessarily beginning the words within the substring. Example below, with bold being the part I am looking to capture:
"Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous "
Thank you
Use findall:
>>> text = '''Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous '''
>>> import re
>>> re.findall(r'http\S+', text)
['httpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php', 'httpswwwgooglecomsilvous']
For substitution (if memory not an issue):
>>> rep = re.compile(r'http\S+')
>>> rep.sub('', text)
You can try this:
strings = [] #your list of strings goes here
import re
new_strings = [re.sub("https.*?php|https.*?$", '.', i) for i in strings]

Combining two patterns with named capturing group in Python?

I have a regular expression which uses the before pattern like so:
>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:)([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:I118uailfriedx151201005423521">>')
>>> x.group('sid')
'I118uailfriedx151201005423521'
and another like so:
>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:<<")([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:<<"I118uailfriedx151201005423521')
>>> x.group('sid')
'I118uailfriedx151201005423521'
How can I combine these two patterns in a way that, after parsing these two different lines,:
sid:A111uancalual2626x151130185758596
sid:<<"I118uailfriedx151201005423521">>
returns only the corresponding id to me.
RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9]+))')
Use this, I've just tested and it is working for me. I've moved some part out.
Instead of tweaking your regex, you can make your strings easier to parse by just removing any characters except alphanumeric and a colon. Then, just split by colon and get the last item:
>>> import re
>>>
>>> test_strings = ['sid:I118uailfriedx151201005423521">>', 'sid:<<"I118uailfriedx151201005423521']
>>> pattern = re.compile(r"[^A-Za-z0-9:]")
>>> for test_string in test_strings:
... print(pattern.sub("", test_string).split(":")[-1])
...
I118uailfriedx151201005423521
I118uailfriedx151201005423521
You can achieve what you want with a single regex:
\bsid:\W*(?P<sid>\w+)
See the regex demo
The regex breakdown:
\bsid - whole word sid
: - a literal colon
\W* - zero or more non-word characters
(?P<sid>\w+) - one or more word characters captured into a group named "sid"
Python demo:
import re
p = re.compile(r'\bsid:\W*(?P<sid>\w+)')
#test_str = "sid:I118uailfriedx151201005423521\">>" # => I118uailfriedx151201005423521
test_str = "sid:<<\"I118uailfriedx151201005423521" # => I118uailfriedx151201005423521
m = p.search(test_str)
if m:
print(m.group("sid"))

Python converting string to latex using regular expression

Say I have a string
string = "{1/100}"
I want to use regular expressions in Python to convert it into
new_string = "\frac{1}{100}"
I think I would need to use something like this
new_string = re.sub(r'{.+/.+}', r'', string)
But I'm stuck on what I would put in order to preserve the characters in the fraction, in this example 1 and 100.
You can use () to capture the numbers. Then use \1 and \2 to refer to them:
new_string = re.sub(r'{(.+)/(.+)}', r'\\frac{\1}{\2}', string)
# \frac{1}{100}
Note: Don't forget to escape the backslash \\.
Capture the numbers using parens and then reference them in the replacement text using \1 and \2. For example:
>>> print re.sub(r'{(.+)/(.+)}', r'\\frac{\1}{\2}', "{1/100}")
\frac{1}{100}
Anything inside the braces would be a number/number. So in the regex place numbers([0-9]) instead of a .(dot).
>>> import re
>>> string = "{1/100}"
>>> new = re.sub(r'{([0-9]+)/([0-9]+)}', r'\\frac{\1}{\2}', string)
>>> print new
\frac{1}{100}
Use re.match. It's more flexible:
>>> m = re.match(r'{(.+)/(.+)}', string)
>>> m.groups()
('1', '100')
>>> new_string = "\\frac{%s}{%s}"%m.groups()
>>> print new_string
\frac{1}{100}

How can I remove all the punctuations from a string?

for removing all punctuations from a string, x.
i want to use re.findall(), but i've been struggling to know what to write in it..
i know that i can get all the punctuations by writing:
import string
y = string.punctuation
but if i write:
re.findall(y,x)
it says:
raise error("multiple repeat")
sre_constants.error: multiple repeat
can someone explain what exactly we should write in re.findall function?
You may not even need RegEx for this. You can simply use translate, like this
import string
print data.translate(None, string.punctuation)
Several characters in string.punctuation have special meaning in regular expression. They should be escaped.
>>> import re
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>> import re
>>> re.escape(string.punctuation)
'\\!\\"\\#\\$\\%\\&\\\'\\(\\)\\*\\+\\,\\-\\.\\/\\:\\;\\<\\=\\>\\?\\#\\[\\\\\\]\\^\\_\\`\\{\\|\\}\\~'
And if you want to match any one of them, use character class ([...])
>>> '[{}]'.format(re.escape(string.punctuation))
'[\\!\\"\\#\\$\\%\\&\\\'\\(\\)\\*\\+\\,\\-\\.\\/\\:\\;\\<\\=\\>\\?\\#\\[\\\\\\]\\^\\_\\`\\{\\|\\}\\~]'
>>> import re
>>> pattern = '[{}]'.format(re.escape(string.punctuation))
>>> re.sub(pattern, '', 'Hell,o World.')
'Hello World'

Categories

Resources