Updating a string using regular expressions in Python - python

I'm pretty sure that my question is very straightforward but I cannot find the answer to it. Let's say we have an input string like:
input = "This is an example"
Now, I want to simply replace every word --generally speaking, every substring using a regular expression, "word" here is just an example-- in the input with another string which includes the original string too. For instance, I want to add an # to the left and right of every word in input. And, the output would be:
output = "#This# #is# #an# #example#"
What is the solution? I know how to use re.sub or replace, but I do not know how I can use them in a way that I can update the original matched strings and not completely replace them with something else.

You can use capture groups for that.
import re
input = "This is an example"
output = re.sub("(\w+)", "#\\1#", input)
A capture group is something that you can later reference, for example in the substitution string. In this case, I'm matching a word, putting it into a capture group and then replacing it with the same word, but with # added as a prefix and a suffix.
You can read about regexps in python more in the docs.

Here is an option using re.sub with lookarounds:
input = "This is an example"
output = re.sub(r'(?<!\w)(?=\w)|(?<=\w)(?!\w)', '#', input)
print(output)
#This# #is# #an# #example#

This is without re library
a = "This is an example"
l=[]
for i in a.split(" "):
l.append('#'+i+'#')
print(" ".join(l))

You can match only word boundaries with \b:
import re
input = "This is an example"
output = re.sub(r'\b', '#', input)
print(output)
#This# #is# #an# #example#

Related

How I can use regex to remove repeated characters from string

I have a string as follows where I tried to remove similar consecutive characters.
import re
input = "abccbcbbb";
for i in input :
input = re.sub("(.)\\1+", "",input);
print(input)
Now I need to let the user specify the value of k.
I am using the following python code to do it, but I got the error message TypeError: can only concatenate str (not "int") to str
import re
input = "abccbcbbb";
k=3
for i in input :
input= re.sub("(.)\\1+{"+(k-1)+"}", "",input)
print(input)
The for i in input : does not do what you need. i is each character in the input string, and your re.sub is supposed to take the whole input as a char sequence.
If you plan to match a specific amount of chars you should get rid of the + quantifier after \1. The limiting {min,} / {min,max} quantifier should be placed right after the pattern it modifies.
Also, it is more convenient to use raw string literals when defining regexps.
You can use
import re
input_text = "abccbcbbb";
k=3
input_text = re.sub(fr"(.)\1{{{k-1}}}", "", input_text)
print(input_text)
# => abccbc
See this Python demo.
The fr"(.)\1{{{k-1}}}" raw f-string literal will translate into (.)\1{2} pattern. In f-strings, you need to double curly braces to denote a literal curly brace and you needn't escape \1 again since it is a raw string literal.
If I were you, I would prefer to do it like suggested before. But since I've already spend time on answering this question here is my handmade solution.
The pattern described below creates a named group named "letter". This group updates iterative, so firstly it is a, then b, etc. Then it looks ahead for all the repetitions of the group "letter" (which updates for each letter).
So it finds all groups of repeated letters and replaces them with empty string.
import re
input = 'abccbcbbb'
result = 'abcbcb'
pattern = r'(?P<letter>[a-z])(?=(?P=letter)+)'
substituted = re.sub(pattern, '', input)
assert substituted == result
Just to make sure I have the question correct you mean to turn "abccbcbbb" into "abcbcb" only removing sequential duplicate characters. Is there a reason you need to use regex? you could likely do a simple list comprehension. I mean this is a really cut and dirty way to do it but you could just put
input = "abccbcbbb"
input = list(input)
previous = input.pop(0)
result = [previous]
for letter in input:
if letter != previous : result += letter
previous = letter
result = "".join(result)
and with a method like this, you could make it easier to read and faster with a bit of modification id assume.

How to get all the string after and before two specific words?

I want to replace all the string after "my;encoded;image:" (which is the base64 data of the image) and i want to stop before the word "END" , but the following code is replacing also the two strings "my;encoded;image:" and "END". Any suggestions?
import re
re.sub("my;encoded;image:.*END","random_words",image,flags=re.DOTALL)
NB : a simple way could be to use replacement but i want to use regex in my case Thanks
You can use a non-greedy regex to split the string into three groups. Then replace the second group with your string:
import re
x = re.sub(r'(.*my;encoded;image:)(.*?)(END.*)', r"\1my string\3", image)
print(x)
You can use f-strings with Python 3.6 and higher:
replacement = "hello"
x = re.sub(r'(.*my;encoded;image:)(.*?)(END.*)', fr'\1{replacement}\3', image)

Extract a portion of string from another string using regex

Lets assume I have a string as follows:
s = '23092020_indent.xlsx'
I want to extract only indent from the above string. Now there are many approaches:
#Via re.split() operation
s_f = re.split('_ |. ',s) <---This is returning 's' ONLY. Not the desired output
#Via re.findall() operation
s_f = re.findall(r'[^A-Za-z]',s,re.I)
s_f
['i','n','d','e','n','t','x','l','s','x']
s_f = ''.join(s_f) <----This is returning 'indentxlsx'. Not the desired output
Am I missing out anything? Or do I need to use regex at all?
P.S. In the whole part of s only '.'delimiter would be constant. Rests all delimiter can be changed.
Use os.path.splitext and then str.split:
import os
name, ext = os.path.splitext(s)
name.split("_")[1] # If the position is always fixed
Output:
"indent"
I LOVE regex's, so that's definitely the way I'd go.
The exactly right answer requires more information as to all possible input strings and what the right thing to extract is for each of them. Here's a solution that assumes:
one or more digits, then
a single underscore, then
a group of chars not containing a '.', then
a '.', then
anything besides a '.', but at least one char
The #3 part is captured.
import re
s = '23092020_indent.xlsx'
exp = re.compile(r"^\d+_(.*?)\.[^.]+$")
m = exp.match(s)
if m:
print(m.group(1))
Result:
indent

how to match a pattern and add a character to it

I have something like:
GCF_002904975:2.6672e-05):2.6672e-05.
and I would like to add the word '_S' right after any GCF(any number) entry before the next colon.
In other words I would like my text becoming like:
GCF_002904975_S:2.6672e-05):2.6672e-05.
I have repeated pattern like that all along my text.
This can be easily done with re.sub function. A working example would look like this:
import re
inp_string='(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)0.924:0.060046136,(GCF_000144955:0.036474926,((GCF_001681075:0.017937143,...'
if __name__ == "__main__":
outp_string = re.sub(r'GCF_(?P<gfc_number>\d+)\:', r'GCF_\g<gfc_number>_S:', inp_string)
print(outp_string)
This code gives the following result, which is hopefully what you need:
(((GCF_001297375_S:2.6671e-05,GCF_002904975_S:2.6672e-05)0.924:0.060046136,(GCF_000144955_S:0.036474926,((GCF_001681075_S:0.017937143,...
For more info take a look at the docs:
https://docs.python.org/3/library/re.html
You can use regular expressions with a function substitution. The solution below depends on the numbers always being 9 digits, but could be modified to work with other cases.
test_str = '(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)0.924:0.060046136,GCF_000144955:0.036474926,((GCF_001681075:0.017937143,...'
new_str = re.sub(r"GCF_\d{9}", lambda x: x.group(0) + "_S", test_str)
print(new_str)
#(((GCF_001297375_S:2.6671e-05,GCF_002904975_S:2.6672e-05)0.924:0.060046136,GCF_000144955_S:0.036474926,((GCF_001681075_S:0.017937143,...
Why not just do a replace? Shortening your example string to make it easier to read:
"(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)...".replace(":","_S:")

How to use a regex variable in a regular expression?

I am using the following pattern to clean a piece of text (replacing the matches with null):
{\s{\s\"[A-Za-z0-9.,\-:]*(?<!\bbecause\b)(?<!\bsince\b)\"\s}\s\"[A-Za-z0-9.,\-:]*\"\s}
I have a list of relators like "because" and "since" that could change every time. So I created a separate string which is a regex itself like:
lookahead_string = (?<!\bbecause\b)(?<!\bsince\b)
And put it in my original regex pattern and changed it like the following:
{\s{\s\"[A-Za-z0-9.,\-:]*'+lookahead_string+r'\"\s}\s\"[A-Za-z0-9.,\-:]*\"\s}
But the new pattern does not match the parts of the input text that could be matched using the original regex pattern. The code I am using is:
lookahead_string = ''
relators = ["because", "since"]
for rel in relators:
lookahead_string += '(?<!\b'+rel+'\b)'
text = re.sub(r'{\s{\s\"[A-Za-z0-9.,\-:]*'+lookahead_string+r'\"\s}\s\"[A-Za-z0-9.,\-:]*\"\s}', "", text)
text = ' '.join(text.split())
What should I do to make it work?! I have already tried using re.escape and format string but none of them works in my case.
Edit: I removed the input output text because I thought it is a little confusing. However, I thank #DYZ for the good suggestion.
A suggestion: Instead of messing up with the complex string syntax, convert the string to a Python list.
import ast
l = ast.literal_eval("[" + s.replace("}", "],").replace("{", "[") + "]")
#[[[[['I'], 'PRP'], 'NP'], [[[[['did'], 'VBD'], [['not'], 'RB'], 'VP'],
# ..., 'S'], '']
Now you can apply simple list functions to your data and, when done, transform the list to a bracketed string.

Categories

Resources