Replace a substring between two substrings - python

How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.

You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position

You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.

This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.

Related

how to make a list in python from a string and using regular expression [duplicate]

I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?
How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.
You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'
This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]
your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis
You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?
You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))
How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)

How to extract substring by not including alternate text using Python regex

I have the following two strings:
various_data/hmsc_proximal_distal/BB_152.HPMSC.distal.tss_ext500bp.narrowPeak
various_data/hmsc_proximal_distal/BB_147.HMSC-he.proximal.tss_ext500bp.narrowPeak
What I want to do is to capture:
BB_152.HPMSC
BB_147.HMSC-he
Why this regex failed:
.*\/([A-Z\_0-9\.\-a-z]+)\.[proximal|distal]
by giving;
BB_152.HPMSC.distal
BB_147.HMSC-he.proximal
What's the right way to do it?
You can use (?=... to form a lookahead group
(?=...)
Matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
import re
s = '''
various_data/hmsc_proximal_distal/BB_152.HPMSC.distal.tss_ext500bp.narrowPeak
various_data/hmsc_proximal_distal/BB_147.HMSC-he.proximal.tss_ext500bp.narrowPeak
'''
re.findall(r"([^/]*)\.(?=proximal|distal)", s)
yields
['BB_152.HPMSC', 'BB_147.HMSC-he']
The regex should be
.*\/([A-Z\_0-9\.\-a-z]+)\.(?:proximal|distal)
[] is a set of characters for one position, you have to use round brackets.
The solution using re.findall() function:
import re
s = '''
various_data/hmsc_proximal_distal/BB_152.HPMSC.distal.tss_ext500bp.narrowPeak
various_data/hmsc_proximal_distal/BB_147.HMSC-he.proximal.tss_ext500bp.narrowPeak
'''
result = re.findall(r'[A-Z]{2}_\d+\.[a-zA-Z-]+(?=\.proximal|\.distal)', s)
print(result)
The output:
['BB_152.HPMSC', 'BB_147.HMSC-he']
(?=\.proximal|\.distal) - lookahead positive assertion, ensures that crucial sequence is followed by either .proximal or .distal

Stripping variable borders with python re

How does one replace a pattern when the substitution itself is a variable?
I have the following string:
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
I would like to retain only the right-most word in the brackets ('merited', 'eaten', 'go'), stripping away what surrounds these words, thus producing:
merited and eaten and go
I have the regex:
p = '''\[\[[a-zA-Z]*\[|]*([a-zA-Z]*)\]\]'''
...which produces:
>>> re.findall(p, s)
['merited', 'eaten', 'go']
However, as this varies, I don't see a way to use re.sub() or s.replace().
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
p = '''\[\[[a-zA-Z]*?[|]*([a-zA-Z]*)\]\]'''
re.sub(p, r'\1', s)
? so that for [[go]] first [a-zA-Z]* will match empty (shortest) string and second will get actual go string
\1 substitutes first (in this case the only) match group in a pattern for each non-overlapping match in the string s. r'\1' is used so that \1 is not interpreted as the character with code 0x1
well first you need to fix your regex to capture the whole group:
>>> s = '[[merit|merited]] and [[eat|eaten]] and [[go]]'
>>> p = '(\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\])'
>>> [('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
[('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
This matches the whole [[whateverisinhere]] and separates the whole match as group 1 and just the final word as group 2. You can than use \2 token to replace the whole match with just group 2:
>>> re.sub(p,r'\2',s)
'merited and eaten and go'
or change your pattern to:
p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
which gets rid of grouping the entire match as group 1 and only groups what you want. you can then do:
>>> re.sub(p,r'\1',s)
to have the same effect.
POST EDIT:
I forgot to mention that I actually changed your regex so here is the explanation:
\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]
\[\[ \]\] #literal matches of brackets
(?: )* #non-capturing group that can match 0 or more of whats inside
[a-zA-Z]*\| #matches any word that is followed by a '|' character
( ... ) #captures into group one the final word
I feel like this is stronger than what you originally had because it will also change if there are more than 2 options:
>>> s = '[[merit|merited]] and [[ate|eat|eaten]] and [[go]]'
>>> p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
>>> re.sub(p,r'\1',s)
'merited and eaten and go'

Add [] around numbers in strings

I like to add [] around any sequence of numbers in a string e.g
"pixel1blue pin10off output2high foo9182bar"
should convert to
"pixel[1]blue pin[10]off output[2]high foo[9182]bar"
I feel there must be a simple way but its eluding me :(
Yes, there is a simple way, using re.sub():
result = re.sub(r'(\d+)', r'[\1]', inputstring)
Here \d matches a digit, \d+ matches 1 or more digits. The (...) around that pattern groups the match so we can refer to it in the second argument, the replacement pattern. That pattern simply replaces the matched digits with [...] around the group.
Note that I used r'..' raw string literals; if you don't you'd have to double all the \ backslashes; see the Backslash Plague section of the Python Regex HOWTO.
Demo:
>>> import re
>>> inputstring = "pixel1blue pin10off output2high foo9182bar"
>>> re.sub(r'(\d+)', r'[\1]', inputstring)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar'
You can use re.sub :
>>> s="pixel1blue pin10off output2high foo9182bar"
>>> import re
>>> re.sub(r'(\d+)',r'[\1]',s)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar
Here the (\d+) will match any combinations of digits and re.sub function will replace it with the first group match within brackets r'[\1]'.
You can start here to learn regular expression http://www.regular-expressions.info/

Python - why doesn't this simple regex work?

This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.

Categories

Resources