The problem of regex strings containing special characters in python

The problem of regex strings containing special characters in python - python

I have a string: "s = string.charAt (0) == 'd'"
I want to retrieve a tuple of ('0', 'd')
I have used: re.search(r "\ ((. ?) \) == '(.?)' && "," string.charAt (0) == 'd' ")
I checked the s variable when printed as "\\ ((.?) \\) == '(.?) '&& "
How do I fix it?

You may try:
\((\d+)\).*?'(\w+)'
Explanation of the above regex:
\( - Matches a ( literally.
(\d+) - Represents the first capturing group matching digits one or more times.
\) - Matches a ) literally.
.*? - Lazily matches everything except a new-line.
'(\w+)' - Represents second capturing group matching ' along with any word character([0-9a-zA-Z_]) one or more times.
Regex Demo
import re
regex = r"\((\d+)\).*?'(\w+)'"
test_str = "s = string.charAt (0) == 'd'"
print(re.findall(regex, test_str))
# Output: [('0', 'd')]
You can find the sample run of the above implementation in here.

Your regular expression should be ".*\((.?)\) .* '(.?)\'". This will get both the character inside the parenthesis and then the character inside the single quotes.
>>> import re
>>> s = " s = string.charAt (0) == 'd'"
>>> m = re.search(r".*\((.?)\) .* '(.?)'", s)
>>> m.groups()
('0', 'd')

Use
\((.*?)\)\s*==\s*'(.*?)'
See proof. The first variable is captured inside Group 1 and the second variable is inside Group 2.
Python code:
import re
string = "s = string.charAt (0) == 'd'"
match_data = re.search(r"\((.*?)\)\s*==\s*'(.*?)'", string)
if match_data:
print(f"Var#1 = {match_data.group(1)}\nVar#2 = {match_data.group(2)}")
Output:
Var#1 = 0
Var#2 = d

Thanks everyone for the very helpful answer. My problem has been solved ^^

Related

Extract substring from a python string

I want to extract the string before the 9 digit number below:
tmp = place1_128017000_gw_cl_mask.tif
The output should be place1
I could do this:
tmp.split('_')[0] but I also want the solution to work for:
tmp = place1_place2_128017000_gw_cl_mask.tif where the result would be:
place1_place2
You can assume that the number will also be 9 digits long

Using regular expressions and the lookahead feature of regex, this is a simple solution:
tmp = "place1_place2_128017000_gw_cl_mask.tif"
m = re.search(r'.+(?=_\d{9}_)', tmp)
print(m.group())
Result:
place1_place2
Note that the \d{9} bit matches exactly 9 digits. And the bit of the regex that is in (?= ... ) is a lookahead, which means it is not part of the actual match, but it only matches if that follows the match.

Assuming we can phrase your problem as wanting the substring up to, but not including the underscore which is followed by all numbers, we can try:
tmp = "place1_place2_128017000_gw_cl_mask.tif"
m = re.search(r'^([^_]+(?:_[^_]+)*)_\d+_', tmp)
print(m.group(1)) # place1_place2

Use a regular expression:
import re
places = (
"place1_128017000_gw_cl_mask.tif",
"place1_place2_128017000_gw_cl_mask.tif",
)
pattern = re.compile("(place\d+(?:_place\d+)*)_\d{9}")
for p in places:
matched = pattern.match(p)
if matched:
print(matched.group(1))
prints:
place1
place1_place2
The regex works like this (adjust as needed, e.g., for less than 9 digits or a variable number of digits):
( starts a capture
place\d+ matches "places plus 1 to many digits"
(?: starts a group, but does not capture it (no need to capture)
_place\d+ matches more "places"
) closes the group
* means zero or many times the previous group
) closes the capture
\d{9} matches 9 digits
The result is in the first (and only) capture group.

Here's a possible solution without regex (unoptimized!):
def extract(s):
result = ''
for x in s.split('_'):
try: x = int(x)
except: pass
if isinstance(x, int) and len(str(x)) == 9:
return result[:-1]
else:
result += x + '_'
tmp = 'place1_128017000_gw_cl_mask.tif'
tmp2 = 'place1_place2_128017000_gw_cl_mask.tif'
print(extract(tmp)) # place1
print(extract(tmp2)) # place1_place2

Match if string starts with n digits and no more

I have strings like 6202_52_55_1959.txt
I want to match those which starts with 3 digits and no more.
So 6202_52_55_1959.txt should not match but 620_52_55_1959.txt should.
import re
regexp = re.compile(r'^\d{3}')
file = r'6202_52_55_1959.txt'
print(regexp.search(file))
<re.Match object; span=(0, 3), match='620'> #I dont want this example to match
How can I get it to only match if there are three digits and no more following?

Use a negative lookahead:
regexp = re.compile(r'^\d{3}(?!\d)')

Use the pattern ^\d{3}(?=\D):
inp = ["6202_52_55_1959.txt", "620_52_55_1959.txt"]
for i in inp:
if re.search(r'^\d{3}(?=\D)', i):
print("MATCH: " + i)
else:
print("NO MATCH: " + i)
This prints:
NO MATCH: 6202_52_55_1959.txt
MATCH: 620_52_55_1959.txt
The regex pattern used here says to match:
^ from the start of the string
\d{3} 3 digits
(?=\D) then assert that what follows is NOT a digit (includes end of string)

Can't get regex patterns right

I made a function that replaces multiple instances of a single character with multiple patterns depending on the character location.
There were two ways I found to accomplish this:
This one looks horrible but it works:
def xSubstitution(target_string):
while target_string.casefold().find('x') != -1:
x_finded = target_string.casefold().find('x')
if (x_finded == 0 and target_string[1] == ' ') or (target_string[x_finded-1] == ' ' and
((target_string[-1] == 'x' or 'X') or target_string[x_finded+1] == ' ')):
target_string = target_string.replace(target_string[x_finded], 'ecks', 1)
elif (target_string[x_finded+1] != ' '):
target_string = target_string.replace(target_string[x_finded], 'z', 1)
else:
target_string = target_string.replace(target_string[x_finded], 'cks', 1)
return(target_string)
This one technically works, but I just can't get the regex patterns right:
import re
def multipleRegexSubstitutions(sentence):
patterns = {(r'^[xX]\s'): 'ecks ', (r'[^\w]\s?[xX](?!\w)'): 'ecks',
(r'[\w][xX]'): 'cks', (r'[\w][xX][\w]'): 'cks',
(r'^[xX][\w]'): 'z',(r'\s[xX][\w]'): 'z'}
regexes = [
re.compile(p)
for p in patterns
]
for regex in regexes:
for match in re.finditer(regex, sentence):
match_location = sentence.casefold().find('x', match.start(), match.end())
sentence = sentence.replace(sentence[match_location], patterns.get(regex.pattern), 1)
return sentence
From what I figured it out, the only problem in the second function is the regex patterns. Could someone help me?
EDIT: Sorry I forgot to tell that the regexes are looking for the different x characters in a string, and replace an X in the beggining of a word for a 'Z', in the middle or end of a word for 'cks' and if it is a lone 'x' char replace with 'ecks'

You need \b (word boundary) and \B (position other than word boundary):
Replace an X in the beggining of a word for a 'Z'
re.sub(r'\bX\B', 'Z', s, flags=re.I)
In the middle or end of a word for 'cks'
re.sub(r'\BX', 'cks', s, flags=re.I)
If it is a lone 'x' char replace with 'ecks'
re.sub(r'\bX\b', 'ecks', s, flags=re.I)

I would just use the following set of substitutions for this:
string = re.sub(r"\b[Xx]\b", "ecks", string)
string = re.sub(r"\b[Xx](?!\s)", "Z", string)
string = re.sub(r"(?<=\w)[Xx](?=\w)", "cks", string)
Here,
(?!\s)
just asserts that the regex does not match any whitespace character,
\b
Edit: The last regex would also match x or X at the beginning of a word. So we can use the following, instead,
(?<=\w)[xX](?=\w)
to make sure there must be a character \w before/after x or X.

Replace subtext of a word

I want to replace this string
ramesh#gmail.com
to
rxxxxh#gxxxl.com
this is what I have done so far
print( re.sub(r'([A-Za-z](.*)[A-Za-z]#)','x', i))

One way to go is to use capturing groups and in the replacement for the parts that should be replaced with x return a repetition for number of characters in the matched group.
For the second and the fourth group use a negated character class [^ matching any char except the listed.
\b([A-Za-z])([^#\s]*)([A-Za-z]#[A-Za-z])([^#\s.]*)([A-Za-z])\b
Regex demo | Python demo
For example
import re
i = "ramesh#gmail.com"
res = re.sub(
r'\b([A-Za-z])([^#\s]*)([A-Za-z]#[A-Za-z])([^#\s.]*)([A-Za-z])\b',
lambda x: x.group(1) + "x" * len(x.group(2)) + x.group(3) + "x" * len(x.group(4)) + x.group(5),
i)
print(res)
Output
rxxxxh#gxxxl.com

How to use regex to only keep first n repeated words

If I have an input sentence
input = 'ok ok, it is very very very very very hard'
and what I want to do is to only keep the first three replica for any repeated word:
output = 'ok ok, it is very very very hard'
How can I achieve this with re or regex module in python?

One option could be to use a capturing group with a backreference and use that in the replacement.
((\w+)(?: \2){2})(?: \2)*
Explanation
( Capture group 1
(\w+) capture group 2, match 1+ word chars (The example data only uses word characters. To make sure they are no part of a larger word use a word boundary \b)
(?: \2){2} Repeat 2 times matching a space and a backreference to group 2. Instead of a single space you could use [ \t]+ to match 1+ spaces or tabs or use \s+ to match 1+ whitespace chars. (Note that that would also match a newline)
) Close group 1
(?: \2)* Match 0+ times a space and a backreference to group 2 to match the same words that you want to remove
Regex demo | Python demo
For example
import re
regex = r"((\w+)(?: \2){2})(?: \2)*"
s = "ok ok, it is very very very very very hard"
result = re.sub(regex, r"\1", s)
if result:
print (result)
Result
ok ok, it is very very very hard

You can group a word and use a backreference to refer to it to ensure that it repeats for more than 2 times:
import re
print(re.sub(r'\b((\w+)(?:\s+\2){2})(?:\s+\2)+\b', r'\1', input))
This outputs:
ok ok, it is very very very hard

One solution with re.sub with custom function:
s = 'ok ok, it is very very very very very hard'
def replace(n=3):
last_word, cnt = '', 0
current_word = yield
while True:
if last_word == current_word:
cnt += 1
else:
cnt = 0
last_word = current_word
if cnt >= n:
current_word = yield ''
else:
current_word = yield current_word
import re
replacer = replace()
next(replacer)
print(re.sub(r'\s*[\w]+\s*', lambda g: replacer.send(g.group(0)), s))
Prints:
ok ok, it is very very very hard

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

The problem of regex strings containing special characters in python - python

I have a string: "s = string.charAt (0) == 'd'" I want to retrieve a tuple of ('0', 'd') I have used: re.search(r "\ ((. ?) \) == '(.?)' && "," string.charAt (0) == 'd' ") I checked the s variable when printed as "\\ ((.?) \\) == '(.?) '&& " How do I fix it?

Your regular expression should be ".\((.?)\) . '(.?)\'". This will get both the character inside the parenthesis and then the character inside the single quotes. >>> import re >>> s = " s = string.charAt (0) == 'd'" >>> m = re.search(r".\((.?)\) . '(.?)'", s) >>> m.groups() ('0', 'd')

Thanks everyone for the very helpful answer. My problem has been solved ^^

Related

Extract substring from a python string

Match if string starts with n digits and no more

Can't get regex patterns right

Replace subtext of a word

How to use regex to only keep first n repeated words

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

The problem of regex strings containing special characters in python - python

I have a string: "s = string.charAt (0) == 'd'" I want to retrieve a tuple of ('0', 'd') I have used: re.search(r "\ ((. ?) \) == '(.?)' && "," string.charAt (0) == 'd' ") I checked the s variable when printed as "\\ ((.?) \\) == '(.?) '&& " How do I fix it?

Your regular expression should be ".*\((.?)\) .* '(.?)\'". This will get both the character inside the parenthesis and then the character inside the single quotes. >>> import re >>> s = " s = string.charAt (0) == 'd'" >>> m = re.search(r".*\((.?)\) .* '(.?)'", s) >>> m.groups() ('0', 'd')

Thanks everyone for the very helpful answer. My problem has been solved ^^

Related

Extract substring from a python string

Match if string starts with n digits and no more

Can't get regex patterns right

Replace subtext of a word

How to use regex to only keep first n repeated words

Categories

Resources

Your regular expression should be ".\((.?)\) . '(.?)\'". This will get both the character inside the parenthesis and then the character inside the single quotes. >>> import re >>> s = " s = string.charAt (0) == 'd'" >>> m = re.search(r".\((.?)\) . '(.?)'", s) >>> m.groups() ('0', 'd')