RE match fail in python, confuse with the result on regex101

RE match fail in python, confuse with the result on regex101 - python

http://regex101.com/r/oU6eI5/1 , test here seam works, but when i put in Python, match whole str.
str = galley/files/tew/tewt/tweqt/
re.sub('^.+/+([^/]+/$)', "\1", str)
i want get "tweqt/"

You need to use a raw string in the replace:
str = galley/files/tew/tewt/tweqt/
re.sub('^.+/+([^/]+/$)', r"\1", str)
# ^
Otherwise, you get the escaped character \1. For instance on my console, it's a little smiley.
If you somehow don't want to raw your string, you'll have to escape the backslash:
re.sub('^.+/+([^/]+/$)', "\\1", str)
Also worth noting that it's a good practice to raw your regex strings and use consistent quotes, so you I would advise using:
re.sub(r'^.+/+([^/]+/$)', r'\1', str)
Other notes
It might be simpler to match (using re.search) instead of using re.sub:
re.search(r'[^/]+/$', str).group()
# => tweqt/
And you might want to use another variable name other than str because this will override the existing function str().

It would be better if you define the pattern or regex as raw string.
>>> import re
>>> s = "galley/files/tew/tewt/tweqt/"
>>> m = re.sub(r'^.+/+([^/]+/$)', r'\1', s)
^ ^
>>> m
'tweqt/'

Related

Python regex not working with special characters

SOLVED: it replaced the " symbols in the file with ' (in the data strings)
Do you know a way to only search for 1 or more words (not numbers) between [" and \n?
This works on regexr.com, but not in python
https://regexr.com/3tju7
¨
(?<=\[\")(\D+)(?=\\n)
"S": ["Something\n13/8-2018 09:00 to 11:30
¨
Python code:
re.search('(?<=[\")(\D+)(?=\n)', str(data))
I think \[, \" and \\n is the problem, I have tried to use raw in python
re.search('(?<=\[\")(\D+)(?=\\n)', '"S": ["Something\n13/8-201809:00 to 11:30').group()
This worked but I have to use "data" because I have multiple strings, and it won't let me use .group() on that.
Error: AttributeError: 'NoneType' object has no attribute 'group'

Your problem is that the \n is being interpreted as a newline, instead of the literal characters \ and n. You can use a simpler regex, \["([\w\s]+)$, along with the MULTILINE flag, without modifying the data.
>>> import re
>>> data = '"S": ["Something\n13/8-201809:00 to 11:30'
>>> pattern = '\["([\w\s]+)$'
>>> m = re.search(pattern, data, re.MULTILINE)
>>> m.group(1)
'Something'

Try to put a r before the string with the pattern, that marks the string as "raw". This stops python from evaluating escaped characters before passing them to the function
re.search(r'\search', string)
Or:
rgx = re.compile(r'pattern')
rgx.search(string)

Replace captured groups with empty string in python

I currently have a string similar to the following:
str = 'abcHello Wor=A9ld'
What I want to do is find the 'abc' and '=A9' and replace these matched groups with an empty string, such that my final string is 'Hello World'.
I am currently using this regex, which is correctly finding the groups I want to replace:
r'^(abc).*?(=[A-Z0-9]+)'
I have tried to replace these groups using the following code:
clean_str = re.sub(r'^(abc).*?(=[A-Z0-9]+)', '', str)
Using the above code has resulted in:
print(clean_str)
>>> 'ld'
My question is, how can I use re.sub to replace these groups with an empty string and obtain my 'Hello World'?

Capture everything else and put those groups in the replacement, like so:
re.sub(r'^abc(.*?)=[A-Z0-9]+(.*)', r'\1\2', s)

This worked for me.
re.sub(r'^(abc)(.*?)(=[A-Z0-9]+)(.*?)$', r"\2\4", str)

Is there a way that I can .. ensure that abc is present, otherwise don't replace the second pattern?
I understand that you need to first check if the string starts with abc, and if yes, remove the abc and all instances of =[0-9A-Z]+ pattern in the string.
I recommend:
import re
s="abcHello wo=A9rld"
if s.startswith('abc'):
print(re.sub(r'=[A-Z0-9]+', '', s[3:]))
Here, if s.startswith('abc'): checks if the string has abc in the beginning, then s[3:] truncates the string from the start removing the abc, and then re.sub removes all non-overlapping instances of the =[A-Z0-9]+ pattern.
Note you may use PyPi regex module to do the same with one regex:
import regex
r = regex.compile(r'^abc|(?<=^abc.*?)=[A-Z0-9]+', regex.S)
print(r.sub('', 'abcHello Wor=A9ld=B56')) # Hello World
print(r.sub('', 'Hello Wor=A9ld')) # => Hello Wor=A9ld
See an online Python demo
Here,
^abc - abc at the start of the string only
| - or
(?<=^abc.*?) - check if there is abc at the start of the input and then any number of chars other than line break chars immediately to the left of the current location
=[A-Z0-9]+ - a = followed with 1+ uppercase ASCII letters/digits.

This is a naïve approach but why can't you use replace twice instead of regex, like this:
str = str.replace('abc','')
str = str.replace('=A9','')
print(str) #'Hello World'

Python converting string to latex using regular expression

Say I have a string
string = "{1/100}"
I want to use regular expressions in Python to convert it into
new_string = "\frac{1}{100}"
I think I would need to use something like this
new_string = re.sub(r'{.+/.+}', r'', string)
But I'm stuck on what I would put in order to preserve the characters in the fraction, in this example 1 and 100.

You can use () to capture the numbers. Then use \1 and \2 to refer to them:
new_string = re.sub(r'{(.+)/(.+)}', r'\\frac{\1}{\2}', string)
# \frac{1}{100}
Note: Don't forget to escape the backslash \\.

Capture the numbers using parens and then reference them in the replacement text using \1 and \2. For example:
>>> print re.sub(r'{(.+)/(.+)}', r'\\frac{\1}{\2}', "{1/100}")
\frac{1}{100}

Anything inside the braces would be a number/number. So in the regex place numbers([0-9]) instead of a .(dot).
>>> import re
>>> string = "{1/100}"
>>> new = re.sub(r'{([0-9]+)/([0-9]+)}', r'\\frac{\1}{\2}', string)
>>> print new
\frac{1}{100}

Use re.match. It's more flexible:
>>> m = re.match(r'{(.+)/(.+)}', string)
>>> m.groups()
('1', '100')
>>> new_string = "\\frac{%s}{%s}"%m.groups()
>>> print new_string
\frac{1}{100}

Python - why doesn't this simple regex work?

This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?

Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')

The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.

How is it possible to encode """ (triple quotes) into a raw string?

How do I encode """ in a raw python string?
The following does not seem to work:
string = r"""\"\"\""""
since when trying to match """ with a regular expression, I have to double-escape the character ":
Returns an empty list:
string = r"""\"\"\""""
regEx = re.compile(r"""
(\"\"\")
""", re.S|re.X)
result = re.findall(regEx, string)
in this case result is an empty list.
This same regular expression returns ['"""'] when I load a string with """ from file content.
Returns double-escaped quotations:
string = r"""\"\"\""""
regEx = re.compile(r"""
(\\"\\"\\")
""", re.S|re.X)
result = re.findall(regEx, string)
now result is equal to ['\\"\\"\\"'].
It want it to be equal to ['"""'].

In general, there are three options:
Don't use the r prefix. That's just a convenience to avoid excessive use of double-backslashes in regexes. It isn't required.
Use r'…', inside which the " character isn't special.
Mix and match r"…" and '':, e.g. pattern = '"""' + r"\s*\d\d-'\d\d'-\d\d\s*" + '"""'
In this instance, you can do both 1 and 2: single quotes and no r prefix.

The simplest way is to just do '"""'.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

RE match fail in python, confuse with the result on regex101 - python

http://regex101.com/r/oU6eI5/1 , test here seam works, but when i put in Python, match whole str. str = galley/files/tew/tewt/tweqt/ re.sub('^.+/+([^/]+/$)', "\1", str) i want get "tweqt/"

It would be better if you define the pattern or regex as raw string. >>> import re >>> s = "galley/files/tew/tewt/tweqt/" >>> m = re.sub(r'^.+/+([^/]+/$)', r'\1', s) ^ ^ >>> m 'tweqt/'

Related

Python regex not working with special characters

Replace captured groups with empty string in python

Python converting string to latex using regular expression

Python - why doesn't this simple regex work?

How is it possible to encode """ (triple quotes) into a raw string?

Categories

Resources