Deleting a string between repeated delimiters in python - python

I have to remove the strings that start with "===" and also end with "===" (for example I have to replace the string "===Links===" with null string) in python. But the problem here is it can start with three "=" or four or any number of '='. I have tried to use the regex re.sub('[=]*.*?[=]*', '', string). But when it is run on "===Refs===", it is giving "Refs" as output instead of null string. Can you please suggest something for this?

import re
string = '===Ref==='
pattern = r'^\=+.+\=+$'
string = re.sub(pattern, '', string)

Too late :-(
import re
str = '===Links=== are great, but ===Refs=== bla bla == blub ===blub'
pattern = re.compile('=+\w+=+')
replaced = re.sub(pattern, '', str)
print(replaced)

.? suggests that you are only accepting no or a single character between your =s. Try changing it to .* to match multiple characters between =s.
Perhaps you can use str.startswith() and str.endswith() to find out if the string starts/ends with ===?

Related

Replace substring in python with matched after modify

I'm trying in python to replace substring in python with matched after modify it.
i have #ma{Z} and expect to replace it by #maZ.
line= '#ma{Z}'
re.sub(t=r'\#\{\w\}',t - 1 , line)
Thank you.
You don't need Regex for this, str.replace to replace { and } with null string would do:
In [855]: str_ = '#ma{Z}'
In [856]: str_.replace('{', '').replace('}', '')
Out[856]: '#maZ'
If you insist on using Regex, use a character class for { and }, and again replace with null string:
In [857]: re.sub(r'[{}]', '', str_)
Out[857]: '#maZ'
Edit based on comment:
As you actually want to remove the braces around Q in {<math>\\mathbb{Q}}, you can use \w+ to match one or more of alphanumerics or underscore and put the match in a captured group to refer it in the replacement with re.sub:
In [858]: str_ = '{<math>\\mathbb{Q}}'
In [859]: re.sub(r'\{(\w+)\}', r'\1', str_)
Out[859]: '{<math>\\mathbbQ}'
If you have patterns like Q,Z,E it might be an option to use a character class [EQZ] or a specify a range to capture those in a group between curly braces {} and the replace with the capturing group:
{([EQZ])}
import re
line = "#ma{Z}"
result = re.sub(r"{([EQZ])}", r"\1", line)
if result:
print (result)
Demo

Replace captured groups with empty string in python

I currently have a string similar to the following:
str = 'abcHello Wor=A9ld'
What I want to do is find the 'abc' and '=A9' and replace these matched groups with an empty string, such that my final string is 'Hello World'.
I am currently using this regex, which is correctly finding the groups I want to replace:
r'^(abc).*?(=[A-Z0-9]+)'
I have tried to replace these groups using the following code:
clean_str = re.sub(r'^(abc).*?(=[A-Z0-9]+)', '', str)
Using the above code has resulted in:
print(clean_str)
>>> 'ld'
My question is, how can I use re.sub to replace these groups with an empty string and obtain my 'Hello World'?
Capture everything else and put those groups in the replacement, like so:
re.sub(r'^abc(.*?)=[A-Z0-9]+(.*)', r'\1\2', s)
This worked for me.
re.sub(r'^(abc)(.*?)(=[A-Z0-9]+)(.*?)$', r"\2\4", str)
Is there a way that I can .. ensure that abc is present, otherwise don't replace the second pattern?
I understand that you need to first check if the string starts with abc, and if yes, remove the abc and all instances of =[0-9A-Z]+ pattern in the string.
I recommend:
import re
s="abcHello wo=A9rld"
if s.startswith('abc'):
print(re.sub(r'=[A-Z0-9]+', '', s[3:]))
Here, if s.startswith('abc'): checks if the string has abc in the beginning, then s[3:] truncates the string from the start removing the abc, and then re.sub removes all non-overlapping instances of the =[A-Z0-9]+ pattern.
Note you may use PyPi regex module to do the same with one regex:
import regex
r = regex.compile(r'^abc|(?<=^abc.*?)=[A-Z0-9]+', regex.S)
print(r.sub('', 'abcHello Wor=A9ld=B56')) # Hello World
print(r.sub('', 'Hello Wor=A9ld')) # => Hello Wor=A9ld
See an online Python demo
Here,
^abc - abc at the start of the string only
| - or
(?<=^abc.*?) - check if there is abc at the start of the input and then any number of chars other than line break chars immediately to the left of the current location
=[A-Z0-9]+ - a = followed with 1+ uppercase ASCII letters/digits.
This is a naïve approach but why can't you use replace twice instead of regex, like this:
str = str.replace('abc','')
str = str.replace('=A9','')
print(str) #'Hello World'

Use python 3 regex to match a string in double quotes

I want to match a string contained in a pair of either single or double quotes. I wrote a regex pattern as so:
pattern = r"([\"\'])[^\1]*\1"
mytext = '"bbb"ccc"ddd'
re.match(pattern, mytext).group()
The expected output would be:
"bbb"
However, this is the output:
"bbb"ccc"
Can someone explain what's wrong with the pattern above? I googled and found the correct pattern to be:
pattern = r"([\"\'])[^\1]*?\1"
However, I don't understand why I must use ?.
In your regex
([\"'])[^\1]*\1
Character class is meant for matching only one character. So your use of [^\1] is incorrect. Think, what would have have happened if there were more than one characters in the first capturing group.
You can use negative lookahead like this
(["'])((?!\1).)*\1
or simply with alternation
(["'])(?:[^"'\\]+|\\.)*\1
or
(?<!\\)(["'])(?:[^"'\\]+|\\.)*\1
if you want to make sure "b\"ccc" does not matches in string bb\"b\"ccc"
You should use a negative lookahead assertion. And I assume there won't be any escaped quotes in your input string.
>>> pattern = r"([\"'])(?:(?!\1).)*\1"
>>> mytext = '"bbb"ccc"ddd'
>>> re.search(pattern, mytext).group()
'"bbb"'
You can use:
pattern = r"[\"'][^\"']*[\"']"
https://regex101.com/r/dO0cA8/1
[^\"']* will match everything that isn't " or '

Python split string by start and end characters

Say you have a string like this: "(hello) (yes) (yo diddly)".
You want a list like this: ["hello", "yes", "yo diddly"]
How would you do this with Python?
import re
pattern = re.compile(r'\(([^)]*)\)')
The pattern matches the parentheses in your string (\(...\)) and these need to be escaped.
Then it defines a subgroup ((...)) - these parentheses are part of the regex-syntax.
The subgroup matches all characters except a right parenthesis ([^)]*)
s = "(hello) (yes) (yo diddly)"
pattern.findall(s)
gives
['hello', 'yes', 'yo diddly']
UPDATE:
It is probably better to use [^)]+ instead of [^)]*. The latter would also match an empty string.
Using the non-greedy modifiers, as DSM suggested, makes the pattern probably better to read: pattern = re.compile(r'\((.+?)\)')
I would do it like this:
"(hello) (yes) (yo diddly)"[1:-1].split(") (")
First, we cut off the first and last characters (since they should be removed anyway). Next, we split the resulting string using ") (" as the delimiter, giving the desired list.
This will give you words from any string :
>>> s="(hello) (yes) (yo diddly)"
>>> import re
>>> words = re.findall(r'\((.*?\))',s)
>>> words
['hello', 'yes', 'yo diddly']
as D.S.M said.
? in the regex to make it non-greedy.

Python regex match text between quotes

In the following script I would like to pull out text between the double quotes ("). However, the python interpreter is not happy and I can't figure out why...
import re
text = 'Hello, "find.me-_/\\" please help with python regex'
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
m = re.match(pattern, text)
print m.group()
The output should be find.me-/\.
match starts searching from the beginning of the text.
Use search instead:
#!/usr/bin/env python
import re
text = 'Hello, "find.me-_/\\" please help with python regex'
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
m = re.search(pattern, text)
print m.group()
match and search return None when they fail to match.
I guess you are getting AttributeError: 'NoneType' object has no attribute 'group' from python: This is because you are assuming you will match without checking the return from re.match.
If you write:
m = re.search(pattern, text)
match: searches at the beginning of text
search: searches all the string
Maybe this helps you to understand:
http://docs.python.org/library/re.html#matching-vs-searching
Split the text on quotes and take every other element starting with the second element:
def text_between_quotes(text):
return text.split('"')[1::2]
my_string = 'Hello, "find.me-_/\\" please help and "this quote" here'
my_string.split('"')[1::2] # ['find.me-_/\\', 'this quote']
'"just one quote"'.split('"')[1::2] # ['just one quote']
This assumes you don't have quotes within quotes, and your text doesn't mix quotes or use other quoting characters like `.
You should validate your input. For example, what do you want to do if there's an odd number of quotes, meaning not all the quotes are balanced? You could do something like discard the last item if you have an even number of things after doing the split
def text_between_quotes(text):
split_text = text.split('"')
between_quotes = split_text[1::2]
# discard the last element if the quotes are unbalanced
if len(split_text) % 2 == 0 and between_quotes and not text.endswith('"'):
between_quotes.pop()
return between_quotes
# ['first quote', 'second quote']
text_between_quotes('"first quote" and "second quote" and "unclosed quote')
or raise an error instead.
Use re.search() instead of re.match(). The latter will match only at the beginning of strings (like an implicit ^).
You need re.search(), not re.match() which is anchored to the start of your input string.
Docs here

Categories

Resources