Stuck with Regex Replace - python

I am trying to replace "[!" at the start of the string only with "(".
The same holds for "!]" with ")", only at the end.
import re
l=["[!hdfjkhtxt.png!] abc", "hghjgfsdjhfg [a!=234]", "[![ITEM:15120710/1]/1587454425954.png!]", "abc"]
p=[re.sub("\[!\w+!]", '', l[i]) for i in range(len(l)) if l[i] != ""]
print(p)
the required output is
["(hdfjkhtxt.png)", "hghjgfsdjhfg [a!=234]", "([ITEM:15120710/1]/1587454425954.png)", "abc"]

Regex placing parens around content between matching pairs of '[!', '!]'
# content between '[!' and '!]' in in capture group
[re.sub(r"\[!(.*)!\]", lambda m: "(" + m.group(1) + ")", s) for s in l]
Output
['(hdfjkhtxt.png) abc', 'hghjgfsdjhfg [a!=234]', '([ITEM:15120710/1]/1587454425954.png)', 'abc']

You describe your task as a combination of two parts:
substitute [! by ( and
substitute !] by ).
If this can be done separately or has only to be done simultaneously is addressed later.
First approach
Think if str.replace could do the job. It looks quite convenient and you don't even need to import re:
[e.replace("[!", "(").replace("!]", ")") for e in l]
BTW: there is no need to exclude the empty string ("") from the replacement because it's formally replaced by "" and will be technically skipped anyway.
Regex version
[re.sub(r"\[!", "(", re.sub(r"!\]", ")", e)) for e in l]
Decomposition
The nested substitutions may not look like two steps on first glance, so see the following example
import re
l = [
"[!hdfjkhtxt.png!] abc",
"hghjgfsdjhfg [a!=234]",
"[![ITEM:15120710/1]/1587454425954.png!]",
"abc"
]
for e in l:
sd = re.sub(r"\[!", "(", e)
sd = re.sub(r"!\]", ")", sd)
print(e, " --> ", sd)
that produces this output:
[!hdfjkhtxt.png!] abc --> (hdfjkhtxt.png) abc
hghjgfsdjhfg [a!=234] --> hghjgfsdjhfg [a!=234]
[![ITEM:15120710/1]/1587454425954.png!] --> ([ITEM:15120710/1]/1587454425954.png)
abc --> abc
See the re.sub documentation for correct argument use.
Refinement
Because re.sub also supports back references, it's also possible to do the replacement of paired brackets.
re.sub(r"\[!(.+)!\]", r"(\1)", e)
Choose wisely!
It's important to be careful reading the actual requirement. If you have to replace bracket pairs, use the second, If you have to replace the sequences regardless of occurrences being paired or not, use the first. Otherwise you are doing it wrong.
Escaping
Keep in mind that that backslash (\), as an escape character, has to be doubled in normal string literals, an alternative is to prefix the string literal by r. Doubling the backslash (or the r prefix) is optional in all but the last example because \[ and \] have no function in python whereas \1 is the code for SOH (the control char in ASCII) or U+0001 (the Unicode point).

Related

Python:Regex to remove more than N consecutive letters

lets say I have this string : Sayy Hellooooooo
if N = 2
I want the result to be (Using Regex): Sayy Helloo
Thank U in advance
Another option is to use re.sub with a callback:
N = 2
result = re.sub(r'(.)\1+', lambda m: m.group(0)[:N], your_string)
You could build the regex dynamically for a given n, and then call sub without callback:
import re
n = 2
regex = re.compile(rf"((.)\2{{{n-1}}})\2+")
s = "Sayy Hellooooooo"
print(regex.sub(r"\1", s)) # Sayy Helloo
Explanation:
{{: this double brace represents a literal brace in an f-string
{n-1} injects the value of n-1, so together with the additional (double) brace-wrap, this {{{n-1}}} produces {2} when n is 3.
The outer capture group captures the maximum allowed repetition of a character
The additional \2+ captures more subsequent occurrences of that same character, so these are the characters that need removal.
The replacement with \1 thus reproduces the allowed repetition, but omits the additional repetition of that same character.
You could use backreferences to mach the previous character. So (a|b)\1 would match aa or bb. In your case you would want probably any letter and any number of repetitions so ([a-zA-Z])\1{n,} for N repetitions. Then substitute it with one occurence using \1 again. So putting it all together:
import re
n=2
expression = r"([a-zA-Z])\1{"+str(n)+",}"
print(re.sub(expression,r"\1","hellooooo friiiiiend"))
# Outputs Hello friend
Attempt This Online!
Note this actually matches N+1 repetitions only, like your test cases. One item then N copies of it. If you want to match exactly N also subtract 1.
Remember to use r in front of regular expressions so you don't need to double escape backslashes.
Learn more about backreferences: https://www.regular-expressions.info/backref.html Learn more about repetition: https://www.regular-expressions.info/repeat.html
You need a regex that search for multiple occurence of the same char, that is done with (.)\1 (the \1 matches the group 1 (in the parenthesis))
To match
2 occurences : (.)\1
3 occurences : (.)\1\1 or (.)\1{2}
4 occurences : (.)\1\1\1 or (.)\1{3}
So you can build it with an f-string and the value you want (that's a bit ugly because you have literal brackets that needs to be escaped using double brackets, and inside that the bracket to allow the value itself)
def remove_letters(value: str, count: int):
return re.sub(rf"(.)\1{{{count}}}", "", value)
print(remove_letters("Sayy Hellooooooo", 1)) # Sa Heo
print(remove_letters("Sayy Hellooooooo", 2)) # Sayy Hello
print(remove_letters("Sayy Hellooooooo", 3)) # Sayy Hellooo
You may understand the pattern creation easier with that
r"(.)\1{" + str(count) + "}"
This seems to work:
When N=2: the regex pattern is compiled to : ((\w)\2{2,})
When N=3: the regex pattern is compiled to : ((\w)\2{3,})
Code:
import re
N = 2
p = re.compile(r"((\w)\2{" + str(N) + r",})")
text = "Sayy Hellooooooo"
matches = p.findall(text)
for match in matches:
text = re.sub(match[0], match[1]*N, text)
print(text)
Output:
Sayy Helloo
Note:
Also tested with N=3, N=4 and other text inputs.

how to make a list in python from a string and using regular expression [duplicate]

I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?
How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.
You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'
This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]
your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis
You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?
You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))
How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)

Using variables in re.findall() regex function

I have a list of regex patterns like k[a-z]p[a-z]+a
and a list of words that can fit into these patterns. Now, the problem is that,
when I use:
re.findall(r'k[a-z]p[a-z]+a', list)
Everything works properly, but when I replace the raw expression with a variable like:
pattern = "r'" + pattern + "'"
and then try:
re.findall(pattern, list)
or
re.findall(str(pattern), list)
It no longer works. How could I fix it?
Thanks!
Spike
You are overthinking it. The r prefix is not part of the pattern string itself, it merely indicates that the following string should not use escape codes for certain characters.
This will work without adjusting your pattern:
re.findall(pattern, list)
If your pattern contains characters that do not need escaping (as they do not), you can add the prefix r to the pattern definition. Suppose you want to search for a different regex, then use
pattern = r'k\wp\wa'
re.findall(pattern, list)
and you don't need to escape it. Since pattern in itself is a perfectly ordinary string, you can concatenate it with other strings:
start = 'a'
middle = 'b'
end = 'c'
pattern = a + r'\w' + b + r'\w' + c
re.findall(pattern, list)

Splitting a string with delimiters and conditions

I'm trying to split a general string of chemical reactions delimited by whitespace, +, = where there may be an arbitrary number of whitespaces. This is the general case but I also need it to split conditionally on the parentheses characters () when there is a + found inside the ().
For example:
reaction= 'C5H6 + O = NC4H5 + CO + H'
Should be split such that the result is
splitresult=['C5H6','O','NC4H5','CO','H']
This case seems simple when using filter(None,re.split('[\s+=]',reaction)). But now comes the conditional splitting. Some reactions will have a (+M) which I'd also like to split off of as well leaving only the M. In this case, there will always be a +M inside the parentheses
For example:
reaction='C5H5 + H (+M)= C5H6 (+M)'
splitresult=['C5H5','H','M','C5H6','M']
However, there will be some cases where the parentheses will not be delimiters. In these cases, there will not be a +M but something else that doesn't matter.
For example:
reaction='C5H5 + HO2 = C5H5O(2,4) + OH'
splitresult=['C5H5','HO2','C5H5O(2,4)','OH']
My best guess is to use negative lookahead and lookbehind to match the +M but I'm not sure how to incorporate that into the regex expression I used above on the simple case. My intuition is to use something like filter(None,re.split('[(?<=M)\)\((?=\+)=+\s]',reaction)). Any help is much appreciated.
You could use re.findall() instead:
re.findall(pattern, string, flags=0)
Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result unless they touch the
beginning of another match.
then:
import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction0)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction1)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction2)
but, if you prefer re.split() and filter(), then:
import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction0))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction1))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction2))
the pattern for findall is different from the pattern for split,
because findall and split are looking for different things; 'the opposite things', indeed.
findall, is looking for that you wanna (keep it).
split, is looking for that you don't wanna (get rid of it).
In findall, '[A-Z0-9]+(?:([1-9],[1-9]))?'
match any upper case or number > [A-Z0-9],
one or more times > +, follow by a pair of numbers, with a comma in the middle, inside of parenthesis > \([1-9],[1-9]\)
(literal parenthesis outside of character classes, must be escaped with backslashes '\'), optionally > ?
\([1-9],[1-9]\) is inside of (?: ), and then,
the ? (which make it optional); ( ), instead of (?: ) works, but, in this case, (?: ) is better; (?: ) is a no capturing group: read about this.
try it with the regex in the split
That seems overly complicated to handle with a single regular expression to split the string. It'd be much easier to handle the special case of (+M) separately:
halfway = re.sub("\(\+M\)", "M", reaction)
result = filter(None, re.split('[\s+=]', halfway))
So here is the regex which you are looking for.
Regex: ((?=\(\+)\()|[\s+=]|((?<=M)\))
Flags used:
g for global search. Or use them as per your situation.
Explanation:
((?=\(\+)\() checks for a ( which is present if (+ is present. This covers the first part of your (+M) problem.
((?<=M)\)) checks for a ) which is present if M is preceded by ). This covers the second part of your (+M) problem.
[\s+=] checks for all the remaining whitespaces, + and =. This covers the last part of your problem.
Note: The care for digits being enclosed by () ensured by both positive lookahead and positive lookbehind assertions.
Check Regex101 demo for working
P.S: Make it suitable for yourself as I am not a python programmer yet.

How do I removes \n founds between double quotes from a string?

Good day,
I am totally new to Python and I am trying to do something with string.
I would like to remove any \n characters found between double quotes ( " ) only, from a given string :
str = "foo,bar,\n\"hihi\",\"hi\nhi\""
The desired output must be:
foo,bar
"hihi", "hihi"
Edit:
The desired output must be similar to that string:
after = "foo,bar,\n\"hihi\",\"hihi\""
Any tips?
A simple stateful filter will do the trick.
in_string = False
input_str = 'foo,bar,\n"hihi","hi\nhi"'
output_str = ''
for ch in input_str:
if ch == '"': in_string = not in_string
if ch == '\n' and in_string: continue
output_str += ch
print output_str
This should do:
def removenewlines(s):
inquotes = False
result = []
for chunk in s.split("\""):
if inquotes: chunk.replace("\n", "")
result.append(chunk)
inquotes = not inquotes
return "\"".join(result)
Quick note: Python strings can use '' or "" as delimiters, so it's common practice to use one when the other is inside your string, for readability. Eg: 'foo,bar,\n"hihi","hi\nhi"'. On to the question...
You probably want the python regexp module: re.
In particular, the substitution function is what you want here. There are a bunch of ways to do it, but one quick option is to use a regexp that identifies the "" substrings, then calls a helper function to strip any \n out of them...
import re
def helper(match):
return match.group().replace("\n","")
input = 'foo,bar,\n"hihi","hi\nhi"'
result = re.sub('(".*?")', helper, input, flags=re.S)
>>> str = "foo,bar,\n\"hihi\",\"hi\nhi\""
>>> re.sub(r'".*?"', lambda x: x.group(0).replace('\n',''), str, flags=re.S)
'foo,bar,\n"hihi","hihi"'
>>>
Short explanation:
re.sub is a substitution engine. It takes a regular expression, a substitution function or expression, a string to work on, and other options.
The regular expression ".*?" catches strings in double quotes that don't in themselves contain other double quotes (it has a small bug, because it wouldn't catch strings which contain escaped double-quotes).
lambda x: ... is an expression which can be used wherever a function can be used.
The substitution engine calls the function with the match object.
x.group(0) is "the whole matched string", which also includes the double quotes.
x.group(0) is the matched string with '\n' substituted for ''.
The flag re.S tells re.sub that '\n' is a valid character to catch with a dot.
Personally I find longer functions that say the same thing more tiring and less readable, in the same way that in C I would prefer i++ to i = i + 1. It's all about what one is used to reading.
This regex works (assuming that quotes are correctly balanced):
import re
result = re.sub(r"""(?x) # verbose regex
\n # Match a newline
(?! # only if it is not followed by
(?:
[^"]*" # an even number of quotes
[^"]*" # (and any other non-quote characters)
)* # (yes, zero counts, too)
[^"]*
\z # until the end of the string.
)""",
"", str)
Something like this
Break the CSV data into columns.
>>> m=re.findall(r'(".*?"|[^"]*?)(,\s*|\Z)',s,re.M|re.S)
>>> m
[('foo', ','), ('bar', ',\n'), ('"hihi"', ','), ('"hi\nhi"', ''), ('', '')]
Replace just the field instances of '\n' with ''.
>>> [ field.replace('\n','') + sep for field,sep in m ]
['foo,', 'bar,\n', '"hihi",', '"hihi"', '']
Reassemble the resulting stuff (if that's really the point.)
>>> "".join(_)
'foo,bar,\n"hihi","hihi"'

Categories

Resources