I have a string like this:
s = k0+k1+k1k2+k2k3+1+12
I want to convert this, such that every number, which follows a letter (k here) becomes surrounded by square brackets:
k[0]+k[1]+k[1]k[2]+k[2]k[3]+1+12
What is a good way to do that?
What I tried: Use replace() function 4 times (but it cannot handle numbers not followed by letters).
Here is one option using re module with regex ([a-zA-Z])(\d+), which matches a single letter followed by digits and with sub, you can enclose the matched digits with a pair of brackets in the replacement:
import re
s = "k0+k1+k1k2+k2k3+1+12"
re.sub(r"([a-zA-Z])(\d+)", r"\1[\2]", s)
# 'k[0]+k[1]+k[1]k[2]+k[2]k[3]+1+12'
To replace the matched letters with upper case, you can use a lambda in the replacement positions to convert them to upper case:
re.sub(r"([a-zA-Z])(\d+)", lambda p: "%s[%s]" % (p.groups(0)[0].upper(), p.groups(0)[1]), s)
# 'K[0]+K[1]+K[1]K[2]+K[2]K[3]+1+12'
How about this?
s = re.sub('([a-z]+)([0-9]+)', r"\1" + '[' + r"\2" + ']', s)
Related
lets say I have this string : Sayy Hellooooooo
if N = 2
I want the result to be (Using Regex): Sayy Helloo
Thank U in advance
Another option is to use re.sub with a callback:
N = 2
result = re.sub(r'(.)\1+', lambda m: m.group(0)[:N], your_string)
You could build the regex dynamically for a given n, and then call sub without callback:
import re
n = 2
regex = re.compile(rf"((.)\2{{{n-1}}})\2+")
s = "Sayy Hellooooooo"
print(regex.sub(r"\1", s)) # Sayy Helloo
Explanation:
{{: this double brace represents a literal brace in an f-string
{n-1} injects the value of n-1, so together with the additional (double) brace-wrap, this {{{n-1}}} produces {2} when n is 3.
The outer capture group captures the maximum allowed repetition of a character
The additional \2+ captures more subsequent occurrences of that same character, so these are the characters that need removal.
The replacement with \1 thus reproduces the allowed repetition, but omits the additional repetition of that same character.
You could use backreferences to mach the previous character. So (a|b)\1 would match aa or bb. In your case you would want probably any letter and any number of repetitions so ([a-zA-Z])\1{n,} for N repetitions. Then substitute it with one occurence using \1 again. So putting it all together:
import re
n=2
expression = r"([a-zA-Z])\1{"+str(n)+",}"
print(re.sub(expression,r"\1","hellooooo friiiiiend"))
# Outputs Hello friend
Attempt This Online!
Note this actually matches N+1 repetitions only, like your test cases. One item then N copies of it. If you want to match exactly N also subtract 1.
Remember to use r in front of regular expressions so you don't need to double escape backslashes.
Learn more about backreferences: https://www.regular-expressions.info/backref.html Learn more about repetition: https://www.regular-expressions.info/repeat.html
You need a regex that search for multiple occurence of the same char, that is done with (.)\1 (the \1 matches the group 1 (in the parenthesis))
To match
2 occurences : (.)\1
3 occurences : (.)\1\1 or (.)\1{2}
4 occurences : (.)\1\1\1 or (.)\1{3}
So you can build it with an f-string and the value you want (that's a bit ugly because you have literal brackets that needs to be escaped using double brackets, and inside that the bracket to allow the value itself)
def remove_letters(value: str, count: int):
return re.sub(rf"(.)\1{{{count}}}", "", value)
print(remove_letters("Sayy Hellooooooo", 1)) # Sa Heo
print(remove_letters("Sayy Hellooooooo", 2)) # Sayy Hello
print(remove_letters("Sayy Hellooooooo", 3)) # Sayy Hellooo
You may understand the pattern creation easier with that
r"(.)\1{" + str(count) + "}"
This seems to work:
When N=2: the regex pattern is compiled to : ((\w)\2{2,})
When N=3: the regex pattern is compiled to : ((\w)\2{3,})
Code:
import re
N = 2
p = re.compile(r"((\w)\2{" + str(N) + r",})")
text = "Sayy Hellooooooo"
matches = p.findall(text)
for match in matches:
text = re.sub(match[0], match[1]*N, text)
print(text)
Output:
Sayy Helloo
Note:
Also tested with N=3, N=4 and other text inputs.
I have a huge string which contains emotions like "\u201d", AS WELL AS "\advance\"
all that I need is to remove back slashed so that:
- \u201d = \u201d
- \united\ = united
(as it breaks the process of uploading it to BigQuery database)
I know it should be somehow this way:
string.replace('\','') But not sure how to keep \u201d emotions.
ADDITIONAL:
Example of Unicode emotions
\ud83d\udc9e
\u201c
\u2744\ufe0f\u2744\ufe0f\u2744\ufe0f
You can split on all '\' and then use a regex to replace your emotions with adding leading '\'
s = '\\advance\\\\united\\ud83d\\udc9e\\u201c\\u2744\\ufe0f\\u2744\\ufe0f\\u2744\\ufe0f'
import re
print(re.sub('(u[a-f0-9]{4})',lambda m: '\\'+m.group(0),''.join(s.split('\\'))))
As your emotions are 'u' and 4 hexa numbers, 'u[a-f0-9]{4}' will match them all, and you just have to add leading backslashes
First of all, you delete every '\' in the string with either ''.join(s.split('\\')) or s.replace('\\')
And then we match every "emotion" with the regex u[a-f0-9]{4} (Which is u with 4 hex letters behind)
And with the regex sub, you replace every match with a leading \\
You could simply add the backslash in front of your string after replacement if your string starts with \u and have at least one digit.
import re
def clean(s):
re1='(\\\\)' # Any Single Character "\"
re2='(u)' # Any Single Character "u"
re3='.*?' # Non-greedy match on filler
re4='(\\d)' # Any Single Digit
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(s)
if m:
r = '\\'+s.replace('\\','')
else:
r = s.replace('\\','')
return r
a = '\\u123'
b = '\\united\\'
c = '\\ud83d'
>>> print(a, b, c)
\u123 \united\ \ud83d
>>> print(clean(a), clean(b), clean(c))
\u123 united \ud83d
Of course, you have to split your sting if multiple entries are in the same line:
string = '\\u123 \\united\\ \\ud83d'
clean_string = ' '.join([clean(word) for word in string.split()])
You can use this simple method to replace the last occurence of your character backslash:
Check the code and use this method.
def replace_character(s, old, new):
return (s[::-1].replace(old[::-1],new[::-1], 1))[::-1]
replace_character('\advance\', '\','')
replace_character('\u201d', '\','')
Ooutput:
\advance
\u201d
You can do it as simple as this
text = text.replace(text[-1],'')
Here you just replace the last character with nothing
I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.
For example,
i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
In [25]: i
Out [25]: 'Inc____Contact'
This string works fine. I can parse them using ____ later.
However it doesn't work on this particular string.
i = "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
Out [31]: '(2 months)____L'
It ate capital M.
What am I missing here?
EDIT To replace multiple continuous newline characters (\n) to ____, this should do:
>>> import re
>>> i = "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'
(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.
Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.
Try the following and see what happens
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)
If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)
You could also use ([\r\n]+) as pattern, if you want to consider carriage returns
Try:
import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"
result = re.sub(p, subst, test_str)
It will reduce string to
(2 months)__ML
See Demo
I'm new to python and am trying to figure out python regex to find any strings that match -. For example, 'type1-001' and 'type2-001' should be a match, but 'type3-asdf001' shouldn't be a match. I would like to be able to match with a regex like [type1|type2|type3]-\d+ to find any strings that start with type1, type2, or type3 and then are appended with '-' and digits. Also, it would be cool to know how to search for any upper case text appended with '-' and digits.
Here's what I think should work, but I can't seem to get it right...
pref_num = re.compile(r'[type1|type2]-\d+')
[] will match any of the set of characters appearing between the brackets. To group regexes you need to use (). So, I think your regex should be something like:
pref_num = re.compile(r'(type1|type2)-\d+')
As to how to search any uppercase text appended with - and digits, I would suggest:
[A-Z]+-\d+
If you only want the digit after "type" to be variable then you should put only those in the square brackets like so:
re.compile(r'type[1|2]-\d+')
You can use the pattern
'type[1-3]-[0-9]{3}'
Demo
>>> import re
>>> p = 'type[1-3]-[0-9]{3}'
>>> s = 'type2-005 with some text type1-101 and then type1-asdf001'
>>> re.findall(p, s)
['type2-005', 'type1-101']
pref_num = re.compile(r'(type1|type2|type3)-\d+')
m = pref_num.search('type1-000')
if m != None: print(m.string)
m = pref_num.search('type2-000')
if m != None: print(m.string)
m = pref_num.search('type3-abc000')
if m != None: print(m.string)
How could I find all instances of a substring in a string?
For example I have the string ("%1 is going to the %2 with %3"). I need to extract all placeholders in this string (%1, %2, %3)
The current code could only find the first two because the ending is not a white space.
import re
string = "%1 is going to the %2 with %3"
r = re.compile('%(.*?) ')
m = r.finditer(string)
for y in m:
print (y.group())
Don't match on whitespace, match on a word boundary instead using \b:
r = re.compile(r'%(.*?)\b')
You may want to restrict your characters to word characters only instead of the . wildcard, and match at least one character:
r = re.compile(r'%(\w+)\b')
You don't appear to be using the capturing group either, so you could just omit that:
r = re.compile(r'%\w+\b')