I need to replace all occurrences of "W32 L30" with "W32in L30in" in a large corpus of text. The numbers after W, L also vary.
I thought of using this regex expressions
[W]([-+]?\d*\.\d+|\d+)
[L]([-+]?\d*\.\d+|\d+)
But these would only find the number after each W and L, so it's still laborious and very time consuming to replace every occurrence so I was wondering if there's a way to do this directly in regex.
You can use a capture group and simplify the regex. Next we can then use a backref to do the replacement. Like:
import re
RGX = re.compile(r'([WL]([-+]?\d*\.\d+|\d+))(in)?')
result = RGX.sub(r'\1in', some_string)
The \1 is used to reference the first capture group: the result of the string we capture with [WL]([-+]?\d*\.\d+|\d+). The last part (in)? optionally also matches the word in, such that in case there is already an in, we simply replace it with the same value.
So if some_string is for instance:
>>> some_string
'A W2 in C3.15 where L2.4in and a bit A4'
>>> RGX.sub(r'\1in', some_string)
'A W2in in C3.15 where L2.4in and a bit A4'
Related
I need help with regex to get the following out of the string
dal001.caxxxxx.test.com. ---> caxxxxx.test.com
caxxxx.test.com -----> caxxxx.test.com
So basically in the first example, I don't want dal001 or anything that starts with 3 letters and 3 digits and want the rest of the string if it starts with only ca.
In second example I want the whole string that starts only with ca.
So far I have tried (^[a-z]{3}[\d]+\.)?(ca.*) but it doesn't work when the string is
dal001.mycaxxxx.test.com.
Any help would be appreciated.
You can use
^(?:[a-z]{3}\d{3}\.)?(ca.*)
See the regex demo. To make it case insensitive, compile with re.I (re.search(rx, s, re.I), see below).
Details:
^ - start of string
(?:[a-z]{3}\d{3}\.)? - an optional sequence of 3 letters and then 3 digits and a .
(ca.*) - Group 1: ca and the rest of the string.
See the Python demo:
import re
rx = r"^(?:[a-z]{3}\d{3}\.)?(ca.*)"
strs = ["dal001.caxxxxx.test.com","caxxxx.test.com"]
for s in strs:
m = re.search(rx, s)
if m:
print( m.group(1) )
Use re.sub like so:
import re
strs = ['dal001.caxxxxx.test.com', 'caxxxx.test.com']
for s in strs:
s = re.sub(r'^[A-Za-z]{3}\d{3}[.]', '', s)
print(s)
# caxxxxx.test.com
# caxxxx.test.com
if you are using re:
import re
my_strings = ['dal001.caxxxxx.test.com', 'caxxxxx.test.com']
my_regex = r'^(?:[a-zA-Z]{3}[0-9]{3}\.)?(ca.*)'
compiled_regex = re.compile(r)
for a_string in my_strings:
if compiled_regex.match(a_string):
compiled_regex.sub(r'\1', a_string)
my_regex matches a string that starts (^ anchors to the start of the string) with [3 letters][3 digits][a .], but only optionally, and using a non-capturing group (the (?:) will not get a numbered reference to use in sub). In either case, it must then contain ca followed by anything, and this part is used as the replacement in the call to re.sub. re.compile is used to make it a bit faster, in case you have many strings to match.
Note on re.compile:
Some answers don't bother pre-compiling the regex before the loop. They have made a trade: removing a single line of code, at the cost of re-compiling the regex implicitly on every iteration. If you will use a regex in a loop body, you should always compile it first. Doing so can have a major effect on the speed of a program, and there is no added cost even when the number of iterations is small. Here is a comparison of compiled vs. non-compiled versions of the same loop using the same regex for different numbers of loop iterations and number of trials. Judge for yourself.
I scraped some text from pdfs and accents/umlaut on characters get scraped after their letter, e.g.: `"Jos´e" and "Mu¨ller". Because there are just a few of these characters, I would like to fix them to e.g. "José" and "Müller".
I am trying to adapt the pattern here Regex to match words with hyphens and/or apostrophes.
pattern="(?=\S*[´])([a-zA-Z´]+)"
ms = re.finditer(pattern, "Jos´e Vald´ez")
for m in ms:
m.group() #returns "Jos´e" and "Vald´ez"
m.start() #returns 0 and 6, but I want 3 and 10
In the example above, what pattern can I use to get the position of the '´' character? Then I can check the subsequent letter and replace the text accordingly.
My texts are scraped from from scientific papers and could contain those characters elsewhere, for example in code. That is the reason why I am using regex instead of .replace or text normalization with e.g. unicodedata, because I want to make sure I am replacing "words" (more precisely the authors' first and last names).
EDIT: I can relax these conditions and simply replace those characters everywhere because, if they appear in non-words such as "F=m⋅x¨", I will discard non-words anyway. Therefore, I can use a simple replace approach
I suggest using
import re
d = {'´e': 'é', 'u¨' : 'ü'}
pattern = "|".join([x for x in d])
print( re.sub(pattern, lambda m: d[m.group()], "Jos´e Vald´ez") )
# => José Valdéz
See the Python demo.
If you need to make sure there are word boundaries, you may consider using
pattern = r"\b´e|u¨\b"
See this Python demo. \b before ´ and after u will make sure there are other word chars before/after them.
A quick fix on the pattern returns the indexes which you are looking for. Instead of matching the whole word, the group will catch the apostrophe characters only.
import re
pattern = "(?=\S*[´])[a-zA-Z]+([´]+)[a-zA-Z]+"
ms = re.finditer(pattern, "Jos´e Vald´ez")
for m in ms:
print(m.group()) # returns "Jos´e" and "Vald´ez"
print(m.start(1)) # returns 3 and 10
I have some sentence and a regular expression. Is it possible to find out till where in the regex my sentence satisfies. For example consider my sentence as MMMV and regex as M+V?T*Z+. Now regex till M+V? satisfies the sentences and the remaining part of regex is T*Z+ which should be my output.
My approach right now is to break the regex in individual parts and store that in a list and then match by concatenating first n parts till sentence matches. For example if my regex is M+V?T*Z+, then my list is ['M+', 'V?', 'T*', 'Z+']. I then match my string in loop first by M+, second by M+V? and so on till complete match is found and then take the remaining list as output. Below is the code
re_exp = ['M+', 'V?', 'T*', 'Z+']
for n in range(len(re_exp)):
re_expression = ''.join(re_exp[:n+1])
if re.match(r'{0}$'.format(re_expression), sentence_language):
return re_exp[n+1:]
Is there a better approach to achieve this may be by using some parsing library etc.
Assuming that your regex is rather simple, with no groups, backreferences, lookaheads, etc., e.g. as in your case, following the pattern \w[+*?]?, you can first split it up into parts, as you already do. But then instead of iteratively joining the parts and matching them against the entire string, you can test each part individually by slicing away the already matched parts.
def match(pattern, string):
res = pat = ""
for p in re.findall(r"\w[+*?]?", pattern):
m = re.match(p, string)
if m:
g = m.group()
string = string[len(g):]
res, pat = res + g, pat + p
else:
break
return pat, res
Example:
>>> for s in "MMMV", "MMVVTTZ", "MTTZZZ", "MVZZZ", "MVTZX":
>>> print(*match("M+V?T*Z+", s))
...
M+V?T* MMMV
M+V?T* MMV
M+V?T*Z+ MTTZZZ
M+V?T*Z+ MVZZZ
M+V?T*Z+ MVTZ
Note, however, that in the worst case of having a string of length n and a pattern of n parts, each matching just a single character, this will still have O(n²) for repeatedly slicing the string.
Also, this may fail if two consecutive parts are about the same character, e.g. a?a+b (which should be equivalent to a+b) will not match ab but only aab as the single a is already "consumed" by the a?.
You could get the complexity down to O(n) by writing your own very simple regex matcher for that very reduced sort of regex, but in the average case that might not be worth it, or even slower.
You can use () to enclose groups in regex. For example: M+V?(T*Z+), the output you want is stored in the first group of the regex.
I know the question says python, but here you can see the regex in action:
const regex = /M+V?(T*Z+)/;
const str = `MMMVTZ`;
let m = regex.exec(str);
console.log(m[1]);
I seldom use | together with .* before. But today when I use both of them together, I find some results really confusing. The expression I use is as follows (in python):
>>> s = "abcdefg"
>>> re.findall(r"((a.*?c)|(.*g))",s)
[('abc',''),('','defg')]
The result of the first caputure is all right, but the second capture is beyond my expectation, for I have expected the second capture would be "abcdefg" (the whole string).
Then I reverse the two alternatives:
>>> re.findall(r"(.*?g)|(a.*?c)",s)
[('abcdefg', '')]
It seems that the regex engine only reads the string once - when the whole string is read in the first alternative, the regex engine will stop and no longer check the second alternative. However, in the first case, after dealing with the first alternative, the regex engine only reads from "a" to "c", and there are still "d" to "g" left in the string, which matches ".*?g" in the second alternative. Have I got it right? What's more, as for an expression with alternatives, the regex engine will check the first alternative first, and if it matches the string, it will never check the second alternative. Is it correct?
Besides, if I want to get both "abc" and "abcdefg" or "abc" and "bcde" (the two results overlap) like in the first case, what expression should I use?
Thank you so much!
You cannot have two matches starting from the same location in the regex (the only regex flavor that does it is Perl6).
In re.findall(r"((a.*?c)|(.*g))",s), re.findall will grab all non-overlapping matches in the string, and since the first one starts at the beginning, ends with c, the next one can only be found after c, within defg.
The (.*?g)|(a.*?c) regex matches abcdefg because the regex engine parses the string from left to right, and .*? will get any 0+ chars as few as possible but up to the first g. And since g is the last char, it will match and capture the whole string into Group 1.
To get abc and abcdefg, you may use, say
(a.*?c)?.*g
See the regex demo
Python demo:
import re
rx = r"(a.*?c)?.*g"
s = "abcdefg"
m = re.search(rx, s)
if m:
print(m.group(0)) # => abcdefg
print(m.group(1)) # => abc
It might not be what you exactly want, but it should give you a hint: you match the bigger part, and capture a subpart of the string.
Re-read the docs for the re.findall method.
findall "return[s] all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found."
Specifically, non-overlapping matches, and left-to-right. So if you have a string abcdefg and one pattern will match abc, then any other patterns must (1) not overlap; and (2) be further to the right.
It's perfectly valid to match abc and defg per the description. It would be a bug to match abc and abcdefg or even abc and cdefg because they would overlap.
Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -
import re
re.sub("a*", "a", "aaaa") # gives 'a'
What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?
import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'
NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes
>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'
The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.
Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.
On side note...
Your example code for just a is actually buggy:
>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'
You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".
In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this
s="ababacbdefefbcdefde"
while re.search(r'([a-z])(.*)\1', s):
s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)
print s # prints 'abcdef'
A solution including all category:
re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')
gives:
'ab['