I want to develop a regex in Python where a component of the pattern is defined in a separate variable and combined to a single string on-the-fly using Python's .format() string method. A simplified example will help to clarify. I have a series of strings where the space between words may be represented by a space, an underscore, a hyphen etc. As an example:
new referral
new-referal
new - referal
new_referral
I can define a regex string to match these possibilities as:
space_sep = '[\s\-_]+'
(The hyphen is escaped to ensure it is not interpreted as defining a character range.)
I can now build a bigger regex to match the strings above using:
myRegexStr = "new{spc}referral".format(spc = space_sep)
The advantage of this method for me is that I need to define lots of reasonably complex regexes where there may be several different commonly-occurring stings that occur multiple times and in an unpredictable order; defining commonly-used patterns beforehand makes the regexes easier to read and allows the strings to be edited very easily.
However, a problem occurs if I want to define the number of occurrences of other characters using the {m,n} or {n} structure. For example, to allow for a common typo in the spelling of 'referral', I need to allow either 1 or 2 occurrences of the letter 'r'. I can edit myRegexStr to the following:
myRegexStr = "new{spc}refer{1,2}al".format(spc = space_sep)
However, now all sorts of things break due to confusion over the use of curly braces (either a KeyError in the case of {1,2} or an IndexError: tuple index out of range in the case of {n}).
Is there a way to use the .format() string method to build longer regexes whilst still being able to define number of occurrences of characters using {n,m}?
You can double the { and } to escape them or you can use the old-style string formatting (% operator):
my_regex = "new{spc}refer{{1,2}}al".format(spc="hello")
my_regex_old_style = "new%(spc)srefer{1,2}al" % {"spc": "hello"}
print(my_regex) # newhellorefer{1,2}al
print(my_regex_old_style) # newhellorefer{1,2}al
Related
I have a string (from an API call) that looks something like this:
val=
{input:a,matches:[{in:["w","x","y","z"],output:{num1:0d-2,num2:7.0d-1}},
{in:["w","x"],output:{num1:0d-2,num2:8.0d-1}}]}
I need to do temp=json.loads(val); but the problem is that the string is not a valid JSON. The keys and values do not have the quotes around them. I tried explicitly putting the quotes and that worked.
How can I programatically include the quotes for such a string before reading it as a JSON?
Also, how can I replace the numbers scientific notations with decimals? eg. 0d-2 becomes "0" and 8.0d-1 becomes "0.8"?
You could catch anything thats a string with regex and replace it accordingly.
Assuming your strings that need quotes:
start with a letter
can have numbers at the end
never start with numbers
never have numbers or special characters in between them
This would be a regex code to catch them:
([a-z]*\d*):
You can try it out here. Or learn more about regex here.
Let's do it in python:
import re
# catch a string in json
json_string = '{input:a,matches:[{in:["w","x","y","z"],output:{num1:0d-2,num2:7.0d-1}},
{in:["w","x"],output:{num1:0d-2,num2:8.0d-1}}]}' # note the single quotes!
# search the strings according to our rule
string_search = re.search('([a-z]*\d*):', json_string)
# extract the first capture group; so everything we matched in brackets
# this is to exclude the colon at the end from the found string as
# we don't want to enquote the colons as well
extracted_strings = string_search.group(1)
This is a solution in case you will build a loop later.
However if you just want to catch all possible strings in python as a list you can do simply the following instead:
import re
# catch ALL strings in json
json_string = '{input:a,matches:[{in:["w","x","y","z"],output:{num1:0d-2,num2:7.0d-1}},
{in:["w","x"],output:{num1:0d-2,num2:8.0d-1}}]}' # note the single quotes!
extract_all_strings = re.findall(r'([a-z]*\d*):', json_string)
# note that this by default catches only our capture group in brackets
# so no extra step required
This was about basically regex and finding everything.
With these basics you could either use re.sub to replace everything with itself just in quotes, or generate a list of replacements to verify first that everything went right (probably somethign you'd rather want to do with this maybe a little bit unstable approach) like this.
Note that this is why I made this kind of comprehensive answer instead of just pointing you to a "re.sub" one-liner.
You can apporach your scientific number notation problem accordingly.
I have a question about making a pattern using fuzzy regex with the python regex module.
I have several strings such as TCATGCACGTGGGGCTGAC
The first eight characters of this string are variable (multiple options): TCAGTGTG, TCATGCAC, TGGTGGCT. In addition, there is a constant part after the variable part: GTGGGGCTGAC.
I would like to design a regex that can detect this string in a longer string, while allowing for at most 2 substitutions.
For example, this would be acceptable as two characters have been substituted:
TCATGCACGTGGGGCTGAC
TCCTGCACGTGGAGCTGAC
However, more substitutions should not be accepted.
In my code, I tried to do the following:
import regex
variable_parts = ["TCAGTGTG", "TCATGCAC", "TGGTGGCT", "GATAAGTG", "ATTAGACG", "CACTTCCG", "GTCTGTAT", "TGTCAAAG"]
string_to_test = "TCATGCACGTGGGGCTGAC"
motif = "(%s)GTGGGGCTGAC" % "|".join(variable_parts)
pattern = regex.compile(r''+motif+'{s<=2}')
print(pattern.search(string_to_test))
I get a match when I run this code and when I change the last character of string_to_test. But when I manually add a substitution in the middle of string_to_test, I do not get any match (even while I want to allow up to 2 substitutions).
Now I know that my regex is probably total crap, but I would like to know what I exactly need to do to make this work and where in the code I need to add/remove/change stuff. Any suggestions/tips are welcome!
Right now, you only add the restriction to the last C in the pattern that looks likelooks like (TCAGTGTG|TCATGCAC|TGGTGGCT|GATAAGTG|ATTAGACG|CACTTCCG|GTCTGTAT|TGTCAAAG)GTGGGGCTGAC{s<=2}.
To apply the {s<=2} quantifier to the whole expression you need to enclose the pattern within a non-capturing group:
pattern = regex.compile(fr'(?:{motif}){{s<=2}}')
The example above shows how to declare your pattern with the help of an f-string literal, where literal braces are defined with {{ and }} (doubled) braces. It yields the same result as pattern = regex.compile('(?:'+motif+'){s<=2}').
Also, note that r''+ is redundant and has no effect on the final pattern.
I've been trying several things to use variables inside a regular expression. None seem to be capable of doing what I need.
I want to search for a consecutively repeated substring (e.g. foofoofoo) within a string (e.g. "barbarfoofoofoobarbarbar"). However, I need both the repeating substring (foo) and the number of repetitions (In this case, 3) to be dynamic, contained within variables. Since the regular expression for repeating is re{n}, those curly braces conflict with the variables I put inside the string, since they also need curly braces around.
The code should match foofoofoo, but NOT foo or foofoo.
I suspect I need to use string interpolation of some sort.
I tried stuff like
n = 3
str = "foo"
string = "barbarfoofoofoobarbarbar"
match = re.match(fr"{str}{n}", string)
or
match = re.match(fr"{str}{{n}}", string)
or escaping with
match = re.match(fr"re.escape({str}){n}", string)
but none of that seems to work. Any thoughts? It's really important both pieces of information are dynamic, and it matches only consecutive stuff. Perhaps I could use findall or finditer? No idea how to proceed.
Something I havent tried at all is not using regular expressions, but something like
if (str*n) in string:
match
I don't know if that would work, but if I ever need the extra functionality of regex, I'd like to be able to use it.
For the string barbarfoofoofoobarbarbar, if you wanted to capture foofoofoo, the regex would be r"(foo){3}". if you wanted to do this dynamically, you could do fr"({your_string}){{{your_number}}}".
If you want a curly brace in an f-string, you use {{ or }} and it'll be printed literally as { or }.
Also, str is not a good variable name because str is a class (the string class).
This question already has answers here:
Convert decimal mark when reading numbers as input
(8 answers)
Closed last year.
I'm trying to replace commas for cases like:
123,123
where the output should be:
123123
for that I tried this:
re.sub('\d,\d','','123,123')
but that is also deleting the the digits, how can avoid this?
I only want to remode the comma for that case in particular, that's way I'm using regex. For this case, e.g.
'123,123 hello,word'
The desired output is:
'123123 hello,word'
You can use regex look around to restrict the comma (?<=\d),(?=\d); use ?<= for look behind and ?= for look ahead; They are zero length assertions and don't consume characters so the pattern in the look around will not be removed:
import re
re.sub('(?<=\d),(?=\d)', '', '123,123 hello,word')
# '123123 hello,word'
This is one of the cases where you want regular expression "lookaround assertions" ... which have zero length (pattern capture semantics).
Doing so allows you to match cases which would otherwise be "overlapping" in your substitution.
Here's an example:
#!python
import re
num = '123,456,7,8.012,345,6,7,8'
pattern = re.compile(r'(?<=\d),(?=\d)')
pattern.sub('',num)
# >>> '12345678.012345678'
... note that I'm using re.compile() to make this more readable and also because that usage pattern is likely to perform better in many cases. I'm using the same regular expression as #Psidom; but I'm using a Python 'raw' string which is more commonly the way to express regular expressions in Python.
I'm deliberately using an example where the spacing of the commas would overlap if I were using a regular expression such as; re.compile(r'(\d),(\d)') and trying to substitute using back references to the captured characters pattern.sub(r'\1\2', num) ... that would work for many examples; but '1,2,3' would not match because the capturing causes them to be overlapping.
This one of the main reasons that these "lookaround" (lookahead and lookbehind) assertions exist ... to avoid cases where you'd have to repeatedly/recursively apply a pattern due to capture and overlap semantics. These assertions don't capture, they match "zero" characters (as with some PCRE meta patterns like \b ... which matches the zero length boundary between words rather than any of the whitespace (\s which or non-"word" (\W) characters which separate words).
I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).
The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)
def mychanger(fileName):
with open(fileName,'r') as file:
str = file.read()
str = str.decode("utf-8")
str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
with open(fileName,'wb') as file:
file.write(str)
I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.
To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.
In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.
A typical line in my table is
variable & -2.061\sym{***}& 4.032\sym{**} & 1.236 \\
& (-2.32) & (-2.02) & (-0.14)
However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.
What is the problem here?
Thanks!
I can offer the following two step replacement:
str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)
The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.
Demo
re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):
re.sub(r"([^{]{1,4})-",r"\1–", str)