I am working with Grammatical Evolution (GE) on Python 3.7.
My grammar generates executable strings in the format:
np.where(<variable> <comparison_sign> <constant>, (<probability1>), (<probability2>))
Yet, the string can get quite complex, with several chained np.where .
<constant> in some cases contains leading zeros, which makes the executable string to generate errors. GE is supposed to generate expressions containing leading zeros, however, I have to detect and remove them.
An example of a possible solution containing leading zeros:
"np.where(x < 02, np.where(x > 01.5025, (0.9), (0.5)), (1))"
Problem:
There are two types of numbers containing leading zeros: int and float.
Supposing that I detect "02" in the string. If I replace all occurrences in the string from "02" to "2", the float "01.5025" will also be changed to "01.525", which cannot happen.
I've made several attempts with different re patterns, but couldn't solve it.
To detect that an executable string contains leading zeros, I use:
try:
_ = eval(expression)
except SyntaxError:
new_expression = fix_expressions(expression)
I need help building the fix_expressions Python function.
You could try to come up with a regular expression for numbers with leading zeros and then replace the leading zeros.
import re
def remove_leading_zeros(string):
return re.sub(r'([^\.^\d])0+(\d)', r'\1\2', string)
print(remove_leading_zeros("np.where(x < 02, np.where(x > 01.5025, (0.9), (0.5)), (1))"))
# output: np.where(x < 2, np.where(x > 1.5025, (0.9), (0.5)), (1))
The remove_leading_zeros function basically finds all occurrences of [^\.^\d]0+\d and removes the zeros. [^\.^\d]0+\d translates to not a number nor a dot followed by at least one zero followed by a number. The brackets (, ) in the regex signalize capture groups, which are used to preserve the character before the leading zeros and the number after.
Regarding Csaba Toth's comment:
The problem with 02+03*04 is that there is a zero at the beginning of the string.
One can modify the regex such that it matches also the beginning of the string in the first capture group:
r"(^|[^\.^\d])0+(\d)"
You can remove leading 0's in a string using .lstrip()
str_num = "02.02025"
print("Initial string: %s \n" % str_num)
str_num = str_num.lstrip("0")
print("Removing leading 0's with lstrip(): %s" % str_num)
Related
I have the Python code below and I would like the output to be a string: "P-1888" discarding all numbers after the 2nd "-" and removing the leading 0's after the 1st "-".
So far all I have been able to do in the following code is to remove the trailing 0's:
import re
docket_no = "P-01888-000"
doc_no_rgx1 = re.compile(r"^([^\-]+)\-(0+(.+))\-0[\d]+$")
massaged_dn1 = doc_no_rgx1.sub(r"\1-\2", docket_no)
print(massaged_dn1)
You can use the split() method to split the string on the "-" character and then use the join() method to join the first and second elements of the resulting list with a "-" character. Additionally, you can use the lstrip() method to remove the leading 0's after the 1st "-". Try this.
docket_no = "P-01888-000"
docket_no_list = docket_no.split("-")
docket_no_list[1] = docket_no_list[1].lstrip("0")
massaged_dn1 = "-".join(docket_no_list[:2])
print(massaged_dn1)
First way is to use capturing groups. You have already defined three of them using brackets. In your example the first capturing group will get "P", and the third capturing group will get numbers without leading zeros. You can get captured data by using re.match:
match = doc_no_rgx1.match(docket_no)
print(f'{match.group(1)}-{match.group(3)}') # Outputs 'P-1888'
Second way is to not use regex for such a simple task. You could split your string and reassemble it like this:
parts = docket_no.split('-')
print(f'{parts[0]}-{parts[1].lstrip("0")}')
It seems like a sledgehammer/nut situation but of you do want to use re then you could use:
doc_no_rgx1 = ''.join(re.findall('([A-Z]-)0+(\d+)-', docket_no)[0])
I don't think I'd use a regular expression for this purpose. Your usecase can be handled by standard string manipulation so using a regular expression would be overkill. Instead, consider doing this:
docket_nos = "P-01888-000".split('-')[:-1]
docket_nos[1] = docket_nos[1].lstrip('0')
docket_no = '-'.join(docket_nos)
print(docket_no) # P-1888
This might seem a little bit verbose but it does exactly what you're looking for. The first line splits docket_no by '-' characters, producing substrings P, 01888 and 000; and then discards the last substring. The second line strips leading zeros from the second substring. And the third line joins all these back together using '-' characters, producing your desired result of P-1888.
Functionally this is no different than other answers suggesting that you split on '-' and lstrip the zero(s), but personally I find my code more readable when I use multiple assignment to clarify intent vs. using indexes:
def convert_docket_no(docket_no):
letter, number, *_ = docket_no.split('-')
return f'{letter}-{number.lstrip("0")}'
_ is used here for a "throwaway" variable, and the * makes it accept all elements of the split list past the first two.
I need to print the 2 missing strings KMJD23KN0008393 and KMJD23KN0008394 but what I am receiving is KMJD23KN8393 and KMJD23KN8394 .I need those missing zeros also in our list.
ll = ['KMJD23KN0008391','KMJD23KN0008392','KMJD23KN0008395','KMJD23KN0008396']
missList=[]
for i in ll:
reList=re.findall(r"[^\W\d_]+|\d+", i)
print(reList)
The issue can be decomposed into three parts:
Extracting the trailing number string
Interpreting that substring as the actual number, which should form a consecutive sequence
Re-formatting the missing items based on the surrounding items.
There are multiple assumptions implicit in these items. You need to be aware of these, and ideally make them explicit. In the following, I’ve worked with the following assumptions:
Anything that forms a number at the end of the string is considered. Everything before that is a prefix and is assumed to be identical throughout the series.
The trailing numbers in the series always have the same width.
The items in the list are sorted in ascending order by their trailing number.
The following code implements these assumptions:
all_missing = []
last_num = int(re.search(r'\d+$', ll[-1])[0])
prefix = re.match('.*\D', ll[0])[0]
for item in ll:
num_str = re.search(r'\d+$', item)[0]
num = int(num_str)
num_width = len(num_str)
for missing in range(last_num + 1, num):
all_missing.append(f'{prefix}{missing:0{num_width}}')
last_num = num
print(all_missing)
Some notes here:
To extract the trailing number, a very simple regex is sufficient: \d+$. That is: one or more digits, until the end of the string.
Conversely, to extract the prefix, we search for any sequence of arbitrary characters where the last character is a non-digit. That is: .*\D.
To re-format the missing items, we concatenate the prefix with the missing number, and we pad the missing number with zeros (from left) until it is of the expected width. This is achieved by using Python’s f-strings with the format specifier '0{num_width}'.
I have a string as follows where I tried to remove similar consecutive characters.
import re
input = "abccbcbbb";
for i in input :
input = re.sub("(.)\\1+", "",input);
print(input)
Now I need to let the user specify the value of k.
I am using the following python code to do it, but I got the error message TypeError: can only concatenate str (not "int") to str
import re
input = "abccbcbbb";
k=3
for i in input :
input= re.sub("(.)\\1+{"+(k-1)+"}", "",input)
print(input)
The for i in input : does not do what you need. i is each character in the input string, and your re.sub is supposed to take the whole input as a char sequence.
If you plan to match a specific amount of chars you should get rid of the + quantifier after \1. The limiting {min,} / {min,max} quantifier should be placed right after the pattern it modifies.
Also, it is more convenient to use raw string literals when defining regexps.
You can use
import re
input_text = "abccbcbbb";
k=3
input_text = re.sub(fr"(.)\1{{{k-1}}}", "", input_text)
print(input_text)
# => abccbc
See this Python demo.
The fr"(.)\1{{{k-1}}}" raw f-string literal will translate into (.)\1{2} pattern. In f-strings, you need to double curly braces to denote a literal curly brace and you needn't escape \1 again since it is a raw string literal.
If I were you, I would prefer to do it like suggested before. But since I've already spend time on answering this question here is my handmade solution.
The pattern described below creates a named group named "letter". This group updates iterative, so firstly it is a, then b, etc. Then it looks ahead for all the repetitions of the group "letter" (which updates for each letter).
So it finds all groups of repeated letters and replaces them with empty string.
import re
input = 'abccbcbbb'
result = 'abcbcb'
pattern = r'(?P<letter>[a-z])(?=(?P=letter)+)'
substituted = re.sub(pattern, '', input)
assert substituted == result
Just to make sure I have the question correct you mean to turn "abccbcbbb" into "abcbcb" only removing sequential duplicate characters. Is there a reason you need to use regex? you could likely do a simple list comprehension. I mean this is a really cut and dirty way to do it but you could just put
input = "abccbcbbb"
input = list(input)
previous = input.pop(0)
result = [previous]
for letter in input:
if letter != previous : result += letter
previous = letter
result = "".join(result)
and with a method like this, you could make it easier to read and faster with a bit of modification id assume.
To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all
I have a list of files, and I am trying to filter for a subset of file names that end in 000000, 060000, 120000, 180000. I know I could do a straight string match, but I would like to understand why the regular expression I attempted below r'[00|06|12|18]+0000', would not work (it is returning MSM_20130519210000.csv as well). I intend it to be match either one of 00, 06, 12, 18, follow by 0000. How can that be accomplished? Please keep the answer along the line of this intended regex instead of other functions, thanks.
Here is the code snippet:
import re
files_in_input_directory = ['MSM_20130519150000.csv', 'MSM_20130519180000.csv', 'MSM_20130519210000.csv',
'MSM_20130520000000.csv', 'MSM_20130520030000.csv', 'MSM_20130520060000.csv', 'MSM_20130520090000.csv',
'MSM_20130520120000.csv', 'MSM_20130520150000.csv', 'MSM_20130520180000.csv', 'MSM_20130520210000.csv',
'MSM_20130521000000.csv', 'MSM_20130521030000.csv', 'MSM_20130521060000.csv', 'MSM_20130521090000.csv',
'MSM_20130521120000.csv', 'MSM_20130521150000.csv', 'MSM_20130521180000.csv', 'MSM_20130521210000.csv',
'MSM_20130522000000.csv', 'MSM_20130522030000.csv', 'MSM_20130522060000.csv', 'MSM_20130522090000.csv',
'MSM_20130522120000.csv', 'MSM_20130522150000.csv', 'MSM_20130522180000.csv', 'MSM_20130522210000.csv',
'MSM_20130523000000.csv', 'MSM_20130523030000.csv', 'MSM_20130523060000.csv', 'MSM_20130523090000.csv',
'MSM_20130523120000.csv', 'MSM_20130523150000.csv', 'MSM_20130523180000.csv', 'MSM_20130523210000.csv',
'MSM_20130524000000.csv', 'MSM_20130524030000.csv', 'MSM_20130524060000.csv', 'MSM_20130524090000.csv',
'MSM_20130524120000.csv', 'MSM_20130524150000.csv', 'MSM_20130524180000.csv', 'MSM_20130524210000.csv',
'MSM_20130525000000.csv', 'MSM_20130525030000.csv', 'MSM_20130525060000.csv', 'MSM_20130525090000.csv',
'MSM_20130525120000.csv', 'MSM_20130525150000.csv', 'MSM_20130525180000.csv', 'MSM_20130525210000.csv',
'MSM_20130526000000.csv', 'MSM_20130526030000.csv', 'MSM_20130526060000.csv', 'MSM_20130526090000.csv',
'MSM_20130526120000.csv', 'MSM_20130526150000.csv', 'MSM_20130526180000.csv', 'MSM_20130526210000.csv',
'MSM_20130527000000.csv', 'MSM_20130527030000.csv', 'MSM_20130527060000.csv', 'MSM_20130527090000.csv',
'MSM_20130527120000.csv', 'MSM_20130527150000.csv', 'MSM_20130527180000.csv', 'MSM_20130527210000.csv',
'MSM_20130528000000.csv', 'MSM_20130528030000.csv', 'MSM_20130528060000.csv', 'MSM_20130528090000.csv',
'MSM_20130528120000.csv', 'MSM_20130528150000.csv', 'MSM_20130528180000.csv', 'MSM_20130528210000.csv',
'MSM_20130529000000.csv', 'MSM_20130529030000.csv', 'MSM_20130529060000.csv', 'MSM_20130529090000.csv']
print files_in_input_directory
print "\n"
# trying to match any string with 000000, 060000, 120000, 180000
# Question: I use + meaning one or more, and | to indicates the options, but this will match
# 'MSM_20130519210000.csv' as well, and I don't know why
print filter(lambda x:re.search(r'[00|06|12|18]+0000', x), files_in_input_directory)
print "\n"
# This verbose version works
print filter(lambda x:re.search(r'0000000|060000|120000|180000', x), files_in_input_directory)
print "\n"
If you are trying to match filenames that contain 000000, 060000, 120000 or 180000, then instead of
re.search(r'[00|06|12|18]+0000', x)
use
re.search(r'(00|06|12|18)0000', x)
The square brackets [...] only match a single character at a time, and the + character means "match 1 or more of the preceding expression".
[00|06|12|18] is the character set matching 00|06|12|18. Thus it will match 210000 in "SM_20130519210000.csv" because [00|06|12|18] is equivalent to writing [01268]. Not what you meant, I should think.
Instead of expressing a character set that can match one or more times, make it either a capturing group
r'(00|06|12|18)0000'
Or a negative lookbehind expression
r'(?<=00|06|12|18)0000'
They are equivalent for your purposes, since you don't care about the match or any groups.
The basic problem here is you were not grouping the patterns, but creating a character set fo match against using ``[ ... ]```.
This regex works: ((000)|(06)|(12)|(18))0000