Match arithmetic operators only one time foreach - python

I have to match the following type of strings:
HELLO
HELLO+2.20
HELLO*1.10
HELLO+2.12*2.99
HELLO*2.30+5.40
The plus and star operator can be there only one time (with their respective amount) so
HELLO+2.20+3.50
HELLO*2.11+1.25*9.99
HELLO*3.33*4.44
aren't valid matches
I tried this regex:
([A-Z]{2,12}(\*(\d+(?:\.\d{1,2})?))?(\+(\d+(?:\.\d{1,2})?))?)
but matches only star operator first and plus operator for last (both optionally). This regex doesn't support this case:
HELLO+2.11*3.56

You might use an alternation | to match either of the 2 variations of + and *
^[A-Z]{2,12}(?:\+\d+\.\d{1,2}(?:\*\d+\.\d{1,2})?|\*\d+\.\d{1,2}(?:\+\d+\.\d{1,2})?)?$
In parts
^ Start of string
[A-Z]{2,12} Match 2-12 uppercae chars
(?: Non capturing group
\+\d+\.\d{1,2} Match + 1+ digits . and 1-2 digits
(?:\*\d+\.\d{1,2})? Optionally match the same as previous starting with *
| Or
\*\d+\.\d{1,2} Match * 1+ digits . and 1-2 digits
(?:\+\d+\.\d{1,2})? Optionally match the same as previous starting with +
)? Close group and make it optional to also match only 12 uppercase chars
$
Regex demo

A more straightforward alternative with builtin features (without regex search):
test_str = '''
HELLO+2.20
HELLO*1.10
HELLO+2.12*2.99
HELLO*2.30+5.40
HELLO+2.20+3.50
HELLO*2.11+1.25*9.99
HELLO*3.33*4.44'''
valid_strings = [s for s in test_str.splitlines()
if s and s.count('+') < 2 and s.count('*') < 2]
print(valid_strings)
The output:
['HELLO+2.20', 'HELLO*1.10', 'HELLO+2.12*2.99', 'HELLO*2.30+5.40']

Related

regex: don't match number preceded by certain character

Following code extracts the first sequence of numbers that appear in a string:
num = re.findall(r'^\D*(\d+)', string)
I'd like to add that the regular expression doesn't match numbers preceded by vor V.
Example:
string = 'foobarv2_34 423_wd"
Output: '34'
If you need to get the first match, you need to use re.search, not re.findall.
In this case, you can use a simpler regular expression like (?<!v)\d+ with re.I:
import re
m = re.search(r'(?<!v)\d+', 'foobarv2_34 423_wd', re.I)
if m:
print(m.group()) # => 34
See the Python demo.
Details
(?<!v) - a negative lookbehind that fails the match if there is a v (or V since re.I is used) immediately to the left of the current location
\d+ - one or more digits.
If you cannot use re.search for some reason, you can use
^.*?(?<!v)(\d+)
See this regex demo. Note that \D* (zero or more non-digits) is replaced with .*? that matches zero or more chars other than line break chars as few as possible (with re.S or re.DOTALL, it will also match line breaks) since there is a need to match all digits not preceded with v.
More details:
^ - start of string
.*? - zero or more chars other than line break chars as few as possible
(?<!v) - a negative lookbehind that fails the match if there is a v (or V since re.I is used) immediately to the left of the current location
(\d+) - Group 1: one or more digtis.

Not able get desired output after string parsing through regex

input =
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
6:/BENM/Tabuler Trading/REM//IMP/2020-341
original_regex = 6:[A-Za-z0-9 \/\.\-:] - bt this is taking full string 6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
modified_regex_pattern = 6:[A-Za-z0-9 \/\.\-:]{1,}[\/-:]
In the first string i want output till
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20
but its giving till :65:
Can anyone suggest better way to write this.
Example as below
https://regex101.com/r/pAduvy/1
You could for example use a capturing group with an optional part at the end to match the :digits:a-z part.
(6:[A-Za-z0-9 \/.:-]+?)(?::\d+:[a-z]+)?$
( Capture group 1
6:[A-Za-z0-9 \/.:-]+? Match any of the listed in the character class as least as possible
) Close group 1
(?::\d+:[a-z]+)? optionally match the part at the end that you don't want to include
$ End of string
Regex demo
Note Not sure if intended, but the last part of your pattern [\/-:] denotes a range from ASCII range 47 - 58.
Or a more precise pattern to get the match only
6:/\w+/\w+ \w+/[A-Z]+//[A-Z]+(?:: \d+)?/[A-Z]*\d+(?:-\d+)?
6:/\w+/\w+ Match 6 and 2 times / followed by 1+ word chars and a space
\w+/[A-Z]+//[A-Z]+ Match 1+ word chars, / and uppercase chars, // and again uppercase chars
(?:: \d+)? Optionally match a space and 1+ digits
/[A-Z]*\d+ Match /, optional uppercase chars and 1+ digits
(?:-\d+)? Optionally match - and 1+ digits
Regex demo

Fetching respective group values in a regex expression

I have an example string like below:
Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00
I can have another example string that can be like:
Unpacking/Unremoval fee Zero Rated 100.00
I am trying to access the first set of words and the last number values.
So I want the dict to be
{'Handling - Uncrating of 3 crates - USD600 each':1800.00}
or
{'Unpacking/Unremoval fee':100.00}
There might be strings where none of the above patterns (Zero Rated or something with %) present and I would skip those strings.
To do that, I was regexing the following pattern
pattern = re.search(r'(.*)Zero.*Rated\s*(\S*)',line.strip())
and then
pattern.group(1)
gives the keys for dict and
pattern.group(2)
gives the value of 1800.00. This works for lines where Zero Rated is present.
However if I want to also check for pattern where Zero Rated is not present but % is present as in first example above, I was trying to use | but it didn't work.
pattern = re.search(r'(.*)Zero.*Rated|%\s*(\S*)',line.strip())
But this time I am not getting the right pattern groups as it is fetching groups.
Sites like regex101.com can help debug regexes.
In this case, the problem is with operator precedence; the | operates over the whole of the rest of the regex. You can group parts of the regex without creating additional groups with (?: )
Try: r'(.*)(?:Zero.*Rated|%)\s*(\S*)'
Definitely give regex101.com a go, though, it'll show you what's going on in the regex.
You might use
^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})
The pattern matches
^ Start of string
(.+?) Capture group 1, match any char except a newline as least as possible
\s* Match 0+ whitespace chars
(?: Non capture group
Zero Rated Match literally
| Or
\d+%= Match 1+ digits and %=
\d{1,3}(?:\,\d{3})*\.\d{2} Match a digit format of 1-3 digits, optionally repeated by a comma and 3 digits followed by a dot and 2 digits
) Close non capture group
\s* Match 0+ whitespace chars
(\d{1,3}(?:,\d{3})*\.\d{2}) Capture group 2, match the digit format
Regex demo | Python demo
For example
import re
regex = r"^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})"
test_str = ("Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00\n"
"Unpacking/Unremoval fee Zero Rated 100.00\n"
"Delivery Cartage - IT Equipment, up to 1000kgs - 7%=210.00 3,000.00")
print(dict(re.findall(regex, test_str, re.MULTILINE)))
Output
{'Handling - Uncrating of 3 crates - USD600 each': '1,800.00', 'Unpacking/Unremoval fee': '100.00', 'Delivery Cartage - IT Equipment, up to 1000kgs -': '3,000.00'}

Regex include only one digit between chars

I have to parse a PDF document and I'm using PyPDF2 with re(regex).
The file includes several lines like the one below:
18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40
I need to extract from this line the text( bold ) between the time and the amount:
PEDMILANO OVEST- BINASCOA
The following code is working but sometimes this code doesn't find anything since can be a number between these chars, for example, 18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40.
regex = re.compile(r'\d\d-\d\d-\d\d\d\d\d\d:\d\d:\d\d\D+\d+,\d\d')
Is there a way to include a number in this regular expression?
The following should simplify the current regex:
import re
s = '18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40'
re.search(r'\:\d+([A-Z].*?)(?=\d+\,\d+$)', s).group(1)
# 'PEDMILANO OVE3ST- BINASCOA'
See demo
\d+([A-Z].*?)(?=\d+\,\d+$)
\: matches the character : literally (case sensitive)
\d+: matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
1st Capturing Group ([A-Z].*?)
Match a single character present in the list below [A-Z]
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=\d+\,\d+$)
Assert that the Regex below matches
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\, matches the character , literally (case sensitive)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
I suggest using
import re
text = "18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40"
print( re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', r'\1', text) )
It can also be written as
re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}|\d+(?:,\d+)?$', '', text)
Or, if you prefer matching and capturing:
m = re.search(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', text)
if m:
print( m.group(1) )
See an online Python demo. With this solution, your data may start with any char, and will contain any char (excluding line break chars, since your data is on single lines).
Regex details
^ - start of string
\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2} - datetime string: two digits, -, two digits, -, five or six digits, :, two digits, : two digits
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
\d+(?:,\d+)? - an int/float value pattern: 1+ digits followed with an optional sequence of , and 1+ digits
$ - end of string.
See the regex demo.

regex of symbolic expression grouped

In python, I am trying to regex of a expression like this:
function_1(param_1,param_2,param_3)+function_2(param_4,param_5)*function_3(param_6)+function_4()-function_5(param_7,param_8,param_9,param_10)
I am using this regex
(?P<perf_name>\w*?)\((?P<perf_param>[\w]+)*(?:,*(?P<perf_param2>[\w]+)?)*\)
but I'm stuck because so far I can't get all the params_x which are not close to brackets (param_2, param_8 and param_9)
Plus, I am pretty sure there is some solution that would prevent me to use a single perf_param instead of the two perf_param and perf_param2
Any ideas?
You should do that in 2 steps:
(?P<perf_name>\w*)\((?P<perf_params>\w*(?:,\w+)*)\)
This regex will get you the name and params as two groups. Then, just split the second group with ,.
import re
p = re.compile(r'(?P<perf_name>\w*)\((?P<perf_params>\w*(?:,\w+)*)\)')
s = "function_1(param_1,param_2,param_3)+function_2(param_4,param_5)*function_3(param_6)+function_4()-function_5(param_7,param_8,param_9,param_10)"
res = [(x.group("perf_name"), x.group("perf_params").split(",")) for x in p.finditer(s)]
print(res)
# => [('function_1', ['param_1', 'param_2', 'param_3']), ('function_2', ['param_4', 'param_5']), ('function_3', ['param_6']), ('function_4', ['']), ('function_5', ['param_7', 'param_8', 'param_9', 'param_10'])]
See the Python demo
The regex matches:
(?P<perf_name>\w*) - 0 or more alphanumeric/underscore characters
\( - a literal (
(?P<perf_params>\w*(?:,\w+)*) - 0+ sequences of 0+ word characters (\w*) followed with 0+ sequences of 1+ word characters
\) - closing ).

Categories

Resources