Not able get desired output after string parsing through regex - python

input =
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
6:/BENM/Tabuler Trading/REM//IMP/2020-341
original_regex = 6:[A-Za-z0-9 \/\.\-:] - bt this is taking full string 6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
modified_regex_pattern = 6:[A-Za-z0-9 \/\.\-:]{1,}[\/-:]
In the first string i want output till
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20
but its giving till :65:
Can anyone suggest better way to write this.
Example as below
https://regex101.com/r/pAduvy/1

You could for example use a capturing group with an optional part at the end to match the :digits:a-z part.
(6:[A-Za-z0-9 \/.:-]+?)(?::\d+:[a-z]+)?$
( Capture group 1
6:[A-Za-z0-9 \/.:-]+? Match any of the listed in the character class as least as possible
) Close group 1
(?::\d+:[a-z]+)? optionally match the part at the end that you don't want to include
$ End of string
Regex demo
Note Not sure if intended, but the last part of your pattern [\/-:] denotes a range from ASCII range 47 - 58.
Or a more precise pattern to get the match only
6:/\w+/\w+ \w+/[A-Z]+//[A-Z]+(?:: \d+)?/[A-Z]*\d+(?:-\d+)?
6:/\w+/\w+ Match 6 and 2 times / followed by 1+ word chars and a space
\w+/[A-Z]+//[A-Z]+ Match 1+ word chars, / and uppercase chars, // and again uppercase chars
(?:: \d+)? Optionally match a space and 1+ digits
/[A-Z]*\d+ Match /, optional uppercase chars and 1+ digits
(?:-\d+)? Optionally match - and 1+ digits
Regex demo

Related

How to extract string between space and symbol '>'?

String 1:
[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97
String 2:
[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17
In string 1, I want to extract CAR<7:5 and BIKE<4:0,
In string 2, I want to extract CAKE<4:0
Any regex for this in Python?
You can use \w+<[^>]+
DEMO
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy).
< matches the character <
[^>] Match a single character not present in the list
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
We can use re.findall here with the pattern (\w+.*?)>:
inp = ["[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97", "[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17"]
for i in inp:
matches = re.findall(r'(\w+<.*?)>', i)
print(matches)
This prints:
['CAR<7:5', 'BIKE<4:0']
['CAKE<4:0']
In the first example, the BIKE part has no leading space but a pipe char.
A bit more precise match might be asserting either a space or pipe to the left, and match the digits separated by a colon and assert the > to the right.
(?<=[ |])[A-Z]+<\d+:\d+(?=>)
In parts, the pattern matches:
(?<=[ |]) Positive lookbehind, assert either a space or a pipe directly to the left
[A-Z]+ Match 1+ chars A-Z
<\d+:\d+ Match < and 1+ digits betqeen :
(?=>) Positive lookahead, assert > directly to the right
Regex demo
Or the capture group variant:
(?:[ |])([A-Z]+<\d+:\d)>
Regex demo

Finding exact values associated with given word using regex in python

am trying to find values associated with a particular word using regex but not getting expected results.
I wrote a pattern that is working fine for standard input only and I want to so the same for all sorts of inputs.
What I have now:
string = r'''results on 12/28/2012: WBC=8.110*3, RBC=3.3010*6, Hgb=11.3gm/dL'''
Pattern which I wrote:
re.findall(r'{}=(.*)'.format(detected_word), search_query)[0].split(',')[0]
detected_word is variable where am detecting left side part of equals sign like (WBC, RBC,...) using another technique.
In this above case, it's working fine, but if I change the sentence pattern like below am unable to find a generic pattern.
string = r'''results on 12/28/2012: WBC=8.110*3, RBC=3.3010*6 and Hgb=11.3gm/dL'''
string = r'''results for WBC, RBC and Hgb are 8.110*3, 3.3010*6 and 11.3gm/dL'''
no matter of string format I can able to detect WBC, RBC, and Hgb these words but detecting the value for an associated word is worrying me
Could anyone please help me with this?
Thanks in advance
Here is an idea: use two separate patterns for the strings you provided as sample input, the first one will extract values coming after expected word= and the other will extract them from clauses of expected word1 + optional expected word2 + optional expected word3 + "to be" verb + value1, optional value2 and optional value3.
Pattern 1:
\b(WBC|RBC|Hgb)=(\S*)\b
See the regex demo.
\b(WBC|RBC|Hgb) - a whole word WBC, RBC or Hgb
= - a = char
(\S*)\b - Group 2: 0 or more non-whitespaces, that stops at last word boundary position
Pattern 2:
\b(WBC|RBC|Hgb)(?:(?:\s+and)?(?:\s*,)?\s+(WBC|RBC|Hgb))?(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))?\s+(?:is|(?:a|we)re|was|will\s+be)(?:\s*,)?\s*(\d\S*)\b(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)?(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)?
See regex demo.
\b(WBC|RBC|Hgb) - Group 1 capturing the searched word
(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))? - an optional pattern:
(?:\s+and)? - an optional sequence of 1+ whitespaces and then and
(?:\s*,)? - an optional sequence of 0+ whitespaces and then a comma
\s*(WBC|RBC|Hgb) - 0+ whitespaces and Group 2 capturing the searched word
(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))? - same as above, captures the 3rd optional searched word into Group 3
\s+ - 1+ whitespaces
(?:is|(?:a|we)re|was|will\s+be) - a VERB, you may add more if you expect them to be at this position, or plainly try a \S+ or \w+ pattern instead
(?:\s*,)?\s* - an optional 0+ whitespaces and a comma sequence, then 0+ whitespaces
(\d\S*)\b - Group 4 (pair it with Group 1 value): a digit and then 0+ non-whitespace chars limited by a word boundary
(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)? - an optional group matching
(?:\s+and)? - an optional sequence of 1+ whitespaces and and
(?:\s*,)?\s* - an optional 0+ whitespaces and a comma, then 0+ whitespaces
(\d\S*)\b - Group 5 (pair it with Group 2 value): a digit and then 0+ non-whitespace chars limited by a word boundary
(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)? - same as above, with a capture group 6 that must be paired with Group 3.

Regex include only one digit between chars

I have to parse a PDF document and I'm using PyPDF2 with re(regex).
The file includes several lines like the one below:
18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40
I need to extract from this line the text( bold ) between the time and the amount:
PEDMILANO OVEST- BINASCOA
The following code is working but sometimes this code doesn't find anything since can be a number between these chars, for example, 18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40.
regex = re.compile(r'\d\d-\d\d-\d\d\d\d\d\d:\d\d:\d\d\D+\d+,\d\d')
Is there a way to include a number in this regular expression?
The following should simplify the current regex:
import re
s = '18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40'
re.search(r'\:\d+([A-Z].*?)(?=\d+\,\d+$)', s).group(1)
# 'PEDMILANO OVE3ST- BINASCOA'
See demo
\d+([A-Z].*?)(?=\d+\,\d+$)
\: matches the character : literally (case sensitive)
\d+: matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
1st Capturing Group ([A-Z].*?)
Match a single character present in the list below [A-Z]
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=\d+\,\d+$)
Assert that the Regex below matches
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\, matches the character , literally (case sensitive)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
I suggest using
import re
text = "18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40"
print( re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', r'\1', text) )
It can also be written as
re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}|\d+(?:,\d+)?$', '', text)
Or, if you prefer matching and capturing:
m = re.search(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', text)
if m:
print( m.group(1) )
See an online Python demo. With this solution, your data may start with any char, and will contain any char (excluding line break chars, since your data is on single lines).
Regex details
^ - start of string
\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2} - datetime string: two digits, -, two digits, -, five or six digits, :, two digits, : two digits
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
\d+(?:,\d+)? - an int/float value pattern: 1+ digits followed with an optional sequence of , and 1+ digits
$ - end of string.
See the regex demo.

RegEx for matching two digits and everything except new lines and dot

Using python v3, I'm trying to find a string only if it contains one to two digits (and not anymore than that in the same number) along with everything else following it. The match breaks on periods or new lines.
\d{1,2}[^.\n]+ is almost right except it returns numbers greater than two digits.
For example:
"5+years {} experience. stop.
10 asdasdas . 255
1abc1
5555afasfasf++++s()(jn."
Should return:
5+years {} experience
10 asdasdas
1abc1
Based upon your description and your sample data, you can use following regex to match the intended strings and discard others,
^\d[^\d.]*\d?[^\d.\n]*(?=\.|$)
Regex Explanation:
^ - Start of line
\d - Matches a digit
[^\d.]* - This matches any character other than digit or dot zero or more times. This basically allows optionally matching of non-digit non-dot characters.
\d? - As you want to allow one or two digits, this is the second digit which is optional hence \d followed by ?
[^\d.\n]* - This matches any character other than digit or dot or newline
(?=\.|$) - This positive look ahead ensures, the match either ends with a dot or end of line
Also, notice, multiline mode is enabled as ^ and $ need to match start of line and end of line.ad
Regex Demo 1
Code:
import re
s = '''5+years {} experience. stop.
10 asdasdas . 255
1abc1
5555afasfasf++++s()(2jn.'''
print(re.findall(r'(?m)^\d[^\d.]*\d?[^\d.\n]*(?=\.|$)', s))
Prints:
['5+years {} experience', '10 asdasdas ', '1abc1']
Also, if matching lines doesn't necessarily start with digits, you can use this regex to capture your intended string but here you need to get your string from group1 if you want captured string to start with number only, and if intended string doesn't necessarily have to start with digits, then you can capture whole match.
^[^\d\n]*(\d[^\d.]*\d?[^\d.\n]*)(?=\.|$)
Regex Explanation:
^ - Start of line
[^\d\n]* - Allows zero or more non-digit characters before first digit
( - Starts first grouping pattern to capture the string starting with first digit
\d - Matches a digit
[^\d.]* - This matches any character other than digit or dot zero or more times. This basically allows optionally matching of non-digit non-dot characters.
\d? - As you want to allow one or two digits, this is the second digit which is optional hence \d followed by ?
[^\d.\n]* - This matches any character other than digit or dot or newline
`) - End of first capturing pattern
(?=\.|$) - This positive look ahead ensures, the match either ends with a dot or end of line
Multiline mode is enabled which you can enable by placing (?m) before start of regex also called inline modifier or by passing third argument to re.search as re.MULTILINE
Regex Demo 2
Code:
import re
s = '''5+years {} experience. stop.
10 asdasdas . 255
1abc1
aaa1abc1
aa2aa1abc1
5555afasfasf++++s()(2jn.'''
print(re.findall(r'(?m)^[^\d\n]*(\d[^\d.]*\d?[^\d.\n]*)(?=\.|$)', s))
Prints:
['5+years {} experience', '10 asdasdas ', '1abc1', '1abc1']

Extracting a string within a string and omitting the search string

I have a string:
string="soupnot$23.99dedarikjdf$44.65 notworryfence$98.44coyoteugle$33.94rock$2,300.00"
I want to extract the numbers 23.99, 44.65, 98.44,33.44, 2,300.00. I have this regex
\$(.*[^\s])
There are 2 issues with this.
It returns the '$' sign. I only want the number.
It only works when there is a space at the end of the number but sometimes there might be letters and it won't work in that case.
Thanks.
You can use regex as shown:
import re
string="soupnot$23.99dedarikjdf$44.65 notworryfence$98.44coyoteugle$33.94rock$2,300.00"
res = re.findall(pattern="[\d.,]+", string=string)
output:
['23.99', '44.65', '98.44', '33.94', '2,300.00']
Try this regex:
(?<=\$)\d+(?:,\d+)*(?:\.\d+)?
Click for Demo
Explanation
(?<=\$) - positive lookbehind to find the position just preceded by a $
\d+ - matches 1+ occurrences of a digit
(?:,\d+)* - matches 0+ occurrences of a , followed by 1 or more digits
(?:\.\d+)? - matches a . followed by 1+ digits. ? in the end makes this decimal part optional

Categories

Resources