Data Set
Cider
631
Spruce
871
Honda
18813
Nissan
3292
Pine
10621
Walnut
10301
Code
#!/usr/bin/python
import re
text = "Cider\n631\n\nSpruce\n871Honda\n18813\n\nNissan\n3292\n\nPine\n10621\n\nWalnut\n10301\n\n"
f1 = re.findall(r"(Cider|Pine)\n(.*)",text)
print(f1)
Current Result
[('Cider', '631'), ('Pine', '10621')]
Question:
How do I change the regex from matching everything except several specified strings? ex (Honda|Nissan)
Desired Result
[('Cider', '631'), ('Spruce', '871'), ('Pine', '10621'), ('Walnut', '10301')]
You can exclude matching either of the names or only digits, and then match the 2 lines starting with at least a non whitespace char.
^(?!(?:Honda|Nissan|\d+)$)(\S.*)\n(.*)
The pattern matches:
^ Start of string
(?! Negative lookahead, assert not directly to the right
(?:Honda|Nissan|\d+)$ Match any of the alternatives at followed by asserting the end of the string
) Close lookahead
(\S.*) Capture group 1, match a non whitespace char followed by the rest of the line
\n Match a newline
(.*) Capture group 2, match any character except a newline
Regex demo
import re
text = ("Cider\n"
"631\n\n"
"Spruce\n"
"871\n\n"
"Honda\n"
"18813\n\n"
"Nissan\n"
"3292\n\n"
"Pine\n"
"10621\n\n"
"Walnut\n"
"10301")
f1 = re.findall(r"^(?!(?:Honda|Nissan|\d+)$)(\S.*)\n(.*)", text, re.MULTILINE)
print(f1)
Output
[('Cider', '631'), ('Spruce', '871'), ('Pine', '10621'), ('Walnut', '10301')]
If the line should start with an uppercase char A-Z and the next line should consist of only digits:
^(?!Honda|Nissan)([A-Z].*)\n(\d+)$
This pattern matches:
^ Start of string
(?!Honda|Nissan) Negative lookahead, assert not Honda or Nissan directly to the right
([A-Z].*) Capture group 1, match an uppercase char A-Z followed by the rest of the line
\n Match a newline
(\d+) Capture group 2, match 1+ digits
$ End of string
Regex demo
inverse it with caret ‘^’ symbol.
f1 = re.findall(r"(\s?^(Cider|Pine))\n(.*)",text)
Keep in mind that caret symbol (in regex) has a special meaning if it is used as a first character match which then would alternatively mean to be “does it start at the beginning of a line”.
Thats why one would insert a “non-usable character” in the beginning. I chosed an optional single space to use up that first character thereby rendering the meaning of the caret (^) symbol as NOT to mean “the beginning of the line”, but to get the desired inverse operator.
Related
I want to match any string that starts with . and word and then optionally any character after a space.
r"^\.(\w+)(?:\s+(.+)\b)?"
eg:
should match
.just one two
.just
.blah one#nine
.blah
.jargon blah
should not match
.jargon
I want this second group mandatory if first group is jargon
Using Python you can exclude matching only jargon using a negative lookahead, and then match 1 or more word characters
Then optionally match 1 or more whitespace characters excluding newlines followed by at least 1 or more characters without newlines.
^\.(?!jargon$)\w+(?:[^\S\n]+.+)?$
The pattern matches:
^ Start of string
\. Match a dot
(?!jargon$) Exlude matching jargon as the only word on the line
\w+ Match 1+ word characters
(?: Non capture group
[^\S\n]+.+ match 1+ whitespace chars excluding newline and then 1+ chars except newlines
)? Close non capture group and make it optional
$ End of string
See a regex demo and a Python demo.
Example
import re
strings = [
".just one two",
".just",
".blah one#nine",
".blah",
".jargon blah",
".jargon"
]
for s in strings:
m = re.match(r"\.(?!jargon$)\w+(?:[^\S\n]+.+)?$", s)
if m:
print(m.group())
Output
.just one two
.just
.blah one#nine
.blah
.jargon blah
One approach would be to phrase your requirement using an alternation:
^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$
This pattern says to match:
^ from the start of the input
\. match dot
(?:
(?!jargon\b)\w+ match a first term which is NOT "jargon"
(?: \S+)* then match optional following terms zero or more times
| OR
jargon match "jargon" as the first term
(?: \S+)+ then match mandatory one or more terms
)
$ end of the input
Here is a sample Python script:
inp = [".just one two", ".just", ".blah one#nine", ".blah", ".jargon blah", "jargon"]
matches = [x for x in inp if re.search(r'^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$', x)]
print(matches) # ['.just one two', '.just', '.blah one#nine', '.blah', '.jargon blah']
You could attempt to match the following regular expression:
^\.(?!jargon$)\w+(?= .|$).*
Demo
If successful, this will match the entire string. If one simply wants to know if the string conforms to the requirements .* can be dropped.
(?!jargon$) is a negative lookahead that asserts that the period is not immediately followed by 'jargon' at the end of the string.
(?= .|$) is a positive lookahead that asserts that the string of word characters is followed by a space followed by any character or they terminate the string.
I am a RegEx beginner and trying to identify the endings of different statements in sms. See screenshot below.
How can I avoid selecting the next letter following by a full-stop that indicates ending of a statement.
Note that some statements have <.><Alphabets> while some have <.><space><Alphabets>
Regex used: r"\. ?[\D]"
Sample SMS: - I want to select just the full-stop and space if any.
Txn of USD 00.00 done using TC XX at POS*MERCH on 30-Feb-22. Avl bal:USD 00.00. Call xxxxxx for dispute or SMS BLOCK xxxx to xxxxxxx
Acct XX debited with USD XX.00 on some date.Info: ABC*BDECS-XYZ.Avbl Bal:USD yy,xxx.95.Call xxxxxx for dispute or SMS BLOCK xx to xxxxx
screenshot from RegExr on regular pattern
What you're looking for is a look-ahead group. Whether you make that a positive look-ahead and use the negated character set \D or a negative look-ahead with the character set \d doesn't really matter- I'll outline both below:
regex = r". ?(?=\D)" # asserts that the following character matches \D
regex = r". ?(?!\d)" # asserts the following character does NOT match \d
There's also look-behind variants (?<!pattern) and (?<=pattern), which assert that the pattern doesn't/does match just before the current position.
None of these groups capture the matched text- they just "look ahead" or "look behind" without changing state.
Using \. ?[\D] is matching a single non digit char, but also that non digit char can be space or a newline by itself.
If you want to match a dot only, but not when it is the last character in the string, you can assert optional spaces without newlines.
Then match a non whitespace char not being a digit.
\.(?=[^\S\n]*[^\s\d])
The pattern matches:
\. Match a dot
(?= Positive lookahead to assert what is directly to the right of the current position is
[^\S\n]* Match optional whitespace chars without a newline
[^\s\d] Match a single non whitespace char other than a digit
) Close lookahead
See a regex demo.
String 1:
[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97
String 2:
[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17
In string 1, I want to extract CAR<7:5 and BIKE<4:0,
In string 2, I want to extract CAKE<4:0
Any regex for this in Python?
You can use \w+<[^>]+
DEMO
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy).
< matches the character <
[^>] Match a single character not present in the list
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
We can use re.findall here with the pattern (\w+.*?)>:
inp = ["[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97", "[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17"]
for i in inp:
matches = re.findall(r'(\w+<.*?)>', i)
print(matches)
This prints:
['CAR<7:5', 'BIKE<4:0']
['CAKE<4:0']
In the first example, the BIKE part has no leading space but a pipe char.
A bit more precise match might be asserting either a space or pipe to the left, and match the digits separated by a colon and assert the > to the right.
(?<=[ |])[A-Z]+<\d+:\d+(?=>)
In parts, the pattern matches:
(?<=[ |]) Positive lookbehind, assert either a space or a pipe directly to the left
[A-Z]+ Match 1+ chars A-Z
<\d+:\d+ Match < and 1+ digits betqeen :
(?=>) Positive lookahead, assert > directly to the right
Regex demo
Or the capture group variant:
(?:[ |])([A-Z]+<\d+:\d)>
Regex demo
I have a field that looks like this :
------->Total cash dispensed: 40000 MGA
I want to get only the "MGA" using regex but without using split
regex = r"Total cash dispensed:\s*([^ 0-9]*)"
The code I used to get anything that's not number or white space does not work, How do I fix this?
You might use a capture group:
\bTotal cash dispensed:\s*\d+\s+([A-Z]+)\b
\bTotal cash dispensed:\s* Match the text starting with a word boundary and followed by : and optional whitespace chars
\d+\s+ Match 1+ digits and 1+ whitespace chars
([A-Z]+) Capture group 1, match 1+ chars A-Z
\b A word boundary to prevent a partial match
Regex demo
import re
pattern = r"\bTotal cash dispensed:\s*\d+\s+([A-Z]+)\b"
s = "------>Total cash dispensed: 40000 MGA"
matches = re.search(pattern, s)
if matches:
print(matches.group(1))
Output
MGA
You can match anything except whitespace and digits with the following expression:
[^\s\d]+
"\s\" is whitespace, including spaces, tabs, and enters
"\d" is digits, same as [0-9] and some other numeric characters from other scripts
Example:
re.search('[^\s\d]+', '123 \t')
# -> No match
I'm trying to extract a pattern from string using python regex. But it is not working with the below pattern
headerRegex = re.compile(r'^[^ ]*\s+\d*')
mo = headerRegex.search(string)
return mo.group()
My requirment is regular expression that should start with anything except white space and followed by one or more whitespace then digits occurence one or more
Example
i/p: test 7895 => olp:7895(correct)
i/p: 8545 ==> Not matching
i/p: #### 3453 ==>3453
May I know what is missing in my regex to implement this requirement?
In the pattern that you tried, only matching whitespace chars is mandatory, and you might possibly also match only newlines.
Change the quantifiers to + to match 1+ times, and if you don't want to match newlines as well use [^\S\r\n]+ instead.
If that exact match is only allowed, add an anchor $ to assert the end of the string, or add \Z if there is no newline following allowed.
^\S+[^\S\r\n]+\d+$
^ Start of string
\S+ Match 1+ times a non whitespace char
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
\d+ Match 1+ digits
$ End of string
Regex demo