I have a text like this
EXPRESS blood| muscle| testis| normal| tumor| fetus| adult
RESTR_EXPR soft tissue/muscle tissue tumor
Right now I want to only extract the last item in EXPRESS line, which is adult.
My pattern is:
[|](.*?)\n
The code goes greedy to muscle| testis| normal| tumor| fetus| adult. Can I know if there is any way to solve this issue?
You can take the capture group value exclude matching pipe chars after matching a pipe char followed by optional spaces.
If there has to be a newline at the end of the string:
\|[^\S\n]*([^|\n]*)\n
Explanation
\| Match |
[^\S\n]* Match optional whitespace chars without newlines
( Capture group 1
[^|\n]* Match optional chars except for | or a newline
) Close group 1
\n Match a newline
Regex demo
Or asserting the end of the string:
\|[^\S\n]*([^|\n]*)$
You could use this one. It spares you the space before, handle the \r\n case and is non-greedy:
\|\s*([^\|])*?\r?\n
Tested here
Related
I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))
Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']
For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.
I am a RegEx beginner and trying to identify the endings of different statements in sms. See screenshot below.
How can I avoid selecting the next letter following by a full-stop that indicates ending of a statement.
Note that some statements have <.><Alphabets> while some have <.><space><Alphabets>
Regex used: r"\. ?[\D]"
Sample SMS: - I want to select just the full-stop and space if any.
Txn of USD 00.00 done using TC XX at POS*MERCH on 30-Feb-22. Avl bal:USD 00.00. Call xxxxxx for dispute or SMS BLOCK xxxx to xxxxxxx
Acct XX debited with USD XX.00 on some date.Info: ABC*BDECS-XYZ.Avbl Bal:USD yy,xxx.95.Call xxxxxx for dispute or SMS BLOCK xx to xxxxx
screenshot from RegExr on regular pattern
What you're looking for is a look-ahead group. Whether you make that a positive look-ahead and use the negated character set \D or a negative look-ahead with the character set \d doesn't really matter- I'll outline both below:
regex = r". ?(?=\D)" # asserts that the following character matches \D
regex = r". ?(?!\d)" # asserts the following character does NOT match \d
There's also look-behind variants (?<!pattern) and (?<=pattern), which assert that the pattern doesn't/does match just before the current position.
None of these groups capture the matched text- they just "look ahead" or "look behind" without changing state.
Using \. ?[\D] is matching a single non digit char, but also that non digit char can be space or a newline by itself.
If you want to match a dot only, but not when it is the last character in the string, you can assert optional spaces without newlines.
Then match a non whitespace char not being a digit.
\.(?=[^\S\n]*[^\s\d])
The pattern matches:
\. Match a dot
(?= Positive lookahead to assert what is directly to the right of the current position is
[^\S\n]* Match optional whitespace chars without a newline
[^\s\d] Match a single non whitespace char other than a digit
) Close lookahead
See a regex demo.
I'm trying to extract a pattern from string using python regex. But it is not working with the below pattern
headerRegex = re.compile(r'^[^ ]*\s+\d*')
mo = headerRegex.search(string)
return mo.group()
My requirment is regular expression that should start with anything except white space and followed by one or more whitespace then digits occurence one or more
Example
i/p: test 7895 => olp:7895(correct)
i/p: 8545 ==> Not matching
i/p: #### 3453 ==>3453
May I know what is missing in my regex to implement this requirement?
In the pattern that you tried, only matching whitespace chars is mandatory, and you might possibly also match only newlines.
Change the quantifiers to + to match 1+ times, and if you don't want to match newlines as well use [^\S\r\n]+ instead.
If that exact match is only allowed, add an anchor $ to assert the end of the string, or add \Z if there is no newline following allowed.
^\S+[^\S\r\n]+\d+$
^ Start of string
\S+ Match 1+ times a non whitespace char
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
\d+ Match 1+ digits
$ End of string
Regex demo
I have structured documents in the following format:
123456789|XXX|1234567|05/05/2012 00:00|81900153|Signed|LASTNAME,FIRSTNAME, M.S.|024813|XXX|3410080|DNR Order Verification:Scanned|
xyz pqs 123
[report_end]
123456789|XXX|1234567|05/05/2012 00:00|81900153|Signed|LASTNAME,FIRSTNAME, M.S.|024813|XXX|3410080|A Note|
xyz pqs 123
[report_end]
Where each record:
starts with an 11-field line delimited by |
has an intervening block of free text
ends with the tag "[report_end]"
How can I capture these three elements with a regular expression?
My approach would be to
search each line that has 11 | characters;
search each line that has [report_end];
search whatever is in between these two lines.
But I don't know how to accomplish this with regular expressions.
You can also try with:
^(?P<fields>(?:[^|]+\|){11})(?P<text>[\s\S]+?)(?P<end>\[report_end\])
DEMO
You can use something like:
r"((?:.*?\|){11}\s+(?:.*)\s+\[report_end\])"
OUTPUT:
Match 1. [0-157] `123456789|XXX|1234567|05/05/2012 00:00|81900153|Signed|LASTNAME,FIRSTNAME, M.S.|024813|XXX|3410080|DNR Order Verification:Scanned|
xyz pqs 123
[report_end]
Match 2. [159-292] `123456789|XXX|1234567|05/05/2012 00:00|81900153|Signed|LASTNAME,FIRSTNAME, M.S.|024813|XXX|3410080|A Note|
xyz pqs 123
[report_end]
DEMO
https://regex101.com/r/xY5nI9/1
Regex Explanation
((?:.*?\|){11}\s+(?:.*)\s+\[report_end\])
Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Regex syntax only
Match the regex below and capture its match into backreference number 1 «((?:.*?\|){11}\s+(?:.*)\s+\[report_end\])»
Match the regular expression below «(?:.*?\|){11}»
Exactly 11 times «{11}»
Match any single character that is NOT a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “|” literally «\|»
Match a single character that is a “whitespace character” «\s+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below «(?:.*)»
Match any single character that is NOT a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match a single character that is a “whitespace character” «\s+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “[” literally «\[»
Match the character string “report_end” literally «report_end»
Match the character “]” literally «\]»
UPDATE BASED ON YOUR COMMENTS
To get 3 groups you can use:
r"((?:.*?\|){11})\s+(.*)\s+(\[report_end\])
To loop all groups:
import re
pattern = re.compile(r"((?:.*?\|){11})\s+(.*)\s+(\[report_end\])")
for (match1, match2, match3) in re.findall(pattern, string):
print match1 +"\n"+ match2 +"\n"+ match3 +"\n"
LIVE DEMO
http://ideone.com/k8sA3k
I'm new in regex.Here is my data.
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
I want to get this.
y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38
Here is my regex.
(<p>\[tag(.*)\])(.+)(\[\/tag\]<\/p>)
But it doesn't work because of new line(\n).If I use re.DOTALL , It works ,but if my data has multi records like
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
re.findall() returns only one match.I briefly want this.
[data1,data2,data3...].What can i do ?
Simple as this:
\](.*?)\[
reobj = re.compile(r"\](.*?)\[", re.IGNORECASE | re.DOTALL | re.MULTILINE)
result = reobj.findall(YOURSTRING)
Output:
y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38
DEMO
Regex Explanation:
\] matches the character ] literally
1st Capturing group (.*?)
.*? matches any character
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\[ matches the character [ literally
s modifier: single line. Dot matches newline characters
You can use a this regex:
\[tag\]([\s\S]*?)\[\/tag\]
Working demo
Match information:
MATCH 1
1. [8-44] `y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38`
Update: what
\[tag\]
([\s\S]*?) --> the [\s\S]*? is used to match everything, since \S will capture
all non blanks and \s will capture blanks. This is just a trick, you can
also use [\D\d] or [\W\w]. Btw, the *? is just a ungreedy quantifier
\[\/tag\]
On the other hand, if you want to allow attributes in the tag you can use:
\[tag.*?\]([\s\S]*?)\[\/tag\]