I have a text in which I need to grab data and split it up. I need to find "Review frequency" within a large group of text, then once that is found, take everything after it and stop at the ')'.
Example text is:
No. of components Variable
Review frequency Quarterly (Mar., Jun., Sep., Dec.)
Quick facts
To learn more about the
What I need is 'Quarterly' and 'Mar., Jun., Sep., Dec.'
My current regex is:
((?=.*?\bReview frequency\b)(\b(Q|q)uarterly|(A|a)nnually|(S|s)emi-(A|a)nnually))
But this is not working. Essentially the 'Review frequency' needs to be the qualifier before we start picking up the other information, as there may be other dates/periods within the file. Thank you!
You are not matching the rest of the data on the line.
I suggest using:
(?m)^Review frequency[ \t]+(\w+)[ \t]+(.+)
See the regex demo
If the first capturing group can only contain 3 words as indicated in your pattern, use
(?m)^Review frequency[ \t]+([Qq]uarterly|(?:[Ss]emi-)?[Aa]nnually)[ \t]+(.*)
See another regex demo
Use these patterns with re.findall:
import re
regex = r"(?m)^Review frequency[ \t]+([Qq]uarterly|(?:[Ss]emi-)?[Aa]nnually)[ \t]+(.*)"
test = "No. of components Variable\nReview frequency Quarterly (Mar., Jun., Sep., Dec.\nQuick facts\nTo learn more about the"
print(re.findall(regex, test))
Related
I would like to create a regex which can match the following pattern
HR060178 RGP LUKA RIJEKA
30.09.2022 09:42:52
22HR060178U0078350MRN
to be specific the date which comes after text LUKA RIJEKA
and the MRN number starting with 22
The screenshot is of another problem where in the highlighted numbers need to be extracted.
since you only provided this one sample, its diffucult to test edge cases and fine tune the regex, but the following should give you a good starting point:
((?<=LUKA\sRIJEKA\s)\d\d\.\d\d\.\d\d\d\d)|((?<=\sLUKA\sRIJEKA\s\d\d\.\d\d\.\d\d\d\d\s\d\d:\d\d:\d\d\s)22[^\s]+?MRN)
i have split your problem into two match groups, one that looks for the date and one which looks for the MRN.
i would advice you to read through this article: Regex lookahead, lookbehind and atomic groups. since the solution heavily relies on lookaheads. for creating and testing regexes i would recommend you to use one of the many websites which come up when you search for "regex online"
This question showed how to replace a regex with another regex like this
$string = '"SIP/1037-00000014","SIP/CL-00000015","Dial","SIP/CL/61436523277,45"';
$$pattern = '["SIP/CL/(\d*),(\d*)",]';
$replacement = '"SIP/CL/\1|\2",';
$string = preg_replace($pattern, $replacement, $string);
print($string);
However, I couldn't adapt that pattern to solve my case where I want to remove the full stop that lies between 2 words but not between a word and a number:
text = 'this . is bad. Not . 820'
regex1 = r'(\w+)(\s\.\s)(\D+)'
regex2 = r'(\w+)(\s)(\D+)'
re.sub(regex1, regex2, text)
# Desired outcome:
'this is bad. Not . 820'
Basically I like to remove the . between the two alphabet words. Could someone please help me with this problem? Thank you in advance.
These expressions might be close to what you might have in mind:
\s[.](?=\s\D)
or
(?<=\s)[.](?=\s\D)
Test
import re
regex = r"\s[.](?=\s\D)"
test_str = "this . is bad. Not . 820"
print(re.sub(regex, "", test_str))
Output
this is bad. Not . 820
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
Firstly, you can't really take PHP and apply it directly to Python, for obvious reasons.
Secondly, it always helps to specify which version of Python you're using as APIs change. Luckily in this instance, the API of re.sub has remained the same between Python 2.x and Python 3.
Onto your issue.
The second argument to re.sub is either a string or a function. If you pass in regex2 it'll just replace regex1 with the string contents of regex2, it won't apply regex2 as a regex.
If you want to use groups derived from the first regex (similar to your example, which is using \1 and \2 to extract the first and second matching group from the first regex), then you'd want to use a function, which takes a match object as its sole argument, which you could then use to extract matching groups and return them as part of the replacement string.
I have a txt file that I converted from a pdf that contains a long list of items. These items have a numbering convention as follows:
[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}
This expression would match something between:
A1.1.1
and
ZZ99.99.99
This works just fine. The issue I am having is that I am trying to capture this in group 1 and everything between each item number (the item description) in group 2.
I also need these returned as a list or an iterable so that, eventually, the contents captured can be exported to an excel spreadsheet.
This is the regex I have currently:
^([A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}\s)([\w\W]*?)(?:\n)
Follow this link to find a sample of what I have and the issues I am facing:
Debuggex Demo
Is anyone able to help me figure out how to capture everything between each number no matter how many paragraphs?
Any input would be greatly appreciated, thanks!
You are very close:
import re
s = """
A1.2.1 This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.ZZ99.99.99
"""
final_data = re.findall("[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}(.*?)[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}", s)
Output:
[' This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.']
By using (.*?) you can match any text between the letters and numbers as defined by your first regex.
I have few data lines
ReadPosRankSum=### SNPEFF_AMINO_ACID_CHANGE=p.Pro3Pro/c.9T>C SNPEFF_CODON_CHANGE=ccT/ccC
ReadPosRankSum=### SNPEFF_AMINO_ACID_CHANGE=p.Trp7Ser/c.20G>C SNPEFF_CODON_CHANGE=tGg/tCg
ReadPosRankSum=### SNPEFF_AMINO_ACID_CHANGE=p.Lys17Arg/c.50A>G SNPEFF_CODON_CHANGE=aAa/aGa
and so on..
I want to be able to extract just the values for the keys SNPEFF_AMINO_ACID_CHANGE, that is p.Pro3Pro/c.9T>C, p.Trp7Ser/c.20G>C and p.Lys17Arg/c.50A>G. Any ideas on how to create a pattern for this?
Usually when questions like this are asked some effort needs to be shown. So please take consideration to state the exact problem with at least some effort on what you have attempted next time.
To get you started, you could try the following regular expression:
>>> re.findall(r'SNPEFF_AMINO_ACID_CHANGE=(\S+)', text)
This will extract the values from the pattern and store them in a list.
Explanation:
SNPEFF_AMINO_ACID_CHANGE= # match 'SNPEFF_AMINO_ACID_CHANGE='
( # group and capture to \1:
\S+ # non-whitespace (1 or more times)
) # end of \1
Working Demo
I am trying to get the text between two fixed header in python.
Please check this link http://regex101.com/r/jV4oP5/1
I want to extract everything that starts after OPINION . The regex I wrote matches only first Line as well as OPINION BY.
Is there any other regex that can fetch the data.
Any help is appreciated
Use a dotall modifier(s) to extract everything after OPINION.
OPINION.*
DEMO
If you don't want to match OPINION then use a lookbehind,
(?<=OPINION).*
If you really mean you want the entire document after "OPINION", try: (OPINION(\n*.*)*$)
Your regex was only finding the first new line character followed by any normal characters (excluding new line).