Python Regex for selecting text between fixed title - python

I am trying to get the text between two fixed header in python.
Please check this link http://regex101.com/r/jV4oP5/1
I want to extract everything that starts after OPINION . The regex I wrote matches only first Line as well as OPINION BY.
Is there any other regex that can fetch the data.
Any help is appreciated

Use a dotall modifier(s) to extract everything after OPINION.
OPINION.*
DEMO
If you don't want to match OPINION then use a lookbehind,
(?<=OPINION).*

If you really mean you want the entire document after "OPINION", try: (OPINION(\n*.*)*$)
Your regex was only finding the first new line character followed by any normal characters (excluding new line).

Related

regex pattern to get date after a particular text and a combination of word-number string ending with text MRN

I would like to create a regex which can match the following pattern
HR060178 RGP LUKA RIJEKA
30.09.2022 09:42:52
22HR060178U0078350MRN
to be specific the date which comes after text LUKA RIJEKA
and the MRN number starting with 22
The screenshot is of another problem where in the highlighted numbers need to be extracted.
since you only provided this one sample, its diffucult to test edge cases and fine tune the regex, but the following should give you a good starting point:
((?<=LUKA\sRIJEKA\s)\d\d\.\d\d\.\d\d\d\d)|((?<=\sLUKA\sRIJEKA\s\d\d\.\d\d\.\d\d\d\d\s\d\d:\d\d:\d\d\s)22[^\s]+?MRN)
i have split your problem into two match groups, one that looks for the date and one which looks for the MRN.
i would advice you to read through this article: Regex lookahead, lookbehind and atomic groups. since the solution heavily relies on lookaheads. for creating and testing regexes i would recommend you to use one of the many websites which come up when you search for "regex online"

Python regex to exclude several words

I try to search for URLS and want to exclude some. In the variable download_artist I stored the base URL and wanto to find additional links, but not upload, favorites, followers or listens.
So I tried different versions with the mentioned words and a |. Like:
urls = re.findall(rf'^{download_artist}uploads/|{download_artist}^favorites/|^{download_artist}followers/|^{download_artist}listens/|{download_artist}\S+"', response.text, re.IGNORECASE)
or:
urls = re.findall(rf'{download_artist}^uploads/|^favorites/|^followers/|^listens/|\S+"', response.text, re.IGNORECASE)
But it ignores my ^ for excluding the words. Where is my mistake?
You need use "lookaround" in this case, can see more details in https://www.regular-expressions.info/lookaround.html.
So, i think wich this regex solve your problem:
{download_artist}(?!uploads/|favorites/|followers/|listens/)\S+\"
You can test if regex working in https://regex101.com/. This site is very useful when you work with regex.
^ only works as a negation in character classes inside [], outside it represents the beginning of the input.
I suggest you do two matches: One to match all urls and another one to match the ones to exclude. Then remove the second set of urls from the first one.
That will keep the regexes simple and readable.
If you have to do it in one regex for whatever reason you can try to solve it with (negative) lookaround pattern (see https://www.rexegg.com/regex-lookarounds.html).

how to find occurrence of special character using regex

I have an url like this
http://foo.com/bar_by_baz.html
now I want to extract baz from that URL using a regex. But so far I have managed to write this much only
[_]+?\w[^.]+
This is giving me
_by_baz
as output. Now I want to know that how can I select any special character exactly one time or what would be the best approach to solve this using regex ?
I am trying it on python 3.x
Here's your regex: [_]+?([^_.]+) the group match will return baz.. The concept is to isolate underscore and dot from the target match
In another case, this works based on capturing only the alphanumerics [_]+?([A-Za-z0-9]+)
I am going to assume from your profile that you are seeking a javascript-friendly solution (you should update your question & tags).
For javascript, you could use this pattern: /[^_]+(?=\.[a-z]+$)/
Demo Link The pattern matches the substring containing no underscores that is followed by a dot then one or more alphabetical characters until the end of the string.
There will be several ways to accomplish your task. Finding the best/most efficient one can only be achieved if you provide more information about the coding environment/language and a few more sample strings.

alternative regex to match all text in between first two dashes

I'm trying to use the following regex \-(.*?)-|\-(.*?)* it seems to work fine on regexr but python says there's nothing to repeat?
I'm trying to match all text in between the first two dashes or if a second dash does not exist after the first all text from the first - onwards.
Also, the regex above includes the dashes, but would preferrably like to exclude these so I don't have to do an extra replace etc.
You can use re.search with this pattern:
-([^-]*)
Note that - doesn't need to be escaped.
An other way consists to only search the positions of the two first dashes, and to extract the substring between these positions. Or you can use split:
>>> 'aaaaa-bbbbbb-ccccc-ddddd'.split('-')[1]
'bbbbbb'

Regex pattern to match two datetime formats

I am doing a directory listening and need to get all directory names that follow the pattern: Feb14-2014 and 14022014-sometext. The directory names must not contain dots, so I dont want to match 14022014-sometext.more. Like you can see I want to match just the directories that follow the pattern %b%d-%Y and %d%m%Y-textofanylengthWithoutDots.
For the first case it should be something like [a-zA-Z]{3}\d{2}. I dont know how to parse the rest because my regex skills are poor, sorry. So I hope someone can tell me what the correct patterns look like. Thanks.
I am assuming each directory listing is separated by a new line
([A-Z]\w{2}\d{1,2}\-\d{4}|\d{7,8}\-\w+)$
Will match both cases and will match the text only if it is uninterrupted (by dots or anything else for that matter) until it hits the end of the line.
Some notes:
If you want to match everything except dot you may replace the final \w+ with [^.]+.
You need the multiline modifier /m for this to work, otherwise the $ will match the end of the string only.
I've not added a ^ to the start of the regex, but you may do so if each line contains a single directory
Of course you may expand this regex to include (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) instead of [A-Z]\w{2}. I've not done this to keep it readable. I would also suggest you store this in a python array and insert it dynamically into your regex for maintainability sake.
See it in action: http://regex101.com/r/pS6iY9
That's quite easy.
The best one I can make is:
((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d)|(\d\d\d\d\d\d\d\d-\w+)
The first part ((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d) matches the first kind of dates and the second part (\d\d\d\d\d\d\d\d-\w+) - the second kind.

Categories

Resources