Regex in python: matching duplicates of optional substrings - python

I am developing a python package that needs to, among other things, process a file containing a list of dataset names and I need to extract the components of these names.
Examples of dataset names would be:
diskLineLuminosity:halpha:rest:z1.0
diskLineLuminosity:halpha:rest:z1.0:dust
diskLineLuminosity:halpha:rest:z1.0:contam_NII
diskLineLuminosity:halpha:rest:z1.0:contam_NII:contam_OII:contam_OIII
diskLineLuminosity:halpha:rest:z1.0:contam_NII:contam_OIII:dust
diskLineLuminosity:halpha:rest:z1.0:contam_OII:contam_NII
diskLineLuminosity:halpha:rest:z1.0:contam_NII:recent
I'm looking for a way to parse the dataset names using regex to extract all the dataset information, including a list of all instances of "contam_*" (where zero instances are allowed). I realise that I could just split the string and used fnmatch.filter, or equivalent, but I also need to be able to flag erroneous dataset names that do not match the above syntax. Also, regex is currently used extensively in similar situations throughout the package and so I prefer not to introduce a second parsing method.
As an MWE, with an example dataset name, I have pieced together:
import re
datasetName = "diskLineLuminosity:halpha:rest:z1.0:contam_NII:recent"
M = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(:recent)?(:contam_[^:]+)?(:dust[^:]+)?",datasetName)
This returns:
print M.group(1,2,3,4,5,6,7)
('disk', 'halpha', 'rest', '1.0', None, ':contam_NII', None)
In the package, this regex search needs to go into a function similar to:
def getDatasetNameInformation(datasetName):
INFO = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(:recent)?(:contam_[^:]+)?(:dust[^:]+)?",datasetName)
if not INFO:
raise ParseError("Cannot parse '"+datasetName+"'!")
return INFO
I am still new to using regex so how can I modify the re.search string to successfully parse all of the above dataset names and extract the information in the substrings (including a list of all the instances of contamination)?
Thanks for any help you can provide!

If you are still learning regular expressions (to be honest, later as well), get in the habit of using the verbose mode as often as possible, it makes for better code and more readable expressions.
That said, you could use
^
(disk|spheroid)
LineLuminosity:
([^:]+):
([^:]+):
z([\d\.]+)
((?::contam_[^:]+)+)?
(:recent)?
(:dust[^:]*)?
Just changed the order a bit and used a non-capturing group inside he contam part, see a demo on regex101.com.

You could capture all of those contam_ with ((?::contam_[^:]+)*): this will capture all of them in one group. Then launch a second regular expression, apply it just on that match alone, and use that result as a nested list within the first results:
import re
datasetName = "diskLineLuminosity:halpha:rest:z1.0:recent:contam_NII:contam_NII:dust"
M = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(?::(recent))?((?::contam_[^:]+)*)(?::(dust))?",datasetName)
lst = list(M.groups())
if lst[5]:
lst[5] = re.findall(":contam_([^:]+)", lst[5])
print(lst)
Output:
['disk', 'halpha', 'rest', '1.0', 'recent', ['NII', 'NII'], 'dust']

Related

Wildcard in python dictionary

I am trying create a python dictionary to reference 'WHM1',2,3, 'HISPM1',2,3, etc. and other iterations to create a new column with a specific string for ex. White or Hispanic. Using regex seems like the right path but I am missing something here and refuse to hard code the whole thing in the dictionary.
I have tried several iterations of regex and regexdict :
d = regexdict({'W*':'White', 'H*':'Hispanic'})
eeoc_nac2_All_unpivot_df['Race'] =
eeoc_nac2_All_unpivot_df['EEOC_Code'].map(d)
A new column will be created with 'White' or 'Hispanic' for each row based on what is in an existing column called 'EEOC_Code'.
Your regular expressions are wrong - you appear to be using glob syntax instead of proper regular expressions.
In regex, x* means "zero or more of x" and so both your regexes will trivially match the empty string. You apparently mean
d = regexdict({'^W':'White', '^H':'Hispanic'})
instead, where the regex anchor ^ matches beginning of string.
There are several third-party packages 1, 2, 3 named regexdict so you should probably point out which one you use. I can't tell whether the ^ is necessary here, or whether the regexes need to match the input completely (I have assumed a substring match is sufficient, as is usually the case in regex) because this sort of detail may well differ between implementations.
I'm not sure to have completely understood your problem. However, if all your labels have structure WHM... and HISP..., then you can simply check the first character:
for race in eeoc_nac2_All_unpivot_df['EEOC_Code']:
if race.startswith('W'):
eeoc_nac2_All_unpivot_df['Race'] = "White"
else:
eeoc_nac2_All_unpivot_df['Race'] = "Hispanic"
Note: it only works if what you have inside eeoc_nac2_All_unpivot_df['EEOC_Code'] is iterable.

Find string in possibly multiple parentheses?

I am looking for a regular expression that discriminates between a string that contains a numerical value enclosed between parentheses, and a string that contains outside of them. The problem is, parentheses may be embedded into each other:
So, for example the expression should match the following strings:
hey(example1)
also(this(onetoo2(hard)))
but(here(is(a(harder)one)maybe23)Hehe)
But it should not match any of the following:
this(one)is22misleading
how(to(go)on)with(multiple)3parent(heses(around))
So far I've tried
\d[A-Za-z] \)
and easy things like this one. The problem with this one is it does not match the example 2, because it has a ( string after it.
How could I solve this one?
The problem is not one of pattern matching. That means regular expressions are not the right tool for this.
Instead, you need lexical analysis and parsing. There are many libraries available for that job.
You might try the parsing or pyparsing libraries.
These type of regexes are not always easy, but sometimes it's possible to come up with a way provided the input remains somewhat consistent. A pattern generally like this should work:
(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)
Code:
import re
p = re.compile(ur'(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)', re.MULTILINE)
result = re.findall(p, searchtext)
print(result)
Result:
https://regex101.com/r/aL8bB8/1

How to regex split, but keep the split string?

I have the following URL pattern:
http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en
I would like to get everything up until and inclusive of /watch/\d+/.
So far I have:
>>> re.split(r'watch/\d+/', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'supernatural-dub-hollywood-babylon/en']
But this does not include the split string (the string which appears between the domain and the path). The end answer I want to achieve is:
http://www.hulu.jp/watch/589851
You need to use capture group :
>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']
As mentioned in the other answer, you need to use groups to capture the "glue" between the split strings.
I wonder though, is what you want here a split() or a search()? It looks (from the sample) that you're trying to extract from a URL everything from the first occurrence of /watch/XXX/ where XXX is 1 or more digits, to the end of the string. If that's the case, then a match/search might be more suitable, as with a split if the search regex can match multiple times you'll split into multiple groups. Ex:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/', 'watch/2342/', 'fdsaafsdf']
Which doesn't look like what you want. Instead perhaps:
result = re.search(r'(watch/\d+/)(.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groups() if result else []
which gives:
('watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
You could also use this approach combined with named groups to get extra fancy:
result = re.search(r'(?P<watchId>watch/\d+/)(?P<path>.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groupdict() if result else {}
giving:
{'path': 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', 'watchId': 'watch/589851/'}
If you're set on the split() approach, you can also set the maxsplit parameter to ensure it's only split once:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', maxsplit=1)
giving:
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf']
Personally though, I find that when parsing URL's into constituent parts the search() with named groups approach works extremely well as it allows you to name the various parts in the regex itself, and via groupdict() get a nice dictionary you can use for working with those parts.
You've surely seen the Stack Overflow don't-parse-HTML-with-regex post, yes?
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Well, regex can parse URLs, but trying to do so when there's a plethora of better tools is foolish.
This is what a regex for URLs looks like:
^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$ (+ caseless flag)
It's just a mess of characters, right? Exactly!
Don't parse URLs with regex... almost.
There is one simple thing:
A path-relative URL must be zero or more path segments separated from each other by a "/".
Splitting the URL should be as simple as url.split("/").
from urllib.parse import urlparse, urlunparse
myurl = "http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en"
# Run a parser over it
parts = urlparse(myurl)
# Crop the path to UP TO length 2
new_path = str("/".join(parts.path.split("/")[:3]))
# Unparse
urlunparse(parts._replace(path=new_path))
#>>> 'http://www.hulu.jp/watch/589851'
You can try following regex
.*\/watch\/\d+
Working Demo

Referencing a RegEx Variable

I'm using python to loop through a large list of self reported locations to try to match them to their home states. The RegEx expression I'm using is:
/^"[^\s]+,\s*([a-zA-Z]{2})"$/
Basically, I'm trying to find a pattern that looks like XXXCITYXXX, [Statecode], where statecode is only two letters.
My issue is that I don't know how to reference the varying state code once I find a matching string. I know in Perl that I could use:
$state = uc($1)
However, I don't know the equivalent Python syntax. Anyone know?
You can do it with re.search, which returns a match object (if the regex matches at all) with a groups property containing the captured groups:
import re
match = re.search('^[^\s]+,\s*([a-zA-Z]{2})$', my_string)
if match:
print match.groups()[0]

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

Categories

Resources