Wildcard in python dictionary - python

I am trying create a python dictionary to reference 'WHM1',2,3, 'HISPM1',2,3, etc. and other iterations to create a new column with a specific string for ex. White or Hispanic. Using regex seems like the right path but I am missing something here and refuse to hard code the whole thing in the dictionary.
I have tried several iterations of regex and regexdict :
d = regexdict({'W*':'White', 'H*':'Hispanic'})
eeoc_nac2_All_unpivot_df['Race'] =
eeoc_nac2_All_unpivot_df['EEOC_Code'].map(d)
A new column will be created with 'White' or 'Hispanic' for each row based on what is in an existing column called 'EEOC_Code'.

Your regular expressions are wrong - you appear to be using glob syntax instead of proper regular expressions.
In regex, x* means "zero or more of x" and so both your regexes will trivially match the empty string. You apparently mean
d = regexdict({'^W':'White', '^H':'Hispanic'})
instead, where the regex anchor ^ matches beginning of string.
There are several third-party packages 1, 2, 3 named regexdict so you should probably point out which one you use. I can't tell whether the ^ is necessary here, or whether the regexes need to match the input completely (I have assumed a substring match is sufficient, as is usually the case in regex) because this sort of detail may well differ between implementations.

I'm not sure to have completely understood your problem. However, if all your labels have structure WHM... and HISP..., then you can simply check the first character:
for race in eeoc_nac2_All_unpivot_df['EEOC_Code']:
if race.startswith('W'):
eeoc_nac2_All_unpivot_df['Race'] = "White"
else:
eeoc_nac2_All_unpivot_df['Race'] = "Hispanic"
Note: it only works if what you have inside eeoc_nac2_All_unpivot_df['EEOC_Code'] is iterable.

Related

Edit regex strings in Python using format method

I want to develop a regex in Python where a component of the pattern is defined in a separate variable and combined to a single string on-the-fly using Python's .format() string method. A simplified example will help to clarify. I have a series of strings where the space between words may be represented by a space, an underscore, a hyphen etc. As an example:
new referral
new-referal
new - referal
new_referral
I can define a regex string to match these possibilities as:
space_sep = '[\s\-_]+'
(The hyphen is escaped to ensure it is not interpreted as defining a character range.)
I can now build a bigger regex to match the strings above using:
myRegexStr = "new{spc}referral".format(spc = space_sep)
The advantage of this method for me is that I need to define lots of reasonably complex regexes where there may be several different commonly-occurring stings that occur multiple times and in an unpredictable order; defining commonly-used patterns beforehand makes the regexes easier to read and allows the strings to be edited very easily.
However, a problem occurs if I want to define the number of occurrences of other characters using the {m,n} or {n} structure. For example, to allow for a common typo in the spelling of 'referral', I need to allow either 1 or 2 occurrences of the letter 'r'. I can edit myRegexStr to the following:
myRegexStr = "new{spc}refer{1,2}al".format(spc = space_sep)
However, now all sorts of things break due to confusion over the use of curly braces (either a KeyError in the case of {1,2} or an IndexError: tuple index out of range in the case of {n}).
Is there a way to use the .format() string method to build longer regexes whilst still being able to define number of occurrences of characters using {n,m}?
You can double the { and } to escape them or you can use the old-style string formatting (% operator):
my_regex = "new{spc}refer{{1,2}}al".format(spc="hello")
my_regex_old_style = "new%(spc)srefer{1,2}al" % {"spc": "hello"}
print(my_regex) # newhellorefer{1,2}al
print(my_regex_old_style) # newhellorefer{1,2}al

Python re module groups match mechanism

Question Formation
background
As I am reading through the tutorial at python2.7 redoc, it introduces the behavior of the groups:
The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.
question
I clearly understands how this works singly. but I can understand the following example:
>>> m = re.match("([abc])+","abc")
>>> m.groups()
('c',)
I mean, isn't + simply means one or more. If so, shouldn't the regex ([abc])+ = ([abc])([abc])+ (not formal BNF). Thus, the result should be:
('a','b','c')
Please shed some light about the mechanism behind, thanks.
P.S
I want to learn the regex language interpreter, how should I start with? books or regex version, thanks!
Well, I guess a picture is worth a 1000 words:
link to the demo
what's happening is that, as you can see on the visual representation of the automaton, your regexp is grouping over a one character one or more times until it reaches the end of the match. Then that last character gets into the group.
If you want to get the output you say, you need to do something like the following:
([abc])([abc])([abc])
which will match and group one character at each position.
About documentation, I advice you to read first theory of NFA, and regexps. The MIT documentation on the topic is pretty nice:
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-045j-automata-computability-and-complexity-spring-2011/lecture-notes/
Basically, the groups that are referred to in regex terminology are the capture groups as defined in your regex.
So for example, in '([abc])+', there's only a single capture group, namely, ([abc]), whereas in something like '([abc])([xyz])+' there are 2 groups.
So in your example, calling .groups() will always return a tuple of length 1 because that is how many groups exist in your regex.
The reason why it isn't returning the results you'd expect is because you're using the repeat operator + outside of the group. This ends up causing the group to equal only the last match, and thus only the last match (c) is retained. If, on the other hand, you had used '([abc]+)' (notice the + is inside the capture group), the results would have been:
('abc',)
One pair of grouping parentheses forms one group, even if it's inside a quantifier. If a group matches multiple times due to a quantifier, only the last match for that group is saved. The group doesn't become as many groups as it had matches.

A simple regexp in python

My program is a simple calculator, so I need to parse te expression which the user types, to get the input more user-friendly. I know I can do it with regular expressions, but I'm not familar enough about this.
So I need transform a input like this:
import re
input_user = "23.40*1200*(12.00-0.01)*MM(H2O)/(8.314 *func(2*x+273.15,x))"
re.some_stuff( ,input_user) # ????
in this:
"23.40*1200*(12.00-0.01)*MM('H2O')/(8.314 *func('2*x+273.15',x))"
just adding these simple quotes inside the parentheses. How can I do that?
UPDATE:
To be more clear, I want add simple quotes after every sequence of characters "MM(" and before the ")" which comes after it, and after every sequence of characters "func(" and before the "," which comes after it.
This is the sort of thing where regexes can work, but they can potentially result in major problems unless you consider exactly what your input will be like. For example, can whatever is inside MM(...) contain parentheses of its own? Can the first expression in func( contain a comma? If the answers to both questions is no, then the following could work:
input_user2 = re.sub(r'MM\(([^\)]*)\)', r"MM('\1')", input_user)
output = re.sub(r'func\(([^,]*),', r"func('\1',", input_user)
However, this will not work if the answer to either question is yes, and even without that could cause problems depending upon what sort of inputs you expect to receive. Essentially, the first re.sub here looks for MM( ('MM('), followed by any number (including 0) of characters that aren't a close-parenthesis ('([^)]*)') that are then stored as a group (caused by the extra parentheses), and then a close-parenthesis. It replaces that section with the string in the second argument, where \1 is replaced by the first and only group from the pattern. The second re.sub works similarly, looking for any number of characters that aren't a comma.
If the answer to either question is yes, then regexps aren't appropriate for the parsing, as your language would not be regular. The answer to this question, while discussing a different application, may give more insight into that matter.

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

Conditional Regular Expressions

I'm using Python and I want to use regular expressions to check if something "is part of an include list" but "is not part of an exclude list".
My include list is represented by a regex, for example:
And.*
Everything which starts with And.
Also the exclude list is represented by a regex, for example:
(?!Andrea)
Everything, but not the string Andrea. The exclude list is obviously a negation.
Using the two examples above, for example, I want to match everything which starts with And except for Andrea.
In the general case I have an includeRegEx and an excludeRegEx. I want to match everything which matchs includeRegEx but not matchs excludeRegEx. Attention: excludeRegEx is still in the negative form (as you can see in the example above), so it should be better to say: if something matches includeRegEx, I check if it also matches excludeRegEx, if it does, the match is satisfied. Is it possible to represent this in a single regular expression?
I think Conditional Regular Expressions could be the solution but I'm not really sure of that.
I'd like to see a working example in Python.
Thank you very much.
Why not put both in one regex?
And(?!rea$).*
Since the lookahead only "looks ahead" without consuming any characters, this works just fine (well, this is the whole point of lookaround, actually).
So, in Python:
if re.match(r"And(?!rea$).*", subject):
# Successful match
# Note that re.match always anchor the match
# to the start of the string.
else:
# Match attempt failed
From the wording of your question, I'm not sure if you're starting with two already finished lists of "match/don't match" pairs. In that case, you could simply combine them automatically by concatenating the regexes. This works just as well but is uglier:
(?!Andrea$)And.*
In general, then:
(?!excludeRegex$)includeRegex

Categories

Resources