String after Escaped Characters in Regex - python

I'm using regular expressions in Python. I'm trying to pull out all the data between 2 variables, it starts with {"justin_h and ends with "} special characters included, however I'm having trouble with the regex syntax.
I've been using:
[{]["][justin_h...["][}]
And it returns no results. I know for a fact it's in there, and the [{]["] returns results, but it's when I start the string it doesn't seem to work. Where am I going wrong?

Use capturing groups or lookarounds.
r'\{"justin_h(.*?)"}'
Grab the string you want from group index 1. It won't work, if the part you want to grab contain newline character. For that case, you need to use (?s) DOTALL flag.
r'(?s)\{"justin_h(.*?)"}'
Example:
>>> re.findall(r'\{"justin_h(.*?)"}', 'foo{"justin_hfoobar"}barfoo')
['foobar']

Related

Is this regex syntax working?

I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")
As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"
Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*

Negative lookahead - exclude entire match if words are found?

I am trying to parse text journals, and I am only interested in specific sections of text.
I thought that I was doing fine until I noticed I was inadvertently identifying sections.
Suppose that I want to match the following section.
Section 7 - Delivering Terminal Diagnosis's
which may also show up as
Section 7. Delivering a Terminal Diagnosis
But I don't want to match anything if the words see or under precede my string like below.
see Section 7. Delivering a Terminal Diagnosis
or
filed under Section 7. Delivering a Terminal Diagnosis
should not match anything.
I tried using a negative look-ahead, but it only excludes the words, it doesn't throw out the entire match.
((?!see )Section[\s\\n]+7[\s+]+?[-:\\n\.]+?[\s+]+?(Delivering|Deliver)(.*terminal[\s+]+Diagnosis('s)?)?[\.]?)
I don't think that I am grasping the look-around concept properly. help?
Negative look-ahead does what it says: specifies a group that cannot match after your main expression. But you don't have anything before it.
Use negative lookbehind:
(?<!see|under)
in lieu of (?!see ).
Other comments: you have a case error (terminal should be Terminal) and if you make your entire string "raw" by prepending it with an r like r'my string' you don't need to double-escape characters like \n.
Try the following..
For whatever case you are using for matching, I would use r in front of your regular expression. r is Python’s raw string notation for regular expression patterns and to avoid escaping, and to avoid the fact of uppercase or lowercase to look for, use re.I for case-insensitive matching.
Here's a possible solution using double Negative Lookbehind's.
(?<!see)(?<!under)\s+(section 7[\s.:-]+(?:deliver(?:ing)?).*?terminal\s+diagnosis(?:'s)?)
See live demo
By example of using the raw string notation and re.I, this is what I meant.
matches = re.findall(r"(?<!see)(?<!under)\s+(section 7[\s.:-]+(?:deliver(?:ing)?).*?terminal\s+diagnosis(?:'s)?)", s, re.I)
print matches

How could I get regex to start when it has reached a specific point within a string?

Say I have a string like {{ComputersRule}} and a regex like: [^\}]+. How would I get regular expressions to start at a specified point in the string, i.e. Once it has reached the third character in the string. If it's relevant, and I doubt it is, I'm working in Python version 2.7.3. Thank you.
I'd recommend using Python to grab the substring from the third character onwards, and then apply the regex to the rest.
Otherwise, you could just use the regex . (any character except newline) to gobble up the first n characters:
^.{3}([^\}]+)
Notice the ^.{3} which forces the [^\}]+ to not include the first three characters of the string (the ^ anchors to the start of the string/line). The brackets capture the bit you want to extract (so get capturing group 1).
In your particular case, if it's just a case of "I want the text inside the {{ and }}" you could do \{\{([^\}]+)\}\} or [^\{\}]+.
It appears that what you want to do is to match text within the double braces.
The trick is to specify the braces in the regex but capture the part within. In this case try
\{\{([^}]+)\}\}

beginning and ending sign in regular expression in python

'[A-Za-z0-9-_]*'
'^[A-Za-z0-9-_]*$'
I want to check if a string only contains the sign in the above expression, just want to make sure no more weird sign like #%&/() are in the strings.
I am wondering if there's any difference between these two regular expression? Did the beginning and ending sign matter? Will it affect the result somehow?
Python regular expressions are anchored at the beginning of strings (like in many other languages): hence the ^ sign at the beginning doesn’t make any difference. However, the $ sign does very much make one: if you don’t include it, you’re only going to match the beginning of your string, and the end could contain anything – including the characters you want to exclude. Just try re.match("[a-z0-9]", "abcdef/%&").
In addition to that, you may want to use a regular expression that simply excludes the characters you’re testing for, it’s much safe (hence [^#%&/()] – or maybe you have to do something to escape the parentheses; can’t remember how it works at the moment).
The beginning and end sign match the beginning and end of a String.
The first will match any String that contains zero or more ocurrences of the class [A-Za-z0-9-_] (basically any string whatsoever...).
The second will match an empty String, but not one that contains characters not defined in [A-Za-z0-9-_]
Yes it will. A regex can match anywhere in its input. # will match in your first regex.

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Categories

Resources