How to properly process regex match

How to properly process regex match - python

I have strings that come from outside to my application and may look like these with quotes:
Prefix content. "Some content goes here". More contents without quotes.
Prefix content. "Another "Additional" goes here". More contents without quotes.
Prefix content. "Just another "content". More contents without quotes.
The key note is that the strings come with quotes and I need to process these quotes properly. Actually I need to catch all the content inside quotes. I tried patterns like .*(".*").* and .*(".+").* but they seem to catch only content between two closest quotes.

Looks like you just want everything from the first quote to the last quote, even if there are other quotes between. This should suffice:
".*"
The leading and trailing .* in your regex were never needed, and the leading one was distorting your results. It would initially consume the whole input, then back off just far enough to let the rest of the regex match, meaning the (".*") would only ever match the last two quotes.
You don't need the parentheses, either. The part of the string you're after is now the entire match, so you can retrieve it with group(0) instead of group(1). If there may be newlines in the string and you want to match those, too, you can change it to:
(?s)".*"
The . metacharacter normally doesn't match newlines, but (?s) turns on DOTALL mode for the rest of the regex.
EDIT: I forgot to mention that you should be using the search() method in this case, not match(). match() only works if the match is found at the very beginning of the input, as if you had added the start anchor (e.g., ^".*"). search() performs a more traditional regex match, where the match can appear anywhere in the input. (ref)

I'm not certain what you're trying to extract, so I'm guessing. I'd suggest using partition and rpartition string methods.
Does this do what you want?
>>> samples = [
... 'Prefix content. "Some content goes here". More contents without quotes.',
... 'Prefix content. "Another "Additional" goes here". More contents without quotes.',
... 'Prefix content. "Just another "content". More contents without quotes.',
... ]
>>> def get_content(data):
... return data.partition('"')[2].rpartition('"')[0]
...
>>> for sample in samples:
... print get_content(sample)
...
Some content goes here
Another "Additional" goes here
Just another "content

EDIT: I now see the another answer and I might have misunderstood your question.
Try changing this
.*(".+").*
to
.*?(".+?")
The ? will make the search non-greedy and will stop as soon as it will find the next matching character (i.e. quote). I also removed the .* at the end as it would match rest of the string (regardless of the quotes). If you want to match empty quotes as well just change the + to *. Use re.findall to extract all the content from within the quotes.
PS: I assumed your last line is wrong as it doesn't have matching quotes.

I'm not quite sure if this is what you wanted to achieve.
finditer method from re module may be helpful here.
>>> import re
>>> s = '''Prefix content. "Some content goes here". More contents without quotes.
... Prefix content. "Another "Additional" goes here". More contents without quotes.
... Prefix content. "Just another "content". More contents without quotes.'''
>>> pattern = '".+?"'
>>> results = [m.group(0) for m in re.finditer(pattern, s)]
>>> print results
['"Some content goes here"', '"Another "', '" goes here"', '"Just another "']

Related

python regular expression not matching file contents with re.match and re.MULTILINE flag

I'm reading in a file and storing its contents as a multiline string. Then I loop through some values I get from a django query to run regexes based on the query results values. My regex seems like it should be working, and works if I copy the values returned by the query, but for some reason isn't matching when all the parts are working together that ends like this
My code is:
with open("/path_to_my_file") as myfile:
data=myfile.read()
#read saved settings then write/overwrite them into the config
items = MyModel.objects.filter(some_id="s100009")
for item in items:
regexString = "^\s*"+item.feature_key+":"
print regexString #to verify its what I want it to be, ie debug
pq = re.compile(regexString, re.M)
if pq.match(data):
#do stuff
So basically my problem is that the regex isn't matching. When I copy the file contents into a big old string, and copy the value(s) printed by the print regexString line, it does match, so I'm thinking theres some esoteric python/django thing going on (or maybe not so esoteric as python isnt my first language).
And for examples sake, the output of print regexString is :
^\s*productDetailOn:
File contents:
productDetailOn:true,
allOff:false,
trendingWidgetOn:true,
trendingWallOn:true,
searchResultOn:false,
bannersOn:true,
homeWidgetOn:true,
}
Running Python 2.7. Also, dumped the types of both item.feature and data, and both were unicode. Not sure if that matters? Anyway, I'm starting to hit my head off the desk after working this for a couple hours, so any help is appreciated. Cheers!

According to documentation, re.match never allows searching at the beginning of a line:
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
You need to use a re.search:
regexString = r"^\s*"+item.feature_key+":"
pq = re.compile(regexString, re.M)
if pq.search(data):
A small note on the raw string (r"^\s+"): in this case, it is equivalent to "\s+" because there is no \s escape sequence (like \r or \n), thus, Python treats it as a raw string literal. Still, it is safer to always declare regex patterns with raw string literals in Python (and with corresponding notations in other languages, too).

Search a delimited string in a file - Python

I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".

You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each time…

If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'

Removing TAGS in a document

I need to find all the tags in .txt format (SEC filing) and remove from the filing.
Well, as a beginner of Python, I used the following code to find the tags, but it returns None, None, ... and I don't know how to remove all the tags. My question is how to find all the tags <....> and remove all the tags so that the document contains everything but tags.
import re
tags = [re.search(r'<.+>', line) for line in mylist]
#mylist is the filename opened by open(filename, 'rU').readlines()
Thanks for your time.

Use something like this:
re.sub(r'<[^>]+>', '', open(filename, 'r').read())
Your current code is getting a None for each line that does not include angle-bracketed tags.
You probably want to use [^>] to make sure it matches only up to the first >.

re.sub(r'<.*?>', '', line)
Use re.sub and <.*?> expression

Well, for starters, you're going to need a different regex. The one you have will select everything between the first '<' and the last '>' So the string:
I can type in <b>BOLD</b>
would render the match:
BOLD
The way to fix this would be to use a lazy operators this site has a good explanation on why you should be using
<.+?>
to match HTML tags. And ultimately, you should be substituting, so:
re.sub(r'', '', line)
Though, I suspect what you'd actually like to match is between the tags. Here's where a good lookahead can do wonders!
(?<=>).+?(?=<)
Looks crazy, but it breaks down pretty easy. Let's start with what you know:
.+?
matches a string of arbitrary length. ? means it will match the shortest string possible. (The laziness we added before)
(<?=...)
is a lookbehind. It literally looks behind itself without capturing the expression.
(?=...)
is a lookahead. It's the same as a lookbehind. Then with a little findall:
re.findall(r'(?<=>).+?(?=<)', line);
Now, you can iterate over the array and trim an unnecessary spaces that got left behind and make for some really nice output! Or, if you'd really like to use a substitution method (I know I would):
re.sub(r'\s*(?:</+?>\s*)+', ' ', line)
the
\s*
will match any amount of whitespace attached to a tag, which you can then replace with one space, whittlling down those unnerving double and triple spaces that often result from over careful tagging. As a bonus, the
(?: ... )
is known as a non-capturing group (it won't give you smaller sub matches in your result). It's not really necessary in this situation for your purposes, but groups are always useful things to think about, and it's good practice to only capture the ones you need. Tacking a + onto the end of that (as I did), will capture as many tags as are right next to each other, eliminating them into a single space. So if the file has
This is <b> <i> overemphasized </b> </i>!
you'd get
This is overemphasized !
instead of
This is overemphasized !

strange behavior of parenthesis in python regex

I'm writing a python regex that looks through a text document for quoted strings (quotes of airline pilots recorded from blackboxes). I started by trying to write a regex with the following rules:
Return what is between quotes.
if it opens with single, only return if it closes with single.
if it opens with double, only return if it closes with double.
For instance I don't want to match "hi there', or 'hi there", but "hi there" and 'hi there'.
I use a testing page which contains things like:
CA "Runway 18, wind 230 degrees, five knots, altimeter 30."
AA "Roger that"
18:24:10 [flap lever moving into detent]
ST: "Some passenger's pushing a switch. May I?"
So I decided to start simple:
re.findall('("|\').*?\\1', page)
########## /("|').*?\1/ <-- raw regex I think I'm going for.
This regex acts very unexpectedly.
I thought it would:
( " | " ) Match EITHER single OR double quotes, save as back reference /1.
.*? Match non-greedy wildcard.
\1 Match whatever it finds in back reference \1 (step one).
Instead, it returns an array of quotes but never anything else.
['"', '"', "'", "'"]
I'm really confused because the equivalent (afaik) regex works just fine in VIM.
\("\|'\).\{-}\1/)
My question is this:
Why does it return only what is inside parenthesis as the match? Is this a flaw in my understanding of back references? If so then why does it work in VIM?
And how do I write the regex I'm looking for in python?
Thank you for your help!

You aren't capturing anything except for the quotes, which is what Python is returning.
If you add another group, things work much better:
for quote, match in re.finditer(r'("|\')(.*?)\1', page):
print match
I prefixed your string literal with an r to make it a raw string, which is useful when you need to use a ton of backslashes (\\1 becomes \1).

You need to catch everything with an extra pair of parentheses.
re.findall('(("|\').*?\\2)', page)

Read the documentation. re.findall returns the groups, if there are any. If you want the entire match you must group it all, or use re.finditer. See this question.

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.

>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.

It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.

If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.

mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.