I am trying to write code that can extract values from variables in a text file.
so if the file was
"bob= 1255 mike = 13"
when I specified bob as var_name it would extract 1255, and so on.
based my code off of this but it doesnt seem to be working
var_name = 'bob'
regexp = re.compile(r''+var_name+'.*?([0-9.-]+)')
with open("textfile") as s:
for line in s:
match = regexp.match(line)
if match:
print(match.group(1))
var_name = 'mike'
regexp = re.compile(r''+var_name+'.*?([0-9.-]+)')
with open("textfile") as s:
for line in s:
match = regexp.match(line)
if match:
print(match.group(1))
You are using re.match, which only finds things at the start of the string (and mike is not at the start of the string). Use re.search, which finds things at any position.
Slightly off-topic: Note that r'...' does not mean "regexp literal". It means "raw string literal". The purpose of it is to avoid having to escape backslashes inside the string. Now, '' very obviously does not contain any backslashes, so r'' is not at all different from ''. On the other hand, .*?([0-9.-]+) is complex enough that we are not sure whether or not there are (or will be) any backslashes in it - and yet you don't make it into a raw string literal. Puzzling. :) I would have written var_name + r'.*?([0-9.-]+)', without the useless r'' +...
You did not mention what does work / what does not work.
Instead of .*? you should use \s*=\s*. Otherwise you can catch things like #edsakjj*kjn - and I assume you do not want this.
You may also make sure that the number is really a number: -?\d+(\.?\d+)?: optional - (minus, for negative numbers), mandatory digit(s), optionally: decimal mark followed by digit(s).
Test regex
Regarding python code, I am not your guy, sorry :(
Related
Using the python re.sub, is there a way I can extract the first alpha numeric characters and disregard the rest form a string that starts with a special character and might have special characters in the middle of the string? For example:
re.sub('[^A-Za-z0-9]','', '#my,name')
How do I just get "my"?
re.sub('[^A-Za-z0-9]','', '#my')
Here I would also want it to just return 'my'.
re.sub(".*?([A-Za-z0-9]+).*", r"\1", str)
The \1 in the replacement is equivalent to matchobj.group(1). In other words it replaces the whole string with just what was matched by the part of the regexp inside the brackets. $ could be added at the end of the regexp for clarity, but it is not necessary because the final .* will be greedy (match as many characters as possible).
This solution does suffer from the problem that if the string doesn't match (which would happen if it contains no alphanumeric characters), then it will simply return the original string. It might be better to attempt a match, then test whether it actually matches, and handle separately the case that it doesn't. Such a solution might look like:
matchobj = re.match(".*?([A-Za-z0-9]+).*", str)
if matchobj:
print(matchobj.group(1))
else:
print("did not match")
But the question called for the use of re.sub.
Instead of re.sub it is easier to do matching using re.search or re.findall.
Using re.search:
>>> s = '#my,name'
>>> res = re.search(r'[a-zA-Z\d]+', s)
>>> if res:
... print (res.group())
...
my
Code Demo
This is not a complete answer. [A-Za-z]+ will give give you ['my','name']
Use this to further explore: https://regex101.com/
I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.
I have a URL that is either going to be united-states/boulder-21781/tool-&-anchor/mulligan-21/. Assuming the best strategy is to encode the &, the url changes to united-states/boulder-21781/tool-%26-anchor/mulligan-21/
I'm trying to write a url conf that will accept this, but the regex I'm using isn't working. I have:
url(r'^%(regex)s/%(regex)s-(\d+)/%(regex)s/%(regex)s-(\d+)/$' % {'regex'= '(?i)([\.\-\_\w]+)'}, 'view_tip_page', name='tip_page'),
What do I add to capture the %? or should i just include the &?
My first recommendation would be to not do it. As you yourself are demonstrating, not everybody knows that a & is a perfectly valid character in a URI before the first ?, and you are bound to get into trouble. It also looks ugly, is harder to type, and more jarring than, say, and, or even just n. Having said that, if you really want it in there, just put it in there in the character class.
Not related to your question, the way you're building that regex is weird; you're not capturing any of the bits of the path for use by the view. You're also including the (?i) global modifier four times, and specifying _ which is already part of \w. I dunno, I'd expect something like
r'(?i)(?P<country>[.\w-]+)/(?P<city>[.\w-]+)-(?P<cityno>[\d+])/...etc...
but maybe I'm missing something.
Well currently there is no way for you to match % or & in your regex. Depending on whether it is encoded or not, you will need to add one or the other to the character class in your regex, and it should match.
I might change it to something like the following:
r'(?i)^%(regex)s/%(regex)s-(\d+)/%(regex)s/%(regex)s-(\d+)/$' % {'regex': r'([-.%\w]+)'}
And proof that it works:
>>> pattern = re.compile(r'(?i)^%(regex)s/%(regex)s-(\d+)/%(regex)s/%(regex)s-(\d+)/$' % {'regex': r'([-.%\w]+)'})
>>> s = 'united-states/boulder-21781/tool-%26-anchor/mulligan-21/'
>>> match = pattern.match(s)
>>> match.groups()
('united-states', 'boulder', '21781', 'tool-%26-anchor', 'mulligan', '21')
A few comments on your regex:
The (?i) isn't really doing anything, since you are using \w which will already match both upper and lowercase. If you do want to use (?i) I would move it out of the replacement string and into the format string ('(?i)...' % {'regex': '...'} instead of '...' % {'regex': '(?i)...'}), since otherwise it will show up multipe times.
Note that character class was changed from [\.\-\_\w] to [-.%\w], this is because underscores are included in \w, you don't need to escape the hyphen if it comes at the beginning of the character class, and you don't need to escape the . inside of character classes.
Also, \w does match digits so technically to match something like 'boulder-21781' you could just use %(regex)s instead of %(regex)s-(\d+), but I didn't want to change that in case it was intentionally adding some additional verification of the format.
I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)
I have a python template engine that heavily uses regexp. It uses concatenation like:
re.compile( regexp1 + "|" + regexp2 + "*|" + regexp3 + "+" )
I can modify the individual substrings (regexp1, regexp2 etc).
Is there any small and light expression that matches nothing, which I can use inside a template where I don't want any matches? Unfortunately, sometimes '+' or '*' is appended to the regexp atom so I can't use an empty string - that will raise a "nothing to repeat" error.
This shouldn't match anything:
re.compile('$^')
So if you replace regexp1, regexp2 and regexp3 with '$^' it will be impossible to find a match. Unless you are using the multi line mode.
After some tests I found a better solution
re.compile('a^')
It is impossible to match and will fail earlier than the previous solution. You can replace a with any other character and it will always be impossible to match
(?!) should always fail to match. It is the zero-width negative look-ahead. If what is in the parentheses matches then the whole match fails. Given that it has nothing in it, it will fail the match for anything (including nothing).
To match an empty string - even in multiline mode - you can use \A\Z, so:
re.compile('\A\Z|\A\Z*|\A\Z+')
The difference is that \A and \Z are start and end of string, whilst ^ and $ these can match start/end of lines, so $^|$^*|$^+ could potentially match a string containing newlines (if the flag is enabled).
And to fail to match anything (even an empty string), simply attempt to find content before the start of the string, e.g:
re.compile('.\A|.\A*|.\A+')
Since no characters can come before \A (by definition), this will always fail to match.
Maybe '.{0}'?
You could use
\z..
This is the absolute end of string, followed by two of anything
If + or * is tacked on the end this still works refusing to match anything
Or, use some list comprehension to remove the useless regexp entries and join to put them all together. Something like:
re.compile('|'.join([x for x in [regexp1, regexp2, ...] if x != None]))
Be sure to add some comments next to that line of code though :-)