How do you replace regex end-of-line with string? - python

I have a for loop that produces a variable current_out_dir, sometimes the variable will have a /. at the end of the line (that is /.$) I want to replace /.$ with /$. Currently I have .replace('/.','/'), but this would replace hidden directories that start with . as well. e.g. /home/.log/file.txt
I've looked into re.sub() but I can't figure out how to apply it.

Dot will match any character not of newline character. So you need to escape the dot to match a literal dot.
re.sub(r'(?<=/)\.$', r'', string)

/\.(?=$)
Try this.This should work for you.This uses a positive lookahead to assert end of string.

The question was about using regex, but I've come up with a more pythonic solution to the problem.
if os.path.split(current_out_dir)[1] == '.':
current_out_dir = os.path.split(current_out_dir)[0]

Related

Is this regex syntax working?

I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")
As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"
Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*

convert vim regex to python for re.sub

I have a working regex under vim: /^ \{-}\a.*$\n
I implement a global search and replace as :%s/^ \{-}\a.*$\n//
This works great -- removes all lines that start with any number of spaces (matched non-greedily), followed by a letter and anything else to the end of the line including the newline.
I cannot (to save my soul) figure out the analogous regex in Python. Here's what make sense to me:
x = re.sub("^ *?\a.$\n","",y)
But this doesn't do anything.
Many thanks for your sagacious replies.
\a means the bell character (0x07) in Python, and $\n is a redundant bad idea, so:
x = re.sub(r"^ *[A-Za-z].*\n","",y)
Also, there's no reason to write ' *?' instead of ' *' here, as it's always going to be followed by a non-space if it's matching.
If you want to match any number of whitespace, you can also use the \s sequence.
Any letter will be matched by the [a-zA-Z] character class. You also don't need to use the $ and the \n, either will do.
Suggest the following:
x = re.sub(r"^\s*[a-zA-Z].*(\r|\n)","",y)
If you want at least one whitespace, use \s+ instead of \s*

Match LaTeX reserved characters with regex

I have an HTML to LaTeX parser tailored to what it's supposed to do (convert snippets of HTML into snippets of LaTeX), but there is a little issue with filling in variables. The issue is that variables should be allowed to contain the LaTeX reserved characters (namely # $ % ^ & _ { } ~ \). These need to be escaped so that they won't kill our LaTeX renderer.
The program that handles the conversion and everything is written in Python, so I tried to find a nice solution. My first idea was to simply do a .replace(), but replace doesn't allow you to match only if the first is not a \. My second attempt was a regex, but I failed miserably at that.
The regex I came up with is ([^\][#\$%\^&_\{\}~\\]). I hoped that this would match any of the reserved characters, but only if it didn't have a \ in front. Unfortunately, this matches ever single character in my input text. I've also tried different variations on this regex, but I can't get it to work. The variations mainly consisted of removing/adding slashes in the second part of the regex.
Can anyone help with this regex?
EDIT Whoops, I seem to have included the slashes as well. Shows how awake I was when I posted this :) They shouldn't be escaped in my case, but it's relatively easy to remove them from the regexes in the answers. Thanks all!
The [^\] is a character class for anything not a \, that is why it is matching everything. You want a negative lookbehind assertion:
((?<!\)[#\$%\^&_\{\}~\\])
(?<!...) will match whatever follows it as long as ... is not in front of it. You can check this out at the python docs
The regex ([^\][#\$%\^&_\{\}~\\]) is matching anything that isn't found between the first [ and the last ], so it should be matching everything except for what you want it to.
Moving around the parenthesis should fix your original regex ([^\\])[#\$%\^&_\{\}~\\].
I would try using regex lookbehinds, which won't match the character preceding what you want to escape. I'm not a regex expert so perhaps there is a better pattern, but this should work (?<!\\)[#\$%\^&_\{\}~\\].
If you're looking to find special characters that aren't escaped, without eliminating special chars preceded by escaped backslashes (e.g. you do want to match the last backslash in abc\\\def), try this:
(?<!\\)(\\\\)*[#\$%\^&_\{\}~\\]
This will match any of your special characters preceded by an even number (this includes 0) of backslashes. It says the character can be preceded by any number of pairs of backslashes, with a negative lookbehind to say those backslashes can't be preceded by another backslash.
The match will include the backslashes, but if you stick another in front of all of them, it'll achieve the same effect of escaping the special char, anyway.

python regular expressions to return complete result

How can I get what was matched from a python regular expression?
re.match("^\\\w*", "/welcome")
All python returns is a valid match; but I want the entire result returned; how do I do that?
Just use re.findall function.
>>> re.findall("a+", 'sdaaddaa')
['aa', 'aa']
You could use a group.
res = re.search("^(\\\w*)", "/welcome")
if res:
res.group(1);
Calling the group() method of the returned match object without any arguments will return the matched portion of the string.
The regular expression "^\\\w*" will match a string beginning with a backslash followed by 0 or more w characters. The string you are searching begins with a forward slash so your regex won't match. That's why you aren't getting anything back.
Note that your regex, if you printed out the string contains \\w. The \\ means match a single backslash then the w means match a literal w. If you want a backslash followed by a word character then you will need to escape the first backslash and the easiest way would be to use a raw string r"^\\\w*" would match "\\welcome" but still not match "/welcome".
Notice that you're "^" says you're string has to start at the beginning of a line. RegexBuddy doesn't tell that to you by default.
Maybe you want to tell us what exactly are you trying to find?

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Categories

Resources