In my text I want to replace all leading tabs with two spaces but leave the non-leading tabs alone.
For example:
a
\tb
\t\tc
\td\te
f\t\tg
("a\n\tb\n\t\tc\n\td\te\nf\t\tg")
should turn into:
a
b
c
d\te
f\t\tg
("a\n b\n c\n d\te\nf\t\tg")
For my case I could do that with multiple replacement operations, repeating as many times as the many maximum nesting level or until nothing changes.
But wouldn't it also be possible to do in a single run?
I tried but didn't manage to come up with something, the best I came up yet was with lookarounds:
re.sub(r'(^|(?<=\t))\t', ' ', a, flags=re.MULTILINE)
Which "only" makes one wrong replacement (second tab between f and g).
Now it might be that it's simply impossible to do in regex in a single run because the already replaced parts can't be matched again (or rather the replacement does not happen right away) and you can't sort-of "count" in regex, in this case I would love to see some more detailed explanations on why (as long as this won't shift too much into [cs.se] territory).
I am working in Python currently but this could apply to pretty much any similar regex implementation.
You may match the tabs at the start of the lines, and use a lambda inside re.sub to replace with the double spaces multiplied by the length of the match:
import re
s = "a\n\tb\n\t\tc\n\td\te\nf\t\tg";
print(re.sub(r"^\t+", lambda m: " "*len(m.group()), s, flags=re.M))
See the Python demo
It is also possible to do this without regex using replace() in a one liner:
>>> s = "a\n\tb\n\t\tc\n\td\te\nf\t\tg"
>>> "\n".join(x.replace("\t"," ",len(x)-len(x.lstrip("\t"))) for x in s.split("\n"))
'a\n b\n c\n d\te\nf\t\tg'
This here is kinda crazy, but it works:
"\n".join([ re.sub(r"^(\t+)"," "*(2*len(re.sub(r"^(\t+).*","\1",x))),x) for x in "a\n\tb\n\t\tc\n\td\te\nf\t\tg".splitlines() ])
Related
I am writing a snippet for the Vim plugin UltiSnips which will trigger on a regex pattern (as supported by Python 3). To avoid conflicts I want to make sure that my snippet only triggers when contained somewhere inside of $$___$$. Note that the trigger pattern might contain an indefinite string in front or behind it. So as an example I might want to match all "a" in "$$ccbbabbcc$$" but not "ccbbabbcc". Obviously this would be trivial if I could simply use indefinite look behind. Alas, I may not as this isn't .NET and vanilla Python will not allow it. Is there a standard way of implementing this kind of expression? Note that I will not be able to use any python functions. The expression must be a self-contained trigger.
If what you are looking for only occurs once between the '$$', then:
\$\$.*?(a)(?=.*?\$\$)
This allows you to match all 3 a characters in the following example:
\$\$) Matches '$$'
.*? Matches 0 or more characters non-greedily
(?=.*?\$\$) String must be followed by 0 or more arbitrary characters followed by '$$'
The code:
import re
s = "$$ccbbabbcc$$xxax$$bcaxay$$"
print(re.findall(r'\$\$.*?(a)(?=.*?\$\$)', s))
Prints:
['a', 'a', 'a']
The following should work:
re.findall("\${2}.+\${2}", stuff)
Breakdown:
Looks for two '$'
"\${2}
Then looks for one or more of any character
.+
Then looks for two '$' again
I believe this regex would work to match the a within the $$:
text = '$$ccbbabbcc$$ccbbabbcc'
re.findall('\${2}.*(a).*\${2}', text)
# prints
['a']
Alternatively:
A simple approach (requiring two checks instead of one regex) would be to first find all parts enclosed in your quoting text, then check if your search string is present withing.
example
text = '$$ccbbabbcc$$ccbbabbcc'
search_string = 'a'
parts = re.findall('\${2}.+\${2}', text)
[p for p in parts if search_string in p]
# prints
['$$ccbbabbcc$$']
I'm working with long strings and I need to replace with '' all the combinations of adjacent full stops . and/or colons :, but only when they are not adjacent to any whitespace. Examples:
a.bcd should give abcd
a..::.:::.:bcde.....:fg should give abcdefg
a.b.c.d.e.f.g.h should give abcdefgh
a .b should give a .b, because . here is adjacent to a whitespace on its left, so it has not to be replaced
a..::.:::.:bcde.. ...:fg should give abcde.. ...:fg for the same reason
Well, here is what I tried (without any success).
Attempt 1:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1), r'', s1)
I would expect to get 'abcdefgh' but what I actually get is r''. I understood why: the code
re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1)
returns '.' instead of '\.', and thus re.search doesn't understand that it has to replace the single full stop . rather than understanding '.' as the usual regex.
Attempt 2:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*\S)[.:]+(\S[^\s.:]*)', r'\g<1>\g<2>', s1)
This doesn't work as it returns a.b.c.d.e.f.gh.
Attempt 3:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*)[.:]+([^\s.:]*)', r'\g<1>\g<2>', s1)
This works on s1, but it doesn't solve my problem because on s2 = r'a .b' it returns a b rather than a .b.
Any suggestion?
There are multiple problems here. Your regex doesn't match what you want to match; but also, your understanding of re.sub and re.search is off.
To find something, re.search lets you find where in a string that something occurs.
To replace that something, use re.sub on the same regular expression instead of re.search, not as well.
And, understand that re.sub(r'thing(moo)other', '', s1) replaces the entire match with the replacement string.
With that out of the way, for your regex, it sounds like you want
r'(?<![\s.:])[.:]+(?![\s.:])' # updated from comments, thanks!
which contains a character class with full stop and colon (notice how no backslash is necessary inside the square brackets -- this is a context where dot and colon do not have any special meaning1), repeated as many times as possible; and lookarounds on both sides to say we cannot match these characters when there is whitespace \s on either side, and also excluding the characters themselves so that there is no way for the regex engine to find a match by applying the + less strictly (it will do its darndest to find a match if there is a way).
Now, the regex only matches the part you want to actually replace, so you can do
>>> import re
>>> s1 = 'name.surname#domain.com'
>>> re.sub(r'(?<![\s.:])[.:]+(?![\s.:])', r'', s1)
'namesurname#domaincom'
though in the broader scheme of things, you also need to know how to preserve some parts of the match. For the purpose of this demonstration, I will use a regular expression which captures into parenthesized groups the text before and after the dot or colon:
>>> re.sub(r'(.*\S)[.:]+(\S.*)', r'\g<1>\g<2>', s1)
'name.surname#domaincom'
See how \g<1> in the replacement string refers back to "whatever the first set of parentheses matched" and similarly \g<2> to the second parenthesized group.
You will also notice that this failed to replace the first full stop, because the .* inside the first set of parentheses matches as much of the string as possible. To avoid this, you need a regex which only matches as little as possible. We already solved that above with the lookarounds, so I will leave you here, though it would be interesting (and yet not too hard) to solve this in a different way.
1 You could even say that the normal regex language (or syntax, or notation, or formalism) is separate from the language (or syntax, or notation, or formalism) inside square brackets!
My script works fine doing this:
images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)
However, I believe it is inefficient to search through the whole document twice.
Here's a sample document if it helps: http://pastebin.com/5kRZXjij
I would expect the following output from the above:
images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl
Instead it would be better to do something like:
image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)
How can I combine the two re.findall lines into one?
I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.
As mentioned in the comments, a pipe (|) should do the trick.
The regular expression
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
Demo on Regex Tester
If you really want efficient...
For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
You can use the re.IGNORECASE option and get rid of some letters:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*
Is there any way to directly replace all groups using regex syntax?
The normal way:
re.match(r"(?:aaa)(_bbb)", string1).group(1)
But I want to achieve something like this:
re.match(r"(\d.*?)\s(\d.*?)", "(CALL_GROUP_1) (CALL_GROUP_2)")
I want to build the new string instantaneously from the groups the Regex just captured.
Have a look at re.sub:
result = re.sub(r"(\d.*?)\s(\d.*?)", r"\1 \2", string1)
This is Python's regex substitution (replace) function. The replacement string can be filled with so-called backreferences (backslash, group number) which are replaced with what was matched by the groups. Groups are counted the same as by the group(...) function, i.e. starting from 1, from left to right, by opening parentheses.
The accepted answer is perfect. I would add that group reference is probably better achieved by using this syntax:
r"\g<1> \g<2>"
for the replacement string. This way, you work around syntax limitations where a group may be followed by a digit. Again, this is all present in the doc, nothing new, just sometimes difficult to spot at first sight.
I have a huge text file, each line seems like this:
Some sort of general menu^a_sub_menu_title^^pagNumber
Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). I want to split each line in 3 (obvious) parts, because I want to create some sort of directory in python.
I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it.
Could someone please help me????
>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']
If you only want three pieces you can accomplish this through a generator expression:
line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']
What you need to do is to "escape" the special characters, like r'\^'. But better than regular expressions in this case would be:
line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')
That gives you the components in a much more straightforward fashion.
You could just say string.split("^") to divide the string into an array containing each segment. The only caveat is that it will divide consecutive caret characters into an empty string. You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array.
For more information see http://docs.python.org/library/stdtypes.html
Does that help?
It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best.
Also, for the re module, any special characters can be escaped with \, like r'\^'. I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. »