Regular Expressions: Special Characters and Tab Spaces - python

I was testing out a function that I wrote. It is supposed to give me the count of full stops (.) in a line or string. The full stop (.) that I am interested in counting has a tab space before and after it.
Here is what I have written.
def Seek():
a = '1 . . 3 .'
b = a.count(r'\t\.\t')
return b
Seek()
However, when I test it, it returns 0. From a, there are 2 full stops (.) with both a tab space before and after it. Am I using regular expressions improperly? Represented a incorrectly? Any help is appreciated.
Thanks.

It doesn't look like a has any tabs in it. Although you may have hit the tab key on your keyboard, that character would have been interpreted by the text editor as "insert a number of spaces to align with the next tab character". You need your line to look like this:
a = '1\t.\t.\t3\t.'
That should do it.
A more complete example:
from re import *
def Seek():
a = '1\t.\t.\t3\t\.'
re = compile(r'(?<=\t)\.(?=\t)');
return len(re.findall(a))
print Seek()
This uses "lookahead" and "lookbehind" to match the tab character without consuming it. What does that mean? It means that when you have \t.\t.\t, you will actually match both the first and the second \.. The original expression would have matched the initial \t\.\t and discarded them. After, there would have been a \. with nothing in front of it, and thus no second match. The lookaround syntax is "zero width" - the expression is tested but it ends up taking no space in the final match. Thus, the code snippet I just gave returns 2, just as you would expect.

It will work if you replace the '\t' with a single tab key press.
Note that count only counts non-overlapping occurrences of a substring so it won't work as intended unless you use regex instead, or change your substring to only test for a tab in front of the period.

Related

Regex: match a character except at the beginning of a string

I'm trying to strip a character from a string, unless that character is at the beginning of a string.
So far, my code looks like this:
def strip_string(value):
return re.sub(r"[^0-9\.]",'',value)
# strip_string('1-23') => '123'
I want to remove only the dashes that aren't the first character though:
strip_string('-1-23') => '-123'
I know how to target dashes that are the first character (r"^-"), but not the inverse.
Is it possible to do this, or do I need to go about it differently?
The simplest solution to remove a character from a string that is not at the beginning is to use a (?!^) / (?!\A) negative lookahead. However, you can't just use re.sub(r"(?!^)[^0-9.]",'',value) as it won't remove non-hyphen chars either, while your scenario implies you expect to only keep a hyphen at the start.
Thus, in Python 3.5 and newer you may use (see demo):
re.sub(r"^(-)|[^0-9.]+", r"\1", value)
Or, you may fall back to
re.sub(r"(?!^)-|[^0-9.-]+", "", value) # This one is somewhat easier to understand
re.sub(r"-(?<!^-)|[^0-9.-]+", "", value) # This one is a bit more efficient
See demo #1 and demo #2.
Both -(?<!^-) and (?!^)- match a - that is not at the start of a string.

Regex only finds results once

I'm trying to find any text between a '>' character and a new line, so I came up with this regex:
result = re.search(">(.*)\n", text).group(1)
It works perfectly with only one result, such as:
>test1
(something else here)
Where the result, as intended, is
test1
But whenever there's more than one result, it only shows the first one, like in:
>test1
(something else here)
>test2
(something else here)
Which should give something like
test1\ntest2
But instead just shows
test1
What am I missing? Thank you very much in advance.
re.search only returns the first match, as documented:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance.
To find all the matches, use findall.
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found.
Here's an example from the shell:
>>> import re
>>> re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx")
['test1', 'test2']
Edit: I just read your question again and realised that you want "test1\ntest2" as output. Well, just join the list with \n:
>>> "\n".join(re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx"))
'test1\ntest2'
You could try:
y = re.findall(r'((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+)', text)
Which returns ['t1\nt2\nt3'] for 't1\nt2\nt3\n'. If you simply want the string, you can get it by:
s = y[0]
Although it seems much larger than your initial code, it will give you your desired string.
Explanation -
((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+) is the regex as well as the match.
(?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|)) is the non-capturing group that matches any text followed by a newline, and is repeatedly found one-or-more times by the + after it.
(?:.+?) matches the actual words which are then followed by a newline.
(?:(?=[\n\r][^\n\r])\n|) is a non-capturing conditional group which tells the regex that if the matched text is followed by a newline, then it should match it, provided that the newline is not followed by another newline or carriage return
(?=[\n\r][^\n\r]) is a positive look-ahead which ascertains that the text found is followed by a newline or carriage return, and then some non-newline characters, which combined with the \n| after it, tells the regex to match a newline.
Granted, after typing this big mess out, the regex is pretty long and complicated, so you would be better off implementing the answers you understand, rather than this answer, which you may not. However, this seems to be the only one-line answer to get the exact output you desire.

Python string exact end, no additional characters

I'm trying to replace a string element, but only if it doesn't have additional characters after the match, though the characters before the match can vary... For example, if I tokenize a name containing underscores, and I want to replace anything that ends with "R", but not elements that start with it... so it would replace "R", or "SideR", but not "Rear" because there are characters that follow after "R". I remember someone showing me something like this before, but can't find it. It was something akin to \n (but wasn't \n, which is a new line, there is no new line), but could be put at the end of a string to denote no further characters (There was ether one for the start... may have been the same thing for start or end).
test="New_R_SideR_Rear_Object"
tokens=test.split("_")
newtest=""
for each in tokens:
if "R" in each:
each=each.replace("R", "L")
newtest=(newtest+each+"_")
I'm positive there is something I can add to the end of the "if "R" in each" line, or the .replace line, that will allow me to ensure that "Rear" doesn't become "Lear", but both "R" and "SideR" doe get replaced.
The above is just simplified for ease of explanation. Thanks for your time
You can use a regular expression. The regular expression language provides a compact way to express how to match text. For your example:
$ python3
>>> import re
>>> test="New_R_SideR_Rear_Object"
>>> re.sub(r"R(_|\b)", r"L\1", test)
'New_L_SideL_Rear_Object'
>>>

Is this regex syntax working?

I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")
As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"
Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*

Matching everything after series of hyphens

I'm trying to capture all the remaining text in a file after three hyphens at the start of a line (---).
Example:
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Everything after the first set of three hyphens should be captured. The closest I've gotten is using this regex [^(---)]+$ which works slightly. It will capture everything after the hyphens, but if the user places any hyphens after that point it instead then captures after the last hyphen the user placed.
I am using this in combination with python to capture text.
If anyone can help me sort out this regex problem I'd appreciate it.
pat = re.compile(r'(?ms)^---(.*)\Z')
The (?ms) adds the MULTILINE and DOTALL flags.
The MULTILINE flag makes ^ match the beginning of lines (not just the beginning of the string.) We need this because the --- occurs at the beginning of a line, but not necessarily the beginning of the string.
The DOTALL flag makes . match any character, including newlines. We need this so that (.*) can match more than one line.
\Z matches the end of the string (as opposed to the end of a line).
For example,
import re
text = '''\
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
'''
pat = re.compile(r'(?ms)^---(.*)\Z')
print(re.search(pat, text).group(1))
prints
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Note that when you define a regex character class with brackets, [...], the stuff inside the brackets are (in general, except for hyphenated ranges like a-z) interpreted as single characters. They are not patterns. So [---] is not different than [-]. In fact, [---] is the range of characters from - to -, inclusive.
The parenthese inside the character class are interpreted as literal parentheses too, not grouping delimiters. So [(---)] is equivalent to [-()], the character class including the hyphen and left and right parentheses.
Thus the character class [^(---)]+ matches any character other than the hyphen or parentheses:
In [23]: re.search('[^(---)]+', 'foo - bar').group()
Out[23]: 'foo '
In [24]: re.search('[^(---)]+', 'foo ( bar').group()
Out[24]: 'foo '
You can see where this is going, and why it does not work for your problem.
Sorry for not directly answering your question, but I wonder if regular expressions are overcomplicating the problem? You could do something like this:
f = open('myfile', 'r')
for i in f:
if i[:3] == "---":
break
text = f.readlines()
f.close()
Or, am I missing something?
I tend to find that regular expressions are difficult enough to maintain that if you don't need their unique capabilities for a given purpose it'll be cleaner and more readable to avoid using them entirely.
s = open(myfile).read().split('\n\n---\n\n', 1)
print s[0] # first part
print s[1] # second part after the dashes
This should work for your example. The second parameter to split specifies how many times to split the string.

Categories

Resources