Python string exact end, no additional characters - python

I'm trying to replace a string element, but only if it doesn't have additional characters after the match, though the characters before the match can vary... For example, if I tokenize a name containing underscores, and I want to replace anything that ends with "R", but not elements that start with it... so it would replace "R", or "SideR", but not "Rear" because there are characters that follow after "R". I remember someone showing me something like this before, but can't find it. It was something akin to \n (but wasn't \n, which is a new line, there is no new line), but could be put at the end of a string to denote no further characters (There was ether one for the start... may have been the same thing for start or end).
test="New_R_SideR_Rear_Object"
tokens=test.split("_")
newtest=""
for each in tokens:
if "R" in each:
each=each.replace("R", "L")
newtest=(newtest+each+"_")
I'm positive there is something I can add to the end of the "if "R" in each" line, or the .replace line, that will allow me to ensure that "Rear" doesn't become "Lear", but both "R" and "SideR" doe get replaced.
The above is just simplified for ease of explanation. Thanks for your time

You can use a regular expression. The regular expression language provides a compact way to express how to match text. For your example:
$ python3
>>> import re
>>> test="New_R_SideR_Rear_Object"
>>> re.sub(r"R(_|\b)", r"L\1", test)
'New_L_SideL_Rear_Object'
>>>

Related

Regex to split text file in python

I am trying to find a way to parse a string of a transcript into speaker segments (as a list). Speaker labels are denoted by the upper-casing of the speaker's name followed by a colon. The problem I am having is some names have a number of non upper-case characters. Examples might include the following:
OBAMA: said something
O'MALLEY: said something else
GOV. HICKENLOOPER: said something else entirely'
I have written the following regex, but I am struggling to get it to work:
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+(\ |\.|\'|\d)*[A-Z]*:)', mystring)
What I think I have written (and ideally what I want to do) is a command to split the string based on:
1. Find a newline
2. Use positive look-ahead for one or more uppercase characters
3. If upper-case characters are found look for optional characters from the list of periods, apostrophes, single spaces, and digits
4. If these optional characters are found, look for additional uppercase characters.
5. Crucially, find a colon symbol at the end of this sequence.
EDIT: In many cases, the content of the speech will have newline characters contained within it, and possibly colon symbols. As such, the only thing separating the speaker label from the content of speech is the sequence mentioned above.
just change (\ |.|\'|\d) to [\ .\'\d] or (?:\ |.|\'|\d)
import re
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+[\ \.\'\d]*[A-Z]*:)', mystring)
print(parse_turns)
If it's true that the speaker's name and what they said are separated by a colon, then it might be simpler to move away from regex to do your splitting.
list_of_things = []
mystring = "OBAMA: Hi\nO'MALLEY: True Dat\nHUCK FINN: Sure thing\n"
lines = mystring.split("\n")# 1st split the string into lines based on the \n character
for line in lines:
colon_pos = line.find(":",0) # Finds the position of the first colon in the line
speaker, utterance = line[0:colon_pos].strip(), line[colon_pos+1:].strip()
list_of_things.append((speaker, utterance))
At the end, you should have a neat list of tuples containing speakers, and the things they said.

Regex: Complement a group of characters (Python)

I want to write a regex to check if a word ends in anything except s,x,y,z,ch,sh or a vowel, followed by an s. Here's my failed attempt:
re.match(r".*[^ s|x|y|z|ch|sh|a|e|i|o|u]s",s)
What is the correct way to complement a group of characters?
Non-regex solution using str.endswith:
>>> from itertools import product
>>> tup = tuple(''.join(x) for x in product(('s','x','y','z','ch','sh'), 's'))
>>> 'foochf'.endswith(tup)
False
>>> 'foochs'.endswith(tup)
True
[^ s|x|y|z|ch|sh|a|e|i|o|u]
This is an inverted character class. Character classes match single characters, so in your case, it will match any character, except one of these: acehiosuxyz |. Note that it will not respect compound groups like ch and sh and the | are actually interpreted as pipe characters which just appear multiple time in the character class (where duplicates are just ignored).
So this is actually equivalent to the following character class:
[^acehiosuxyz |]
Instead, you will have to use a negative look behind to make sure that a trailing s is not preceded by any of the character sequences:
.*(?<!.[ sxyzaeiou]|ch|sh)s
This one has the problem that it will not be able to match two character words, as, to be able to use look behinds, the look behind needs to have a fixed size. And to include both the single characters and the two-character groups in the look behind, I had to add another character to the single character matches. You can however use two separate look behinds instead:
.*(?<![ sxyzaeiou])(?<!ch|sh)s
As LarsH mentioned in the comments, if you really want to match words that end with this, you should add some kind of boundary at the end of the expression. If you want to match the end of the string/line, you should add a $, and otherwise you should at least add a word boundary \b to make sure that the word actually ends there.
It looks like you need a negative lookbehind here:
import re
rx = r'(?<![sxyzaeiou])(?<!ch|sh)s$'
print re.search(rx, 'bots') # ok
print re.search(rx, 'boxs') # None
Note that re doesn't support variable-width LBs, therefore you need two of them.
How about
re.search("([^sxyzaeiouh]|[^cs]h)s$", s)
Using search() instead of match() means the match doesn't have to begin at the beginning of the string, so we can eliminate the .*.
This is assuming that the end of the word is the end of the string; i.e. we don't have to check for a word boundary.
It also assumes that you don't need to match the "word" hs, even it conforms literally to your rules. If you want to match that as well, you could add another alternative:
re.search("([^sxyzaeiouh]|[^cs]|^h)s$", s)
But again, we're assuming that the beginning of the word is the beginning of the string.
Note that the raw string notation, r"...", is unecessary here (but harmless). It only helps when you have backslashes in the regexp, so that you don't have to escape them in the string notation.

Matching everything after series of hyphens

I'm trying to capture all the remaining text in a file after three hyphens at the start of a line (---).
Example:
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Everything after the first set of three hyphens should be captured. The closest I've gotten is using this regex [^(---)]+$ which works slightly. It will capture everything after the hyphens, but if the user places any hyphens after that point it instead then captures after the last hyphen the user placed.
I am using this in combination with python to capture text.
If anyone can help me sort out this regex problem I'd appreciate it.
pat = re.compile(r'(?ms)^---(.*)\Z')
The (?ms) adds the MULTILINE and DOTALL flags.
The MULTILINE flag makes ^ match the beginning of lines (not just the beginning of the string.) We need this because the --- occurs at the beginning of a line, but not necessarily the beginning of the string.
The DOTALL flag makes . match any character, including newlines. We need this so that (.*) can match more than one line.
\Z matches the end of the string (as opposed to the end of a line).
For example,
import re
text = '''\
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
'''
pat = re.compile(r'(?ms)^---(.*)\Z')
print(re.search(pat, text).group(1))
prints
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Note that when you define a regex character class with brackets, [...], the stuff inside the brackets are (in general, except for hyphenated ranges like a-z) interpreted as single characters. They are not patterns. So [---] is not different than [-]. In fact, [---] is the range of characters from - to -, inclusive.
The parenthese inside the character class are interpreted as literal parentheses too, not grouping delimiters. So [(---)] is equivalent to [-()], the character class including the hyphen and left and right parentheses.
Thus the character class [^(---)]+ matches any character other than the hyphen or parentheses:
In [23]: re.search('[^(---)]+', 'foo - bar').group()
Out[23]: 'foo '
In [24]: re.search('[^(---)]+', 'foo ( bar').group()
Out[24]: 'foo '
You can see where this is going, and why it does not work for your problem.
Sorry for not directly answering your question, but I wonder if regular expressions are overcomplicating the problem? You could do something like this:
f = open('myfile', 'r')
for i in f:
if i[:3] == "---":
break
text = f.readlines()
f.close()
Or, am I missing something?
I tend to find that regular expressions are difficult enough to maintain that if you don't need their unique capabilities for a given purpose it'll be cleaner and more readable to avoid using them entirely.
s = open(myfile).read().split('\n\n---\n\n', 1)
print s[0] # first part
print s[1] # second part after the dashes
This should work for your example. The second parameter to split specifies how many times to split the string.

Regular Expressions: Special Characters and Tab Spaces

I was testing out a function that I wrote. It is supposed to give me the count of full stops (.) in a line or string. The full stop (.) that I am interested in counting has a tab space before and after it.
Here is what I have written.
def Seek():
a = '1 . . 3 .'
b = a.count(r'\t\.\t')
return b
Seek()
However, when I test it, it returns 0. From a, there are 2 full stops (.) with both a tab space before and after it. Am I using regular expressions improperly? Represented a incorrectly? Any help is appreciated.
Thanks.
It doesn't look like a has any tabs in it. Although you may have hit the tab key on your keyboard, that character would have been interpreted by the text editor as "insert a number of spaces to align with the next tab character". You need your line to look like this:
a = '1\t.\t.\t3\t.'
That should do it.
A more complete example:
from re import *
def Seek():
a = '1\t.\t.\t3\t\.'
re = compile(r'(?<=\t)\.(?=\t)');
return len(re.findall(a))
print Seek()
This uses "lookahead" and "lookbehind" to match the tab character without consuming it. What does that mean? It means that when you have \t.\t.\t, you will actually match both the first and the second \.. The original expression would have matched the initial \t\.\t and discarded them. After, there would have been a \. with nothing in front of it, and thus no second match. The lookaround syntax is "zero width" - the expression is tested but it ends up taking no space in the final match. Thus, the code snippet I just gave returns 2, just as you would expect.
It will work if you replace the '\t' with a single tab key press.
Note that count only counts non-overlapping occurrences of a substring so it won't work as intended unless you use regex instead, or change your substring to only test for a tab in front of the period.

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Categories

Resources