Match string between special characters - python

I've messed around with regex a little bit but am pretty unfamiliar with it for the most part. The string will in the format:
\n\n*text here, can be any spaces, etc. etc.*
The string that I will get will have two line breaks, followed by an asterisk, followed by text, and then ending with another asterisk.
I want to exclude the beginning \n\n from the returned text. This is the pattern that I've come up with so far and it seems to work:
pattern = "(?<=\\n\\n)\*(.*)(\*)"
match = re.search(pattern, string)
if match:
text = match.group()
print (text)
else:
print ("Nothing")
I'm wondering if there is a better way to go about matching this pattern or if the way I'm handling it is okay.
Thanks.

You can avoid capturing groups and have the whole match as result using:
pattern = r'(?<=\n\n\*)[^*]*(?=\*)'
Example:
import re
print re.findall(r'(?<=\n\n\*)[^*]*(?=\*)','\n\n*text here, can be any spaces, etc. etc.*')
If you want to include the asterisk in the result you can use instead:
pattern = r'(?<=\n\n)\*[^*]*\*'

Regular expressions are overkill in a case like this -- if the delimiters are always static and at the head/tail of the string:
>>> s = "\n\n*text here, can be any spaces, etc. etc.*"
>>> def CheckString(s):
... if s.startswith("\n\n*") and s.endswith("*"):
... return s[3:-1]
... else:
... return "(nothing)"
>>> CheckString(s)
'text here, can be any spaces, etc. etc.'
>>> CheckString("no delimiters")
'(nothing)'
(adjusting the slice indexes as needed -- it wasn't clear to me if you want to keep the leading/trailing '*' characters. If you want to keep them, change the slice to
return s[2:]

Related

how do you do regex in python

I have a string like this:
data='WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
I need to get rid of everything until the first instance of the underline (inclusive) in regex.
I've tried this:
re.sub("(^.*\_),"", data)
but this get rids of everything before all underlines
ProcessCpuUsage
I need it to be:
jvmRuntimeModule_ProcessCpuUsag
Use this instead:
from string import find
data='WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
result = data[find(data, "_")+1:]
print result
re.sub("(^.*\_),"", data)
This makes . match every character in the line. Once it gets to the end, and can't match any more ".", it goes to the next token. Oops, that's a underscore! So, it backtracks back before the _ProcessCpuUsage, where it can match a underscore at the start, and then complete the match.
You should ask the . multiplier to be less greedy. You also do not need to capture the contents. Drop the parens. The backslash does nothing. Drop it. The leading line-start anchor also does nothing. Drop it.
re.sub(".*?_,", data)
You have become a victim of greedy matching. The expression matches the longest sequence that it possibly can.
I know there's a way to turn off greedy matching, but I never remember it. Instead there's a trick I use when there's a character I want to stop at. Instead of matching on every character with . I match on every character except the one I want to stop at.
re.sub("(^[^_]*\_", "", data)
This should do:
import re
def get_last_part(d):
m = re.match('[^_]*_(.*)', d)
if m:
return m.group(1)
else:
return None
print get_last_part('WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage')
you can use str.index:
>>> data = 'WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
>>> data[data.index('_')+1:]
'jvmRuntimeModule_ProcessCpuUsage'
Using str.split
>>> data.split('_',1)[1]
'jvmRuntimeModule_ProcessCpuUsage'
Using str.find:
>>> data[data.find('_')+1:]
'jvmRuntimeModule_ProcessCpuUsage'
Take a look at string methods Here
Try this regex:
result = re.sub("^.*?_", "", text)
What the regex ^.*?_ does:
^ .. Assert that the position is at the beginning of the string.
.*? .. Match every character that is not a linebreak character
between zero and unlimitted times as few times as possible.
- .. Match the character _
Try using split():
s = 'WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
print(s.split('_',1)[1])
Result:
jvmRuntimeModule_ProcessCpuUsage

Find ISBN with regex in Python

If have a text (actually lots of texts), where somewhere is one ISBN inside, and I have to find it.
I know: my ISBN-13 will start with "978" followed by 10 digits.
I don't kow: how many '-' (minus) there are and if they are at the correct place.
My code will only find me the ISBN without any Minus:
regex=r'978[0-9]{10}'
pattern = re.compile(regex, re.UNICODE)
for match in pattern.findall(mytext):
print(match)
But how can I find ISBN like these:
978-123-456-789-0
978-1234-567890
9781234567890
etc...
Is this possible with one regex-pattern?
Thanks!
This matches 10 digits and allows one optional hyphen before each:
regex = r'978(?:-?\d){10}'
Since you can't have 2 consecutive hyphens, and it must end with a digit:
r'978(-?\d){10}'
... allowing for a hyphen right after then 978, mandating a digit after every hyphen (does not end in a hyphen), and allowing for consecutive digits by making each hyphen optional.
I would add \b before the 978 and after then {10}, to make sure the ISBN's are well separated from surrounding text.
Also, I would add ?: right after the opening parenthesis, to make those non-capturing (slightly better performance, and also more expressive), making it:
r'\b978(?:-?\d){10}\b'
What about adding the - char in the pattern for the regex? This way, it will look for any combination of (number or -)x10 times.
regex=r'978[0-9\-]{10}'
Although it may be better to use
regex=r'978[0-9\-]+'
because otherwise if we use {10} and some - are found, not all the digits will be found.
Test
>>> import re
>>> regex=r'978[0-9\-]+'
>>> pattern = re.compile(regex, re.UNICODE)
>>> mytext="978-123-456-789-0"
>>> for match in pattern.findall(mytext):
... print(match)
...
978-123-456-789-0
>>> mytext="978-1234-567890"
>>> for match in pattern.findall(mytext):
... print(match)
...
978-1234-567890
>>> mytext="9781234567890"
>>> for match in pattern.findall(mytext):
... print(match)
...
9781234567890
>>>
You can try to match every digits and - characters. In that case you can't know how many characters find however:
regex=r'978[\d\-]+\d'
pattern = re.compile(regex, re.UNICODE)
for match in pattern.findall(mytext):
print(match)
If your ISBN is stucked between other digits or hyphens, you'll have some problems, but if it's clearly seperated, no worries :)
EDIT: According to the first comment, you can add an extra \d at the end of the regex (I've updated my code just below) because you know that an ISBN ends with a digit.
The simplest way should be
regex=r'978[-0-9]{10,15}'
which will accept them.
If someone is still looking : ISBN Detail and Contraints
Easy one regex = r'^(978-?|979-?)?\d(-?\d){9}$'
Strong one isbnRegex = r'^(978-?|979-?)?\d{1,5}-?\d{1,7}-?\d{1,6}-?\d{1,3}$' and include length check of 10 and 13 after removing hypen (Note : Also add substring check for length = 13 ie. only for 978 or 979, Some edge case still need to be checked)

Searching a string and returning only things I specify

Hopefully this post goes better..
So I am stuck on this feature of this program that will return the whole word where a certain keyword is specified.
ie - If I tell it to look for the word "I=" in the string "blah blah blah blah I=1mV blah blah etc?", that it returns the whole word where it is found, so in this case, it would return I=1mV.
I have tried a bunch of different approaches, such as,
text = "One of the values, I=1mV is used"
print(re.split('I=', text))
However, this returns the same String without I in it, so it would return
['One of the values, ', '1mV is used']
If I try regex solutions, I run into the problem where the number could possibly be more then 1 digit, and so this bottom piece of code only works if the number is 1 digit. If I=10mV was that value, it would only return one, but if i have [/0-9] in twice, the code no longer works with only 1 value.
text = "One of the values, I=1mV is used"
print(re.findall("I=[/0-9]", text))
['I=1']
When I tried using re.match,
text = "One of the values, I=1mV is used"
print(re.search("I=", text))
<_sre.SRE_Match object at 0x02408BF0>
What is a good way to retrieve the word (In this case, I want to retrieve I=1mV) and cut out the rest of the string?
A better way would be to split the text into words first:
>>> text = "One of the values, I=1mV is used"
>>> words = text.split()
>>> words
['One', 'of', 'the', 'values,', 'I=1mV', 'is', 'used']
And then filter the words to find the one you need:
>>> [w for w in words if 'I=' in w]
['I=1mV']
This returns a list of all words with I= in them. We can then just take the first element found:
>>> [w for w in words if 'I=' in w][0]
'I=1mV'
Done! What we can do to clean this up a bit is to just look for the first match, rather then checking every word. We can use a generator expression for that:
>>> next(w for w in words if 'I=' in w)
'I=1mV'
Of course you could adapt the if condition to fit your needs better, you could for example use str.startswith() to check if the words starts with a certain string or re.match() to check if the word matches a pattern.
Using string methods
For the record, your attempt to split the string in two halves, using I= as the separator, was nearly correct. Instead of using str.split(), which discards the separator, you could have used str.partition(), which keeps it.
>>> my_text = "Loadflow current was I=30.63kA"
>>> my_text.partition("I=")
('Loadflow current was ', 'I=', '30.63kA')
Using regular expressions
A more flexible and robust solution is to use a regular expression:
>>> import re
>>> pattern = r"""
... I= # specific string "I="
... \s* # Possible whitespace
... -? # possible minus sign
... \s* # possible whitespace
... \d+ # at least one digit
... (\.\d+)? # possible decimal part
... """
>>> m = re.search(pattern, my_text, re.VERBOSE)
>>> m
<_sre.SRE_Match object at 0x044CCFA0>
>>> m.group()
'I=30.63'
This accounts for a lot more possibilities (negative numbers, integer or decimal numbers).
Note the use of:
Quantifiers to say how many of each thing you want.
a* - zero or more as
a+ - at least one a
a? - "optional" - one or zero as
Verbose regular expression (re.VERBOSE flag) with comments - much easier to understand the pattern above than the non-verbose equivalent, I=\s?-?\s?\d+(\.\d+).
Raw strings for regexp patterns, r"..." instead of plain strings "..." - means that literal backslashes don't have to be escaped. Not required here because our pattern doesn't use backslashes, but one day you'll need to match C:\Program Files\... and on that day you will need raw strings.
Exercises
Exercise 1: How do you extend this so that it can match the unit as well? And how do you extend this so that it can match the unit as either mA, A, or kA? Hint: "Alternation operator".
Exercise 2: How do you extend this so that it can match numbers in engineering notation, i.e. "1.00e3", or "-3.141e-4"?
import re
text = "One of the values, I=1mV is used"
l = (re.split('I=', text))
print str(l[1]).split(' ') [0]
if you have more than one I= do the above for each odd index in l sice 0 is the first one.
that is a good way since one can write "One of the values, I= 1mV is used"
and I guess you want to get that I is 1mv.
BTW I is current and its units are Ampers and not Volts :)
With your re.findall attempt you would want to add a + which means one or more.
Here are some examples:
import re
test = "This is a test with I=1mV, I=1.414mv, I=10mv and I=1.618mv."
result = re.findall(r'I=[\d\.]+m[vV]', test)
print(result)
test = "One of the values, I=1mV is used"
result = re.search(r'I=([\d\.]+m[vV])', test)
print(result.group(1))
The first print is: ['I=1mV', 'I=1.414mv', 'I=10mv', 'I=1.618mv']
I've grouped everything other than I= in the re.search example,
so the second print is: 1mV
incase you are interested in extracting that.

Python Regex Split Keeps Split Pattern Characters

Easiest way to explain this is an example:
I have this string: 'Docs/src/Scripts/temp'
Which I know how to split two different ways:
re.split('/', 'Docs/src/Scripts/temp') -> ['Docs', 'src', 'Scripts', 'temp']
re.split('(/)', 'Docs/src/Scripts/temp') -> ['Docs', '/', 'src', '/', 'Scripts', '/', 'temp']
Is there a way to split by the forward slash, but keep the slash part of the words?
For example, I want the above string to look like this:
['Docs/', '/src/', '/Scripts/', '/temp']
Any help would be appreciated!
Interesting question, I would suggest doing something like this:
>>> 'Docs/src/Scripts/temp'.replace('/', '/\x00/').split('\x00')
['Docs/', '/src/', '/Scripts/', '/temp']
The idea here is to first replace all / characters by two / characters separated by a special character that would not be a part of the original string. I used a null byte ('\x00'), but you could change this to something else, then finally split on that special character.
Regex isn't actually great here because you cannot split on zero-length matches, and re.findall() does not find overlapping matches, so you would potentially need to do several passes over the string.
Also, re.split('/', s) will do the same thing as s.split('/'), but the second is more efficient.
A solution without split() but with lookaheads:
>>> s = 'Docs/src/Scripts/temp'
>>> r = re.compile(r"(?=((?:^|/)[^/]*/?))")
>>> r.findall(s)
['Docs/', '/src/', '/Scripts/', '/temp']
Explanation:
(?= # Assert that it's possible to match...
( # and capture...
(?:^|/) # the start of the string or a slash
[^/]* # any number of non-slash characters
/? # and (optionally) an ending slash.
) # End of capturing group
) # End of lookahead
Since a lookahead assertion is tried at every position in the string and doesn't consume any characters, it doesn't have a problem with overlapping matches.
1) You do not need regular expressions to split on a single fixed character:
>>> 'Docs/src/Scripts/temp'.split('/')
['Docs', 'src', 'Scripts', 'temp']
2) Consider using this method:
import os.path
def components(path):
start = 0
for end, c in enumerate(path):
if c == os.path.sep:
yield path[start:end+1]
start = end
yield path[start:]
It doesn't rely on clever tricks like split-join-splitting, which makes it much more readable, in my opinion.
If you don't insist on having slashes on both sides, it's actually quite simple:
>>> re.findall(r"([^/]*/)", 'Docs/src/Scripts/temp')
['Docs/', 'src/', 'Scripts/']
Neither re nor split are really cut out for overlapping strings, so if that's what you really want, I'd just add a slash to the start of every result except the first.
Try about this:
re.split(r'(/)', 'Docs/src/Scripts/temp')
From python's documentation
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the
occurrences of pattern. If capturing parentheses are used in pattern,
then the text of all groups in the pattern are also returned as part
of the resulting list. If maxsplit is nonzero, at most maxsplit splits
occur, and the remainder of the string is returned as the final
element of the list. (Incompatibility note: in the original Python 1.5
release, maxsplit was ignored. This has been fixed in later releases.)
I'm not sure there is an easy way to do this. This is the best I could come up with...
import re
lSplit = re.split('/', 'Docs/src/Scripts/temp')
print [lSplit[0]+'/'] + ['/'+x+'/' for x in lSplit][1:-1] + ['/'+lSplit[len(lSplit)-1]]
Kind of a mess, but it does do what you wanted.

Python Regular Expression Matching: ## ##

I'm searching a file line by line for the occurrence of ##random_string##. It works except for the case of multiple #...
pattern='##(.*?)##'
prog=re.compile(pattern)
string='lala ###hey## there'
result=prog.search(string)
print re.sub(result.group(1), 'FOUND', string)
Desired Output:
"lala #FOUND there"
Instead I get the following because its grabbing the whole ###hey##:
"lala FOUND there"
So how would I ignore any number of # at the beginning or end, and only capture "##string##".
To match at least two hashes at either end:
pattern='##+(.*?)##+'
Your problem is with your inner match. You use ., which matches any character that isn't a line end, and that means it matches # as well. So when it gets ###hey##, it matches (.*?) to #hey.
The easy solution is to exclude the # character from the matchable set:
prog = re.compile(r'##([^#]*)##')
Protip: Use raw strings (e.g. r'') for regular expressions so you don't have to go crazy with backslash escapes.
Trying to allow # inside the hashes will make things much more complicated.
EDIT: If you do not want to allow blank inner text (i.e. "####" shouldn't match with an inner text of ""), then change it to:
prog = re.compile(r'##([^#]+)##')
+ means "one or more."
'^#{2,}([^#]*)#{2,}' -- any number of # >= 2 on either end
be careful with using lazy quantifiers like (.*?) because it'd match '##abc#####' and capture 'abc###'. also lazy quantifiers are very slow
Try the "block comment trick": /##((?:[^#]|#[^#])+?)##/
Adding + to regex, which means to match one or more character.
pattern='#+(.*?)#+'
prog=re.compile(pattern)
string='###HEY##'
result=prog.search(string)
print result.group(1)
Output:
HEY
have you considered doing it non-regex way?
>>> string='lala ####hey## there'
>>> string.split("####")[1].split("#")[0]
'hey'
>>> import re
>>> text= 'lala ###hey## there'
>>> matcher= re.compile(r"##[^#]+##")
>>> print matcher.sub("FOUND", text)
lala #FOUND there
>>>

Categories

Resources