Regex match single characters between strings - python

I have a string with some markup which I'm trying to parse, generally formatted like this.
'[*]\r\n[list][*][*][/list][*]text[list][*][/list]'
I want to match the asterisks within the [list] tags so I can re.sub them as [**] but I'm having trouble forming an expression to grab them. So far, I have:
match = re.compile('\[list\].+?\[/list\]', re.DOTALL)
This gets everything within the list, but I can't figure out a way to narrow it down to the asterisks alone. Any advice would be massively appreciated.

You may use a re.sub and use a lambda in the replacement part. You pass the match to the lambda and use a mere .replace('*','**') on the match value.
Here is the sample code:
import re
s = '[*]\r\n[list][*][*][/list][*]text[list][*][/list]'
match = re.compile('\[list].+?\[/list]', re.DOTALL)
print(match.sub(lambda m: m.group().replace('*', '**'), s))
# = > [*]
# [list][**][**][/list][*]text[list][**][/list]
See the IDEONE demo
Note that a ] outside of a character class does not have to be escaped in Python re regex.

Related

Python matching dashes using Regular Expressions

I am currently new to Regular Expressions and would appreciate if someone can guide me through this.
import re
some = "I cannot take this B01234-56-K-9870 to the house of cards"
I have the above string and trying to extract the string with dashes (B01234-56-K-9870) using python regular expression. I have following code so far:
regex = r'\w+-\w+-\w+-\w+'
match = re.search(regex, some)
print(match.group()) #returns B01234-56-K-9870
Is there any simpler way to extract the dash pattern using regular expression? For now, I do not care about the order or anything. I just wanted it to extract string with dashes.
Try the following regex (as shortened by The fourth bird),
\w+-\S+
Original regex: (?=\w+-)\S+
Explanation:
\w+- matches 1 or more words followed by a -
\S+ matches non-space characters
Regex demo!

Python Regex to Remove Special Characters from Middle of String and Disregard Anything Else

Using the python re.sub, is there a way I can extract the first alpha numeric characters and disregard the rest form a string that starts with a special character and might have special characters in the middle of the string? For example:
re.sub('[^A-Za-z0-9]','', '#my,name')
How do I just get "my"?
re.sub('[^A-Za-z0-9]','', '#my')
Here I would also want it to just return 'my'.
re.sub(".*?([A-Za-z0-9]+).*", r"\1", str)
The \1 in the replacement is equivalent to matchobj.group(1). In other words it replaces the whole string with just what was matched by the part of the regexp inside the brackets. $ could be added at the end of the regexp for clarity, but it is not necessary because the final .* will be greedy (match as many characters as possible).
This solution does suffer from the problem that if the string doesn't match (which would happen if it contains no alphanumeric characters), then it will simply return the original string. It might be better to attempt a match, then test whether it actually matches, and handle separately the case that it doesn't. Such a solution might look like:
matchobj = re.match(".*?([A-Za-z0-9]+).*", str)
if matchobj:
print(matchobj.group(1))
else:
print("did not match")
But the question called for the use of re.sub.
Instead of re.sub it is easier to do matching using re.search or re.findall.
Using re.search:
>>> s = '#my,name'
>>> res = re.search(r'[a-zA-Z\d]+', s)
>>> if res:
... print (res.group())
...
my
Code Demo
This is not a complete answer. [A-Za-z]+ will give give you ['my','name']
Use this to further explore: https://regex101.com/

Regex multiline syntax help (python)

I'm struggling to do multiline regex with multiple matches.
I have data separated by newline/linebreaks like below. My pattern matches each of these lines if i test it separately. How can i match all the occurrences (specifically numbers?
I've read that i could/should use DOTALL somehow (possibly with MULTILINE). This seems to match any character (newlines also) but not sure of any eventual side effects. Don't want to have it match an integer or something and give me malformed data in the end.
Any info on this would be great.
What i really need though, is some assistance in making this example code work. I only need to fetch the numbers from the data.
I used re.fullmatch when i only needed one specific match in a previous case and not entirely sure which function i should use now by the way (finditer, findall, search etc.).
Thank you for any and all help :)
data = """http://store.steampowered.com/app/254060/
http://www.store.steampowered.com/app/254061/
https://www.store.steampowered.com/app/254062
store.steampowered.com/app/254063
254064"""
regPattern = '^\s*(?:https?:\/\/)?(?:www\.)?(?:store\.steampowered\.com\/app\/)?([0-9]+)\/?\s*$'
evaluateData = re.search(regPattern, data, re.DOTALL | re.MULTILINE)
if evaluateString2 is not None:
print('do stuff')
else:
print('found no match')
import re
p = re.compile(ur'^\s*(?:https?:\/\/)?(?:www\.)?(?:store\.steampowered\.com\/app\/)?([0-9]+)\/?\s*$', re.MULTILINE)
test_str = u"http://store.steampowered.com/app/254060/\nhttp://www.store.steampowered.com/app/254061/\nhttps://www.store.steampowered.com/app/254062\nstore.steampowered.com/app/254063\n254064"
re.findall(p, test_str)
https://regex101.com/r/rC9rI0/1
this gives [u'254060', u'254061', u'254062', u'254063', u'254064'].
Are you trying to return those specific integers?
re.search stop at the first occurrence
You should use this intead
re.findall(regPattern, data, re.MULTILINE)
['254060', '254061', '254062', '254063', '254064']
Note: Search was not working for me (python 2.7.9). It just return the first line of data
/ has no special meaning so you do not have to escape it (and in not-raw strings you would have to escape every \)
try this
regPattern = r'^\s*(?:https?://)?(?:www\.)?(?:store\.steampowered\.com/app/)?([0-9]+)/?\s*$'

Python Regex for alpha(alpha|digit)*

I'm trying to produce a python regex to represent identifiers for a lexical analyzer. My approach is:
([a-zA-Z]([a-zA-Z]|\d)*)
When I use this in:
regex = re.compile("\s*([a-zA-Z]([a-zA-Z]|\d)*)")
regex.findall(line)
It doesn't produce a list of identifiers like it should. Have I built the expression incorrectly?
What's a good way to represent the form:
alpha(alpha|digit)*
With the python re module?
like this:
regex = re.compile(r'[a-zA-Z][a-zA-Z\d]*')
Note the r before the quote to obtain a raw string, otherwise you need to escape all backslashes.
Since the \s* before is optional, you can remove it, like capture groups.
If you want to ensure that the match isn't preceded by a digit, you can write it like this with a negative lookbehind (?<!...):
regex = re.compile(r'(?:^|(?<![\da-zA-Z]))[a-zA-Z][a-zA-Z\d]*')
Note that with re.compile you can use the case insensitive option:
regex = re.compile(r'(?:^|(?<![\da-z]))[a-z][a-z\d]*', re.I)

re.sub repl function returning \1 does not replace the group

I am trying to write a generic replace function for a regex sub operation in Python (trying in both 2 and 3) Where the user can provide a regex pattern and a replacement for the match. This could be just a simple string replacement to replacing using the groups from the match.
In the end, I get from the user a dictionary in this form:
regex_dict = {pattern:replacement}
When I try to replace all the occurrences of a pattern via this command, the replacement works for replacements for a group number, (such as \1) and I call the following operation:
re.sub(pattern, regex_dict[pattern], text)
This works as expected, but I need to do additional stuff when a match is found. Basically, what I try to achieve is as follows:
replace_function(matchobj):
result = regex_dict[matchobj.re]
##
## Do some other things
##
return result
re.sub(pattern, replace_function, text)
I see that this works for normal replacements, but the re.sub does not use the group information to get the match when the function is used.
I also tried to convert the \1 pattern to \g<1>, hoping that the re.sub would understand it, but to no avail.
Am I missing something vital?
Thanks in advance!
Additional notes: I compile the pattern using strings as in bytes, and the replacements are also in bytes. I have non-Latin characters in my pattern, but I read everything in bytes, including the text where the regex substitution will operate on.
EDIT
Just to clarify, I do not know in advance what kind of replacement the user will provide. It could be some combination of normal strings and groups, or just a string replacement.
SOLUTION
replace_function(matchobj):
repl = regex_dict[matchobj.re]
##
## Do some other things
##
return matchobj.expand(repl)
re.sub(pattern, replace_function, text)
I suspect you're after .expand, if you've got a compiled regex object (for instance), you can provide a string to be taken into consideration for the replacements, eg:
import re
text = 'abc'
# This would be your key in the dict
rx = re.compile('a(\w)c')
# This would be the value for the key (the replacement string, eg: `\1\1\1`)
res = rx.match(text).expand(r'\1\1\1')
# bbb

Categories

Resources