How can I ignore a string in python regex group matching? - python

Say I have the following string
>>> mystr = 'A-ABd54-Bf657'
(a random string of dash-delimited character groups) and want to match the opening part, and the rest of the string, in separate groups. I can use
>>> re.match('(?P<a>[a-zA-Z0-9]+)-(?P<b>[a-zA-Z0-9-]+)', mystr)
This produces a groupdict() like this:
{'a': 'A', 'b': 'ABd54-Bf657'}
How can I get the same regex to match group b but separately match a specific suffix (or set of suffices) if it exists (they exist)? Ideally something like this
>>> myregex = <help me here>
>>> re.match(myregex, 'A-ABd54-Bf657').groupdict()
{'a': 'A', 'b': 'ABd54-Bf657', 'test': None}
>>> re.match(myregex, 'A-ABd54-Bf657-blah').groupdict()
{'a': 'A', 'b': 'ABd54-Bf657-blah', 'test': None}
>>> re.match(myregex, 'A-ABd54-Bf657-test').groupdict()
{'a': 'A', 'b': 'ABd54-Bf657', 'test': 'test'}
Thanks.

mystr = 'A-ABd54-Bf657'
re.match('(?P<a>[a-zA-Z0-9]+)-(?P<b>[a-zA-Z0-9-]+?)(?:-(?P<test>test))?$', mystr)
^ ^
The first indicated ? makes the + quantifier non-greedy, so that it consumes the minimum possible.
The second indicated ? makes the group optional.
The $ is necessary or else the non-greediness plus optionality will match nothing.

Related

Regex Replace Python [duplicate]

This question already has answers here:
Character Translation using Python (like the tr command)
(6 answers)
Closed 2 years ago.
In my project I need to be able to replace a regex in a string to another.
For example if I have 2 regular expressions [a-c] and [x-z], I need to be able to replace the string "abc" with "xyz", or the string "hello adan" with "hello xdxn". How would I do this?
We build a map and then translate letter by letter.
When using get for dictionary then the second argument specifying what to return if not find.
>>> trans = dict(zip(list("xyz"),list("abc")))
>>> trans
{'x': 'a', 'y': 'b', 'z': 'c'}
>>> "".join([trans.get(i,i) for i in "hello xdxn"])
'hello adan'
>>>
Or change the order in trans to go in other direction
>>> trans = dict(zip(list("abc"),list("xyz")))
>>> trans
{'a': 'x', 'b': 'y', 'c': 'z'}
>>> "".join([trans.get(i,i) for i in "hello adan"])
'hello xdxn'
>>>
Try with re.sub
>>>replace = re.sub(r'[a-c]+', 'x','Hello adan')
>>>replace
'Hello xdxn'
>>>re.sub(r'[a-c]+', 'x','Hello bob')
'Hello xox'

python regex: string with maximum one whitespace

Hello I would like to know how to create a regex pattern with a sting which might contain maximum one white space. More specificly:
s = "a b d d c"
pattern = "(?P<a>.*) +(?P<b>.*) +(?P<c>.*)"
print(re.match(pattern, s).groupdict())
returns:
{'a': 'a b d d', 'b': '', 'c': 'c'}
I would like to have:
{'a': 'a', 'b': 'b d d', 'c': 'c'}
Another option could be to use zip and a dict and generate the characters based on the length of the matches.
You can get the matches which contain at max one whitespace using a repeating pattern matching a non whitespace char \S and repeat 0+ times a space followed by a non whitespace char:
\S(?: \S)*
Regex demo | Python demo
For example:
import re
a=97
regex = r"\S(?: \S)*"
test_str = "a b d d c"
matches = re.findall(regex, test_str)
chars = list(map(chr, range(a, a+len(matches))))
print(dict(zip(chars, matches)))
Result
{'a': 'a', 'b': 'b d d', 'c': 'c'}
With the help of The fourth birds answer I managed to do it in a way I imagened it to be:
import re
s = "a b d d c"
pattern = "(?P<a>\S(?: \S)*) +(?P<b>\S(?: \S)*) +(?P<c>\S(?: \S)*)"
print(re.match(pattern, s).groupdict())
Looks like you just want to split your string with 2 or more spaces. You can do it this way:
s = "a b d d c"
re.split(r' {2,}', s)
will return you:
['a', 'b d d', 'c']
It's probably easier to use re.split, since the delimiter is known (2 or more spaces), but the patterns in-between are not. I'm sure someone better at regex than myself can work out the look-aheads, but by splitting on \s{2,}, you can greatly simplify the problem.
You can make your dictionary of named groups like so:
import re
s = "a b d d c"
x = dict(zip('abc', re.split('\s{2,}', s)))
x
{'a': 'a', 'b': 'b d d', 'c': 'c'}
Where the first arg in zip is the named groups. To extend this to more general names:
groups = ['group_1', 'another group', 'third_group']
x = dict(zip(groups, re.split('\s{2,}', s)))
{'group_1': 'a', 'another group': 'b d d', 'third_group': 'c'}
I found an other solution I even like better:
import re
s = "a b dll d c"
pattern = "(?P<a>(\S*[\t]?)*) +(?P<b>(\S*[\t ]?)*) +(?P<c>(\S*[\t ]?)*)"
print(re.match(pattern, s).groupdict())
here it's even possible to have more than one letter.

Convert string tuples to dict

I have malformed string:
a = '(a,1.0),(b,6.0),(c,10.0)'
I need dict:
d = {'a':1.0, 'b':6.0, 'c':10.0}
I try:
print (ast.literal_eval(a))
#ValueError: malformed node or string: <_ast.Name object at 0x000000000F67E828>
Then I try replace chars to 'string dict', it is ugly and does not work:
b = a.replace(',(','|{').replace(',',' : ')
.replace('|',', ').replace('(','{').replace(')','}')
print (b)
{a : 1.0}, {b : 6.0}, {c : 10.0}
print (ast.literal_eval(b))
#ValueError: malformed node or string: <_ast.Name object at 0x000000000C2EA588>
What do you do? Something missing? Is possible use regex?
Given the string has the above stated format, you could use regex substitution with backrefs:
import re
a = '(a,1.0),(b,6.0),(c,10.0)'
a_fix = re.sub(r'\((\w+),', r"('\1',",a)
So you look for a pattern (x, (with x a sequence of \ws and you substitute it into ('x',. The result is then:
# result
a_fix == "('a',1.0),('b',6.0),('c',10.0)"
and then parse a_fix and convert it to a dict:
result = dict(ast.literal_eval(a_fix))
The result in then:
>>> dict(ast.literal_eval(a_fix))
{'b': 6.0, 'c': 10.0, 'a': 1.0}
No need for regexes, if your string is in this format.
>>> a = '(a,1.0),(b,6.0),(c,10.0)'
>>> d = dict([x.split(',') for x in a[1:-1].split('),(')])
>>> print(d)
{'c': '10.0', 'a': '1.0', 'b': '6.0'}
We remove the first opening parantheses and last closing parantheses to get the key-value pairs by splitting on ),(. The pairs can then be split on the comma.
To cast to float, the list comprehension gets a little longer:
d = dict([(a, float(b)) for (a, b) in [x.split(',') for x in a[1:-1].split('),(')]])
If there are always 2 comma-separated values inside parentheses and the second is of a float type, you may use
import re
s = '(a,1.0),(b,6.0),(c,10.0)'
print(dict(map(lambda (w, m): (w, float(m)), [(x, y) for x, y in re.findall(r'\(([^),]+),([^)]*)\)', s) ])))
See the Python demo and the (quite generic) regex demo. This pattern just matches a (, then 0+ chars other than a comma and ) capturing into Group 1, then a comma is matched, then any 0+ chars other than ) (captured into Group 2) and a ).
As the pattern above is suitable when you have pre-validated data, the regex can be restricted for your current data as
r'\((\w+),(\d*\.?\d+)\)'
See the regex demo
Details:
\( - a literal (
(\w+) - Capturing group 1: one or more word (letter/digit/_) chars
, - a comma
(\d*\.?\d+) - a common integer/float regex: zero or more digits, an optional . (decimal separator) and 1+ digits
\) - a literal closing parenthesis.
the reason why eval() dose not work is the a, b, c are not defined, we can define those with it's string form and eval will get that string form to use
In [11]: text = '(a,1.0),(b,6.0),(c,10.0)'
In [12]: a, b, c = 'a', 'b', 'c'
In [13]: eval(text)
Out[13]: (('a', 1.0), ('b', 6.0), ('c', 10.0))
In [14]: dict(eval(text))
Out[14]: {'a': 1.0, 'b': 6.0, 'c': 10.0}
to do this in regex way:
In [21]: re.sub(r'\((.+?),', r'("\1",', text)
Out[21]: '("a",1.0),("b",6.0),("c",10.0)'
In [22]: eval(_)
Out[22]: (('a', 1.0), ('b', 6.0), ('c', 10.0))
In [23]: dict(_)
Out[23]: {'a': 1.0, 'b': 6.0, 'c': 10.0}

Regex How to match Empty

I have log structure looks like
a b c|
so for example:
Mozilla 5.0 white|
should be matched/extracted to sth like
a: Mozilla, b: 5.0, c: white
but there is an entry in my log is:
iOS|
which can be explained as
a:iOS, b:null, c:null
I am using python3 re, doing match with named group ?P
is there any way to achieve this?
>>> m = re.match(r"(?P<a>[^\s]+)(\s+(?P<b>[^\s]+))?(\s+(?P<c>[^\s]+))?\s*\|")
>>> m.groups()
('iOS', None, None)
>>> m.groupdict()
{'c': None, 'a': 'iOS', 'b': None}
>>> m = re.match(r"(?P<a>[^\s]+)(\s+(?P<b>[^\s]+))?(\s+(?P<c>[^\s]+))?\s*\|")
>>> m.groups()
('Mozilla', ' 5.0', ' white')
>>> m.groupdict()
{'c': 'white', 'a': 'Mozilla', 'b': '5.0'}
UPDATE:
I noticed that the previous version included spaces in the returned groups - I had factored the \s+ into the (?P<>...) to save a couple bytes, but it had that side effect. So I fixed that and also made it tolerant of spaces before the final '|'
You can put your patterns in a list like following :
>>> pattern = ['a', 'b', 'c']
Then use re.findall() to find all the relative parts, then use zip and dict to create the relative dictionary:
>>> s = "IOS|"
>>> dict(zip(pattern,re.findall('([^\s]+)?\s?([^\s]+)?\s?([^\s]+)?\|',s)[0]))
{'a': 'IOS', 'c': '', 'b': ''}
>>>
>>> s = "Mozilla 5.0 white|"
>>>
>>> dict(zip(pattern,re.findall('([^\s]+)?\s?([^\s]+)?\s?([^\s]+)?\|',s)[0]))
{'a': 'Mozilla', 'c': 'white', 'b': '5.0'}

Python Parse JSON value with space delimited subfields

I need to parse JSON like this:
{
"entity": " a=123455 b=234234 c=S d=CO e=1 f=user1 timestamp=null",
"otherField": "text"
}
I want to get values for a, b, c, d, e, timestamp separately. Is there a better way than assigning the entity value to a string, then parsing with REGEX?
There is nothing to the JSON standard that parses that value for you, you'll have to do this in Python.
It could be easier to just split that string on whitespace, then on =:
entities = dict(keyvalue.split('=', 1) for keyvalue in data['entity'].split())
This results in:
>>> data = {'entity': " a=123455 b=234234 c=S d=CO e=1 f=user1 timestamp=null"}
>>> dict(keyvalue.split('=', 1) for keyvalue in data['entity'].split())
{'a': '123455', 'c': 'S', 'b': '234234', 'e': '1', 'd': 'CO', 'f': 'user1', 'timestamp': 'null'}
What about this:
>>> dic = dict(item.split("=") for item in s['entity'].strip().split(" "))
>>> dic
>>> {'a': '123455', 'c': 'S', 'b': '234234', 'e': '1', 'd': 'CO', 'f': 'user1', 'timestamp':'null'}
>>> dic['a']
'123455'
>>> dic['b']
'234234'
>>> dic['c']
'S'
>>> dic['d']
'CO'
>>>

Categories

Resources