I am new to regex module and learning a simple case to extract key and values from a simple dictionary.
the dictionary can not contain nested dicts and any lists, but may have simple tuples
MWE
import re
# note: the dictionary are simple and does NOT contains list, nested dicts, just these two example suffices for the regex matching.
d = "{'a':10,'b':True,'c':(5,'a')}" # ['a', 10, 'b', True, 'c', (5,'a') ]
d = "{'c':(5,'a'), 'd': 'TX'}" # ['c', (5,'a'), 'd', 'TX']
regexp = r"(.*):(.*)" # I am not sure how to repeat this pattern separated by ,
out = re.match(regexp,d).groups()
out
You should not use regex for this job. When the input string is valid Python syntax, you can use ast.literal_eval.
Like this:
import ast
# ...
out = ast.literal_eval(d)
Now you have a dictionary object in Python. You can for instance get the key/value pairs in a (dict_items) list:
print(out.items())
Regex
Regex is not the right tool. There will always be cases where some boundary case will be wrongly parsed. But to get the repeated matches, you can better use findall. Here is a simple example regex:
regexp = r"([^{\s][^:]*):([^:}]*)(?:[,}])"
out = re.findall(regexp, d)
This will give a list of pairs.
Regex would be hard (perhaps impossible, but I'm not versed enough to say confidently) to use because of the ',' nested in your tuples. Just for the sake of it, I wrote (regex-less) code to parse your string for separators, ignoring parts inside parentheses:
d = "{'c':(5,'a',1), 'd': 'TX', 1:(1,2,3)}"
d=d.replace("{","").replace("}","")
indices = []
inside = False
for i,l in enumerate(d):
if inside:
if l == ")":
inside = False
continue
continue
if l == "(":
inside = True
continue
if l in {":",","}:
indices.append(i)
indices.append(len(d))
parts = []
start = 0
for i in indices:
parts.append(d[start:i].strip())
start = i+1
parts
# ["'c'", "(5,'a',1)", "'d'", "'TX'", '1', '(1,2,3)']
There's a logfile with text in the form of space-separated key=value pairs, and each line was originally serialized from data in a Python dict, something like:
' '.join([f'{k}={v!r}' for k,v in d.items()])
The keys are always just strings. The values could be anything that ast.literal_eval can successfully parse, no more no less.
How to process this logfile and turn the lines back into Python dicts? Example:
>>> to_dict("key='hello world'")
{'key': 'hello world'}
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
Here is some extra context about the data:
Keys are valid names
Input lines are well-formed (e.g. no dangling brackets)
The data is trusted (unsafe functions such as eval, exec, yaml.load are OK to use)
Order is not important. Performance is not important. Correctness is important.
Edit: As requested in the comments, here is an MCVE and an example code that didn't work correctly
>>> def to_dict(s):
... s = s.replace(' ', ', ')
... return eval(f"dict({s})")
...
...
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'} # OK
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234} # OK
>>> to_dict("key='hello world'")
{'key': 'hello, world'} # Incorrect, the value was corrupted
Your input can't be conveniently parsed by something like ast.literal_eval, but it can be tokenized as a series of Python tokens. This makes things a bit easier than they might otherwise be.
The only place = tokens can appear in your input is as key-value separators; at least for now, ast.literal_eval doesn't accept anything with = tokens in it. We can use the = tokens to determine where the key-value pairs start and end, and most of the rest of the work can be handled by ast.literal_eval. Using the tokenize module also avoids problems with = or backslash escapes in string literals.
import ast
import io
import tokenize
def todict(logstring):
# tokenize.tokenize wants an argument that acts like the readline method of a binary
# file-like object, so we have to do some work to give it that.
input_as_file = io.BytesIO(logstring.encode('utf8'))
tokens = list(tokenize.tokenize(input_as_file.readline))
eqsign_locations = [i for i, token in enumerate(tokens) if token[1] == '=']
names = [tokens[i-1][1] for i in eqsign_locations]
# Values are harder than keys.
val_starts = [i+1 for i in eqsign_locations]
val_ends = [i-1 for i in eqsign_locations[1:]] + [len(tokens)]
# tokenize.untokenize likes to add extra whitespace that ast.literal_eval
# doesn't like. Removing the row/column information from the token records
# seems to prevent extra leading whitespace, but the documentation doesn't
# make enough promises for me to be comfortable with that, so we call
# strip() as well.
val_strings = [tokenize.untokenize(tok[:2] for tok in tokens[start:end]).strip()
for start, end in zip(val_starts, val_ends)]
vals = [ast.literal_eval(val_string) for val_string in val_strings]
return dict(zip(names, vals))
This behaves correctly on your example inputs, as well as on an example with backslashes:
>>> todict("key='hello world'")
{'key': 'hello world'}
>>> todict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> todict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> todict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
>>> s=input()
a='=' b='"\'' c=3
>>> todict(s)
{'a': '=', 'b': '"\'', 'c': 3}
Incidentally, we probably could look for token type NAME instead of = tokens, but that'll break if they ever add set() support to literal_eval. Looking for = could also break in the future, but it doesn't seem as likely to break as looking for NAME tokens.
Regex replacement functions to the rescue
I'm not rewriting a ast-like parser for you, but one trick that works pretty well is to use regular expressions to replace the quoted strings and replace them by "variables" (I've chosen __token(number)__), a bit like you're offuscating some code.
Make a note of the strings you're replacing (that should take care of the spaces), replace space by comma (protecting against symbols before like : allows to pass last test) and replace by strings again.
import re,itertools
def to_dict(s):
rep_dict = {}
cnt = itertools.count()
def rep_func(m):
rval = "__token{}__".format(next(cnt))
rep_dict[rval] = m.group(0)
return rval
# replaces single/double quoted strings by token variable-like idents
# going on a limb to support escaped quotes in the string and double escapes at the end of the string
s = re.sub(r"(['\"]).*?([^\\]|\\\\)\1",rep_func,s)
# replaces spaces that follow a letter/digit/underscore by comma
s = re.sub("(\w)\s+",r"\1,",s)
#print("debug",s) # uncomment to see temp string
# put back the original strings
s = re.sub("__token\d+__",lambda m : rep_dict[m.group(0)],s)
return eval("dict({s})".format(s=s))
print(to_dict("k1='v1' k2='v2'"))
print(to_dict("s='1234' n=1234"))
print(to_dict(r"key='hello world'"))
print(to_dict('key="hello world"'))
print(to_dict("""k4='k5="hello"' k5={'k6': ['potato']}"""))
# extreme string test
print(to_dict(r"key='hello \'world\\'"))
prints:
{'k2': 'v2', 'k1': 'v1'}
{'n': 1234, 's': '1234'}
{'key': 'hello world'}
{'key': 'hello world'}
{'k5': {'k6': ['potato']}, 'k4': 'k5="hello"'}
{'key': "hello 'world\\"}
The key is to extract the strings (quoted/double quoted) using non-greedy regex and replace them by non-strings (like if those were string variables not literals) in the expression. The regex has been tuned so it can accept escaped quotes and double escape at the end of string (custom solution)
The replacement function is an inner function so it can make use of the nonlocal dictionary & counter and track the replaced text, so it can be restored once the spaces have been taken care of.
When replacing the spaces by commas, you have to be careful not to do it after a colon (last test) or all things considered after a alphanum/underscore (hence the \w protection in the replacement regex for comma)
If we uncomment the debug print code just before the original strings are put back that prints:
debug k1=__token0__,k2=__token1__
debug s=__token0__,n=1234
debug key=__token0__
debug k4=__token0__,k5={__token1__: [__token2__]}
debug key=__token0__
The strings have been pwned, and the replacement of spaces has worked properly. With some more effort, it should probably be possible to quote the keys and replace k1= by "k1": so ast.literal_eval can be used instead of eval (more risky, and not required here)
I'm sure some super-complex expressions can break my code (I've even heard that there are very few json parsers able to parse 100% of the valid json files), but for the tests you submitted, it'll work (of course if some funny guy tries to put __tokenxx__ idents in the original strings, that'll fail, maybe it could be replaced by some otherwise invalid-as-variable placeholders). I have built an Ada lexer using this technique some time ago to be able to avoid spaces in strings and that worked pretty well.
You can find all the occurrences of = characters, and then find the maximum runs of characters which give a valid ast.literal_eval result. Those characters can then be parsed for the value, associated with a key found by a string slice between the last successful parse and the index of the current =:
import ast, typing
def is_valid(_str:str) -> bool:
try:
_ = ast.literal_eval(_str)
except:
return False
else:
return True
def parse_line(_d:str) -> typing.Generator[typing.Tuple, None, None]:
_eq, last = [i for i, a in enumerate(_d) if a == '='], 0
for _loc in _eq:
if _loc >= last:
_key = _d[last:_loc]
_inner, seen, _running, _worked = _loc+1, '', _loc+2, []
while True:
try:
val = ast.literal_eval(_d[_inner:_running])
except:
_running += 1
else:
_max = max([i for i in range(len(_d[_inner:])) if is_valid(_d[_inner:_running+i])])
yield (_key, ast.literal_eval(_d[_inner:_running+_max]))
last = _running+_max
break
def to_dict(_d:str) -> dict:
return dict(parse_line(_d))
print([to_dict("key='hello world'"),
to_dict("k1='v1' k2='v2'"),
to_dict("s='1234' n=1234"),
to_dict("""k4='k5="hello"' k5={'k6': ['potato']}"""),
to_dict("val=['100', 100, 300]"),
to_dict("val=[{'t':{32:45}, 'stuff':100, 'extra':[]}, 100, 300]")
]
)
Output:
{'key': 'hello world'}
{'k1': 'v1', 'k2': 'v2'}
{'s': '1234', 'n': 1234}
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
{'val': ['100', 100, 300]}
{'val': [{'t': {32: 45}, 'stuff': 100, 'extra': []}, 100, 300]}
Disclaimer:
This solution is not as elegant as #Jean-FrançoisFabre's, and I am not sure if it can parse 100% of what is passed to to_dict, but it may give you inspiration for your own version.
Provide two helper functions.
popstr: split thing from start of string that looks like string
If it starts with a single or double quote mark, I'll look for the next one and split at that point.
def popstr(s):
i = s[1:].find(s[0]) + 2
return s[:i], s[i:]
poptrt: split thing from start of string that is surrounded by brackets ('[]', '()', '{}').
If it starts with a bracket, I'll start incrementing for every instance of the starting character and decrementing for every instance of it's complement. When I reach zero, I split.
def poptrt(s):
d = {'{': '}', '[': ']', '(': ')'}
b = s[0]
c = lambda x: {b: 1, d[b]: -1}.get(x, 0)
parts = []
t, i = 1, 1
while t > 0 and s:
if i > len(s) - 1:
break
elif s[i] in '\'"':
s, s, s = s[:i], *map(str.strip, popstr(s[i:]))
parts.extend([s, s])
i = 0
else:
t += c(s[i])
i += 1
if t == 0:
return ''.join(parts + [s[:i]]), s[i:]
else:
raise ValueError('Your string has unbalanced brackets.')
Chew through string until there is no more string to chew
def to_dict(log):
d = {}
while log:
k, log = map(str.strip, log.split('=', 1))
if log.startswith(('"', "'")):
v, log = map(str.strip, popstr(log))
elif log.startswith((*'{[(',)):
v, log = map(str.strip, poptrt(log))
else:
v, *log = map(str.strip, log.split(None, 1))
log = ' '.join(log)
d[k] = ast.literal_eval(v)
return d
All tests passed
assert to_dict("key='hello world'") == {'key': 'hello world'}
assert to_dict("k1='v1' k2='v2'") == {'k1': 'v1', 'k2': 'v2'}
assert to_dict("s='1234' n=1234") == {'s': '1234', 'n': 1234}
assert to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""") == {'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
Deficiencies
Did not account for backslashes
Did not account for nested goofy formatting
All Together
import ast
def popstr(s):
i = s[1:].find(s[0]) + 2
return s[:i], s[i:]
def poptrt(s):
d = {'{': '}', '[': ']', '(': ')'}
b = s[0]
c = lambda x: {b: 1, d[b]: -1}.get(x, 0)
parts = []
t, i = 1, 1
while t > 0 and s:
if i > len(s) - 1:
break
elif s[i] in '\'"':
_s, s_, s = s[:i], *map(str.strip, popstr(s[i:]))
parts.extend([_s, s_])
i = 0
else:
t += c(s[i])
i += 1
if t == 0:
return ''.join(parts + [s[:i]]), s[i:]
else:
raise ValueError('Your string has unbalanced brackets.')
def to_dict(log):
d = {}
while log:
k, log = map(str.strip, log.split('=', 1))
if log.startswith(('"', "'")):
v, log = map(str.strip, popstr(log))
elif log.startswith((*'{[(',)):
v, log = map(str.strip, poptrt(log))
else:
v, *log = map(str.strip, log.split(None, 1))
log = ' '.join(log)
d[k] = ast.literal_eval(v)
return d
I have similar problem to convert 'key1="value1" key2="value2" ...' string into dict. I split string on spaces and create a list of ['key="value"'] pairs. Than in cycle through list again, split pairs on '=' and add pairs to dict.
Code:
str_attr = 'name="Attr1" type="Attr2" use="Attr3"'
list_attr = str_attr.split(' ')
dict_attr = {}
for item in list_attr:
list_item = item.split('=')
dict_attr.update({list_item[0] : list_item[1]})
print(dict_attr)
result:
{'name': '"Attr1"', 'type': '"Attr2"', 'use': '"Attr3"'}
Limitations:
keys and values should don't have space (' ') and/or equal sign ('=') inside.
If you have different delimiters like spaces, commas, commas with spaces, semicolon et cetera, use regex to split string, specify delimiters by '|':
'\s+|,\s*|;\s*'
\s+ - one or more spaces
",\s*" - colon or colon with space(s)
";\s*" - semicolon or semicolon with space(s)
"+" means "one or more"
"*" means "none or more"
import re
str_attr = 'name="Attr1" type="Attr2", use="Attr3",new="yes";old="no"'
list_attr = re.split(''\s+|,\s*|;\s*'', str_attr)
dict_attr = {}
for item in list_attr:
if item:
list_item = item.split('=')
dict_attr.update({list_item[0] : list_item[1]})
print(dict_attr)
Result:
{'name': '"Attr1"', 'type': '"Attr2"', 'use': '"Attr3"', 'new': '"yes"', 'old': '"no"'}
I want to make this string to be dictionary.
s = 'SEQ(A=1%B=2)OPS(CC=0%G=2)T1(R=N)T2(R=Y)'
Following
{'SEQ':'A=1%B=2', 'OPS':'CC=0%G=2', 'T1':'R=N', 'T2':'R=Y'}
I tried this code
d = dict(item.split('(') for item in s.split(')'))
But an error occurred
ValueError: dictionary update sequence element #4 has length 1; 2 is required
I know why this error occurred, the solution is deleting bracket of end
s = 'SEQ(A=1%B=2)OPS(CC=0%G=2)T1(R=N)T2(R=Y'
But it is not good for me. Any other good solution to make this string to be dictionary type ...?
More compactly:
import re
s = 'SEQ(A=1%B=2)OPS(CC=0%G=2)T1(R=N)T2(R=Y)'
print dict(re.findall(r'(.+?)\((.*?)\)', s))
Add a if condition in your generator expression.
>>> s = 'SEQ(A=1%B=2)OPS(CC=0%G=2)T1(R=N)T2(R=Y)'
>>> s.split(')')
['SEQ(A=1%B=2', 'OPS(CC=0%G=2', 'T1(R=N', 'T2(R=Y', '']
>>> d = dict(item.split('(') for item in s.split(')') if item!='')
>>> d
{'T1': 'R=N', 'OPS': 'CC=0%G=2', 'T2': 'R=Y', 'SEQ': 'A=1%B=2'}
Alternatively, this could be solved with a regular expression:
>>> import re
>>> s = 'SEQ(A=1%B=2)OPS(CC=0%G=2)T1(R=N)T2(R=Y)'
>>> print dict(match.groups() for match in re.finditer('([^(]+)\(([^)]+)\)', s))
{'T1': 'R=N', 'T2': 'R=Y', 'SEQ': 'A=1%B=2', 'OPS': 'CC=0%G=2'}
I have a string input in the following format: (x,y) where x and y are doubles.
For example : (1,2.556) can be a vector.
I want the easiest way to split it into the x,y values, 1 and 2.556 in this case.
What would you suggest?
You could use code like this:
import ast
text = '(1,2.556)'
vector = ast.literal_eval(text)
print(vector)
The literal_eval function does not have a security risks associated with eval and works just as well in this particular case.
The eval answers are good. But if you are sure of the format of your strings -- always start and end with parentheses, no spaces in the string, etc., then you can do this fairly efficiently:
x, y = (float(num) for num in s[1:-1].split(','))
eval works:
>>> s = "(1.2,3.40)"
>>> eval(s)
(1.2, 3.4)
>>> x,y = eval(s)
>>> x
1.2
>>> y
3.4
eval has potential security risks, but if you trust that you are dealing with strings of that form then this is adequate.
Remove the first and last (, ) and then do splitting according to the comma.
re.sub(r'^\(|\)$', '',string).split(',')
OR
>>> s = "(1,2.556)"
>>> x = [i for i in re.split(r'[,()]', s) if i]
>>> x[0]
'1'
>>> x[1]
'2.556'
If you're sure they'll be passed in exactly this way, try this:
>>> s = '(1,2.556)'
>>> [float(i) for i in s[1:-1].split(',')]
[1.0, 2.556]
I have this line
Server:x.x.x.x # U:100 # P:100 # Pre:0810 # Tel:xxxxxxxxxx
and I want to copy the value 0810 which is after Pre: value
How i can do that ?
You could use the re module ('re' stands for regular expressions)
This solution assumes that your Pre: field will always have four numbers.
If the length of the number varies, you could replace the {4}(expect exactly 4) with + (expect 'one or more')
>>> import re
>>> x = "Server:x.x.x.x # U:100 # P:100 # Pre:0810 # Tel:xxxxxxxxxx"
>>> num = re.findall(r'Pre:(\d{4})', x)[0] # re.findall returns a list
>>> print num
'0810'
You can read about it here:
https://docs.python.org/2/howto/regex.html
As usual in these cases, the best answer depends upon how your strings will vary, and we only have one example to generalize from.
Anyway, you could use string methods like str.split to get at it directly:
>>> s = "Server:x.x.x.x # U:100 # P:100 # Pre:0810 # Tel:xxxxxxxxxx"
>>> s.split()[6].split(":")[1]
'0810'
But I tend to prefer more general solutions. For example, depending on how the format changes, something like
>>> d = dict(x.split(":") for x in s.split(" # "))
>>> d
{'Pre': '0810', 'P': '100', 'U': '100', 'Tel': 'xxxxxxxxxx', 'Server': 'x.x.x.x'}
which makes a dictionary of all the values, after which you could access any element:
>>> d["Pre"]
'0810'
>>> d["Server"]
'x.x.x.x'