Convert string tuples to dict - python

I have malformed string:
a = '(a,1.0),(b,6.0),(c,10.0)'
I need dict:
d = {'a':1.0, 'b':6.0, 'c':10.0}
I try:
print (ast.literal_eval(a))
#ValueError: malformed node or string: <_ast.Name object at 0x000000000F67E828>
Then I try replace chars to 'string dict', it is ugly and does not work:
b = a.replace(',(','|{').replace(',',' : ')
.replace('|',', ').replace('(','{').replace(')','}')
print (b)
{a : 1.0}, {b : 6.0}, {c : 10.0}
print (ast.literal_eval(b))
#ValueError: malformed node or string: <_ast.Name object at 0x000000000C2EA588>
What do you do? Something missing? Is possible use regex?

Given the string has the above stated format, you could use regex substitution with backrefs:
import re
a = '(a,1.0),(b,6.0),(c,10.0)'
a_fix = re.sub(r'\((\w+),', r"('\1',",a)
So you look for a pattern (x, (with x a sequence of \ws and you substitute it into ('x',. The result is then:
# result
a_fix == "('a',1.0),('b',6.0),('c',10.0)"
and then parse a_fix and convert it to a dict:
result = dict(ast.literal_eval(a_fix))
The result in then:
>>> dict(ast.literal_eval(a_fix))
{'b': 6.0, 'c': 10.0, 'a': 1.0}

No need for regexes, if your string is in this format.
>>> a = '(a,1.0),(b,6.0),(c,10.0)'
>>> d = dict([x.split(',') for x in a[1:-1].split('),(')])
>>> print(d)
{'c': '10.0', 'a': '1.0', 'b': '6.0'}
We remove the first opening parantheses and last closing parantheses to get the key-value pairs by splitting on ),(. The pairs can then be split on the comma.
To cast to float, the list comprehension gets a little longer:
d = dict([(a, float(b)) for (a, b) in [x.split(',') for x in a[1:-1].split('),(')]])

If there are always 2 comma-separated values inside parentheses and the second is of a float type, you may use
import re
s = '(a,1.0),(b,6.0),(c,10.0)'
print(dict(map(lambda (w, m): (w, float(m)), [(x, y) for x, y in re.findall(r'\(([^),]+),([^)]*)\)', s) ])))
See the Python demo and the (quite generic) regex demo. This pattern just matches a (, then 0+ chars other than a comma and ) capturing into Group 1, then a comma is matched, then any 0+ chars other than ) (captured into Group 2) and a ).
As the pattern above is suitable when you have pre-validated data, the regex can be restricted for your current data as
r'\((\w+),(\d*\.?\d+)\)'
See the regex demo
Details:
\( - a literal (
(\w+) - Capturing group 1: one or more word (letter/digit/_) chars
, - a comma
(\d*\.?\d+) - a common integer/float regex: zero or more digits, an optional . (decimal separator) and 1+ digits
\) - a literal closing parenthesis.

the reason why eval() dose not work is the a, b, c are not defined, we can define those with it's string form and eval will get that string form to use
In [11]: text = '(a,1.0),(b,6.0),(c,10.0)'
In [12]: a, b, c = 'a', 'b', 'c'
In [13]: eval(text)
Out[13]: (('a', 1.0), ('b', 6.0), ('c', 10.0))
In [14]: dict(eval(text))
Out[14]: {'a': 1.0, 'b': 6.0, 'c': 10.0}
to do this in regex way:
In [21]: re.sub(r'\((.+?),', r'("\1",', text)
Out[21]: '("a",1.0),("b",6.0),("c",10.0)'
In [22]: eval(_)
Out[22]: (('a', 1.0), ('b', 6.0), ('c', 10.0))
In [23]: dict(_)
Out[23]: {'a': 1.0, 'b': 6.0, 'c': 10.0}

Related

python regex: string with maximum one whitespace

Hello I would like to know how to create a regex pattern with a sting which might contain maximum one white space. More specificly:
s = "a b d d c"
pattern = "(?P<a>.*) +(?P<b>.*) +(?P<c>.*)"
print(re.match(pattern, s).groupdict())
returns:
{'a': 'a b d d', 'b': '', 'c': 'c'}
I would like to have:
{'a': 'a', 'b': 'b d d', 'c': 'c'}
Another option could be to use zip and a dict and generate the characters based on the length of the matches.
You can get the matches which contain at max one whitespace using a repeating pattern matching a non whitespace char \S and repeat 0+ times a space followed by a non whitespace char:
\S(?: \S)*
Regex demo | Python demo
For example:
import re
a=97
regex = r"\S(?: \S)*"
test_str = "a b d d c"
matches = re.findall(regex, test_str)
chars = list(map(chr, range(a, a+len(matches))))
print(dict(zip(chars, matches)))
Result
{'a': 'a', 'b': 'b d d', 'c': 'c'}
With the help of The fourth birds answer I managed to do it in a way I imagened it to be:
import re
s = "a b d d c"
pattern = "(?P<a>\S(?: \S)*) +(?P<b>\S(?: \S)*) +(?P<c>\S(?: \S)*)"
print(re.match(pattern, s).groupdict())
Looks like you just want to split your string with 2 or more spaces. You can do it this way:
s = "a b d d c"
re.split(r' {2,}', s)
will return you:
['a', 'b d d', 'c']
It's probably easier to use re.split, since the delimiter is known (2 or more spaces), but the patterns in-between are not. I'm sure someone better at regex than myself can work out the look-aheads, but by splitting on \s{2,}, you can greatly simplify the problem.
You can make your dictionary of named groups like so:
import re
s = "a b d d c"
x = dict(zip('abc', re.split('\s{2,}', s)))
x
{'a': 'a', 'b': 'b d d', 'c': 'c'}
Where the first arg in zip is the named groups. To extend this to more general names:
groups = ['group_1', 'another group', 'third_group']
x = dict(zip(groups, re.split('\s{2,}', s)))
{'group_1': 'a', 'another group': 'b d d', 'third_group': 'c'}
I found an other solution I even like better:
import re
s = "a b dll d c"
pattern = "(?P<a>(\S*[\t]?)*) +(?P<b>(\S*[\t ]?)*) +(?P<c>(\S*[\t ]?)*)"
print(re.match(pattern, s).groupdict())
here it's even possible to have more than one letter.

Split string into character and numbers and store in a map Python

I've a string like
'A15B7C2'
It represents count of the character.
I am using re right now to split it into characters and numbers. After that will eventually store it in a dict
import re
data_str = 'A15B7C2'
re.split("(\d+)", data_str)
# prints --> ['A', '15', 'B', '7', 'C', '2', '']
But if I have a string like
'A15B7CD2Ef5'
it means that count of C is 1 (its implicit) and count of Ef is 5. (Uppercase and subsequent lowercase count as one key) consequently I get
'CD' = 2 (Not correct)
'Ef' = 5 (Correct)
How do modify it to provide me proper count?
Whats the best approach to parse and get count and store in a dict?
You can do this all in one fell swoop:
In [2]: s = 'A15B7CD2Ef5'
In [3]: {k: int(v) if v else 1 for k,v in re.findall(r"([A-Z][a-z]?)(\d+)?", s)}
Out[3]: {'A': 15, 'B': 7, 'C': 1, 'D': 2, 'Ef': 5}
The regex is essentially a direct translation of your requirements, leveraging .findall and capture groups:
r"([A-Z][a-z]?)(\d+)?"
Essentially, an uppercase letter that may be followed by a lowercase letter as the first group, and a digit that may or may not be there as the second group (this will return '' if it isn't there.
A trickier example:
In [7]: s = 'A15B7CD2EfFGHK5'
In [8]: {k: int(v) if v else 1 for k,v in re.findall(r"([A-Z][a-z]?)(\d+)?", s)}
Out[8]: {'A': 15, 'B': 7, 'C': 1, 'D': 2, 'Ef': 1, 'F': 1, 'G': 1, 'H': 1, 'K': 5}
Finally, breaking it down with an even trickier example:
In [10]: s = 'A15B7CD2EfFGgHHhK5'
In [11]: re.findall(r"([A-Z](?:[a-z])?)(\d+)?", s)
Out[11]:
[('A', '15'),
('B', '7'),
('C', ''),
('D', '2'),
('Ef', ''),
('F', ''),
('Gg', ''),
('H', ''),
('Hh', ''),
('K', '5')]
In [12]: {k: int(v) if v else 1 for k,v in re.findall(r"([A-Z][a-z]?)(\d+)?", s)}
Out[12]:
{'A': 15,
'B': 7,
'C': 1,
'D': 2,
'Ef': 1,
'F': 1,
'Gg': 1,
'H': 1,
'Hh': 1,
'K': 5}
You could use some regex logic and .span():
([A-Z])[a-z]*(\d+)
See a demo on regex101.com.
In Python this would be:
import re
string = "A15B7CD2Ef5"
rx = re.compile(r'([A-Z])[a-z]*(\d+)')
def analyze(string=None):
result = []; lastpos = 0;
for m in rx.finditer(string):
span = m.span()
if lastpos != span[0]:
result.append((string[lastpos], 1))
else:
result.append((m.group(1), m.group(2)))
lastpos = span[1]
return result
print(analyze(string))
# [('A', '15'), ('B', '7'), ('C', 1), ('E', '5')]
Search for the letters in the string, instead of digits.
import re
data_str = 'A15B7C2'
temp = re.split("([A-Za-z])", data_str)[1:] # First element is just "", don want that
temp= [a if a != "" else "1" for a in temp] # add the 1's that were implicit in the original string
finalDict = dict(zip(temp[0::2], temp[1::2])) # turn the list into a dict
In keeping with your original logic. Instead of using re.split() we can find all the numbers, split the string on the first match, keep the second half of the string for the next split, and store your pairs as tuples for later.
import re
raw = "A15B7CD2Ef5"
# find all the numbers
found = re.findall("(\d+)", raw)
# save the pairs as a list of tuples
pairs = []
# check that numbers where found
if found:
# iterate over all matches
for f in found:
# split the raw, with a max split of one, so that duplicate numbers don't cause more then 2 parts
part = raw.split(f, 1)
# set the original string to the second half of the split
raw = part[1]
# append pair
pairs.append((part[0], f))
# Now for fun expand values
long_str = ""
for p in pairs:
long_str += p[0] * int(p[1])
print pairs
print long_str

Find all strings in nested brackets

How do i find string in nested brackets
Lets say I have a string
uv(wh(x(yz))
and I want to find all string in brackets (so wh, x, yz)
import re
s="uuv(wh(x(yz))"
regex = r"(\(\w*?\))"
matches = re.findall(regex, s)
The above code only finds yz
Can I modify this regex to find all matches?
To get all properly parenthesized text:
import re
def get_all_in_parens(text):
in_parens = []
n = "has something to substitute"
while n:
text, n = re.subn(r'\(([^()]*)\)', # match flat expression in parens
lambda m: in_parens.append(m.group(1)) or '', text)
return in_parens
Example:
>>> get_all_in_parens("uuv(wh(x(yz))")
['yz', 'x']
Note: there is no 'wh' in the result due to the unbalanced paren.
If the parentheses are balanced; it returns all three nested substrings:
>>> get_all_in_parens("uuv(wh(x(yz)))")
['yz', 'x', 'wh']
>>> get_all_in_parens("a(b(c)de)")
['c', 'bde']
Would a string split work instead of a regex?
s='uv(wh(x(yz))'
match=[''.join(x for x in i if x.isalpha()) for i in s.split('(')]
>>>print(match)
['uv', 'wh', 'x', 'yz']
>>> match.pop(0)
You could pop off the first element because if it was contained in a parenthesis, the first position would be blank, which you wouldn't want and if it wasn't blank that means it wasn't in the parenthesis so again, you wouldn't want it.
Since that wasn't flexible enough something like this would work:
def match(string):
unrefined_match=re.findall('\((\w+)|(\w+)\)', string)
return [x for i in unrefined_match for x in i if x]
>>> match('uv(wh(x(yz))')
['wh', 'x', 'yz']
>>> match('a(b(c)de)')
['b', 'c', 'de']
Using regex a pattern such as this might potentially work:
\((\w{1,})
Result:
['wh', 'x', 'yz']
Your current pattern escapes the ( ) and doesn't treat them as a capture group.
Well if you know how to covert from PHP regex to Python , then you can use this
\(((?>[^()]+)|(?R))*\)

How can I ignore a string in python regex group matching?

Say I have the following string
>>> mystr = 'A-ABd54-Bf657'
(a random string of dash-delimited character groups) and want to match the opening part, and the rest of the string, in separate groups. I can use
>>> re.match('(?P<a>[a-zA-Z0-9]+)-(?P<b>[a-zA-Z0-9-]+)', mystr)
This produces a groupdict() like this:
{'a': 'A', 'b': 'ABd54-Bf657'}
How can I get the same regex to match group b but separately match a specific suffix (or set of suffices) if it exists (they exist)? Ideally something like this
>>> myregex = <help me here>
>>> re.match(myregex, 'A-ABd54-Bf657').groupdict()
{'a': 'A', 'b': 'ABd54-Bf657', 'test': None}
>>> re.match(myregex, 'A-ABd54-Bf657-blah').groupdict()
{'a': 'A', 'b': 'ABd54-Bf657-blah', 'test': None}
>>> re.match(myregex, 'A-ABd54-Bf657-test').groupdict()
{'a': 'A', 'b': 'ABd54-Bf657', 'test': 'test'}
Thanks.
mystr = 'A-ABd54-Bf657'
re.match('(?P<a>[a-zA-Z0-9]+)-(?P<b>[a-zA-Z0-9-]+?)(?:-(?P<test>test))?$', mystr)
^ ^
The first indicated ? makes the + quantifier non-greedy, so that it consumes the minimum possible.
The second indicated ? makes the group optional.
The $ is necessary or else the non-greediness plus optionality will match nothing.

Convert String of letters to numbers according to ordinal value?

I have no idea where to start with this. I need to write a function that will return a string of numbers in ordinal value. So like
stringConvert('DABC')
would give me '4123'
stringConvert('XPFT')
would give me '4213'
I thought maybe I could make a dictionary and make the each letter from the string associate it with an integer, but that seems too inefficient and I still don't know how to put them in order.
You could sort the unique characters in the input string and apply indices to each letter by using the enumerate() function:
def stringConvert(s):
ordinals = {c: str(ordinal) for ordinal, c in enumerate(sorted(set(s)), 1)}
return ''.join([ordinals[c] for c in s])
The second argument to enumerate() is the integer at which to start counting; since your ordinals start at 1 you use that as the starting value rather than 0. set() gives us the unique values only.
ordinals then is a dictionary mapping character to an integer, in alphabetical order.
Demo:
>>> def stringConvert(s):
... ordinals = {c: str(ordinal) for ordinal, c in enumerate(sorted(set(s)), 1)}
... return ''.join([ordinals[c] for c in s])
...
>>> stringConvert('DABC')
'4123'
>>> stringConvert('XPFT')
'4213'
Breaking that all down a little:
>>> s = 'XPFT'
>>> set(s) # unique characters
set(['X', 'F', 'T', 'P'])
>>> sorted(set(s)) # unique characters in sorted order
['F', 'P', 'T', 'X']
>>> list(enumerate(sorted(set(s)), 1)) # unique characters in sorted order with index
[(1, 'F'), (2, 'P'), (3, 'T'), (4, 'X')]
>>> {c: str(ordinal) for ordinal, c in enumerate(sorted(s), 1)} # character to number
{'P': '2', 'T': '3', 'X': '4', 'F': '1'}
Take a look at string module, especially maketrans and translate
With those, your code may look like
def stringConvert(letters):
return translate(letters, maketrans(''.join(sorted(set(letters))).ljust(9), '123456789'))
and pass your strings as variable
You could make a character translation table and use the translate() string method:
from string import maketrans
TO = ''.join(str(i+1)[0] for i in xrange(256))
def stringConvert(s):
frm = ''.join(sorted(set(s)))
return s.translate(maketrans(frm, TO[:len(frm)]))
print stringConvert('DABC') # --> 4123
print stringConvert('XPFT') # --> 4213

Categories

Resources