Python regex query to parse very simple dictionary - python

I am new to regex module and learning a simple case to extract key and values from a simple dictionary.
the dictionary can not contain nested dicts and any lists, but may have simple tuples
MWE
import re
# note: the dictionary are simple and does NOT contains list, nested dicts, just these two example suffices for the regex matching.
d = "{'a':10,'b':True,'c':(5,'a')}" # ['a', 10, 'b', True, 'c', (5,'a') ]
d = "{'c':(5,'a'), 'd': 'TX'}" # ['c', (5,'a'), 'd', 'TX']
regexp = r"(.*):(.*)" # I am not sure how to repeat this pattern separated by ,
out = re.match(regexp,d).groups()
out

You should not use regex for this job. When the input string is valid Python syntax, you can use ast.literal_eval.
Like this:
import ast
# ...
out = ast.literal_eval(d)
Now you have a dictionary object in Python. You can for instance get the key/value pairs in a (dict_items) list:
print(out.items())
Regex
Regex is not the right tool. There will always be cases where some boundary case will be wrongly parsed. But to get the repeated matches, you can better use findall. Here is a simple example regex:
regexp = r"([^{\s][^:]*):([^:}]*)(?:[,}])"
out = re.findall(regexp, d)
This will give a list of pairs.

Regex would be hard (perhaps impossible, but I'm not versed enough to say confidently) to use because of the ',' nested in your tuples. Just for the sake of it, I wrote (regex-less) code to parse your string for separators, ignoring parts inside parentheses:
d = "{'c':(5,'a',1), 'd': 'TX', 1:(1,2,3)}"
d=d.replace("{","").replace("}","")
indices = []
inside = False
for i,l in enumerate(d):
if inside:
if l == ")":
inside = False
continue
continue
if l == "(":
inside = True
continue
if l in {":",","}:
indices.append(i)
indices.append(len(d))
parts = []
start = 0
for i in indices:
parts.append(d[start:i].strip())
start = i+1
parts
# ["'c'", "(5,'a',1)", "'d'", "'TX'", '1', '(1,2,3)']

Related

String manipulation with python and storing alphabets and digits in separate lists

For a given string s='ab12dc3e6' I want to add 'ab' and '12' in two different lists. that means for output i am trying to achieve as temp1=['ab','dc','e'] and for temp2=['12,'3','6'].
I am not able to do so with the following code. Can someone provide an efficient way to do it?
S = "ab12dc3e6"
temp=list(S)
x=''
temp1=[]
temp2=[]
for i in range(len(temp)):
while i<len(temp) and (temp[i] and temp[i+1]).isdigit():
x+=temp[i]
i+=1
temp1.append(x)
if not temp[i].isdigit():
break
You can also solve this without any imports:
S = "ab12dc3e6"
def get_adjacent_by_func(content, func):
"""Returns a list of elements from content that fullfull func(...)"""
result = [[]]
for c in content:
if func(c):
# add to last inner list
result[-1].append(c)
elif result[-1]: # last inner list is filled
# add new inner list
result.append([])
# return only non empty inner lists
return [''.join(r) for r in result if r]
print(get_adjacent_by_func(S, str.isalpha))
print(get_adjacent_by_func(S, str.isdigit))
Output:
['ab', 'dc', 'e']
['12', '3', '6']
you can use regex, where you group letters and digits, then append them to lists
import re
S = "ab12dc3e6"
pattern = re.compile(r"([a-zA-Z]*)(\d*)")
temp1 = []
temp2 = []
for match in pattern.finditer(S):
# extract words
#dont append empty match
if match.group(1):
temp1.append(match.group(1))
print(match.group(1))
# extract numbers
#dont append empty match
if match.group(2):
temp2.append(match.group(2))
print(match.group(2))
print(temp1)
print(temp2)
Your code does nothing for isalpha - you also run into IndexError on
while i<len(temp) and (temp[i] and temp[i+1]).isdigit():
for i == len(temp)-1.
You can use itertools.takewhile and the correct string methods of str.isdigit and str.isalpha to filter your string down:
S = "ab12dc3e6"
r = {"digit":[], "letter":[]}
from itertools import takewhile, cycle
# switch between the two test methods
c = cycle([str.isalpha, str.isdigit])
r = {}
i = 0
while S:
what = next(c) # get next method to use
k = ''.join(takewhile(what, S))
S = S[len(k):]
r.setdefault(what.__name__, []).append(k)
print(r)
Output:
{'isalpha': ['ab', 'dc', 'e'],
'isdigit': ['12', '3', '6']}
This essentially creates a dictionary where each seperate list is stored under the functions name:
To get the lists, use r["isalpha"] or r["isdigit"].

For loop to dictionary comprehension correct translation (Python)

I need to find the longest string in a list for each letter of the alphabet.
My first straight forward approach looked like this:
alphabet = ["a","b", ..., "z"]
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
result = {key:"" for key in alphabet} # create a dictionary
# go through all words, if that word is longer than the current longest, save it
for word in text:
if word[0].lower() in alphabet and len(result[word[0].lower()]) < len(word):
result[word[0].lower()] = word.lower()
print(result)
which returns:
{'a': 'andhjtje9'}
as it is supposed to do.
In order to practice dictionary comprehension I tried to solve this in just one line:
result2 = {key:"" for key in alphabet}
result2 = {word[0].lower(): word.lower() for word in text if word[0].lower() in alphabet and len(result2[word[0].lower()]) < len(word)}
I just copied the if statement into the comprehension loop...
results2 however is:
{'a': 'ajhe5'}
can someone explain me why this is the case? I feel like I did exactly the same as in the first loop...
Thanks for any help!
List / Dict / Set - comprehension can not refer to themself while building itself - thats why you do not get what you want.
You can use a complicated dictionary comprehension to do this - with help of collections.groupby on a sorted list this could look like this:
from string import ascii_lowercase
from itertools import groupby
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
d = {key:sorted(value, key=len)[-1]
for key,value in groupby((s for s in sorted(text)
if s[0].lower() in frozenset(ascii_lowercase)),
lambda x:x[0].lower())}
print(d) # {'a': 'andhjtje9'}
or
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
d = {key:next(value) for key,value in groupby(
(s for s in sorted(text, key=lambda x: (x[0],-len(x)))
if s[0].lower() in frozenset(ascii_lowercase)),
lambda x:x[0].lower())}
print(d) # {'a': 'andhjtje9'}
or several other ways ... but why would you?
Having it as for loops is much cleaner and easier to understand and would, in this case, follow the zen of python probably better.
Read about the zen of python by running:
import this

pattern match get list and dict from string

I have string below,and I want to get list,dict,var from this string.
How can I to split this string to specific format?
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}'
import re
m1 = re.findall (r'(?=.*,)(.*?=\[.+?\],?)',s)
for i in m1 :
print('m1:',i)
I only get result 1 correctly.
Does anyone know how to do?
m1: list_c=[1,2],
m1: a=3,b=1.3,c=abch,list_a=[1,2],
Use '=' to split instead, then you can work around with variable name and it's value.
You still need to handle the type casting for values (regex, split, try with casting may help).
Also, same as others' comment, using dict may be easier to handle
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}'
al = s.split('=')
var_l = [al[0]]
value_l = []
for a in al[1:-1]:
var_l.append(a.split(',')[-1])
value_l.append(','.join(a.split(',')[:-1]))
value_l.append(al[-1])
output = dict(zip(var_l, value_l))
print(output)
You may have better luck if you more or less explicitly describe the right-hand side expressions: numbers, lists, dictionaries, and identifiers:
re.findall(r"([^=]+)=" # LHS and assignment operator
+r"([+-]?\d+(?:\.\d+)?|" # Numbers
+r"[+-]?\d+\.|" # More numbers
+r"\[[^]]+\]|" # Lists
+r"{[^}]+}|" # Dictionaries
+r"[a-zA-Z_][a-zA-Z_\d]*)", # Idents
s)
# [('list_c', '[1,2]'), ('a', '3'), ('b', '1.3'), ('c', 'abch'),
# ('list_a', '[1,2]'), ('dict_a', '{a:2,b:3}')]
The answer is like below
import re
from pprint import pprint
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1],Save,Record,dict_a={a:2,b:3}'
m1 = re.findall(r"([^=]+)=" # LHS and assignment operator
+r"([+-]?\d+(?:\.\d+)?|" # Numbers
+r"[+-]?\d+\.|" # More numbers
+r"\[[^]]+\]|" # Lists
+r"{[^}]+}|" # Dictionaries
+r"[a-zA-Z_][a-zA-Z_\d]*)", # Idents
s)
temp_d = {}
for i,j in m1:
temp = i.strip(',').split(',')
if len(temp)>1:
for k in temp[:-1]:
temp_d[k]=''
temp_d[temp[-1]] = j
else:
temp_d[temp[0]] = j
pprint(temp_d)
Output is like
{'Record': '',
'Save': '',
'a': '3',
'b': '1.3',
'c': 'abch',
'dict_a': '{a:2,b:3}',
'list_a': '[1]',
'list_c': '[1,2]'}
Instead of picking out the types, you can start by capturing the identifiers. Here's a regex that captures all the identifiers in the string (for lowercase only, but see note):
regex = re.compile(r'([a-z]|_)+=')
#note if you want all valid variable names: r'([a-z]|[A-Z]|[0-9]|_)+'
cases = [x.group() for x in re.finditer(regex, s)]
This gives a list of all the identifiers in the string:
['list_c=', 'a=', 'b=', 'c=', 'list_a=', 'dict_a=']
We can now define a function to sequentially chop up s using the
above list to partition the string sequentially:
def chop(mystr, mylist):
temp = mystr.partition(mylist[0])[2]
cut = temp.find(mylist[1]) #strip leading bits
return mystr.partition(mylist[0])[2][cut:], mylist[1:]
mystr = s[:]
temp = [mystr]
mylist = cases[:]
while len() > 1:
mystr, mylist = chop(mystr, mylist)
temp.append(mystr)
This (convoluted) slicing operation gives this list of strings:
['list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'list_a=[1,2],dict_a={a:2,b:3}',
'dict_a={a:2,b:3}']
Now cut off the ends using each successive entry:
result = []
for x in range(len(temp) - 1):
cut = temp[x].find(temp[x+1]) - 1 #-1 to remove commas
result.append(temp[x][:cut])
result.append(temp.pop()) #get the last item
Now we have the full list:
['list_c=[1,2]', 'a=3', 'b=1.3', 'c=abch', 'list_a=[1,2]', 'dict_a={a:2,b:3}']
Each element is easily parsable into key:value pairs (and is also executable via exec).

most pythonic way to compare substrings l in list L to string S & edit S according to l in L?

The list ['a','a #2','a(Old)'] should become {'a'} because '#' and '(Old)' are to be excised and a list of duplicates isn't needed. I struggled to develop a list comprehension with a generator and settled on this since I knew it'd work and valued time more than looking good:
l = []
groups = ['a','a #2','a(Old)']
for i in groups:
if ('#') in i: l.append(i[:i.index('#')].strip())
elif ('(Old)') in i: l.append(i[:i.index('(Old)')].strip())
else: l.append(i)
groups = set(l)
What's the slick way to get this result?
Here is general solution, if you want to clean elements of list lst from parts in wastes:
lst = ['a','a #2','a(Old)']
wastes = ['#', '(Old)']
cleaned_set = {
min([element.split(waste)[0].strip() for waste in wastes])
for element in arr
}
You could write this whole expression in a single set comprehension
>>> groups = ['a','a #2','a(Old)']
>>> {i.split('#')[0].split('(Old)')[0].strip() for i in groups}
{'a'}
This will get everything preceding a # and everything preceding '(Old)', then trim off whitespace. The remainder is placed into a set, which only keeps unique values.
You could define a helper function to apply all of the splits and then use a set comprehension.
For example:
lst = ['a','a #2','a(Old)', 'b', 'b #', 'b(New)']
splits = {'#', '(Old)', '(New)'}
def split_all(a):
for s in splits:
a = a.split(s)[0]
return a.strip()
groups = {split_all(a) for a in lst}
#{'a', 'b'}

Populate dictionary from list

I have a list of strings (from a .tt file) that looks like this:
list1 = ['have\tVERB', 'and\tCONJ', ..., 'tree\tNOUN', 'go\tVERB']
I want to turn it into a dictionary that looks like:
dict1 = { 'have':'VERB', 'and':'CONJ', 'tree':'NOUN', 'go':'VERB' }
I was thinking of substitution, but it doesn't work that well. Is there a way to tag the tab string '\t' as a divider?
Try the following:
dict1 = dict(item.split('\t') for item in list1)
Output:
>>>dict1
{'and': 'CONJ', 'go': 'VERB', 'tree': 'NOUN', 'have': 'VERB'}
Since str.split also splits on '\t' by default ('\t' is considered white space), you could get a functional approach by feeding dict with a map that looks quite elegant:
d = dict(map(str.split, list1))
With the dictionary d now being in the wanted form:
print(d)
{'and': 'CONJ', 'go': 'VERB', 'have': 'VERB', 'tree': 'NOUN'}
If you need a split only on '\t' (while ignoring ' ' and '\n') and still want to use the map approach, you can create a partial object with functools.partial that only uses '\t' as the separator:
from functools import partial
# only splits on '\t' ignoring new-lines, white space e.t.c
tabsplit = partial(str.split, sep='\t')
d = dict(map(tabsplit, list1))
this, of course, yields the same result for d using the sample list of strings.
do that with a simple dict comprehension and a str.split (without arguments strip splits on blanks)
list1 = ['have\tVERB', 'and\tCONJ', 'tree\tNOUN', 'go\tVERB']
dict1 = {x.split()[0]:x.split()[1] for x in list1}
result:
{'and': 'CONJ', 'go': 'VERB', 'tree': 'NOUN', 'have': 'VERB'}
EDIT: the x.split()[0]:x.split()[1] does split twice, which is not optimal. Other answers here do it better without dict comprehension.
A short way to solve the problem, since split method splits '\t' by default (as pointed out by Jim Fasarakis-Hilliard), could be:
dictionary = dict(item.split() for item in list1)
print dictionary
I also wrote down a more simple and classic approach.
Not very pythonic but easy to understand for beginners:
list1 = ['have\tVERB', 'and\tCONJ', 'tree\tNOUN', 'go\tVERB']
dictionary1 = {}
for item in list1:
splitted_item = item.split('\t')
word = splitted_item[0]
word_type = splitted_item[1]
dictionary1[word] = word_type
print dictionary1
Here I wrote the same code with very verbose comments:
# Let's start with our word list, we'll call it 'list1'
list1 = ['have\tVERB', 'and\tCONJ', 'tree\tNOUN', 'go\tVERB']
# Here's an empty dictionary, 'dictionary1'
dictionary1 = {}
# Let's start to iterate using variable 'item' through 'list1'
for item in list1:
# Here I split item in two parts, passing the '\t' character
# to the split function and put the resulting list of two elements
# into 'splitted_item' variable.
# If you want to know more about split function check the link available
# at the end of this answer
splitted_item = item.split('\t')
# Just to make code more readable here I now put 1st part
# of the splitted item (part 0 because we start counting
# from number 0) in "word" variable
word = splitted_item[0]
# I use the same apporach to save the 2nd part of the
# splitted item into 'word_type' variable
# Yes, you're right: we use 1 because we start counting from 0
word_type = splitted_item[1]
# Finally I add to 'dictionary1', 'word' key with a value of 'word_type'
dictionary1[word] = word_type
# After the for loop has been completed I print the now
# complete dictionary1 to check if result is correct
print dictionary1
Useful links:
You can quickly copy and paste this code here to check how it works and tweak it if you like: http://www.codeskulptor.com
If you want to learn more about split and string functions in general: https://docs.python.org/2/library/string.html

Categories

Resources