I'd like to parse tag/value descriptions using the delimiters :, and •
E.g. the Input would be:
Name:Test•Title: Test•Keywords: A,B,C
the expected result should be the name value dict
{
"name": "Test",
"title": "Title",
"keywords: "A,B,C"
}
potentially already splitting the keywords in "A,B,C" to a list. (This is a minor detail since the python built in split method of string will happily do this).
Also applying a mapping
keys={
"Name": "name",
"Title": "title",
"Keywords": "keywords",
}
as a mapping between names and dict keys would be helpful but could be a separate step.
I tried the code below https://trinket.io/python3/8dbbc783c7
# pyparsing named values
# Wolfgang Fahl
# 2023-01-28 for Stackoverflow question
import pyparsing as pp
notes_text="Name:Test•Title: Test•Keywords: A,B,C"
keys={
"Name": "name",
"Titel": "title",
"Keywords": "keywords",
}
keywords=list(keys.keys())
runDelim="•"
name_values_grammar=pp.delimited_list(
pp.oneOf(keywords,as_keyword=True).setResultsName("key",list_all_matches=True)
+":"+pp.Suppress(pp.Optional(pp.White()))
+pp.delimited_list(
pp.OneOrMore(pp.Word(pp.printables+" ", exclude_chars=",:"))
,delim=",")("value")
,delim=runDelim).setResultsName("tag", list_all_matches=True)
results=name_values_grammar.parseString(notes_text)
print(results.dump())
and variations of it but i am not even close to the expected result. Currently the dump shows:
['Name', ':', 'Test']
- key: 'Name'
- tag: [['Name', ':', 'Test']]
[0]:
['Name', ':', 'Test']
- value: ['Test']
Seems i don't know how to define the grammar and work on the parseresult in a way to get the needed dict result.
The main questions for me are:
Should i use parse actions?
How is the naming of part results done?
How is the navigation of the resulting tree done?
How is it possible to get the list back from delimitedList?
What does list_all_matches=True achieve - it's behavior seems strange
I searched for answers on the above questions here on stackoverflow and i couldn't find a consistent picture of what to do.
Pyparsing delimited list only returns first element
Finding lists of elements within a string using Pyparsing
PyParsing seems to be a great tool but i find it very unintuitive. There are fortunately lots of answers here so i hope to learn how to get this example working
Trying myself i took a stepwise approach:
First i checked the delimitedList behavior see https://trinket.io/python3/25e60884eb
# Try out pyparsing delimitedList
# WF 2023-01-28
from pyparsing import printables, OneOrMore, Word, delimitedList
notes_text="A,B,C"
comma_separated_values=delimitedList(Word(printables+" ", exclude_chars=",:"),delim=",")("clist")
grammar = comma_separated_values
result=grammar.parseString(notes_text)
print(f"result:{result}")
print(f"dump:{result.dump()}")
print(f"asDict:{result.asDict()}")
print(f"asList:{result.asList()}")
which returns
result:['A', 'B', 'C']
dump:['A', 'B', 'C']
- clist: ['A', 'B', 'C']
asDict:{'clist': ['A', 'B', 'C']}
asList:['A', 'B', 'C']
which looks promising and the key success factor seems to be to name this list with "clist" and the default behavior looks fine.
https://trinket.io/python3/bc2517e25a
shows in more detail where the problem is.
# Try out pyparsing delimitedList
# see https://stackoverflow.com/q/75266188/1497139
# WF 2023-01-28
from pyparsing import printables, oneOf, OneOrMore,Optional, ParseResults, Suppress,White, Word, delimitedList
def show_result(title:str,result:ParseResults):
"""
show pyparsing result details
Args:
result(ParseResults)
"""
print(f"result for {title}:")
print(f" result:{result}")
print(f" dump:{result.dump()}")
print(f" asDict:{result.asDict()}")
print(f" asList:{result.asList()}")
# asXML is deprecated and doesn't work any more
# print(f"asXML:{result.asXML()}")
notes_text="Name:Test•Title: Test•Keywords: A,B,C"
comma_text="A,B,C"
keys={
"Name": "name",
"Titel": "title",
"Keywords": "keywords",
}
keywords=list(keys.keys())
runDelim="•"
comma_separated_values=delimitedList(Word(printables+" ", exclude_chars=",:"),delim=",")("clist")
cresult=comma_separated_values.parseString(comma_text)
show_result("comma separated values",cresult)
grammar=delimitedList(
oneOf(keywords,as_keyword=True)
+Suppress(":"+Optional(White()))
+comma_separated_values
,delim=runDelim
)("namevalues")
nresult=grammar.parseString(notes_text)
show_result("name value list",nresult)
#ogrammar=OneOrMore(
# oneOf(keywords,as_keyword=True)
# +Suppress(":"+Optional(White()))
# +comma_separated_values
#)
#oresult=grammar.parseString(notes_text)
#show_result("name value list with OneOf",nresult)
output:
result for comma separated values:
result:['A', 'B', 'C']
dump:['A', 'B', 'C']
- clist: ['A', 'B', 'C']
asDict:{'clist': ['A', 'B', 'C']}
asList:['A', 'B', 'C']
result for name value list:
result:['Name', 'Test']
dump:['Name', 'Test']
- clist: ['Test']
- namevalues: ['Name', 'Test']
asDict:{'clist': ['Test'], 'namevalues': ['Name', 'Test']}
asList:['Name', 'Test']
while the first result makes sense for me the second is unintuitive. I'd expected a nested result - a dict with a dict of list.
What causes this unintuitive behavior and how can it be mitigated?
Issues with the grammar being that: you are encapsulating OneOrMore in delimited_list and you only want the outer one, and you aren't telling the parser how your data needs to be structured to give the names meaning.
You also don't need the whitespace suppression as it is automatic.
Adding parse_all to the parse_string function will help to see where not everything is being consumed.
name_values_grammar = pp.delimited_list(
pp.Group(
pp.oneOf(keywords,as_keyword=True).setResultsName("key",list_all_matches=True)
+ pp.Suppress(pp.Literal(':'))
+ pp.delimited_list(
pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
, delim=',')
)
, delim='•'
).setResultsName('tag', list_all_matches=True)
Should i use parse actions? As you can see, you don't technically need to, but you've ended up with a data structure that might be less efficient for what you want. If the grammar gets more complicated, I think using some parse actions would make sense. Take a look below for some examples to map the key names (only if they are found), and cleaning up list parsing for a more complicated grammar.
How is the naming of part results done? By default in a ParseResults object, the last part that is labelled with a name will be returned when you ask for that name. Asking for all matches to be returned using list_all_matches will only work usefully for some simple structures, but it does work. See below for examples.
How is the navigation of the resulting tree done? By default, everything gets flattened. You can use pyparsing.Group to tell the parser not to flatten its contents into the parent list (and therefore retain useful structure and part names).
How is it possible to get the list back from delimitedList? If you don't wrap the delimited_list result in another list then the flattening that is done will remove the structure. Parse actions or Group on the internal structure again to the rescue.
What does list_all_matches=True achieve - its behavior seems strange It is a function of the grammar structure that it seems strange. Consider the different outputs in:
import pyparsing as pp
print(
pp.delimited_list(
pp.Word(pp.printables, exclude_chars=',').setResultsName('word', list_all_matches=True)
).parse_string('x,y,z').dump()
)
print(
pp.delimited_list(
pp.Word(pp.printables, exclude_chars=':,').setResultsName('key', list_all_matches=True)
+ pp.Suppress(pp.Literal(':'))
+ pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
)
.parse_string('x:a,y:b,z:c').dump()
)
print(
pp.delimited_list(
pp.Group(
pp.Word(pp.printables, exclude_chars=':,').setResultsName('key', list_all_matches=True)
+ pp.Suppress(pp.Literal(':'))
+ pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
)
).setResultsName('tag', list_all_matches=True)
.parse_string('x:a,y:b,z:c').dump()
)
The first one makes sense, giving you a list of all the tokens you would expect. The third one also makes sense, since you have a structure you can walk. But the second one you end up with two lists that are not necessarily (in a more complicated grammar) going to be easy to match up.
Here's a different way of building the grammar so that it supports quoting strings with delimiters in them so they don't become lists, and keywords that aren't in your mapping. It's harder to do this without parse actions.
import pyparsing as pp
import json
test_string = "Name:Test•Title: Test•Extra: '1,2,3'•Keywords: A,B,C,'D,E',F"
keys={
"Name": "name",
"Title": "title",
"Keywords": "keywords",
}
g_key = pp.Word(pp.alphas)
g_item = pp.Word(pp.printables, excludeChars='•,\'') | pp.QuotedString(quote_char="'")
g_value = pp.delimited_list(g_item, delim=',')
l_key_value_sep = pp.Suppress(pp.Literal(':'))
g_key_value = g_key + l_key_value_sep + g_value
g_grammar = pp.delimited_list(g_key_value, delim='•')
g_key.add_parse_action(lambda x: keys[x[0]] if x[0] in keys else x)
g_value.add_parse_action(lambda x: [x] if len(x) > 1 else x)
g_key_value.add_parse_action(lambda x: (x[0], x[1].as_list()) if isinstance(x[1],pp.ParseResults) else (x[0], x[1]))
key_values = dict()
for k,v in g_grammar.parse_string(test_string, parse_all=True):
key_values[k] = v
print(json.dumps(key_values, indent=2))
Another approach using regular expressions would be:
def _extractByKeyword(keyword: str, string: str) -> typing.Union[str, None]:
"""
Extract the value for the given key from the given string.
designed for simple key value strings without further formatting
e.g.
Title: Hello World
Goal: extraction
For keyword="Goal" the string "extraction would be returned"
Args:
keyword: extract the value associated to this keyword
string: string to extract from
Returns:
str: value associated to given keyword
None: keyword not found in given string
"""
if string is None or keyword is None:
return None
# https://stackoverflow.com/a/2788151/1497139
# value is closure of not space not / colon
pattern = rf"({keyword}:(?P<value>[\s\w,_-]*))(\s+\w+:|\n|$)"
import re
match = re.search(pattern, string)
value = None
if match is not None:
value = match.group('value')
if isinstance(value, str):
value = value.strip()
return value
keys={
"Name": "name",
"Title": "title",
"Keywords": "keywords",
}
notes_text="Name:Test Title: Test Keywords: A,B,C"
lod = {v: _extractByKeyword(k, notes_text) for k,v in keys.items()}
The extraction function was tested with:
import typing
from dataclasses import dataclass
from unittest import TestCase
class TestExtraction(TestCase)
def test_extractByKeyword(self):
"""
tests the keyword extraction
"""
#dataclass
class TestParam:
expected: typing.Union[str, None]
keyword: typing.Union[str, None]
string: typing.Union[str, None]
testParams = [
TestParam("test", "Goal", "Title:Title\nGoal:test\nLabel:title"),
TestParam("test", "Goal", "Title:Title\nGoal:test Label:title"),
TestParam("test", "Goal", "Title:Title\nGoal:test"),
TestParam("test with spaces", "Goal", "Title:Title\nGoal:test with spaces\nLabel:title"),
TestParam("test with spaces", "Goal", "Title:Title\nGoal:test with spaces Label:title"),
TestParam("test with spaces", "Goal", "Title:Title\nGoal:test with spaces"),
TestParam("SQL-DML", "Goal", "Title:Title\nGoal:SQL-DML"),
TestParam("SQL_DML", "Goal", "Title:Title\nGoal:SQL_DML"),
TestParam(None, None, "Title:Title\nGoal:test"),
TestParam(None, "Label", None),
TestParam(None, None, None),
]
for testParam in testParams:
with self.subTest(testParam=testParam):
actual = _extractByKeyword(testParam.keyword, testParam.string)
self.assertEqual(testParam.expected, actual)
For the time being i am using a simple work-around see https://trinket.io/python3/7ccaa91f7e
# Try out parsing name value list
# WF 2023-01-28
import json
notes_text="Name:Test•Title: Test•Keywords: A,B,C"
keys={
"Name": "name",
"Title": "title",
"Keywords": "keywords",
}
result={}
key_values=notes_text.split("•")
for key_value in key_values:
key,value=key_value.split(":")
value=value.strip()
result[keys[key]]=value # could do another split here if need be
print(json.dumps(result,indent=2))
output:
{
"name": "Test",
"title": "Test",
"keywords": "A,B,C"
}
Related
I am trying to parse JSON input as string in Python, not able to parse as list or dict since the JSON input is not in a proper format (Due to limitations in the middleware can't do much here.)
{
"Records": "{Output=[{_fields=[{Entity=ABC , No=12345, LineNo= 1, EffDate=20200630}, {Entity=ABC , No=567, LineNo= 1, EffDate=20200630}]}"
}
I tried json.loads and ast.literal (invalid syntax error).
How can I load this?
The sad answer is: the contents of your "Records" field are simply not JSON. No amount of ad-hoc patching (= to :, adding quotes) will change that. You have to find out the language/format specification for what the producing system emits and write/find a proper parser for that particular format.
As a clutch, and only in the case that the above example already captures all the variability you might see in production data, a much simpler approach based on regular expressions (see package re or edd's pragmatic answer) might be sufficient.
If the producer of the data is consistent, you can start with something like the following, that aims to bridge the JSON gap.
import re
import json
source = {
"Records": "{Output=[{_fields=[{Entity=ABC , No=12345, LineNo= 1, EffDate=20200630}, {Entity=ABC , No=567, LineNo= 1, EffDate=20200630}]}"
}
s = source["Records"]
# We'll start by removing any extraneous white spaces
s2 = re.sub('\s', '', s)
# Surrounding any word with "
s3 = re.sub('(\w+)', '"\g<1>"', s2)
# Replacing = with :
s4 = re.sub('=', ':', s3)
# Lastly, fixing missing closing ], }
## Note that }} is an escaped } for f-string.
s5 = f"{s4}]}}"
>>> json.loads(s5)
{'Output': [{'_fields': [{'Entity': 'ABC', 'No': '12345', 'LineNo': '1', 'EffDate': '20200630'}, {'Entity': 'ABC', 'No': '567', 'LineNo': '1', 'EffDate': '20200630'}]}]}
Follow up with some robust testing and have a nice polished ETL with your favorite tooling.
As i understand you are trying to parse the value of the Records item in the dictionary as JSON, unfortunately you cannot.
The string in that value is not JSON, and you must write a parser that will first parse the string into a JSON string according to the format that the string is written in by yourself. ( We don't know what "middleware" you are talking of unfortunately ).
tldr: Parse it into a JSON string, then parse the JSON into a python dictionary. Read this to find out more about JSON ( Javascript Object Notation ) rules.
There you go. This code will make it valid json:
notjson = """{
"Records": "{Output=[{_fields=[{Entity=ABC , No=12345, LineNo= 1, EffDate=20200630}, {Entity=ABC , No=567, LineNo= 1, EffDate=20200630}]}"
}"""
notjson = notjson.replace("=","':") #adds a singlequote and makes it more valid
notjson = notjson.replace("{","{'")
notjson = notjson.replace(", ",", '")
notjson = notjson.replace("}, '{","}, {")
json = "{" + notjson[2:]
print(json)
print(notjson)
In a Flask-RESTful based API, I want to allow clients to retrieve a JSON response partially, via the ?fields=... parameter. It lists field names (keys of the JSON object) that will be used to construct a partial representation of the larger original.
This may be, in its simplest form, a comma-separated list:
GET /v1/foobar?fields=name,id,date
That can be done with webargs' DelimitedList schema field easily, and is no trouble for me.
But, to allow nested objects' keys to be represented, the delimited field list may include arbitrarily nested keys enclosed in matching parentheses:
GET /v1/foobar?fields=name,id,another(name,id),date
{
"name": "",
"id": "",
"another": {
"name": "",
"id": ""
},
"date": ""
}
GET /v1/foobar?fields=id,one(id,two(id,three(id),date)),date
{
"id": "",
"one": {
"id: "",
"two": {
"id": "",
"three": {
"id": ""
},
"date": ""
}
},
"date": ""
}
GET /v1/foobar?fields=just(me)
{
"just": {
"me: ""
}
}
My question is two-fold:
Is there a way to do this (validate & deserialize) with webargs and marshmallow natively?
If not, how would I do this with a parsing framework like pyparsing? Any hint on what the BNF grammar is supposed to look like is highly appreciated.
Pyparsing has a couple of helpful built-ins, delimitedList and nestedExpr. Here is an annotated snippet that builds up a parser for your values. (I also included an example where your list elements might be more than just simple alphabetic words):
import pyparsing as pp
# placeholder element that will be used recursively
item = pp.Forward()
# your basic item type - expand as needed to include other characters or types
word = pp.Word(pp.alphas + '_')
list_element = word
# for instance, add support for numeric values
list_element = word | pp.pyparsing_common.number
# retain x(y, z, etc.) groupings using Group
grouped_item = pp.Group(word + pp.nestedExpr(content=pp.delimitedList(item)))
# define content for placeholder; must use '<<=' operator here, not '='
item <<= grouped_item | list_element
# create parser
parser = pp.Suppress("GET /v1/foobar?fields=") + pp.delimitedList(item)
You can test any pyparsing expression using runTests:
parser.runTests("""
GET /v1/foobar?fields=name,id,date
GET /v1/foobar?fields=name,id,another(name,id),date
GET /v1/foobar?fields=id,one(id,two(id,three(id),date)),date
GET /v1/foobar?fields=just(me)
GET /v1/foobar?fields=numbers(1,2,3.7,-26e10)
""", fullDump=False)
Gives:
GET /v1/foobar?fields=name,id,date
['name', 'id', 'date']
GET /v1/foobar?fields=name,id,another(name,id),date
['name', 'id', ['another', ['name', 'id']], 'date']
GET /v1/foobar?fields=id,one(id,two(id,three(id),date)),date
['id', ['one', ['id', ['two', ['id', ['three', ['id']], 'date']]]], 'date']
GET /v1/foobar?fields=just(me)
[['just', ['me']]]
GET /v1/foobar?fields=numbers(1,2,3.7,-26e10)
[['numbers', [1, 2, 3.7, -260000000000.0]]]
The ultimate goal or the origin of the problem is to have a field compatible with in json_extract_path_text Redshift.
This is how it looks right now:
{'error': "Feed load failed: Parameter 'url' must be a string, not object", 'errorCode': 3, 'event_origin': 'app', 'screen_id': '6118964227874465', 'screen_class': 'Promotion'}
To extract field I need from the string in Redshift, I replaced single quotes with double quotes.
The particular record is giving error because inside value of error, there is a single quote there. With that, the string will be a invalid json if those get replaced as well.
So what I need is:
{"error": "Feed load failed: Parameter 'url' must be a string, not object", "errorCode": 3, "event_origin": "app", "screen_id": "6118964227874465", "screen_class": "Promotion"}
Several ways, one is to use the regex module with
"[^"]*"(*SKIP)(*FAIL)|'
See a demo on regex101.com.
In Python:
import regex as re
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
new_string = rx.sub('"', old_string)
With the original re module, you'd need to use a function and see if the group has been matched or not - (*SKIP)(*FAIL) lets you avoid exactly that.
I tried a regex approach but found it to complicated and slow. So i wrote a simple "bracket-parser" which keeps track of the current quotation mode. It can not do multiple nesting you'd need a stack for that. For my usecase converting str(dict) to proper JSON it works:
example input:
{'cities': [{'name': "Upper Hell's Gate"}, {'name': "N'zeto"}]}
example output:
{"cities": [{"name": "Upper Hell's Gate"}, {"name": "N'zeto"}]}'
python unit test
def testSingleToDoubleQuote(self):
jsonStr='''
{
"cities": [
{
"name": "Upper Hell's Gate"
},
{
"name": "N'zeto"
}
]
}
'''
listOfDicts=json.loads(jsonStr)
dictStr=str(listOfDicts)
if self.debug:
print(dictStr)
jsonStr2=JSONAble.singleQuoteToDoubleQuote(dictStr)
if self.debug:
print(jsonStr2)
self.assertEqual('''{"cities": [{"name": "Upper Hell's Gate"}, {"name": "N'zeto"}]}''',jsonStr2)
singleQuoteToDoubleQuote
def singleQuoteToDoubleQuote(singleQuoted):
'''
convert a single quoted string to a double quoted one
Args:
singleQuoted(string): a single quoted string e.g. {'cities': [{'name': "Upper Hell's Gate"}]}
Returns:
string: the double quoted version of the string e.g.
see
- https://stackoverflow.com/questions/55600788/python-replace-single-quotes-with-double-quotes-but-leave-ones-within-double-q
'''
cList=list(singleQuoted)
inDouble=False;
inSingle=False;
for i,c in enumerate(cList):
#print ("%d:%s %r %r" %(i,c,inSingle,inDouble))
if c=="'":
if not inDouble:
inSingle=not inSingle
cList[i]='"'
elif c=='"':
inDouble=not inDouble
doubleQuoted="".join(cList)
return doubleQuoted
I'm writing a JSON configuration (i.e config file in JSON format) interpreter with PLY.
There are huge swaths of the configuration file that I'd like to ignore. Some parts that I'd like to ignore contain tokens that I can't ignore in other parts of the file.
For example, I want to ignore:
"features" : [{
"name" : "someObscureFeature",
"version": "1.2",
"options": {
"values" : ["a", "b", "c"]
"allowWithoutContentLength": false,
"enabled": true
}
...
}]
But I do NOT want to ignore:
"features" : [{
"name" : "importantFeature",
"version": "1.1",
"options": {
"value": {
"id": 587842,
"description": "ramy-single-hostmatch",
"products": [
"Fresca"
]
...
}]
There are also lots of other tokens within the array of features that I want to ignore if the name value is not 'importantFeature'. For example there is likely to be an array of values in both important and obscure features. I need to ignore accordingly.
Notice also that I need to extract certain elements of the values field and that I'd like the values field to be tokenized so I can make use of it. Effectively, I'd like to conditionally tokenize the values field if it's inside of an importantMatch.
Also note that importantFeature is just standing in for what will eventually be about a dozen different features, each with their own grammar inside of the their respective features blocks.
The problem I'm running into is that every feature, obviously, has a name. I'd like to write something along these lines:
def p_FEATURES(p):
'''FEATURES : ARRAY_START FEATURE COMMA FEATURES ARRAY_END
| ARRAY_START FEATURE ARRAY_END'''
def p_FEATURE(p):
'''FEATURE : TESTABLE_FEATURE
| UNTESTABLE_FEATURE'''
def p_TESTABLE_FEATURE(p):
'''TESTABLE_FEATURE : BLOCK_START QUOTE NAME_KEY QUOTE COLON QUOTE CPCODE_FEATURE QUOTE COMMA IGNORE_KEY_VAL_PAIRS COMMA CPCODE_OPTIONS COMMA IGNORE_KEY_VAL_PAIRS'''
def p_UNTESTABLE_FEATURE(p):
'''UNTESTABLE_FEATURE : IGNORE_BLOCK '''
def p_IGNORE_BLOCK(p):
'''IGNORE_BLOCK : BLOCK_START LINES BLOCK_END'''
However the problem i'm running into is that I can't just "IGNORE_BLOCK" because the block with have a 'name' and I have a token in my lexer called 'name':
def t_NAME_KEY(t): r'name'; return t
Any help greatly appreciated.
When you define a regex rule function, you can choose whether or not to return the token. Depending on what is returned, the token is either ignored or considered. For example:
def t_BLOCK(t):
r'\{[\s]*name[\s]*:[\s]*(importantFeature)|(obscureFeature)\}' # will match a full block with the 'name' key in it
if 'obscureFeature' not in t:
return t
else:
pass
You can build a rule somewhat along these lines, and then choose whether to return the token or not based on whether your important feature was present or not.
Also, a general convention for specifying tokens to ignore as a string is to append t_IGNORE_ to the name.
Based on OP's edit. Forget about elimination during tokenisation. What you could, instead do is, manually rebuild the json as you parse it with the grammar. For example.
Replace
def p_FEATURE(p):
'''FEATURE : TESTABLE_FEATURE
| UNTESTABLE_FEATURE'''
def p_TESTABLE_FEATURE(p):
'''TESTABLE_FEATURE : BLOCK_START QUOTE NAME_KEY QUOTE COLON QUOTE CPCODE_FEATURE QUOTE COMMA IGNORE_KEY_VAL_PAIRS COMMA CPCODE_OPTIONS COMMA IGNORE_KEY_VAL_PAIRS'''
def p_UNTESTABLE_FEATURE(p):
'''UNTESTABLE_FEATURE : IGNORE_BLOCK '''
with
data = []
def p_FEATURE(p):
'''FEATURE : BLOCK_START DATA BLOCK_END FEATURE
| BLOCK_START DATA BLOCK_END'''
def p_DATA(p):
'''DATA : KEY COLON VALUE COMMA DATA
| KEY COLON VALUE ''' # and so on (have another function for values)
What you can do now is to examine p[2] and see if it is important. If yes, add it to your data variable. Else, ignore.
This is just a rough idea. You'll still have to figure out the grammar rules exactly (for example, VALUE would also probably lead to another state), and adding the right blocks to data and how. But it is possible.
What is the simplest way to convert a string of keyword=values to a dictionary, for example the following string:
name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"
to the following python dictionary:
{'name':'John Smith', 'age':34, 'height':173.2, 'location':'US', 'avatar':':,=)'}
The 'avatar' key is just to show that the strings can contain = and , so a simple 'split' won't do. Any ideas? Thanks!
This works for me:
# get all the items
matches = re.findall(r'\w+=".+?"', s) + re.findall(r'\w+=[\d.]+',s)
# partition each match at '='
matches = [m.group().split('=', 1) for m in matches]
# use results to make a dict
d = dict(matches)
I would suggest a lazy way of doing this.
test_string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
eval("dict({})".format(test_string))
{'age': 34, 'location': 'US', 'avatar': ':,=)', 'name': 'John Smith', 'height': 173.2}
Hope this helps someone !
Edit: since the csv module doesn't deal as desired with quotes inside fields, it takes a bit more work to implement this functionality:
import re
quoted = re.compile(r'"[^"]*"')
class QuoteSaver(object):
def __init__(self):
self.saver = dict()
self.reverser = dict()
def preserve(self, mo):
s = mo.group()
if s not in self.saver:
self.saver[s] = '"%d"' % len(self.saver)
self.reverser[self.saver[s]] = s
return self.saver[s]
def expand(self, mo):
return self.reverser[mo.group()]
x = 'name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
qs = QuoteSaver()
y = quoted.sub(qs.preserve, x)
kvs_strings = y.split(',')
kvs_pairs = [kv.split('=') for kv in kvs_strings]
kvs_restored = [(k, quoted.sub(qs.expand, v)) for k, v in kvs_pairs]
def converter(v):
if v.startswith('"'): return v.strip('"')
try: return int(v)
except ValueError: return float(v)
thedict = dict((k.strip(), converter(v)) for k, v in kvs_restored)
for k in thedict:
print "%-8s %s" % (k, thedict[k])
print thedict
I'm emitting thedict twice to show exactly how and why it differs from the required result; the output is:
age 34
location US
name John Smith
avatar :,=)
height 173.2
{'age': 34, 'location': 'US', 'name': 'John Smith', 'avatar': ':,=)',
'height': 173.19999999999999}
As you see, the output for the floating point value is as requested when directly emitted with print, but it isn't and cannot be (since there IS no floating point value that would display 173.2 in such a case!-) when the print is applied to the whole dict (because that inevitably uses repr on the keys and values -- and the repr of 173.2 has that form, given the usual issues about how floating point values are stored in binary, not in decimal, etc, etc). You might define a dict subclass which overrides __str__ to specialcase floating-point values, I guess, if that's indeed a requirement.
But, I hope this distraction doesn't interfere with the core idea -- as long as the doublequotes are properly balanced (and there are no doublequotes-inside-doublequotes), this code does perform the required task of preserving "special characters" (commas and equal signs, in this case) from being taken in their normal sense when they're inside double quotes, even if the double quotes start inside a "field" rather than at the beginning of the field (csv only deals with the latter condition). Insert a few intermediate prints if the way the code works is not obvious -- first it changes all "double quoted fields" into a specially simple form ("0", "1" and so on), while separately recording what the actual contents corresponding to those simple forms are; at the end, the simple forms are changed back into the original contents. Double-quote stripping (for strings) and transformation of the unquoted strings into integers or floats is finally handled by the simple converter function.
Here is a more verbose approach to the problem using pyparsing. Note the parse actions
which do the automatic conversion of types from strings to ints or floats. Also, the
QuotedString class implicitly strips the quotation marks from the quoted value. Finally,
the Dict class takes each 'key = val' group in the comma-delimited list, and assigns
results names using the key and value tokens.
from pyparsing import *
key = Word(alphas)
EQ = Suppress('=')
real = Regex(r'[+-]?\d+\.\d+').setParseAction(lambda t:float(t[0]))
integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0]))
qs = QuotedString('"')
value = real | integer | qs
dictstring = Dict(delimitedList(Group(key + EQ + value)))
Now to parse your original text string, storing the results in dd. Pyparsing returns an
object of type ParseResults, but this class has many dict-like features (support for keys(),
items(), in, etc.), or can emit a true Python dict by calling asDict(). Calling dump()
shows all of the tokens in the original parsed list, plus all of the named items. The last
two examples show how to access named items within a ParseResults as if they were attributes of
a Python object.
text = 'name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
dd = dictstring.parseString(text)
print dd.keys()
print dd.items()
print dd.dump()
print dd.asDict()
print dd.name
print dd.avatar
Prints:
['age', 'location', 'name', 'avatar', 'height']
[('age', 34), ('location', 'US'), ('name', 'John Smith'), ('avatar', ':,=)'), ('height', 173.19999999999999)]
[['name', 'John Smith'], ['age', 34], ['height', 173.19999999999999], ['location', 'US'], ['avatar', ':,=)']]
- age: 34
- avatar: :,=)
- height: 173.2
- location: US
- name: John Smith
{'age': 34, 'height': 173.19999999999999, 'location': 'US', 'avatar': ':,=)', 'name': 'John Smith'}
John Smith
:,=)
The following code produces the correct behavior, but is just a bit long! I've added a space in the avatar to show that it deals well with commas and spaces and equal signs inside the string. Any suggestions to shorten it?
import hashlib
string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":, =)"'
strings = {}
def simplify(value):
try:
return int(value)
except:
return float(value)
while True:
try:
p1 = string.index('"')
p2 = string.index('"',p1+1)
substring = string[p1+1:p2]
key = hashlib.md5(substring).hexdigest()
strings[key] = substring
string = string[:p1] + key + string[p2+1:]
except:
break
d = {}
for pair in string.split(', '):
key, value = pair.split('=')
if value in strings:
d[key] = strings[value]
else:
d[key] = simplify(value)
print d
Here is a approach with eval, I considered it is as unreliable though, but its works for your example.
>>> import re
>>>
>>> s='name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
>>>
>>> eval("{"+re.sub('(\w+)=("[^"]+"|[\d.]+)','"\\1":\\2',s)+"}")
{'age': 34, 'location': 'US', 'name': 'John Smith', 'avatar': ':,=)', 'height': 173.19999999999999}
>>>
Update:
Better use the one pointed by Chris Lutz in the comment, I believe Its more reliable, because even there is (single/double) quotes in dict values, it might works.
Here's a somewhat more robust version of the regexp solution:
import re
keyval_re = re.compile(r'''
\s* # Leading whitespace is ok.
(?P<key>\w+)\s*=\s*( # Search for a key followed by..
(?P<str>"[^"]*"|\'[^\']*\')| # a quoted string; or
(?P<float>\d+\.\d+)| # a float; or
(?P<int>\d+) # an int.
)\s*,?\s* # Handle comma & trailing whitespace.
|(?P<garbage>.+) # Complain if we get anything else!
''', re.VERBOSE)
def handle_keyval(match):
if match.group('garbage'):
raise ValueError("Parse error: unable to parse: %r" %
match.group('garbage'))
key = match.group('key')
if match.group('str') is not None:
return (key, match.group('str')[1:-1]) # strip quotes
elif match.group('float') is not None:
return (key, float(match.group('float')))
elif match.group('int') is not None:
return (key, int(match.group('int')))
It automatically converts floats & ints to the right type; handles single and double quotes; handles extraneous whitespace in various locations; and complains if a badly formatted string is supplied
>>> s='name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
>>> print dict(handle_keyval(m) for m in keyval_re.finditer(s))
{'age': 34, 'location': 'US', 'name': 'John Smith', 'avatar': ':,=)', 'height': 173.19999999999999}
do it step by step
d={}
mystring='name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"';
s = mystring.split(", ")
for item in s:
i=item.split("=",1)
d[i[0]]=i[-1]
print d
I think you just need to set maxsplit=1, for instance the following should work.
string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":, =)"'
newDict = dict(map( lambda(z): z.split("=",1), string.split(", ") ))
Edit (see comment):
I didn't notice that ", " was a value under avatar, the best approach would be to escape ", " wherever you are generating data. Even better would be something like JSON ;). However, as an alternative to regexp, you could try using shlex, which I think produces cleaner looking code.
import shlex
string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":, =)"'
lex = shlex.shlex ( string )
lex.whitespace += "," # Default whitespace doesn't include commas
lex.wordchars += "." # Word char should include . to catch decimal
words = [ x for x in iter( lex.get_token, '' ) ]
newDict = dict ( zip( words[0::3], words[2::3]) )
Always comma separated? Use the CSV module to split the line into parts (not checked):
import csv
import cStringIO
parts=csv.reader(cStringIO.StringIO(<string to parse>)).next()