I am trying to parse JSON input as string in Python, not able to parse as list or dict since the JSON input is not in a proper format (Due to limitations in the middleware can't do much here.)
{
"Records": "{Output=[{_fields=[{Entity=ABC , No=12345, LineNo= 1, EffDate=20200630}, {Entity=ABC , No=567, LineNo= 1, EffDate=20200630}]}"
}
I tried json.loads and ast.literal (invalid syntax error).
How can I load this?
The sad answer is: the contents of your "Records" field are simply not JSON. No amount of ad-hoc patching (= to :, adding quotes) will change that. You have to find out the language/format specification for what the producing system emits and write/find a proper parser for that particular format.
As a clutch, and only in the case that the above example already captures all the variability you might see in production data, a much simpler approach based on regular expressions (see package re or edd's pragmatic answer) might be sufficient.
If the producer of the data is consistent, you can start with something like the following, that aims to bridge the JSON gap.
import re
import json
source = {
"Records": "{Output=[{_fields=[{Entity=ABC , No=12345, LineNo= 1, EffDate=20200630}, {Entity=ABC , No=567, LineNo= 1, EffDate=20200630}]}"
}
s = source["Records"]
# We'll start by removing any extraneous white spaces
s2 = re.sub('\s', '', s)
# Surrounding any word with "
s3 = re.sub('(\w+)', '"\g<1>"', s2)
# Replacing = with :
s4 = re.sub('=', ':', s3)
# Lastly, fixing missing closing ], }
## Note that }} is an escaped } for f-string.
s5 = f"{s4}]}}"
>>> json.loads(s5)
{'Output': [{'_fields': [{'Entity': 'ABC', 'No': '12345', 'LineNo': '1', 'EffDate': '20200630'}, {'Entity': 'ABC', 'No': '567', 'LineNo': '1', 'EffDate': '20200630'}]}]}
Follow up with some robust testing and have a nice polished ETL with your favorite tooling.
As i understand you are trying to parse the value of the Records item in the dictionary as JSON, unfortunately you cannot.
The string in that value is not JSON, and you must write a parser that will first parse the string into a JSON string according to the format that the string is written in by yourself. ( We don't know what "middleware" you are talking of unfortunately ).
tldr: Parse it into a JSON string, then parse the JSON into a python dictionary. Read this to find out more about JSON ( Javascript Object Notation ) rules.
There you go. This code will make it valid json:
notjson = """{
"Records": "{Output=[{_fields=[{Entity=ABC , No=12345, LineNo= 1, EffDate=20200630}, {Entity=ABC , No=567, LineNo= 1, EffDate=20200630}]}"
}"""
notjson = notjson.replace("=","':") #adds a singlequote and makes it more valid
notjson = notjson.replace("{","{'")
notjson = notjson.replace(", ",", '")
notjson = notjson.replace("}, '{","}, {")
json = "{" + notjson[2:]
print(json)
print(notjson)
E.g. I have a dictionary:
{"John": "Doe", "Jane" : "Smith", "Jack" : "Poland"}
I want to create a list
["John Doe", "Jane Smith", "Jack Poland"]
Given dict names, you can use an f-string in a list comprehension.
[f'{key} {value}' for key, value in names.items()]
The f-string in a list comprehension would be the most efficient, which it looks like someone already answered.
Another way, which might be easier to wrap your head around, would be to just do a for loop and iterate through the dictionary once. Use some python string concatenation, which you may have learned already and then append it to a new list.
(FYI: The f-string approach has same result with less code)
Using string concatenation:
full_name = []
name_dict = {"John": "Doe", "Jane" : "Smith", "Jack" : "Poland"}
for first, last in name_dict.items():
name = first + " " + last
full_name.append(name)
print(full_name)
There are a few other ways too, but this one and the f-string popped out at me right off the bat.
You could also define multiple variables and lists at the top, which would be a little less confusing than putting everything on one line and similar to other exercises you may have done and is also more organized.
Another neat thing is that you have the name within a separate variable in the for loop; that is not yet in your new list until you append it. This is cool because you can use the variable too if you want to do anything else with combined string other than just append it to the list.
Try this out:
dic = {"John": "Doe", "Jane" : "Smith", "Jack" : "Poland"}
lst =[key+" "+value for key,value in dic.items()]
print(lst)
output:
['Jack Poland', 'John Doe', 'Jane Smith']
The ultimate goal or the origin of the problem is to have a field compatible with in json_extract_path_text Redshift.
This is how it looks right now:
{'error': "Feed load failed: Parameter 'url' must be a string, not object", 'errorCode': 3, 'event_origin': 'app', 'screen_id': '6118964227874465', 'screen_class': 'Promotion'}
To extract field I need from the string in Redshift, I replaced single quotes with double quotes.
The particular record is giving error because inside value of error, there is a single quote there. With that, the string will be a invalid json if those get replaced as well.
So what I need is:
{"error": "Feed load failed: Parameter 'url' must be a string, not object", "errorCode": 3, "event_origin": "app", "screen_id": "6118964227874465", "screen_class": "Promotion"}
Several ways, one is to use the regex module with
"[^"]*"(*SKIP)(*FAIL)|'
See a demo on regex101.com.
In Python:
import regex as re
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
new_string = rx.sub('"', old_string)
With the original re module, you'd need to use a function and see if the group has been matched or not - (*SKIP)(*FAIL) lets you avoid exactly that.
I tried a regex approach but found it to complicated and slow. So i wrote a simple "bracket-parser" which keeps track of the current quotation mode. It can not do multiple nesting you'd need a stack for that. For my usecase converting str(dict) to proper JSON it works:
example input:
{'cities': [{'name': "Upper Hell's Gate"}, {'name': "N'zeto"}]}
example output:
{"cities": [{"name": "Upper Hell's Gate"}, {"name": "N'zeto"}]}'
python unit test
def testSingleToDoubleQuote(self):
jsonStr='''
{
"cities": [
{
"name": "Upper Hell's Gate"
},
{
"name": "N'zeto"
}
]
}
'''
listOfDicts=json.loads(jsonStr)
dictStr=str(listOfDicts)
if self.debug:
print(dictStr)
jsonStr2=JSONAble.singleQuoteToDoubleQuote(dictStr)
if self.debug:
print(jsonStr2)
self.assertEqual('''{"cities": [{"name": "Upper Hell's Gate"}, {"name": "N'zeto"}]}''',jsonStr2)
singleQuoteToDoubleQuote
def singleQuoteToDoubleQuote(singleQuoted):
'''
convert a single quoted string to a double quoted one
Args:
singleQuoted(string): a single quoted string e.g. {'cities': [{'name': "Upper Hell's Gate"}]}
Returns:
string: the double quoted version of the string e.g.
see
- https://stackoverflow.com/questions/55600788/python-replace-single-quotes-with-double-quotes-but-leave-ones-within-double-q
'''
cList=list(singleQuoted)
inDouble=False;
inSingle=False;
for i,c in enumerate(cList):
#print ("%d:%s %r %r" %(i,c,inSingle,inDouble))
if c=="'":
if not inDouble:
inSingle=not inSingle
cList[i]='"'
elif c=='"':
inDouble=not inDouble
doubleQuoted="".join(cList)
return doubleQuoted
I have some data that are not properly saved in an old database. I am moving the system to a new database and reformatting the old data as well. The old data looks like this:
a:10:{
s:7:"step_no";s:1:"1";
s:9:"YOUR_NAME";s:14:"Firtname Lastname";
s:11:"CITIZENSHIP"; s:7:"Indian";
s:22:"PROPOSE_NAME_BUSINESS1"; s:12:"ABC Limited";
s:22:"PROPOSE_NAME_BUSINESS2"; s:15:"XYZ Investment";
s:22:"PROPOSE_NAME_BUSINESS3";s:0:"";
s:22:"PROPOSE_NAME_BUSINESS4";s:0:"";
s:23:"PURPOSE_NATURE_BUSINESS";s:15:"Some dummy content";
s:15:"CAPITAL_COMPANY";s:24:"20 Million Capital";
s:14:"ANOTHER_AMOUNT";s:0:"";
}
I want the new look to be in proper JSON format so I can read in python jut like this:
data = {
"step_no": "1",
"YOUR_NAME":"Firtname Lastname",
"CITIZENSHIP":"Indian",
"PROPOSE_NAME_BUSINESS1":"ABC Limited",
"PROPOSE_NAME_BUSINESS2":"XYZ Investment",
"PROPOSE_NAME_BUSINESS3":"",
"PROPOSE_NAME_BUSINESS4":"",
"PURPOSE_NATURE_BUSINESS":"Some dummy content",
"CAPITAL_COMPANY":"20 Million Capital",
"ANOTHER_AMOUNT":""
}
I am thinking using regex to strip out the unwanted parts and reformatting the content using the names in caps would work but I don't know how to go about this.
Regexes would be the wrong approach here. There is no need, and the format is a little more complex than you assume it is.
You have data in the PHP serialize format. You can trivially deserialise it in Python with the phpserialize library:
import phpserialize
import json
def fixup_php_arrays(o):
if isinstance(o, dict):
if isinstance(next(iter(o), None), int):
# PHP has no lists, only mappings; produce a list for
# a dictionary with integer keys to 'repair'
return [fixup_php_arrays(o[i]) for i in range(len(o))]
return {k: fixup_php_arrays(v) for k, v in o.items()}
return o
json.dumps(fixup_php(phpserialize.loads(yourdata, decode_strings=True)))
Note that PHP strings are byte strings, not Unicode text, so especially in Python 3 you'd have to decode your key-value pairs after the fact if you want to be able to re-encode to JSON. The decode_strings=True flag takes care of this for you. The default is UTF-8, pass in an encoding argument to pick a different codec.
PHP also uses arrays for sequences, so you may have to convert any decoded dict object with integer keys to a list first, which is what the fixup_php_arrays() function does.
Demo (with repaired data, many string lengths were off and whitespace was added):
>>> import phpserialize, json
>>> from pprint import pprint
>>> data = b'a:10:{s:7:"step_no";s:1:"1";s:9:"YOUR_NAME";s:18:"Firstname Lastname";s:11:"CITIZENSHIP";s:6:"Indian";s:22:"PROPOSE_NAME_BUSINESS1";s:11:"ABC Limited";s:22:"PROPOSE_NAME_BUSINESS2";s:14:"XYZ Investment";s:22:"PROPOSE_NAME_BUSINESS3";s:0:"";s:22:"PROPOSE_NAME_BUSINESS4";s:0:"";s:23:"PURPOSE_NATURE_BUSINESS";s:18:"Some dummy content";s:15:"CAPITAL_COMPANY";s:18:"20 Million Capital";s:14:"ANOTHER_AMOUNT";s:0:"";}'
>>> pprint(phpserialize.loads(data, decode_strings=True))
{'ANOTHER_AMOUNT': '',
'CAPITAL_COMPANY': '20 Million Capital',
'CITIZENSHIP': 'Indian',
'PROPOSE_NAME_BUSINESS1': 'ABC Limited',
'PROPOSE_NAME_BUSINESS2': 'XYZ Investment',
'PROPOSE_NAME_BUSINESS3': '',
'PROPOSE_NAME_BUSINESS4': '',
'PURPOSE_NATURE_BUSINESS': 'Some dummy content',
'YOUR_NAME': 'Firstname Lastname',
'step_no': '1'}
>>> print(json.dumps(phpserialize.loads(data, decode_strings=True), sort_keys=True, indent=4))
{
"ANOTHER_AMOUNT": "",
"CAPITAL_COMPANY": "20 Million Capital",
"CITIZENSHIP": "Indian",
"PROPOSE_NAME_BUSINESS1": "ABC Limited",
"PROPOSE_NAME_BUSINESS2": "XYZ Investment",
"PROPOSE_NAME_BUSINESS3": "",
"PROPOSE_NAME_BUSINESS4": "",
"PURPOSE_NATURE_BUSINESS": "Some dummy content",
"YOUR_NAME": "Firstname Lastname",
"step_no": "1"
}
What is the simplest way to convert a string of keyword=values to a dictionary, for example the following string:
name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"
to the following python dictionary:
{'name':'John Smith', 'age':34, 'height':173.2, 'location':'US', 'avatar':':,=)'}
The 'avatar' key is just to show that the strings can contain = and , so a simple 'split' won't do. Any ideas? Thanks!
This works for me:
# get all the items
matches = re.findall(r'\w+=".+?"', s) + re.findall(r'\w+=[\d.]+',s)
# partition each match at '='
matches = [m.group().split('=', 1) for m in matches]
# use results to make a dict
d = dict(matches)
I would suggest a lazy way of doing this.
test_string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
eval("dict({})".format(test_string))
{'age': 34, 'location': 'US', 'avatar': ':,=)', 'name': 'John Smith', 'height': 173.2}
Hope this helps someone !
Edit: since the csv module doesn't deal as desired with quotes inside fields, it takes a bit more work to implement this functionality:
import re
quoted = re.compile(r'"[^"]*"')
class QuoteSaver(object):
def __init__(self):
self.saver = dict()
self.reverser = dict()
def preserve(self, mo):
s = mo.group()
if s not in self.saver:
self.saver[s] = '"%d"' % len(self.saver)
self.reverser[self.saver[s]] = s
return self.saver[s]
def expand(self, mo):
return self.reverser[mo.group()]
x = 'name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
qs = QuoteSaver()
y = quoted.sub(qs.preserve, x)
kvs_strings = y.split(',')
kvs_pairs = [kv.split('=') for kv in kvs_strings]
kvs_restored = [(k, quoted.sub(qs.expand, v)) for k, v in kvs_pairs]
def converter(v):
if v.startswith('"'): return v.strip('"')
try: return int(v)
except ValueError: return float(v)
thedict = dict((k.strip(), converter(v)) for k, v in kvs_restored)
for k in thedict:
print "%-8s %s" % (k, thedict[k])
print thedict
I'm emitting thedict twice to show exactly how and why it differs from the required result; the output is:
age 34
location US
name John Smith
avatar :,=)
height 173.2
{'age': 34, 'location': 'US', 'name': 'John Smith', 'avatar': ':,=)',
'height': 173.19999999999999}
As you see, the output for the floating point value is as requested when directly emitted with print, but it isn't and cannot be (since there IS no floating point value that would display 173.2 in such a case!-) when the print is applied to the whole dict (because that inevitably uses repr on the keys and values -- and the repr of 173.2 has that form, given the usual issues about how floating point values are stored in binary, not in decimal, etc, etc). You might define a dict subclass which overrides __str__ to specialcase floating-point values, I guess, if that's indeed a requirement.
But, I hope this distraction doesn't interfere with the core idea -- as long as the doublequotes are properly balanced (and there are no doublequotes-inside-doublequotes), this code does perform the required task of preserving "special characters" (commas and equal signs, in this case) from being taken in their normal sense when they're inside double quotes, even if the double quotes start inside a "field" rather than at the beginning of the field (csv only deals with the latter condition). Insert a few intermediate prints if the way the code works is not obvious -- first it changes all "double quoted fields" into a specially simple form ("0", "1" and so on), while separately recording what the actual contents corresponding to those simple forms are; at the end, the simple forms are changed back into the original contents. Double-quote stripping (for strings) and transformation of the unquoted strings into integers or floats is finally handled by the simple converter function.
Here is a more verbose approach to the problem using pyparsing. Note the parse actions
which do the automatic conversion of types from strings to ints or floats. Also, the
QuotedString class implicitly strips the quotation marks from the quoted value. Finally,
the Dict class takes each 'key = val' group in the comma-delimited list, and assigns
results names using the key and value tokens.
from pyparsing import *
key = Word(alphas)
EQ = Suppress('=')
real = Regex(r'[+-]?\d+\.\d+').setParseAction(lambda t:float(t[0]))
integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0]))
qs = QuotedString('"')
value = real | integer | qs
dictstring = Dict(delimitedList(Group(key + EQ + value)))
Now to parse your original text string, storing the results in dd. Pyparsing returns an
object of type ParseResults, but this class has many dict-like features (support for keys(),
items(), in, etc.), or can emit a true Python dict by calling asDict(). Calling dump()
shows all of the tokens in the original parsed list, plus all of the named items. The last
two examples show how to access named items within a ParseResults as if they were attributes of
a Python object.
text = 'name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
dd = dictstring.parseString(text)
print dd.keys()
print dd.items()
print dd.dump()
print dd.asDict()
print dd.name
print dd.avatar
Prints:
['age', 'location', 'name', 'avatar', 'height']
[('age', 34), ('location', 'US'), ('name', 'John Smith'), ('avatar', ':,=)'), ('height', 173.19999999999999)]
[['name', 'John Smith'], ['age', 34], ['height', 173.19999999999999], ['location', 'US'], ['avatar', ':,=)']]
- age: 34
- avatar: :,=)
- height: 173.2
- location: US
- name: John Smith
{'age': 34, 'height': 173.19999999999999, 'location': 'US', 'avatar': ':,=)', 'name': 'John Smith'}
John Smith
:,=)
The following code produces the correct behavior, but is just a bit long! I've added a space in the avatar to show that it deals well with commas and spaces and equal signs inside the string. Any suggestions to shorten it?
import hashlib
string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":, =)"'
strings = {}
def simplify(value):
try:
return int(value)
except:
return float(value)
while True:
try:
p1 = string.index('"')
p2 = string.index('"',p1+1)
substring = string[p1+1:p2]
key = hashlib.md5(substring).hexdigest()
strings[key] = substring
string = string[:p1] + key + string[p2+1:]
except:
break
d = {}
for pair in string.split(', '):
key, value = pair.split('=')
if value in strings:
d[key] = strings[value]
else:
d[key] = simplify(value)
print d
Here is a approach with eval, I considered it is as unreliable though, but its works for your example.
>>> import re
>>>
>>> s='name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
>>>
>>> eval("{"+re.sub('(\w+)=("[^"]+"|[\d.]+)','"\\1":\\2',s)+"}")
{'age': 34, 'location': 'US', 'name': 'John Smith', 'avatar': ':,=)', 'height': 173.19999999999999}
>>>
Update:
Better use the one pointed by Chris Lutz in the comment, I believe Its more reliable, because even there is (single/double) quotes in dict values, it might works.
Here's a somewhat more robust version of the regexp solution:
import re
keyval_re = re.compile(r'''
\s* # Leading whitespace is ok.
(?P<key>\w+)\s*=\s*( # Search for a key followed by..
(?P<str>"[^"]*"|\'[^\']*\')| # a quoted string; or
(?P<float>\d+\.\d+)| # a float; or
(?P<int>\d+) # an int.
)\s*,?\s* # Handle comma & trailing whitespace.
|(?P<garbage>.+) # Complain if we get anything else!
''', re.VERBOSE)
def handle_keyval(match):
if match.group('garbage'):
raise ValueError("Parse error: unable to parse: %r" %
match.group('garbage'))
key = match.group('key')
if match.group('str') is not None:
return (key, match.group('str')[1:-1]) # strip quotes
elif match.group('float') is not None:
return (key, float(match.group('float')))
elif match.group('int') is not None:
return (key, int(match.group('int')))
It automatically converts floats & ints to the right type; handles single and double quotes; handles extraneous whitespace in various locations; and complains if a badly formatted string is supplied
>>> s='name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
>>> print dict(handle_keyval(m) for m in keyval_re.finditer(s))
{'age': 34, 'location': 'US', 'name': 'John Smith', 'avatar': ':,=)', 'height': 173.19999999999999}
do it step by step
d={}
mystring='name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"';
s = mystring.split(", ")
for item in s:
i=item.split("=",1)
d[i[0]]=i[-1]
print d
I think you just need to set maxsplit=1, for instance the following should work.
string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":, =)"'
newDict = dict(map( lambda(z): z.split("=",1), string.split(", ") ))
Edit (see comment):
I didn't notice that ", " was a value under avatar, the best approach would be to escape ", " wherever you are generating data. Even better would be something like JSON ;). However, as an alternative to regexp, you could try using shlex, which I think produces cleaner looking code.
import shlex
string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":, =)"'
lex = shlex.shlex ( string )
lex.whitespace += "," # Default whitespace doesn't include commas
lex.wordchars += "." # Word char should include . to catch decimal
words = [ x for x in iter( lex.get_token, '' ) ]
newDict = dict ( zip( words[0::3], words[2::3]) )
Always comma separated? Use the CSV module to split the line into parts (not checked):
import csv
import cStringIO
parts=csv.reader(cStringIO.StringIO(<string to parse>)).next()