Parsing unquoted JavaScript object literal as JSON using YAML & JSON modules - python

I have scraped a JavaScript object which I can't parse as JSON because it has unquoted keys.
I found a solution here which says to load the object as a Python data structure using the PyYaml library, and then write it back out as valid JSON:
https://stackoverflow.com/a/31030022/10601287
This would be a great solution for me, however yaml.load(js_obj) causes the keys & values to merge together as a key, and causes the value to default to 'None'. This is my code snippet:
import yaml
yaml_obj = yaml.safe_load(js_obj)
print(yaml_obj)
Example of the JavaScript Object before loaded as YAML (in reality it is much bigger than this):
{
path:"1/83656/83659/83669/83670",
is_active:!0,
level:4,
children_count:0,
product_count:59,
parent_id:83669,
name:"Red Wine",
position:1,
id:83670,
include_in_menu:1,
url_key:"red-wine-83670",
url_path:"liquor/wine/red-wine.html",
_score:null,
slug:"red-wine-83670"
}
After yaml.load(js_obj):
{
'path:"1/83656/83659/83669/83670"': None,
'is_active:!0': None,
'level:4': None,
'children_count:0': None,
'product_count:59': None,
'parent_id:83669': None,
'name:"Red Wine"': None,
'position:1': None,
'id:83670': None,
'include_in_menu:1': None,
'url_key:"red-wine-83670"': None,
'url_path:"liquor/wine/red-wine.html"': None,
'_score:null': None,
'slug:"red-wine-83670"': None
}
Any advice would be greatly appreciated.

YAML requires the colon in a mapping to be followed by at least one space character, so your input isn't valid YAML either. If the format is as simple as your example indicates, you could preprocess it into YAML by searching for a word at the beginning of a line followed by a colon and inserting a space after the colon. (Or you could insert quotes around the word to make it JSON, but you're going have a problem with is_active:!0,, because !0 isn't a JSON value.)
So you could try something like:
import re
first_word = re.compile(r"^\s*[_a-zA-Z]\w*:")
# ...
yaml_obj = yaml.load(first_word.replace(r"\g<0> ", js_obj))
Of course, if the input is less regular, that could fail horribly.

Related

JSON Parsing with python from Rethink database [Python]

Im trying to retrieve data from a database named RethinkDB, they output JSON when called with r.db("Databasename").table("tablename").insert([{ "id or primary key": line}]).run(), when doing so it outputs [{'id': 'ValueInRowOfid\n'}] and I want to parse that to just the value eg. "ValueInRowOfid". Ive tried with JSON in Python, but I always end up with the typeerror: list indices must be integers or slices, not str, and Ive been told that it is because the Database outputs invalid JSON format. My question is how can a JSON format be invalid (I cant see what is invalid with the output) and also what would be the best way to parse it so that the value "ValueInRowOfid" is left in a Operator eg. Value = ("ValueInRowOfid").
This part imports the modules used and connects to RethinkDB:
import json
from rethinkdb import RethinkDB
r = RethinkDB()
r.connect( "localhost", 28015).repl()
This part is getting the output/value and my trial at parsing it:
getvalue = r.db("Databasename").table("tablename").sample(1).run() # gets a single row/value from the table
print(getvalue) # If I print that, it will show as [{'id': 'ValueInRowOfid\n'}]
dumper = json.dumps(getvalue) # I cant use `json.loads(dumper)` as JSON object must be str. Which the output of the database isnt (The output is a list)
parsevalue = json.loads(dumper) # After `json.dumps(getvalue)` I can now load it, but I cant use the loaded JSON.
print(parsevalue["id"]) # When doing this it now says that the list is a str and it needs to be an integers or slices. Quite frustrating for me as it is opposing it self eg. It first wants str and now it cant use str
print(parsevalue{'id'}) # I also tried to shuffle it around as seen here, but still the same result
I know this is janky and is very hard to comprehend this level of stupidity that I might be on. As I dont know if it is the most simple problem or something that just isnt possible (Which it should or else I cant use my data in the database.)
Thank you for reading this through and not jumping straight into the comments and say that I have to read the JSON documentation, because I have and I havent found a single piece that could help me.
I tried reading the documentation and watching tutorials about JSON and JSON parsing. I also looked for others whom have had the same problems as me and couldnt find.
It looks like it's returning a dictionary ({}) inside a list ([]) of one element.
Try:
getvalue = r.db("Databasename").table("tablename").sample(1).run()
print(getvalue[0]['id'])

Decoding json data to Python dictionary

I am currently trying to create a dictionary from a json formatted server response:
{"id": null,{"version": "1.1","result": "9QtirjtH9b","error": null}}
Therefore I am using json.loads(). But I always get the following error:
ValueError: Expecting property name: line 1 column 12 (char 12)
I know that this means that there is an error in the json syntax and I found some threads (like this one) here at stackoverflow, but they did not include an answer that solved my problem.
However, I was not sure if the null value within the json response causes the error, so I had a closer look at the json.org Reference Manual and it seems to be a valid syntax. Any ideas?
It's not valid. The outer object needs a property name for the second element; raw values are not valid in an object.
{"id": null, "somename":{"version": "1.1","result": "9QtirjtH9b","error": null}}
The problem here is the lack of a key for the nested object, not the null. You'd need to find a way to fix that syntax or parse it yourself.
If we make a few assumptions about the syntax, you should be able to use a regular expression to fix the JSON data before decoding:
import re
from itertools import count
def _gen_id(match, count=count()):
return '{1}"generated_id_{0}":{2}'.format(next(count), *match.groups())
_no_key = re.compile(r'(,)({)')
def fix_json(json_data):
return _no_key.sub(_gen_id, json_data)
This assumes that any ,{ combo indicates the location of a missing key, and generates one to insert there. That is a reasonable assumption to make, but may break things if you have string data with exactly that sequence.
Demo:
>>> json_data = '{"id": null,{"version": "1.1","result": "9QtirjtH9b","error": null}}'
>>> fix_json(json_data)
'{"id": null,"generated_id_0":{"version": "1.1","result": "9QtirjtH9b","error": null}}'
>>> json.loads(fix_json(json_data))
{u'id': None, u'generated_id_1': {u'version': u'1.1', u'result': u'9QtirjtH9b', u'error': None}}

urllib's urlencode returning weird encoded results

I'm trying to use Facebook's REST api, and am encoding a JSON string/dictionary using urllib.urlencode. The result I get however, is different from the correct encoded result (as displayed by pasting the dictionary in the attachment field here http://developers.facebook.com/docs/reference/rest/stream.publish/). I was wondering if anyone could offer any help.
Thanks.
EDIT:
I'm trying to encode the following dictionary:
{"media": [{"type":"flash", "swfsrc":"http://shopperspoll.webfactional.com/media/flashFile.swf", "height": '100', "width": '100', "expanded_width":"160", "expanded_height":"120", "imgsrc":"http://shopperspoll.webfactional.com/media/laptop1.jpg"}]}
This is the encoded string using urllib.urlencode:
"media=%5B%7B%27swfsrc%27%3A+%27http%3A%2F%2Fshopperspoll.webfactional.com%2Fmedia%2FflashFile.swf%27%2C+%27height%27%3A+%27100%27%2C+%27width%27%3A+%27100%27%2C+%27expanded_width%27%3A+%27160%27%2C+%27imgsrc%27%3A+%27http%3A%2F%2Fshopperspoll.webfactional.com%2Fmedia%2Flaptop1.jpg%27%2C+%27expanded_height%27%3A+%27120%27%2C+%27type%27%3A+%27flash%27%7D%5D"
It's not letting me copy the result being thrown out from the facebook rest documentation link, but on copying the above dictionary in the attachment field, the result is different.
urllib.encode isn't meant for urlencoding a single value (as functions of the same name are in many languages), but for encoding a dict of separate values. For example, if I had the dict {"a": 1, "b": 2} it would produce the string "a=1&b=2".
First, you want to encode your dict as JSON.
data = {"media": [{"type":"flash", "swfsrc":"http://shopperspoll.webfactional.com/media/flashFile.swf", "height": '100', "width": '100', "expanded_width":"160", "expanded_height":"120", "imgsrc":"http://shopperspoll.webfactional.com/media/laptop1.jpg"}]}
import json
json_encoded = json.dumps(data)
You can then either use urllib.encode to create a complete query string
import urllib
urllib.encode({"access_token": example, "attachment": json_encoded})
# produces a long string in the form "access_token=...&attachment=..."
or use urllib.quote to just encode your attachment parameter
urllib.quote(json_encoded)
# produces just the part following "&attachment="

json object as get parameter

I'm writing API for a mongo database. I need to pass JSON object as GET parameter:
example.com/api/obj/list/1/?find={"foo":"bar"}
How should I organize this better?
I thought about using JSON-like objects without quotes and spaces, for example:
{$or:[{a:foo+bar},{b:2}]}
So is there any tools to parse it in Python/Django?
It should be fine as long as the JSON objects aren't too big, they don't contain sensitive data (it sucks to see your password in your browser history) and you URL-escape them.
Unfortunately, you have to take shortcuts if you want to have a human-readable JSON parameter. All JSON brackets ({, }, [, ]) are recommended for escaping. You don't have to escape them, but you are taking a risk if you don't. More annoying is the :, which is ubiquitous in JSON and must be escaped.
If you want human-readable query strings, then the sensible solution is to encode all query parameters explicitly. A compromise that might work quite well is to unpack the top-level JSON object into explicit query parameters, each of remains JSON-encoded. Going a small step further, you could drop any top-level delimiters that remain, e.g.:
JSON: {"foo":"bar", "items":[1, 2, 3], "staff":{"id":432, "first":"John", "last":"Doe"}}
Query: foo=bar&items=1,2,3&staff="id"%3A432,"first"%3A"John","last"%3A"Doe"
Since you know that foo is a string, items is an array and staff is an object, you can rehydrate the JSON syntax correctly before sending the lot to a JSON parser.

XML to store system paths in Python with lxml

I'm using an xml file to store configurations for a software.
One of theese configurations would be a system path like
> set_value = "c:\\test\\3 tests\\test"
i can store it by using:
> setting = etree.SubElement(settings,
> "setting", name=tmp_set_name, type =
> set_type , value= set_value)
If I use
doc.write(output_file, method='xml',encoding = 'utf-8', compression=0)
the file would be:
< setting type="str" name="MyPath" value="c:\test\3 tests\test"/>
Now I read it again with the etree.parse method
I obtain an etree child object with a string value, but the string
contains the
\3
character and if i try to use it to write again to xml it will be interpreted !!!!! So i cannot use it anymore as a path
Maybe i'm only missing a simple string operation, but I cannot see it =)
How would you solve it in a smart way ?
This is an example, but what is the best way, you think to store paths in xml and parse them with lxml ?
Thank you !!
Now I read it again with the
etree.parse method
I obtain an etree child object with a
string value, but the string contains
the
\3
character and if i try to use it to
write again to xml it will be
interpreted !!!!!
I just tried that, and it doesn't get "interpreted". The elements attributes as returned after parsed is:
{'type': 'str', 'name': 'yowza!', 'value': 'c:\\test\\3 tests\\test'}
So as you see this works just as you expected it to work. If you really have this problem, you are doing something else than what you are saying. Show us the real code, or make a small example code where you demonstrate the problem and use that.

Categories

Resources