I am creating a json file from pseudo xml format file. However I get commas between json object, which I don't want.
This is sample of what I get:
[{"a": a , "b": b } , {"a": a , "b": b }]
However I want this:
{"a": a , "b": b } {"a": a , "b": b }
It might not be a valid json but I want it that way so that I can shuffle it by doing:
shuf -n 100000 original.json > sample.json
otherwise, it will be just one big line of json
This is my code:
def read_html_file(file_name):
f = open(file_name,"r", encoding="ISO-8859-1")
html = f.read()
parsed_html = BeautifulSoup(html, "html.parser")
return parsed_html
def process_reviews(parsed_html):
reviews = []
for r in parsed_html.findAll('review'):
review_text = r.find('review_text').text
asin = r.find('asin').text
rating = r.find('rating').text
product_type = r.find('product_type').text
reviewer_location = r.find('reviewer_location').text
reviews.append({
'review_text': review_text.strip(),
'asin': asin.strip(),
'rating': rating.strip(),
'product_type': product_type.strip(),
'reviewer_location': reviewer_location.strip()
})
return reviews
def write_json_file(file_name, reviews):
with open('{f}.json'.format(f=file_name), 'w') as outfile:
json.dump(reviews, outfile)
if __name__ == '__main__':
parser = optparse.OptionParser()
parser.add_option('-f', '--file_name',action="store", dest="file_name",
help="name of the input html file to parse", default="positive.html")
options, args = parser.parse_args()
file_name = options.file_name
html = read_html_file(file_name)
reviews_list = process_reviews(html)
write_json_file(file_name,reviews_list)
The first [ ] is because of the reviews = [], and I can manually remove it but I also don't want commas between my json object.
What you are asking for is just not JSON. The standards, by definition, specify there has to be a comma between objects. You have two options to go forward:
Update your parser to match the standards (highly recommended).
For display purposes, or other internal processing you may have, in case you really want the structure you specified: capture the JSON object and transform it to something else, but please do not call it JSON, because it isn't.
There are a few concepts you're mixing on your question!
1. What you have is not a dict, but a list of dicts.
2. You don't have a JSON, neither on your input element list, nor on your expected output
Now going for solution, if you want to simply print your objects without the comma separating them, so you only need to print all your elements list, what you can do with:
sample = [{"a": "a" , "b": "b" } , {"a": "a" , "b": "b" }]
print(" ".join([str(element) for element in sample]))
Now, if what you really want is to manipulate it as a JSON object, you have two options, using the json lib:
Add each element from your sample as a Json and manipulate it individually
They are already formatted as Json, so you could manipulate them using the json lib to pretty print (dumps) as strings or any other manipulation:
import json
for element in sample:
print(json.dumps(element, indent = 4))
Make your sample list become a Json
You can either add all your elements to a single key, let's say adding to a key called elements, what would be:
sample_json = {"elements": []}
for data in sample:
sample_json["elements"].append(data)
# Output from sample_json
# {'elements': [{'a': 'a', 'b': 'b'}, {'a': 'a', 'b': 'b'}]}
Or you can add every single element to a different key. As an example, I'll create a counter and each number of the counter will define a different key for that specific element:
sample_json = {}
counter = 0
for data in sample:
sample_json[counter] = data
counter += 1
# Output from sample_json
# {0: {'a': 'a', 'b': 'b'}, 1: {'a': 'a', 'b': 'b'}}
You could use text keys as well, for this second case.
Related
I need a little help processing a String to a Dict, considering that the String is not in a common format, but an output from a UDF function
The return from the PySpark UDF looks like the string below:
"{list=[{a=1}, {a=2}, {a=3}]}"
And I need to convert it to a python dictionary with the structure below:
{
"list": [
{"a": 1}
{"a": 2}
{"a": 3}
]
}
So I can access it's values, like
dict["list"][1]["a"]
I already tried using:
JSON.loads
ast_eval()
Could someone please help me?
As an example of how this unparsed string is generated:
#udf()
def execute_method():
return {"list": [{"a":1},{"b":1}{"c":1}]}
df_result = df_source.withColumn("result", execute_method())
By the very least you will need to replace = with : and surround keys with double quotes:
import json
import re
string = "{list=[{a=1}, {a=2}, {a=3}]}"
fixed_string = re.sub(r'(\w+)=', r'"\1":', string)
print(type(fixed_string), fixed_string)
parsed = json.loads(fixed_string)
print(type(parsed), parsed)
outputs
<class 'str'> {"list":[{"a":1}, {"a":2}, {"a":3}]}
<class 'dict'> {'list': [{'a': 1}, {'a': 2}, {'a': 3}]}
try this :
import re
import json
data="{list=[{a=1}, {a=2}, {a=3}]}"
data=data.replace('=',':')
pattern=[e.group() for e in re.finditer('[a-z]+', data, flags=re.IGNORECASE)]
for e in set(pattern):
data=data.replace(e,"\""+e+"\"")
print(json.loads(data))
I'm a newbie in Python trying to turn information from an Excel file into JSON output.
I'm trying to parse this Python list:
value = ['Position: Backstab, Gouge,', 'SumPosition: DoubleParse, Pineapple']
into this JSON format:
"value": [
{
"Position": [
"Backstab, Gouge,"
]
},
{
"SumPosition": [
"DoubleParse, Pineapple"
]
}
]
Please note:
This list was previously a string:
value = 'Position: Backstab, Gouge, SumPosition: DoubleParse, Pineapple'
Which I turned into a list by using re.split().
I've already turned the string into a list by using re.split, but I still can't turn the inside of the string into a dict, and the value from the dict into a list.
Is that even possible? Is it the case to format the list/string with JSON or previously prepare the string itself so it can receive the json.dump method?
Thanks in advance!
You can iterate over the list to achieve desired result.
d = {'value': []}
for val in value:
k, v = val.split(':')
tmp = {k.strip() : [v.strip()]}
d['value'].append(tmp)
print(d)
{'value': [{'Position': ['Backstab, Gouge,']},
{'SumPosition': ['DoubleParse, Pineapple']}]}
Here is a quick way.
value = ['Position: Backstab, Gouge,',
'SumPosition: DoubleParse, Pineapple']
dictionary_result = {}
for line in value:
key, vals = line.split(':')
vals = vals.split(',')
dictionary_result[key] = vals
Remaining tasks for you: trim off empty strings from result lists like [' Backstab', ' Gouge', ''], and actually convert the data from a Python dict to a JSON file
I have a JSON file that is getting continuously appended with new data. Each time it gets updated I need it to be "well-formed". The problem is that my JSON looks like this (each item is dumped serially):
{"one": 1},
{"two": 2}
I need the data to be properly formed so enclosing in square brackets could work, or an outer curly bracket. But I'm not quite sure how to do that.
[
{"one": 1},
{"two": 2}
]
Here is the code performing the JSON writing:
def printJSONFile(data):
json_dump = json.dumps(data, default=serialize)
try:
jf = open(fullpath, "a+")
jf.write(json_dump + ",\n")
jf.close()
except IOError:
print "ERROR: Unable to open/write to {}".format(fullpath)
return
I have a text file that has several thousand json objects (meaning the textual representation of json) one after the other. They're not separated and I would prefer not to modify the source file. How can I load/parse each json in python? (I have seen this question, but if I'm not mistaken, this only works for a list of jsons (alreay separated by a comma?) My file looks like this:
{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}...
I don't see a clean way to do this without using the real JSON parser. The other options of modifying the text and using a non-JSON parser are risky. So the best way to go it find a way to iterate using the real JSON parser so that you're sure to comply with the JSON spec.
The core idea is to let the real JSON parser do all the work in identifying the groups:
import json, re
combined = '{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}'
start = 0
while start != len(combined):
try:
json.loads(combined[start:])
except ValueError as e:
pass
# Find the location where the parsing failed
end = start + int(re.search(r'column (\d+)', e.args[0]).group(1)) - 1
result = json.loads(combined[start:end])
start = end
print(result)
This outputs:
{u'json': 1}
{u'json': 2}
{u'json': 3}
{u'json': 4}
{u'json': 5}
I think the following would work as long as there are no non-comma-delimited json arrays of json sub-objects inside any of the outermost json objects. It's somewhat brute-force in that it reads the whole file into memory and attempts to fix it.
import json
def get_json_array(filename):
with open(filename, 'rt') as jsonfile:
json_array = '[{}]'.format(jsonfile.read().replace('}{', '},{'))
return json.loads(json_array)
for obj in get_json_array('multiobj.json'):
print(obj)
Output:
{u'json': 1}
{u'json': 2}
{u'json': 3}
{u'json': 4}
{u'json': 5}
Instead of modifying the source file, just make a copy. Use a regex to replace }{ with },{ and then hopefully a pre-built json reader will take care of it nicely.
EDIT: quick solution:
from re import sub
with open(inputfile, 'r') as fin:
text = sub(r'}{', r'},{', fin.read())
with open(outfile, 'w' as fout:
fout.write('[')
fout.write(text)
fout.write(']')
>>> import ast
>>> s = '{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}'
>>> [ast.literal_eval(ele + '}') for ele in s.split('}')[:-1]]
[{'json': 1}, {'json': 2}, {'json': 3}, {'json': 4}, {'json': 5}]
Provided you have no nested objects and splitting on '}' is feasible this can be accomplished pretty simply.
Here is one pythonic way to do it:
from json.scanner import make_scanner
from json import JSONDecoder
def load_jsons(multi_json_str):
s = multi_json_str.strip()
scanner = make_scanner(JSONDecoder())
idx = 0
objects = []
while idx < len(s):
obj, idx = scanner(s, idx)
objects.append(obj)
return objects
I think json was never supposed to be used this way, but it solves your problem.
I agree with #Raymond Hettinger, you need to use json itself to do the work, text manipulation doesn't work for complex JSON objects. His answer parses the exception message to find the split position. It works, but it looks like a hack, hence, not pythonic :)
EDIT:
Just found out this is actually supported by json module, just use raw_decode like this:
decoder = JSONDecoder()
first_obj, remaining = decoder.raw_decode(multi_json_str)
Read http://pymotw.com/2/json/index.html#mixed-data-streams
I have JSON data as an array of dictionaries which comes as the request payload.
[
{ "Field1": 1, "Feld2": "5" },
{ "Field1": 3, "Feld2": "6" }
]
I tried ijson.items(f, '') which yields the entire JSON object as one single item. Is there a way I can iterate the items inside the array one by one using ijson?
Here is the sample code I tried which is yielding the JSON as one single object.
f = open("metadatam1.json")
objs = ijson.items(f, '')
for o in objs:
print str(o) + "\n"
[{'Feld2': u'5', 'Field1': 1}, {'Feld2': u'6', 'Field1': 3}]
I'm not very familiar with ijson, but reading some of its code it looks like calling items with a prefix of "item" should work to get the items of the array, rather than the top-level object:
for item in ijson.items(f, "item"):
# do stuff with the item dict