I need a little help processing a String to a Dict, considering that the String is not in a common format, but an output from a UDF function
The return from the PySpark UDF looks like the string below:
"{list=[{a=1}, {a=2}, {a=3}]}"
And I need to convert it to a python dictionary with the structure below:
{
"list": [
{"a": 1}
{"a": 2}
{"a": 3}
]
}
So I can access it's values, like
dict["list"][1]["a"]
I already tried using:
JSON.loads
ast_eval()
Could someone please help me?
As an example of how this unparsed string is generated:
#udf()
def execute_method():
return {"list": [{"a":1},{"b":1}{"c":1}]}
df_result = df_source.withColumn("result", execute_method())
By the very least you will need to replace = with : and surround keys with double quotes:
import json
import re
string = "{list=[{a=1}, {a=2}, {a=3}]}"
fixed_string = re.sub(r'(\w+)=', r'"\1":', string)
print(type(fixed_string), fixed_string)
parsed = json.loads(fixed_string)
print(type(parsed), parsed)
outputs
<class 'str'> {"list":[{"a":1}, {"a":2}, {"a":3}]}
<class 'dict'> {'list': [{'a': 1}, {'a': 2}, {'a': 3}]}
try this :
import re
import json
data="{list=[{a=1}, {a=2}, {a=3}]}"
data=data.replace('=',':')
pattern=[e.group() for e in re.finditer('[a-z]+', data, flags=re.IGNORECASE)]
for e in set(pattern):
data=data.replace(e,"\""+e+"\"")
print(json.loads(data))
Related
I am creating a json file from pseudo xml format file. However I get commas between json object, which I don't want.
This is sample of what I get:
[{"a": a , "b": b } , {"a": a , "b": b }]
However I want this:
{"a": a , "b": b } {"a": a , "b": b }
It might not be a valid json but I want it that way so that I can shuffle it by doing:
shuf -n 100000 original.json > sample.json
otherwise, it will be just one big line of json
This is my code:
def read_html_file(file_name):
f = open(file_name,"r", encoding="ISO-8859-1")
html = f.read()
parsed_html = BeautifulSoup(html, "html.parser")
return parsed_html
def process_reviews(parsed_html):
reviews = []
for r in parsed_html.findAll('review'):
review_text = r.find('review_text').text
asin = r.find('asin').text
rating = r.find('rating').text
product_type = r.find('product_type').text
reviewer_location = r.find('reviewer_location').text
reviews.append({
'review_text': review_text.strip(),
'asin': asin.strip(),
'rating': rating.strip(),
'product_type': product_type.strip(),
'reviewer_location': reviewer_location.strip()
})
return reviews
def write_json_file(file_name, reviews):
with open('{f}.json'.format(f=file_name), 'w') as outfile:
json.dump(reviews, outfile)
if __name__ == '__main__':
parser = optparse.OptionParser()
parser.add_option('-f', '--file_name',action="store", dest="file_name",
help="name of the input html file to parse", default="positive.html")
options, args = parser.parse_args()
file_name = options.file_name
html = read_html_file(file_name)
reviews_list = process_reviews(html)
write_json_file(file_name,reviews_list)
The first [ ] is because of the reviews = [], and I can manually remove it but I also don't want commas between my json object.
What you are asking for is just not JSON. The standards, by definition, specify there has to be a comma between objects. You have two options to go forward:
Update your parser to match the standards (highly recommended).
For display purposes, or other internal processing you may have, in case you really want the structure you specified: capture the JSON object and transform it to something else, but please do not call it JSON, because it isn't.
There are a few concepts you're mixing on your question!
1. What you have is not a dict, but a list of dicts.
2. You don't have a JSON, neither on your input element list, nor on your expected output
Now going for solution, if you want to simply print your objects without the comma separating them, so you only need to print all your elements list, what you can do with:
sample = [{"a": "a" , "b": "b" } , {"a": "a" , "b": "b" }]
print(" ".join([str(element) for element in sample]))
Now, if what you really want is to manipulate it as a JSON object, you have two options, using the json lib:
Add each element from your sample as a Json and manipulate it individually
They are already formatted as Json, so you could manipulate them using the json lib to pretty print (dumps) as strings or any other manipulation:
import json
for element in sample:
print(json.dumps(element, indent = 4))
Make your sample list become a Json
You can either add all your elements to a single key, let's say adding to a key called elements, what would be:
sample_json = {"elements": []}
for data in sample:
sample_json["elements"].append(data)
# Output from sample_json
# {'elements': [{'a': 'a', 'b': 'b'}, {'a': 'a', 'b': 'b'}]}
Or you can add every single element to a different key. As an example, I'll create a counter and each number of the counter will define a different key for that specific element:
sample_json = {}
counter = 0
for data in sample:
sample_json[counter] = data
counter += 1
# Output from sample_json
# {0: {'a': 'a', 'b': 'b'}, 1: {'a': 'a', 'b': 'b'}}
You could use text keys as well, for this second case.
Here is the problem - I have a string in the following format (note: there are no line breaks). I simply want this string to be serialized in a python dictionary or a json object to navigate easily. I have tried both ast.literal_eval and json but the end result is either an error or simply another string. I have been scratching my head over this for sometimes and I know there is a simple and elegant solution than to just write my own parser.
{
table_name:
{
"columns":
[
{
"col_1":{"col_1_1":"value_1_1","col_1_2":"value_1_2"},
"col_2":{"col_2_1":"value_2_1","col_2_2":"value_2_2"},
"col_3":"value_3","col_4":"value_4","col_5":"value_5"}],
"Rows":1,"Total":1,"Flag":1,"Instruction":none
}
}
Note, that JSON decoder expects each property name to be enclosed in double quotes.Use the following approach with re.sub() and json.loads() functions:
import json, re
s = '{table_name:{"columns":[{"col_1":{"col_1_1":"value_1_1","col_1_2":"value_1_2"},"col_2":{"col_2_1":"value_2_1","col_2_2":"value_2_2"},"col_3":"value_3","col_4":"value_4","col_5":"value_5"}],"Rows":1,"Total":1,"Flag":1,"Instruction":none}}'
s = re.sub(r'\b(?<!\")([_\w]+)(?=\:)', r'"\1"', s).replace('none', '"None"')
obj = json.loads(s)
print(obj)
The output:
{'table_name': {'columns': [{'col_5': 'value_5', 'col_2': {'col_2_1': 'value_2_1', 'col_2_2': 'value_2_2'}, 'col_3': 'value_3', 'col_1': {'col_1_2': 'value_1_2', 'col_1_1': 'value_1_1'}, 'col_4': 'value_4'}], 'Flag': 1, 'Total': 1, 'Instruction': 'None', 'Rows': 1}}
I am writing a test in Python where i am specifying the JSON sting in a parameter as follows :
json = '...[{"MemberOperand":{
"AttributeName":"TEST",
"Comparison":"=",
"Value":"Test"}
}]...'
In this example i have the value as "Test" however i want to run the test with several values. Could you guys tell me how can i parameterize The values of "Value"?
You can construct proper JSON:
import json
the_value = 'Test'
data = [{"MemberOperand": {
"AttributeName":"TEST",
"Comparison":"=",
"Value": the_value}
}]
json_text = json.dumps(data)
This is regular dictionary (nested) formatted as string -
def changer(x):
import json
d=json.loads(json.loads(json.dumps('[{"MemberOperand":{"AttributeName":"TEST","Comparison":"=","Value":"Test"}}]')))
d[0]['MemberOperand']['AttributeName']=x
return d
print changer('New_TEST')
Output-
[{'MemberOperand': {'Comparison': '=', 'AttributeName': 'New_TEST', 'Value': 'Test'}}]
Add function which return you different json string all the time by provided value as parameter:
def get_mock_json(value='Test'):
return '...[{"MemberOperand":{"AttributeName":"TEST","Comparison":"=","Value":%s}}]...'%value
print get_mock_json('test')
print get_mock_json('ttttttest')
So I'm parsing a really big log file with some embedded json.
So I'll see lines like this
foo="{my_object:foo, bar:baz}" a=b c=d
The problem is that since the internal json can have spaces, but outside of the JSON, spaces act as tuple delimiters (except where they have unquoted strings . Huzzah for whatever idiot thought that was a good idea), I'm not sure how to figure out where the end of the JSON string is without reimplementing large portions of a json parser.
Is there a json parser for Python where I can give it '{"my_object":"foo", "bar":"baz"} asdfasdf', and it can return ({'my_object' : 'foo', 'bar':'baz'}, 'asdfasdf') or am I going to have to reimplement the json parser by hand?
Found a really cool answer. Use json.JSONDecoder's scan_once function
In [30]: import json
In [31]: d = json.JSONDecoder()
In [32]: my_string = 'key="{"foo":"bar"}"more_gibberish'
In [33]: d.scan_once(my_string, 5)
Out[33]: ({u'foo': u'bar'}, 18)
In [37]: my_string[18:]
Out[37]: '"more_gibberish'
Just be careful
In [38]: d.scan_once(my_string, 6)
Out[38]: (u'foo', 11)
Match everything around it.
>>> re.search('^foo="(.*)" a=.+ c=.+$', 'foo="{my_object:foo, bar:baz}" a=b c=d').group(1)
'{my_object:foo, bar:baz}'
Use shlex and json.
Something like:
import shlex
import json
def decode_line(line):
decoded = {}
fields = shlex.split(line)
for f in fields:
k, v = f.split('=', 1)
if k == "foo":
v = json.loads(v)
decoded[k] = v
return decoded
This does assume that the JSON inside the quotes is quoted properly.
Here's a short example program that uses the above:
import pipes
testdict = {"hello": "world", "foo": "bar"}
line = 'foo=' + pipes.quote(json.dumps(testdict)) + ' a=b c=d'
print line
print decode_line(line)
With output:
foo='{"foo": "bar", "hello": "world"}' a=b c=d
{'a': 'b', 'c': 'd', 'foo': {u'foo': u'bar', u'hello': u'world'}}
I want to convert such query string:
a=1&b=2
to json string
{"a":1, "b":2}
Any existing solution?
Python 3+
import json
from urllib.parse import parse_qs
json.dumps(parse_qs("a=1&b=2"))
Python 2:
import json
from urlparse import parse_qs
json.dumps(parse_qs("a=1&b=2"))
In both cases the result is
'{"a": ["1"], "b": ["2"]}'
This is actually better than your {"a":1, "b":2}, because URL query strings can legally contain the same key multiple times, i.e. multiple values per key.
>>> strs="a=1&b=2"
>>> {x.split('=')[0]:int(x.split('=')[1]) for x in strs.split("&")}
{'a': 1, 'b': 2}
Python 3.x
from json import dumps
from urllib.parse import parse_qs
dumps(parse_qs("a=1&b=2"))
yelds
{"b": ["2"], "a": ["1"]}
dict((itm.split('=')[0],itm.split('=')[1]) for itm in qstring.split('&'))