Extract JSON Data in Python - Example Code Included

Extract JSON Data in Python - Example Code Included - python

I am brand new to using JSON data and fairly new to Python. I am struggling with being able to parse the following JSON data in Python, in order to import the data into a SQL Server database. I already have a program that will import the parsed data into sql server using PYDOBC, however I can't for the life of me figure out how to correctly parse the JSON data into a Python dictionary.
I know there are a number of threads that address this issue, however I was unable to find any examples of the same JSON data structure. Any help would be greatly appreciated as I am completely stuck on this issue. Thank you SO! Below is a cut of the JSON data I am working with:
{
"data":
[
{
"name": "Mobile Application",
"url": "https://www.example-url.com",
"metric": "users",
"package": "example_pkg",
"country": "USA",
"data": [
[ 1396137600000, 5.76 ],
[ 1396224000000, 5.79 ],
[ 1396310400000, 6.72 ],
....
[ 1487376000000, 7.15 ]
]
}
],"as_of":"2017-01-22"}
Again, I apologize if this thread is repetitive, however as I mentioned above, I was not able to work out the logic from other threads as I am brand new to using JSON.
Thank you again for any help or advice in regard to this.
import json
with open("C:\\Pathyway\\7Park.json") as json_file:
data = json.load(json_file)
assert data["data"][0]["metric"] == "users"
The above code results with the following error:
Traceback (most recent call last):
File "JSONpy", line 10, in <module>
data = json.load(json_file)
File "C:\json\__init__.py", line 291, in load
**kw)
File "C:\json\__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "C:\json\decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 2 column 1 - line 7 column 1 (char 23549 - 146249)

Assuming the data you've described (less the ... ellipsis) is in a file called j.json, this code parses the JSON document into a Python object:
import json
with open("j.json") as json_file:
data = json.load(json_file)
assert data["data"][0]["metric"] == "users"
From your error message it seems possible that your file is not a single JSON document, but a sequence of JSON documents separated by newlines. If that is the case, then this code might be more helpful:
import json
with open("j.json") as json_file:
for line in json_file:
data = json.loads(line)
print (data["data"][0]["metric"])

Related

How to shuffle big JSON file?

I have a JSON file with 1 000 000 entries in it (Size: 405 Mb). It looks like that:
[
{
"orderkey": 1,
"name": "John",
"age": 23,
"email": "john#example.com"
},
{
"orderkey": 2,
"name": "Mark",
"age": 33,
"email": "mark#example.com"
},
...
]
The data is sorted by "orderkey", I need to shuffle data.
I tried to apply the following Python code. It worked for smaller JSON file, but did not work for my 405 MB one.
import json
import random
with open("sorted.json") as f:
data = json.load(f)
random.shuffle(data)
with open("sorted.json") as f:
json.dump(data, f, indent=2)
How to do it?
UPDATE:
Initially I got the following error:
~/Desktop/shuffleData$ python3 toShuffle.py
Traceback (most recent call last):
File "/home/andrei/Desktop/shuffleData/toShuffle.py", line 5, in <module>
data = json.load(f)
File "/usr/lib/python3.10/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 403646259 (char 403646258)
Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid.
Removing "}" fixed the problem.

Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid format.
Removing "}" in the end fixed the problem.
Provided python code works.

Well this should ideally work unless you have memory constraints.
import random
random.shuffle(data)
In case you are looking for another way and would like to benchmark which is faster for the huge set, you can use the sci-kit learn libraries shuffle function.
from sklearn.utils import shuffle
shuffled_data = shuffle(data)
print(shuffled_data)
Note: Additional package has to be installed called Scikit learn. (pip install -U scikit-learn)

How to Import Json file to Neo4j

I have a Very big Json file and I would like to import it to Neo4j but When I used Apoc I get this error
Failed to invoke procedure `apoc.import.json`: Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize value of type `java.util.LinkedHashMap<java.lang.Object,java.lang.Object>` from Array value (token `JsonToken.START_ARRAY`)
at [Source: (String)"[[{ "; line: 1, column: 1]
The code I am using to import the file is:
CALL apoc.import.json("file:///eight9.json")
The Start of the file looks like this:
[[{
"id" : "149715690143449899009",
"objectType" : "activity",
"actor" : {
But when I checked online it is a valid Json File.

It is complaining about "[[{ ". Below is taken from neo4j documentation; https://neo4j.com/labs/apoc/4.3/import/load-json/. A json format file starts with { so your json is NOT accepted by neo4j;
For example:
{
"name":"Michael",
"age": 41,
"children": ["Selina","Rana","Selma"]
}
Please remove [[ at the start and ]] at the end of your file then try it again.

Mixing dicts with non-series error when attempting json to csv with pandas, but only when the json is retrieved from url

Python 3.8.5 with Pandas 1.1.3
This code, to convert Python json dict list to csv, using pandas, works without issue:
import csv
import pandas as pd
import json
data = [{"results": [{"type": "ID", "value": "1234", "normalized": "1234", "count": 1, "offsets": [{"start": 14, "end": 25}], "id_b": "10"}, {"type": "ID", "value": "5678", "normalized": "5678", "count": 1, "offsets": [{"start": 32, "end": 43}], "id_b": "11"}], "responseHeaders": {"Date": "Tue, 25 May 2021 14:41:28 GMT", "Content-Type": "application/json", "Content-Length": "350", "Connection": "keep-alive", "Server": "openresty", "X-StuffAPI-ProcessedLanguage": "eng", "X-StuffAPI-Request-Id": "abcdef", "Strict-Transport-Security": "max-age=63072000; includeSubDomains; preload", "X-StuffAPI-App-Id": "123456789", "X-StuffAPI-Concurrency": "1"}}]
pd.read_json(json.dumps(data)).to_csv('file.csv')
The value in the data variable above is pasted directly from the response of an API call to one of our services. The problem occurs when I attempt to do everything including the API call in one script. Let's first look at everything in the script that seems to be working fine:
import csv
import pandas as pd
import json
import stuff.api
def run(key, url):
# Create an API instance
api = API(user_key=key, service_url=url)
# submit data from a text file to the API parser
file1 = open("123.txt","r")
text_data = file1.read()
params = DocumentParameters()
params["content"] = text_data
file1.close()
try:
return api.data(params)
except StuffAPIException as exception:
print(exception)
if __name__ == '__main__':
result = run('1234', 'https://192.168.0.125:8100/rest/')
y = json.dumps(result)
t = type(y)
print(y)
print(t)
The above print(y) statement will return the exact data which I've shown in the data variable in the first code block above. And the print(t) statement was for me to capture the return type to help me try and diagnose the issue - the result of that is <class 'str'>.
So now we add this right under the print(t) line (exactly as in the first code block):
pd.read_json(json.dumps(result)).to_csv('file.csv')
And I get this error:
ValueError: Mixing dicts with non-Series may lead to ambiguous
ordering.
I have seen the many threads about this error, but none of them seem to pertain exactly to what's happening here.
With my limited experience thus far, I am guessing this issue may be due to the return type being string? I'm not sure, but this troubleshooting step is just the first hurdle to overcome - I need to eventually be able to parse the data into separate columns of the csv file, but for now, I just need to get it into the csv file without errors.
I understand you won't be able to fully reproduce this without access to my server, but hoping that's not needed to figure this out.

Python errors when trying to read and query a JSON file

I am trying to write a Python function as part of my job to be able to check the existence of data in a JSON file which I can only get by downloading it from a website. I am the only resource here with any coding or scripting experience (HTML, CSS & SQL) so this has fallen to me to sort out. I have no experience thus far with Python.
I am not allowed to change the structure or format of the JSON file, the format of it is:
{
"naglowek": {
"dataGenerowaniaDanych": "20210514",
"liczbaTransformacji": "5000",
"schemat": "RRRRMMDDNNNNNNNNNNBBBBBBBBBBBBBBBBBBBBBBBBBB"
},
"skrotyPodatnikowCzynnych": [
"examplestring1",
"examplestring2",
"examplestring3",
"examplestring4",
],
"maski": [
"examplemask1",
"examplemask2",
"examplemask3",
"examplemask4"
]
}
I have tried numerous examples found online but none of them seem to work. From looking at various websites the Python code I have is:
import json
with open('20210514.json') as myfile:
data = json.load(myfile)
print(data)
keyVal = 'examplestring2'
if keyVal in data:
# Print the success message and the value of the key
print("Data is found in JSON data")
else:
# Print the message if the value does not exist
print("Data is not found in JSON data")
But I am getting these errors below, I am a complete newbie to Python so am having trouble deciphering them:
D:\PycharmProjects\venv\Scripts\python.exe D:/PycharmProjects/json_test.py
Traceback (most recent call last):
File "D:\PycharmProjects\json_test.py", line 4, in <module>
data = json.load(myfile)
File "C:\Users\xyz\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
return loads(fp.read(),
File "C:\Users\xyz\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "C:\Users\xyz\AppData\Local\Programs\Python\Python39\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\xyz\AppData\Local\Programs\Python\Python39\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 12 column 5 (char 921)
Process finished with exit code 1
Any help would be massively appreciated!

{
"naglowek": {
"dataGenerowaniaDanych": "20210514",
"liczbaTransformacji": "5000",
"schemat": "RRRRMMDDNNNNNNNNNNBBBBBBBBBBBBBBBBBBBBBBBBBB"
},
"skrotyPodatnikowCzynnych": [
"examplestring1",
"examplestring2",
"examplestring3",
"examplestring4"
],
"maski": [
"examplemask1",
"examplemask2",
"examplemask3",
"examplemask4"
]
}
This should work. The problem here is that you have a comma at the end of a list which your parser can't handle. ECMAScript 5 introduced the ability to parse that. But apparently JSON in general doesn't support it (yet?). So, make sure to not have a comma at the end of a list.
For your if-else statement to be correct, you'd have to change it to something like this:
keyVal = 'examplestring2'
keyName = 'skrotyPodatnikowCzynnych'
if keyName in data.keys() and keyval in data[keyName]:
# Print the success message and the value of the key
print("Data is found in JSON data")
else:
# Print the message if the value does not exist
print("Data is not found in JSON data")

Remove the trailing comma. JSON specification does not allow a trailing comma

If you don't want to change the file structure then you have to do this:
import yaml
with open('20210514.json') as myfile:
data = yaml.load(myfile, Loader=yaml.FullLoader)
print(data)
You also need to install yaml first.
https://pyyaml.org/

Number plate detection JSON dataset

I am new to JSON. I am doing a project for Vehicle Number Plate Detection.
I have a dataset of the form:
{"content": "http://com.dataturks.a96-i23.open.s3.amazonaws.com/2c9fafb0646e9cf9016473f1a561002a/77d1f81a-bee6-487c-aff2-0efa31a9925c____bd7f7862-d727-11e7-ad30-e18a56154311.jpg.jpeg","annotation":[{"label":["number_plate"],"notes":"","points":[{"x":0.7220843672456576,"y":0.5879828326180258},{"x":0.8684863523573201,"y":0.6888412017167382}],"imageWidth":806,"imageHeight":466}],"extras":null},
{"content": "http://com.dataturks.a96-i23.open.s3.amazonaws.com/2c9fafb0646e9cf9016473f1a561002a/4eb236a3-6547-4103-b46f-3756d21128a9___06-Sanjay-Dutt.jpg.jpeg","annotation":[{"label":["number_plate"],"notes":"","points":[{"x":0.16194331983805668,"y":0.8507795100222717},{"x":0.582995951417004,"y":1}],"imageWidth":494,"imageHeight":449}],"extras":null},
There are in total 240 blocks of data.
I want to do two things with the above dataset.
Firstly,I need to download all the images from each block and secondly,need to get the values of "points" column to a text file.
I am getting problem while getting the values for the columns.
import json
jsonFile = open('Indian_Number_plates.json', 'r')
x = json.load(jsonFile)
for criteria in x['annotation']:
for key, value in criteria.iteritems():
print(key, 'is:', value)
print('')
I have written the above code to get all the values under the "annotation".
But,getting the following error
Traceback (most recent call last):
File "prac.py", line 13, in <module>
x = json.load(jsonFile)
File "C:\python364\Lib\json\__init__.py", line 299, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "C:\python364\Lib\json\__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "C:\python364\Lib\json\decoder.py", line 342, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 394 (char 393)
Please help me for getting the values for "points" column and also for downloading the images from the link in the "content" section.

i found this answer while searching. Essentially, you can read an object, catch the exception when JSON sees an unexpected object, and then seek/reparse and build a list of objects.
in Java, i'd just tell you to use Jackson and their SAX style streaming interface, as i've done that to read a list of objects formatted like this - if JSON in python has a streaming api, i'd use that instead of the exception handler workaround

the error comes because your file contains two records or more :
{"content": "http://com.dataturks.a96- } ..... {"content": .....
to solve this you should reformat your json so that all the records are contained in an array :
{ "data" : [ {"content": "http://com.dataturks.a96- .... },{"content":... }]}
to download the images, extract the image names and urls and use requests :
import requests
with open(image_name, 'wb') as handle:
response = requests.get(pic_url, stream=True)
if not response.ok:
print response
for block in response.iter_content(1024):
if not block:
break
handle.write(block)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract JSON Data in Python - Example Code Included - python

Related

How to shuffle big JSON file?

How to Import Json file to Neo4j

Mixing dicts with non-series error when attempting json to csv with pandas, but only when the json is retrieved from url

Python errors when trying to read and query a JSON file

Number plate detection JSON dataset

Categories

Resources