How to Import Json file to Neo4j - python

I have a Very big Json file and I would like to import it to Neo4j but When I used Apoc I get this error
Failed to invoke procedure `apoc.import.json`: Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize value of type `java.util.LinkedHashMap<java.lang.Object,java.lang.Object>` from Array value (token `JsonToken.START_ARRAY`)
at [Source: (String)"[[{ "; line: 1, column: 1]
The code I am using to import the file is:
CALL apoc.import.json("file:///eight9.json")
The Start of the file looks like this:
[[{
"id" : "149715690143449899009",
"objectType" : "activity",
"actor" : {
But when I checked online it is a valid Json File.

It is complaining about "[[{ ". Below is taken from neo4j documentation; https://neo4j.com/labs/apoc/4.3/import/load-json/. A json format file starts with { so your json is NOT accepted by neo4j;
For example:
{
"name":"Michael",
"age": 41,
"children": ["Selina","Rana","Selma"]
}
Please remove [[ at the start and ]] at the end of your file then try it again.

Related

How to shuffle big JSON file?

I have a JSON file with 1 000 000 entries in it (Size: 405 Mb). It looks like that:
[
{
"orderkey": 1,
"name": "John",
"age": 23,
"email": "john#example.com"
},
{
"orderkey": 2,
"name": "Mark",
"age": 33,
"email": "mark#example.com"
},
...
]
The data is sorted by "orderkey", I need to shuffle data.
I tried to apply the following Python code. It worked for smaller JSON file, but did not work for my 405 MB one.
import json
import random
with open("sorted.json") as f:
data = json.load(f)
random.shuffle(data)
with open("sorted.json") as f:
json.dump(data, f, indent=2)
How to do it?
UPDATE:
Initially I got the following error:
~/Desktop/shuffleData$ python3 toShuffle.py
Traceback (most recent call last):
File "/home/andrei/Desktop/shuffleData/toShuffle.py", line 5, in <module>
data = json.load(f)
File "/usr/lib/python3.10/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 403646259 (char 403646258)
Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid.
Removing "}" fixed the problem.
Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid format.
Removing "}" in the end fixed the problem.
Provided python code works.
Well this should ideally work unless you have memory constraints.
import random
random.shuffle(data)
In case you are looking for another way and would like to benchmark which is faster for the huge set, you can use the sci-kit learn libraries shuffle function.
from sklearn.utils import shuffle
shuffled_data = shuffle(data)
print(shuffled_data)
Note: Additional package has to be installed called Scikit learn. (pip install -U scikit-learn)

Unable to parse JSON from file using Python3

I'm trying to get the value of ["pooled_metrics"]["vmaf"]["harmonic_mean"] from a JSON file I want to parse using python. This is the current state of my code:
for crf in crf_ranges:
vmaf_output_crf_list_log = job['config']['output_dir'] + '/' + build_name(stream) + f'/vmaf_{crf}.json'
# read the vmaf_output_crf_list_log file and get value from ["pooled_metrics"]["vmaf"]["harmonic_mean"]
with open(vmaf_output_crf_list_log, 'r') as json_vmaf_file:
# load the json_string["pooled_metrics"] into a python dictionary
vm = json.loads(json_vmaf_file.read())
vmaf_values.append((crf, vm["pooled_metrics"]["vmaf"]["harmonic_mean"]))
This will give me back the following error:
AttributeError: 'dict' object has no attribute 'loads'
I always get back the same AttributeError not matter if I use "load" or "loads".
I validated the contents of the JSON, which is valid using various online validators, but still, I am not able to load the JSON for further parsing operations.
I expect that I can load a file that contains valid JSON data. The content of the file looks like this:
{
"frames": [
{
"frameNum": 0,
"metrics": {
"integer_vif_scale2": 0.997330,
}
},
],
"pooled_metrics": {
"vmaf": {
"min": 89.617207,
"harmonic_mean": 99.868023
}
},
"aggregate_metrics": {
}
}
Can somebody provide me some advice onto this behavior, what does it seem so absolutely impossible to load this JSON file?
loads is a method for the json library as the docs say https://docs.python.org/3/library/json.html#json.loads. In this case you are having a AttributeError this means that probably you have created another variable named "json" and when you call json.loads is calling that variable hence it won't have a loads method.

Using Schema_Object in GoogleCloudStorageToBigQueryOperator in Airflow

I want to load GCS files written in JSON format into a BQ Table through an Airflow DAG.
So, i used the GoogleCloudStorageToBigQueryOperator. Additionally, to avoid using the autodetect option, i created a Schema JSON file stored in the GCS bucket which i have my JSON raw data files to be used as schema_object.
Below is the JSON Schema file:
[{"name": "id", "type": "INTEGER", "mode": "NULLABLE"},{"name": "description", "type": "INTEGER", "mode": "NULLABLE"}]
And for my JSON raw data file, it looks like that (New line Delimited JSON File):
{"description":"HR Department","id":9}
{"description":"Restaurant Department","id":10}
Here is my operator look like:
gcs_to_bq = GoogleCloudStorageToBigQueryOperator(
task_id=table_name + "_gcs_to_bq",
bucket=bucket_name,
bigquery_conn_id="bigquery_default",
google_cloud_storage_conn_id="google_cloud_storage_default",
source_objects=[table_name + "/{{ ds_nodash }}/data_json/*.json"],
schema_object=table_name+"/{{ ds_nodash }}/data_json/schema_file.json",
allow_jagged_rows=True,
ignore_unknown_values=True,
source_format="NEWLINE_DELIMITED_JSON",
destination_project_dataset_table=project_id
+ "."
+ data_set
+ "."
+ table_name,
write_disposition="WRITE_TRUNCATE",
create_disposition="CREATE_IF_NEEDED",
dag=dag,
)
The error i got is:
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Failed to parse JSON: No object found when new array is started.; BeginArray returned false; Parser terminated before end of string File: schema_file.json
Could you please help me solving this issue?
Thanks in Advance.
I see 2 problems :
Your BigQuery table schema is incorrect, the type of description column is INTEGER instead of STRING. You have to set it to STRING
You are using an old Airflow version. In the recent version, normally the source object retrieves by default object from bucket specified in bucket param. For the old version, I am not sure about this behaviour. You can set the full path to check if it solve your issue, example : schema_object='gs://test-bucket/schema.json'

How to test and check a json files and make a report of the errors

I am currently using jsonschema with python to check my json files. I have created a schema.json that checks the json files that are looped over, and I display the file is invalid or valid according to the checks i have added in my schema.json.
Currently i am not able to display and highlight the error in the json file which is being checked, for example
"data_type": {
"type": "string",
"not": { "enum": ["null", "0", "NULL", ""] }
},
if the json file has the value "data_type" : "null" or anything similar in the enum, it will display json file is invalid but i want it to highlight the error and display you added null in the data_type field can i do it with jsonschema?
If there is another way without using jsonschema that will work; the main testing is to check multiple json files in a folder (which can be looped over) and check whether they are valid according to few set of rules and display a clean and nice report which specifically tells us the problem.
What I did is used Draft7Validator from jsonschema and it allows to display error message
validator = jsonschema.Draft7Validator(schema)
errors = sorted(validator.iter_errors(jsonData[a]), key=lambda e: e.path)
for error in errors:
print(f"{error.message}")
report.write(f"In {file_name}, object {a}, {error.message}{N}")
error_report.write(f"In {file_name}, object {a},{error.json_path[2:]} {error.message}{N}")
This prints -
In test1.json, object 0, apy '' is not of type 'number'
In test2.json, object 0, asset 'asset_contract_address' is a required property
Helpful link that helped me achieve this- Handling Validation Errors

Extract JSON Data in Python - Example Code Included

I am brand new to using JSON data and fairly new to Python. I am struggling with being able to parse the following JSON data in Python, in order to import the data into a SQL Server database. I already have a program that will import the parsed data into sql server using PYDOBC, however I can't for the life of me figure out how to correctly parse the JSON data into a Python dictionary.
I know there are a number of threads that address this issue, however I was unable to find any examples of the same JSON data structure. Any help would be greatly appreciated as I am completely stuck on this issue. Thank you SO! Below is a cut of the JSON data I am working with:
{
"data":
[
{
"name": "Mobile Application",
"url": "https://www.example-url.com",
"metric": "users",
"package": "example_pkg",
"country": "USA",
"data": [
[ 1396137600000, 5.76 ],
[ 1396224000000, 5.79 ],
[ 1396310400000, 6.72 ],
....
[ 1487376000000, 7.15 ]
]
}
],"as_of":"2017-01-22"}
Again, I apologize if this thread is repetitive, however as I mentioned above, I was not able to work out the logic from other threads as I am brand new to using JSON.
Thank you again for any help or advice in regard to this.
import json
with open("C:\\Pathyway\\7Park.json") as json_file:
data = json.load(json_file)
assert data["data"][0]["metric"] == "users"
The above code results with the following error:
Traceback (most recent call last):
File "JSONpy", line 10, in <module>
data = json.load(json_file)
File "C:\json\__init__.py", line 291, in load
**kw)
File "C:\json\__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "C:\json\decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 2 column 1 - line 7 column 1 (char 23549 - 146249)
Assuming the data you've described (less the ... ellipsis) is in a file called j.json, this code parses the JSON document into a Python object:
import json
with open("j.json") as json_file:
data = json.load(json_file)
assert data["data"][0]["metric"] == "users"
From your error message it seems possible that your file is not a single JSON document, but a sequence of JSON documents separated by newlines. If that is the case, then this code might be more helpful:
import json
with open("j.json") as json_file:
for line in json_file:
data = json.loads(line)
print (data["data"][0]["metric"])

Categories

Resources