How to shuffle big JSON file?

How to shuffle big JSON file? - python

I have a JSON file with 1 000 000 entries in it (Size: 405 Mb). It looks like that:
[
{
"orderkey": 1,
"name": "John",
"age": 23,
"email": "john#example.com"
},
{
"orderkey": 2,
"name": "Mark",
"age": 33,
"email": "mark#example.com"
},
...
]
The data is sorted by "orderkey", I need to shuffle data.
I tried to apply the following Python code. It worked for smaller JSON file, but did not work for my 405 MB one.
import json
import random
with open("sorted.json") as f:
data = json.load(f)
random.shuffle(data)
with open("sorted.json") as f:
json.dump(data, f, indent=2)
How to do it?
UPDATE:
Initially I got the following error:
~/Desktop/shuffleData$ python3 toShuffle.py
Traceback (most recent call last):
File "/home/andrei/Desktop/shuffleData/toShuffle.py", line 5, in <module>
data = json.load(f)
File "/usr/lib/python3.10/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 403646259 (char 403646258)
Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid.
Removing "}" fixed the problem.

Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid format.
Removing "}" in the end fixed the problem.
Provided python code works.

Well this should ideally work unless you have memory constraints.
import random
random.shuffle(data)
In case you are looking for another way and would like to benchmark which is faster for the huge set, you can use the sci-kit learn libraries shuffle function.
from sklearn.utils import shuffle
shuffled_data = shuffle(data)
print(shuffled_data)
Note: Additional package has to be installed called Scikit learn. (pip install -U scikit-learn)

Related

How to Import Json file to Neo4j

I have a Very big Json file and I would like to import it to Neo4j but When I used Apoc I get this error
Failed to invoke procedure `apoc.import.json`: Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize value of type `java.util.LinkedHashMap<java.lang.Object,java.lang.Object>` from Array value (token `JsonToken.START_ARRAY`)
at [Source: (String)"[[{ "; line: 1, column: 1]
The code I am using to import the file is:
CALL apoc.import.json("file:///eight9.json")
The Start of the file looks like this:
[[{
"id" : "149715690143449899009",
"objectType" : "activity",
"actor" : {
But when I checked online it is a valid Json File.

It is complaining about "[[{ ". Below is taken from neo4j documentation; https://neo4j.com/labs/apoc/4.3/import/load-json/. A json format file starts with { so your json is NOT accepted by neo4j;
For example:
{
"name":"Michael",
"age": 41,
"children": ["Selina","Rana","Selma"]
}
Please remove [[ at the start and ]] at the end of your file then try it again.

Python errors when trying to read and query a JSON file

I am trying to write a Python function as part of my job to be able to check the existence of data in a JSON file which I can only get by downloading it from a website. I am the only resource here with any coding or scripting experience (HTML, CSS & SQL) so this has fallen to me to sort out. I have no experience thus far with Python.
I am not allowed to change the structure or format of the JSON file, the format of it is:
{
"naglowek": {
"dataGenerowaniaDanych": "20210514",
"liczbaTransformacji": "5000",
"schemat": "RRRRMMDDNNNNNNNNNNBBBBBBBBBBBBBBBBBBBBBBBBBB"
},
"skrotyPodatnikowCzynnych": [
"examplestring1",
"examplestring2",
"examplestring3",
"examplestring4",
],
"maski": [
"examplemask1",
"examplemask2",
"examplemask3",
"examplemask4"
]
}
I have tried numerous examples found online but none of them seem to work. From looking at various websites the Python code I have is:
import json
with open('20210514.json') as myfile:
data = json.load(myfile)
print(data)
keyVal = 'examplestring2'
if keyVal in data:
# Print the success message and the value of the key
print("Data is found in JSON data")
else:
# Print the message if the value does not exist
print("Data is not found in JSON data")
But I am getting these errors below, I am a complete newbie to Python so am having trouble deciphering them:
D:\PycharmProjects\venv\Scripts\python.exe D:/PycharmProjects/json_test.py
Traceback (most recent call last):
File "D:\PycharmProjects\json_test.py", line 4, in <module>
data = json.load(myfile)
File "C:\Users\xyz\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
return loads(fp.read(),
File "C:\Users\xyz\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "C:\Users\xyz\AppData\Local\Programs\Python\Python39\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\xyz\AppData\Local\Programs\Python\Python39\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 12 column 5 (char 921)
Process finished with exit code 1
Any help would be massively appreciated!

{
"naglowek": {
"dataGenerowaniaDanych": "20210514",
"liczbaTransformacji": "5000",
"schemat": "RRRRMMDDNNNNNNNNNNBBBBBBBBBBBBBBBBBBBBBBBBBB"
},
"skrotyPodatnikowCzynnych": [
"examplestring1",
"examplestring2",
"examplestring3",
"examplestring4"
],
"maski": [
"examplemask1",
"examplemask2",
"examplemask3",
"examplemask4"
]
}
This should work. The problem here is that you have a comma at the end of a list which your parser can't handle. ECMAScript 5 introduced the ability to parse that. But apparently JSON in general doesn't support it (yet?). So, make sure to not have a comma at the end of a list.
For your if-else statement to be correct, you'd have to change it to something like this:
keyVal = 'examplestring2'
keyName = 'skrotyPodatnikowCzynnych'
if keyName in data.keys() and keyval in data[keyName]:
# Print the success message and the value of the key
print("Data is found in JSON data")
else:
# Print the message if the value does not exist
print("Data is not found in JSON data")

Remove the trailing comma. JSON specification does not allow a trailing comma

If you don't want to change the file structure then you have to do this:
import yaml
with open('20210514.json') as myfile:
data = yaml.load(myfile, Loader=yaml.FullLoader)
print(data)
You also need to install yaml first.
https://pyyaml.org/

Number plate detection JSON dataset

I am new to JSON. I am doing a project for Vehicle Number Plate Detection.
I have a dataset of the form:
{"content": "http://com.dataturks.a96-i23.open.s3.amazonaws.com/2c9fafb0646e9cf9016473f1a561002a/77d1f81a-bee6-487c-aff2-0efa31a9925c____bd7f7862-d727-11e7-ad30-e18a56154311.jpg.jpeg","annotation":[{"label":["number_plate"],"notes":"","points":[{"x":0.7220843672456576,"y":0.5879828326180258},{"x":0.8684863523573201,"y":0.6888412017167382}],"imageWidth":806,"imageHeight":466}],"extras":null},
{"content": "http://com.dataturks.a96-i23.open.s3.amazonaws.com/2c9fafb0646e9cf9016473f1a561002a/4eb236a3-6547-4103-b46f-3756d21128a9___06-Sanjay-Dutt.jpg.jpeg","annotation":[{"label":["number_plate"],"notes":"","points":[{"x":0.16194331983805668,"y":0.8507795100222717},{"x":0.582995951417004,"y":1}],"imageWidth":494,"imageHeight":449}],"extras":null},
There are in total 240 blocks of data.
I want to do two things with the above dataset.
Firstly,I need to download all the images from each block and secondly,need to get the values of "points" column to a text file.
I am getting problem while getting the values for the columns.
import json
jsonFile = open('Indian_Number_plates.json', 'r')
x = json.load(jsonFile)
for criteria in x['annotation']:
for key, value in criteria.iteritems():
print(key, 'is:', value)
print('')
I have written the above code to get all the values under the "annotation".
But,getting the following error
Traceback (most recent call last):
File "prac.py", line 13, in <module>
x = json.load(jsonFile)
File "C:\python364\Lib\json\__init__.py", line 299, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "C:\python364\Lib\json\__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "C:\python364\Lib\json\decoder.py", line 342, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 394 (char 393)
Please help me for getting the values for "points" column and also for downloading the images from the link in the "content" section.

i found this answer while searching. Essentially, you can read an object, catch the exception when JSON sees an unexpected object, and then seek/reparse and build a list of objects.
in Java, i'd just tell you to use Jackson and their SAX style streaming interface, as i've done that to read a list of objects formatted like this - if JSON in python has a streaming api, i'd use that instead of the exception handler workaround

the error comes because your file contains two records or more :
{"content": "http://com.dataturks.a96- } ..... {"content": .....
to solve this you should reformat your json so that all the records are contained in an array :
{ "data" : [ {"content": "http://com.dataturks.a96- .... },{"content":... }]}
to download the images, extract the image names and urls and use requests :
import requests
with open(image_name, 'wb') as handle:
response = requests.get(pic_url, stream=True)
if not response.ok:
print response
for block in response.iter_content(1024):
if not block:
break
handle.write(block)

Extract JSON Data in Python - Example Code Included

I am brand new to using JSON data and fairly new to Python. I am struggling with being able to parse the following JSON data in Python, in order to import the data into a SQL Server database. I already have a program that will import the parsed data into sql server using PYDOBC, however I can't for the life of me figure out how to correctly parse the JSON data into a Python dictionary.
I know there are a number of threads that address this issue, however I was unable to find any examples of the same JSON data structure. Any help would be greatly appreciated as I am completely stuck on this issue. Thank you SO! Below is a cut of the JSON data I am working with:
{
"data":
[
{
"name": "Mobile Application",
"url": "https://www.example-url.com",
"metric": "users",
"package": "example_pkg",
"country": "USA",
"data": [
[ 1396137600000, 5.76 ],
[ 1396224000000, 5.79 ],
[ 1396310400000, 6.72 ],
....
[ 1487376000000, 7.15 ]
]
}
],"as_of":"2017-01-22"}
Again, I apologize if this thread is repetitive, however as I mentioned above, I was not able to work out the logic from other threads as I am brand new to using JSON.
Thank you again for any help or advice in regard to this.
import json
with open("C:\\Pathyway\\7Park.json") as json_file:
data = json.load(json_file)
assert data["data"][0]["metric"] == "users"
The above code results with the following error:
Traceback (most recent call last):
File "JSONpy", line 10, in <module>
data = json.load(json_file)
File "C:\json\__init__.py", line 291, in load
**kw)
File "C:\json\__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "C:\json\decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 2 column 1 - line 7 column 1 (char 23549 - 146249)

Assuming the data you've described (less the ... ellipsis) is in a file called j.json, this code parses the JSON document into a Python object:
import json
with open("j.json") as json_file:
data = json.load(json_file)
assert data["data"][0]["metric"] == "users"
From your error message it seems possible that your file is not a single JSON document, but a sequence of JSON documents separated by newlines. If that is the case, then this code might be more helpful:
import json
with open("j.json") as json_file:
for line in json_file:
data = json.loads(line)
print (data["data"][0]["metric"])

Do file size requirements change when importing a CSV file to MongoDB?

Background:
I'm attempting to follow a tutorial in which I'm importing a CSV file that's approximately 324MB
to MongoLab's sandbox plan (capped at 500MB), via pymongo in Python 3.4.
The file holds ~ 770,000 records, and after inserting ~ 164,000 I hit my quota and received:
raise OperationFailure(error.get("errmsg"), error.get("code"), error)
OperationFailure: quota exceeded
Question:
Would it be accurate to say the JSON-like structure of NoSQL takes more space to hold the same data as a CSV file? Or am I doing something screwy here?
Further information:
Here are the database metrics:
Here's the Python 3.4 code I used:
import sys
import pymongo
import csv
MONGODB_URI = '***credentials removed***'
def main(args):
client = pymongo.MongoClient(MONGODB_URI)
db = client.get_default_database()
projects = db['projects']
with open('opendata_projects.csv') as f:
records = csv.DictReader(f)
projects.insert(records)
client.close()
if __name__ == '__main__':
main(sys.argv[1:])

Yes, JSON takes up much more space than CSV. Here's an example:
name,age,job
Joe,35,manager
Fred,47,CEO
Bob,23,intern
Edgar,29,worker
translated in JSON, it would be:
[
{
"name": "Joe",
"age": 35,
"job": "manager"
},
{
"name": "Fred",
"age": 47,
"job": "CEO"
},
{
"name": "Bob",
"age": 23,
"job": "intern"
},
{
"name": "Edgar",
"age": 29,
"job": "worker"
}
]
Even with all whitespace removed, the JSON is 158 characters, while the CSV is only 69 characters.

Not accounting for things like compression, a set of json documents would take up more space than a csv, because the field names are repeated in each record, whereas in the csv the field names are only in the first row.
The way files are allocated is another factor:
In the filesize section of the Database Metrics screenshot you attached, notice that it says that the first file allocated is 16MB, then the next one is 32MB, and so on. So when your data grew past 240MB total, you had 5 files, of 16MB, 32MB, 64MB, 128MB, and 256MB. This explains why your filesize total is 496MB, even though your data size is only about 317MB. The next file that would be allocated would be 512MB, which would put you way past the 500MB limit.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to shuffle big JSON file? - python

Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid format. Removing "}" in the end fixed the problem. Provided python code works.

Related

How to Import Json file to Neo4j

Python errors when trying to read and query a JSON file

Number plate detection JSON dataset

Extract JSON Data in Python - Example Code Included

Do file size requirements change when importing a CSV file to MongoDB?

Categories

Resources