Reading pretty print json files in Apache Spark - python

I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way.
My JSON schema looks like this -
{
"dataset": [
{
"key1": [
{
"range": "range1",
"value": 0.0
},
{
"range": "range2",
"value": 0.23
}
]
}, {..}, {..}
],
"last_refreshed_time": "2016/09/08 15:05:31"
}
Here are my questions -
Can I avoid converting these files to match the schema required by Apache Spark (one dictionary per line in a file) and still be able to read it?
If not, what's the best way to do it in Python? I have a bunch of these files for each day in the bucket. The bucket is partitioned by day.
Is there any other tool better suited to query these files other than Apache Spark? I'm on AWS stack so can try out any other suggested tool with Zeppelin notebook.

You could use sc.wholeTextFiles() Here is a related post.
Alternatively, you could reformat your json using a simple function and load the generated file.
def reformat_json(input_path, output_path):
with open(input_path, 'r') as handle:
jarr = json.load(handle)
f = open(output_path, 'w')
for entry in jarr:
f.write(json.dumps(entry)+"\n")
f.close()

Related

NEO4J APOC LOAD JSON FROM EXTERNAL VARIABLE

I'm trying to load a json document into Neo4j but, if possible, I don't want to use a file because, in my case, it's a waste of time.
WHAT I'M DOING NOW:
Python query to Elasticsearch Database
Push data into a .json file
From Neo4j Python Library, run apoc.load.json('file:///file.json')
WHAT I WANT TO DO:
Python query to Elasticsearch Database
From Neo4j Python Library, run apoc.load.json()
Is there any syntax that could help me with that? Thank you
If you already have APOC installed, you can utilize the APOC to ES connector without having to use apoc.load.json.
Here is an example from the documentation:
CALL apoc.es.query("localhost","bank","_doc",null,{
query: { match_all: {} },
sort: [
{ account_number: "asc" }
]
})
YIELD value
UNWIND value.hits.hits AS hit
RETURN hit;
Link to docs: https://neo4j.com/labs/apoc/4.1/overview/apoc.es/

Strange formatting on append - JSON

recently I have been working on a project, and I needed to append a list of dictionaries to my existing JSON file. But it behaves somewhat strangely.
Here is what I have:
def write_records_to_json(json_object):
with open("tracker.json", "r+") as f:
json_file = json.load(f)
json_file.append(json_object)
print(json_file)
This is the object I'm trying to append(The object is formatted this way):
[
{
"file": "dnc_complaint_numbers_2021-12-03.csv",
"date": "2021-12-03"
}
]
And this is what I get(Pay attention to the end):
Excuse me please, for not having it more readable.
[{'file': 'dnc_complaint_numbers_2021-12-01.csv', 'date': '2021-12-01'}, {'file': 'dnc_complaint_numbers_2021-12-02.csv', 'date': '2021-12-02'}, '[\n {\n "file": "dnc_complaint_numbers_2021-12-03.csv",\n "date": "2021-12-03"\n }\n]']
Can someone tell me why is that and how to fix it? Thanks a lot.
From your code and output, we can infer that json_object refers to a string. This string contains JSON. json_file is not JSON, it is a list that is deserialised from JSON.
If you want to add json_object to json_file you should first deserialise the former:
json_file.extend(json.loads(json_object))
You also want to use extend instead of append here, so it is on the same level as the rest of the data.

How to convert from TSV file to JSON file?

So I know this question might be duplicated but I just want to know and understand how can you convert from TSV file to JSON? I've tried searching everywhere and I can't find a clue or understand the code.
So this is not the Python code, but it's the TSV file that I want to convert to JSON:
title content difficulty
week01 python syntax very easy
week02 python data manipulation easy
week03 python files and requests intermediate
week04 python class and advanced concepts hard
And this is the JSON file that I want as an output.
[{
"title": "week 01",
"content": "python syntax",
"difficulty": "very easy"
},
{
"title": "week 02",
"content": "python data manipulation",
"difficulty": "easy"
},
{
"title": "week 03",
"content": "python files and requests",
"difficulty": "intermediate"
},
{
"title": "week 04",
"content": "python class and advanced concepts",
"difficulty": "hard"
}
]
The built-in modules you need for this are csv and json.
To read tab-separated data with the CSV module, use the delimiter="\t" parameter:
Even more conveniently, the CSV module has a DictReader that automatically reads the first row as column keys, and returns the remaining rows as dictionaries:
with open('file.txt') as file:
reader = csv.DictReader(file, delimiter="\t")
data = list(reader)
return json.dumps(data)
The JSON module can also write directly to a file instead of a string.
if you are using pandas you can use the to_json method with the option orient="records"to obtain the list of entries you want.
my_data_frame.to_json(orient="records")

JSON file isn't finished writing to by the time I load it, behave BDD

My program is writing to a JSON file, and then loading, reading, andPOSTing it. The writing part is being done by the behave BDD.
# writing to the JSON file is done by behave
data = json.load(open('results.json', 'r'))
r = requests.post(MyAPIEndpoint, json=data)
I'm running into an issue since the writing is not being completed before I begin loading. (It's missing the closing [ after the final {.)
HOOK-ERROR in after_all: ValueError: Expecting object: line 2 column 2501 (char 2502)
Is there a way of getting by this, either by changing something with my call to behave's __main__ or by a change in how or when I'm loading the JSON file?
i think the problem here is little mixed, in one part, you can wait to the file finish to be written, and close it when is not in use, you can do that inside your code or somwthing like this
check if a file is open in Python
In other part, for the computer, the data is data, don't will analysis that, i means, you know where is the error, because you analise that, when you think, so for you is obvs, but for the computer isn't obvs, how much errores there is?, and where or the structure of it?, is there all the data we need?, for a computer is hard know all of this, you will need write a program to can deduce all of this data and check it.
If your program uses multiples results, i think the better ways is use temp files, so you can freely create one, write, check when is ready and use it, and don't will care if you have other similar process using it.
Other way, check if the json is valid before call it, Python: validate and format JSON files, when is valid load it.
Hope can help.
Cya.
One way to address this problem is to change your file format from being JSON at the top level to newline-delimited JSON (NDJSON), also called line-delimited JSON (LDJSON) or JSON lines (JSONL).
https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON
For example, this JSON file:
{
"widgets": [
{"name": "widget1", "color": "red"},
{"name": "widget2", "color": "green"},
{"name": "widget3", "color": "blue"}
]
}
Would become this NDJSON file:
{"name": "widget1", "color": "red"}
{"name": "widget2", "color": "green"}
{"name": "widget3", "color": "blue"}
It's especially useful in the context of streaming data, which kind of sounds like the use case you have where you might have one process writing to a file continuously while another is reading it.
You could then read the NDJSON file like so:
import json
from pprint import pprint
with open('widgets.json') as f:
all_lines = [json.loads(l) for l in f.readlines()]
all_data = {'widgets': all_lines}
pprint(all_data)
Output:
{'widgets': [{'color': 'red', 'name': 'widget1'},
{'color': 'green', 'name': 'widget2'},
{'color': 'blue', 'name': 'widget3'}]}

Writing BSON to disk

I'm storing hierarchical data in a format similar to JSON:
{
"preferences": {
"is_latest": true,
"revision": 18,
// ...
},
"updates": [
{ "id": 1, "content": "..." },
// ...
]
}
I'm writing this data to disk and I'd like to store it efficiently. I assume that, towards this end, BSON would be more efficient as a storage format than raw JSON.
How can I read and write BSON trees to/from disk in Python?
I haven't used it, but it looks like there is a bson module on PyPI:
https://pypi.python.org/pypi/bson
The project is hosted in GitHub here:
https://github.com/martinkou/bson

Categories

Resources