I'm storing hierarchical data in a format similar to JSON:
{
"preferences": {
"is_latest": true,
"revision": 18,
// ...
},
"updates": [
{ "id": 1, "content": "..." },
// ...
]
}
I'm writing this data to disk and I'd like to store it efficiently. I assume that, towards this end, BSON would be more efficient as a storage format than raw JSON.
How can I read and write BSON trees to/from disk in Python?
I haven't used it, but it looks like there is a bson module on PyPI:
https://pypi.python.org/pypi/bson
The project is hosted in GitHub here:
https://github.com/martinkou/bson
Related
I have gridded winds data in geojson so all the geometries are points and each feature has wind speed and direction in its properties section - I would like to generate a wind barb map from this data.
I am totally new to python, but i found an excellent sample to do this at the unidata site:-
https://unidata.github.io/python-training/gallery/500hpa_hght_winds/
The only thing I think I need to do is modify it so that it can use geojson as a data source instead of a netcdf file.
Being completely new to this, I googled for a while and didn't find much to help with the geojson bit directly, but I did find that I could use geopandas to convert geojson to a shapefile then use gdal to convert the shapefile to a netcdf file. I am currently trying to work out how to modify the sample to use this data structure so I may encounter issues along the way.
I feel like I may have gone down a rabbit hole on this though, can anyone recommend a better way of doing this? Or alternatively, is this a valid method?
Here is a sample of the geojson:-
{
"type": "FeatureCollection",
"totalFeatures": 2124,
"features": [
{
"type": "Feature",
"id": "1bdffe7b-af88-4f33-a792-e3a2a53c08b3",
"geometry": {
"type": "Point",
"coordinates": [
-120,
40
]
},
"properties": {
"altitude": 38615,
"flightLevel": 400,
"temperature": -60.619,
"windSpeed": 18.9,
"windDirection": 352.7
}
},
...
And here is my code to do the 2-stage convert, but if I can go to the dataset straight from the geopandas dataframe that would be great!
import geopandas as gpd
import xarray as xr
df = gpd.read_file('./winds.json')
df[['altitude', 'flightLevel', 'temperature', 'windSpeed', 'windDirection', 'geometry']].to_file('./winds.shp')
from osgeo import gdal
inputfile = './winds.shp'
outputfile = './winds.nc'
# actually i haven't got this bit working yet, temp workaround is to use cmd line app (ogr2ogr -F netCDF './winds.nc' './winds.shp')
gdal.Translate(outputfile, inputfile, format='NetCDF')
ds = xr.open_dataset('./winds.nc')
I am trying to dump JSON data in Python to a file.
I receive the data as an ImmutableMultiDict from a Flask post request.
It looks as follows: ImmutableMultiDict([('prefix', ''), ('key1', 'value1'), ('key2', 'value2')])
The data should look like this in the file:
{ "prefix": [
{"key1": "value1"},
{"key2" : "value2"}
]
}
The prefix as well as all the other data is part of the post request. My question now is: How can I json.dump the ImmutableMultiDict so it appears like this in the file? Right now it looks like this:
{
"prefix": "",
"key1": "value1",
"key2": "value2",
}
The reason why I want to do it the other way is because I want to append data later on by appending it to the array with the "prefix" key. Can anyone show me a way to do this properly please?
Thanks.
EDIT:
Ok. I fixed it so it looks the way it should now. The Python code:
def write_to_json(file, data, prefix):
with open(file, "a", encoding="utf-8") as f:
json.dump({prefix : list(data)}, f, indent=4)
Result:
{
"prefix": [
"key1",
"key2"
]
}
As you did not mentioned anything about it, I'd recommend to check ImmutableMultiDict#to_dict() with flat parameter, see documentation.
Maybe I am understanding incorrectly, but are you trying to have multiple keys for "prefix"? This is opposing to the definition of JSON and will not work as the keys of a dict (MultiDict) uses something like a set for store their keys.
So I know this question might be duplicated but I just want to know and understand how can you convert from TSV file to JSON? I've tried searching everywhere and I can't find a clue or understand the code.
So this is not the Python code, but it's the TSV file that I want to convert to JSON:
title content difficulty
week01 python syntax very easy
week02 python data manipulation easy
week03 python files and requests intermediate
week04 python class and advanced concepts hard
And this is the JSON file that I want as an output.
[{
"title": "week 01",
"content": "python syntax",
"difficulty": "very easy"
},
{
"title": "week 02",
"content": "python data manipulation",
"difficulty": "easy"
},
{
"title": "week 03",
"content": "python files and requests",
"difficulty": "intermediate"
},
{
"title": "week 04",
"content": "python class and advanced concepts",
"difficulty": "hard"
}
]
The built-in modules you need for this are csv and json.
To read tab-separated data with the CSV module, use the delimiter="\t" parameter:
Even more conveniently, the CSV module has a DictReader that automatically reads the first row as column keys, and returns the remaining rows as dictionaries:
with open('file.txt') as file:
reader = csv.DictReader(file, delimiter="\t")
data = list(reader)
return json.dumps(data)
The JSON module can also write directly to a file instead of a string.
if you are using pandas you can use the to_json method with the option orient="records"to obtain the list of entries you want.
my_data_frame.to_json(orient="records")
I have the following structure in a geojsonfile:
{"crs":
{"type": "name",
"properties":
{"name": "urn:ogc:def:crs:EPSG::4326"}
},
"type": "FeatureCollection",
"features": [
{"geometry":
{"type": "Polygon",
"coordinates": [[[10.914622377957983, 45.682007076150505],
[10.927456267537572, 45.68179119797432],
[10.927147329501077, 45.672795442796335],
[10.914315493899755, 45.67301125363092],
[10.914622377957983, 45.682007076150505]]]},
"type": "Feature",
"id": 0,
"properties": {"cellId": 38}
},
{"geometry":
{"type": "Polygon",
"coordinates":
... etc. ...
I want to read this geoJSON into Google Maps and have each cell colored based on a property I calculated in Python for each cell individually. So my most question would be: How can I read the geoJSON in with Python and add another property to these Polygons (there are like 12 000 polygons, so adding them one by one is not an option), then write the new file?
I think what I'm looking for is a library for Python that can handle geoJSON, so I don't have to add these feature via srting manipulation.
There is a way with Python geojson package.
Like that, you can read the geojson has an object:
import geojson
loaded = geojson.loads("Any geojson file or geojson string")
for feature in loaded.features[0:50]: #[0:50] for the only first 50th.
print feature
There is Feature, FeatureCollection and Custom classes to help you to add your attributes.
The geoJSON is just a JSON doc (a simplification but is all you need for this purpose). Python reads that as a dict object.
Since dict are updated inplace, we don't need to store a new variable for the geo objects.
import json
# read in json
geo_objects = json.load(open("/path/to/files"))
# code ...
for d in geo_objects:
d["path"]["to"]["field"] = calculated_value
json.dump(geofiles, open("/path/to/output/file"))
No string manipulation needed, no need to load new library!
I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way.
My JSON schema looks like this -
{
"dataset": [
{
"key1": [
{
"range": "range1",
"value": 0.0
},
{
"range": "range2",
"value": 0.23
}
]
}, {..}, {..}
],
"last_refreshed_time": "2016/09/08 15:05:31"
}
Here are my questions -
Can I avoid converting these files to match the schema required by Apache Spark (one dictionary per line in a file) and still be able to read it?
If not, what's the best way to do it in Python? I have a bunch of these files for each day in the bucket. The bucket is partitioned by day.
Is there any other tool better suited to query these files other than Apache Spark? I'm on AWS stack so can try out any other suggested tool with Zeppelin notebook.
You could use sc.wholeTextFiles() Here is a related post.
Alternatively, you could reformat your json using a simple function and load the generated file.
def reformat_json(input_path, output_path):
with open(input_path, 'r') as handle:
jarr = json.load(handle)
f = open(output_path, 'w')
for entry in jarr:
f.write(json.dumps(entry)+"\n")
f.close()