How to create AWS Glue DynamicFrame from json with array of objects

How to create AWS Glue DynamicFrame from json with array of objects - python

I have json files in S3 containing array of objects in each file, like shown below.
[{
"id": "c147162a-a304-11ea-aa90-0242ac110028",
"clientId": "xxx",
"contextUUID": "1bb6b39e-b181-4a6d-b43b-4040f9d254b8",
"tags": {},
"timestamp": 1592855898
}, {
"id": "c147162a-a304-11ea-aa90-0242ac110028",
"clientId": "yyy",
"contextUUID": "1bb6b39e-b181-4a6d-b43b-4040f9d254b8",
"tags": {},
"timestamp": 1592855898
}]
I used crawler to detect and load the schema to catalog. It was successful and it created a schema with a single column named array with data type array<struct<id:string,clientId:string,contextUUID:string,tags:string,timestamp:int>>.
Now, I tried to load the data using glueContext.create_dynamic_frame.from_catalog function, but I could not see any data. I tried printing schema and data as shown below.
ds = glueContext.create_dynamic_frame.from_catalog(
database = "dbname",
table_name = "tablename")
ds.printSchema()
root
ds.schema()
StructType([], {})
ds.show()
empty
ds.toDF().show()
++
||
++
++
Any idea, what I am doing wrong? I am planning to extract each object in array and transform the object to a different schema.

You can try to give regex in format_options to tell glue how it should read the data. Following code has worked for me:
glueContext.create_dynamic_frame_from_options('s3',
{
'paths': ["s3://glue-test-bucket-12345/events/101-1.json"]
},
format="json",
format_options={"jsonPath": "$[*]"}
).toDF()
I hope it solves the problem.

Related

How to get value off JSON in Robot framework

I have for example a log that will change each time it is run an example is below. I will like to take one of the value(id) lets say as a variable and log only the id to console or use that value somewhere else.
[
{
"#type": "type",
"href": [
{
"#url": "url1",
"#method": "get"
},
{
"#url": "url2",
"#method": "post"
},
{
"#url": "url3",
"#method": "post"
}
],
"id": "3",
"meta": [
{
"key": "key1",
"value": "value1"
},
{
"key": "key2",
"value": "value2"
}
]
}
]
I want to get the id in a variable because the id changes after each time the robot framework is ran

You can see here that the JSON you are getting is in list format. Which means that to get a value from the JSON, you'll first need to read the JSON object in, then get the dictionary out of the list and only then access the key value you'd need to extract.
Robot Framework supports using Python statements with Evaluate keyword. When we need to simply parse some string to JSON we can get by using
${DATA}= Evaluate json.loads("""${DATA}""")
Notice that the ${DATA} here should contain your JSON as a string.
Now that we have the JSON object, we can do whatever we want with it. We can actually see from your JSON that it is actually a dictionary nested inside a list object (See surrounding []). So first, extract dictionary from the list, then access the dictionary key normally. The following should work fine.
${DICT}= Get From List ${DATA} 0
${ID}= Get From Dictionary ${DICT} id

Mapbox: Programatically update mapbox dataset from .geojson file

I have a .geojson file (call it data.geojson) which I use to manually update a dataset on mapbox.
Suppose that my data.geojson file is structured as follows:
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {
"suburb": "A",
"unemployed": 10
},
"geometry": {
"type": "Point",
"coordinates": [
0,
0
]
}
},
{
"type": "Feature",
"properties": {
"suburb": "B",
"unemployed": 20
},
"geometry": {
"type": "Point",
"coordinates": [
1,
1
]
}
data.geojson is stored locally, and every 12 hours the 'unemployed' property of each feature is updated using another python script that scrapes data from the web.
Currently, in order to update these properties within the online dataset (stored at mapbox.com) I am manually navigating to the Mapbox website and reuploading the data.geojson file. I am looking for a way to accomplish this task pythonically.
Any help would be greatly appreciated!

you can setup a timer of some sort to automatically update the data using javascript functions. Here I am using a source and layer named "STI", which is just geoJSON line data.
The function would first add the source of the data as well as the layer :
var STI_SOURCE = 'json/sti/STI.json'; // declare URL for data
map.addSource('sti', { type: 'geojson', data: STI1 }); // Add source using URL
// Add the actual layer using the source
map.addLayer({
"id": "sti",
"type": "line",
"source": "sti",
"layout": {
"line-join": "miter",
"line-cap": "round"
},
"paint": {
"line-color": "#fff",
"line-width": 1,
"line-dasharray": [6, 2]
}
});
Then, when you want to refresh the data - remove them :
map.removeLayer('sti');
map.removeSource('sti');
Then, you can re-add them by starting at the beginning. There are other ways (and better) to do this, but this is just one way that works. I think there is a setData() function that does this better. But hopefully this can get you started.

My solution, in the end, was simply to point the source of the Mapbox layer to the locally stored dataset.geojson file rather than the corresponding dataset stored online at mapbox.com.
I was able to edit the locally stored dataset.geojson using the 'json' python package. Since the Mapbox layer source was pointing directly to the local dataset, all updates to this local file would then be reflected in the Mapbox layer. This way, there is no need to upload any data to Mapbox.
#David also posted a helpful solution if you wish to go down that route.

Adding a unique feature id to GeoJSON in QGIS or Python

According to the GeoJSON Format Specification
"If a feature has a commonly used identifier, that identifier should be included as a member of the feature object with the name "id"."
My question is how do I add this to my GeoJSON?
If I create it as an attribute and then save it as GeoJSON in QGIS it ends up in Properties instead of in Features.
This is what I want to do:
{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name":"urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [
{ "type": "Feature", "id":"1", "properties": { "Namn":.................
This is what QGIS produces:
{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [
{ "type": "Feature", "properties": { "id": 1, "Name"..................
I have also tried PyGeoj https://github.com/karimbahgat/PyGeoj. This has a function to add a unique id but it also adds it under properties.
If I open the GeoJSON and write it in by hand then it works but I don't want to do this for all of my layers, some of which contain many features.

Thought I would write here how I have solved it in case anyone else comes across the same problem.
A simple python script:
#create empty file to be writen to
file = open("kommun_id.geojson", "w")
count = 0
#read original file
with open('kommun.geojson', 'r')as myfile:
for line in myfile:
#lines that don't need to be edited with an 'id'
if not line.startswith('{ "type": '):
file.write(line)
else:
#lines that need to be edited
count = count +1
idNr = str(count)
file.write(line[0:20] + '"id":'+ '"'+ idNr + '",' +line[21:])
file.close()
It might not work for all geoGJSONs but it works for the ones I have created with QGIS

You can fix this by using the preserve_fid flag in ogr2ogr. You're going to "convert" the bad GeoJSON that QGIS dumps to the format you want, with the id field exposed.
ogr2ogr -f GeoJSON fixed.geojson qgis.geojson -preserve_fid
Here qgis.geojson is the one the QGIS created and fixed.geojson is the new version with the exposed id field.

This is due to ogr2ogr not knowing which QGIS field to use as the geoJSON ID.
Per this QGIS issue (thank you #gioman), a workaround is to manually specify the ID field in Custom Options --> Layer Settings when you do the export:
If you need to do this to geoJSONs that have already been created, you can use the standalone ogr2ogr command-line program:
ogr2ogr output.geojson input.geojson -lco id_field=id
If you need to convert the IDs from properties inside multiple files, you can use the command-line tool jq inside a Bash for-loop:
for file in *.geojson; do jq '(.features[] | select(.properties.id != null)) |= (.id = .properties.id)' $file > "$file"_tmp; mv "$file"_tmp $file;done

I had the same problem and i managed with this little python script.
import json
with open('georef-spain-municipio.geojson') as json_file:
data = json.load(json_file)
i = 0
for p in data['features']:
i+=1
p['id'] = i
with open('georef-spain-municipio_id.geojson', 'w') as outfile:
json.dump(data, outfile)
I hope it helps

How do I generate python class source code from JSON? [duplicate]

Is there a python library for converting a JSON schema to a python class definition, similar to jsonschema2pojo -- https://github.com/joelittlejohn/jsonschema2pojo -- for Java?

So far the closest thing I've been able to find is warlock, which advertises this workflow:
Build your schema
>>> schema = {
'name': 'Country',
'properties': {
'name': {'type': 'string'},
'abbreviation': {'type': 'string'},
},
'additionalProperties': False,
}
Create a model
>>> import warlock
>>> Country = warlock.model_factory(schema)
Create an object using your model
>>> sweden = Country(name='Sweden', abbreviation='SE')
However, it's not quite that easy. The objects that Warlock produces lack much in the way of introspectible goodies. And if it supports nested dicts at initialization, I was unable to figure out how to make them work.
To give a little background, the problem that I was working on was how to take Chrome's JSONSchema API and produce a tree of request generators and response handlers. Warlock doesn't seem too far off the mark, the only downside is that meta-classes in Python can't really be turned into 'code'.
Other useful modules to look for:
jsonschema - (which Warlock is built on top of)
valideer - similar to jsonschema but with a worse name.
bunch - An interesting structure builder thats half-way between a dotdict and construct
If you end up finding a good one-stop solution for this please follow up your question - I'd love to find one. I poured through github, pypi, googlecode, sourceforge, etc.. And just couldn't find anything really sexy.
For lack of any pre-made solutions, I'll probably cobble together something with Warlock myself. So if I beat you to it, I'll update my answer. :p

python-jsonschema-objects is an alternative to warlock, build on top of jsonschema
python-jsonschema-objects provides an automatic class-based binding to JSON schemas for use in python.
Usage:
Sample Json Schema
schema = '''{
"title": "Example Schema",
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"age": {
"description": "Age in years",
"type": "integer",
"minimum": 0
},
"dogs": {
"type": "array",
"items": {"type": "string"},
"maxItems": 4
},
"gender": {
"type": "string",
"enum": ["male", "female"]
},
"deceased": {
"enum": ["yes", "no", 1, 0, "true", "false"]
}
},
"required": ["firstName", "lastName"]
} '''
Converting the schema object to class
import python_jsonschema_objects as pjs
import json
schema = json.loads(schema)
builder = pjs.ObjectBuilder(schema)
ns = builder.build_classes()
Person = ns.ExampleSchema
james = Person(firstName="James", lastName="Bond")
james.lastName
u'Bond' james
example_schema lastName=Bond age=None firstName=James
Validation :
james.age = -2
python_jsonschema_objects.validators.ValidationError: -2 was less
or equal to than 0
But problem is , it is still using draft4validation while jsonschema has moved over draft4validation , i filed an issue on the repo regarding this .
Unless you are using old version of jsonschema , the above package will work as shown.

I just created this small project to generate code classes from json schema, even if dealing with python I think can be useful when working in business projects:
pip install jsonschema2popo
running following command will generate a python module containing json-schema defined classes (it uses jinja2 templating)
jsonschema2popo -o /path/to/output_file.py /path/to/json_schema.json
more info at: https://github.com/frx08/jsonschema2popo

JSON parsing using python

I'm attempting to understand the basics of JSON and thought using some Google translate examples would be interesting. I'm not actually making requests via the API but they have the following example I have saved as "file.json":
{
"data": {
"detections": [
[
{
"language": "en",
"isReliable": false,
"confidence": 0.18397073
}
]
]
}
}
I'm reading in the file and used simplejson:
json_data = open('file.json').read()
json = simplejson.loads(json_data)
>>> json
{'data': {'detections': [[{'isReliable': False, 'confidence': 0.18397073, 'language': 'en'}]]}}
I've tried multiple ways to print the value of 'language' with no success. For example, this fails. Any pointers would be appreciated!
print json['detections']['language']

You need json['data']['detections'][0][0]['language']. As your example data shows, 'language' is a key of a dict that is inside a list that is inside another list that inside the 'detections' dict which is inside the 'data' dict.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create AWS Glue DynamicFrame from json with array of objects - python

Related

How to get value off JSON in Robot framework

Mapbox: Programatically update mapbox dataset from .geojson file

Adding a unique feature id to GeoJSON in QGIS or Python

How do I generate python class source code from JSON? [duplicate]

JSON parsing using python

Categories

Resources