create dynamic frame with schema from catalog table

create dynamic frame with schema from catalog table - python

I have created table in catalog table through create_table in API aws glue.
Through this code sample below code is creating table in catalog.
When I create dynamic frame from this table, it is empty with no schema.
I want to create empty dynamic frame with these four columns
response = client.create_table(
DatabaseName= 'xxxxxxxxxx',
TableInput={'Name':'xxxxxxxxxx',
'StorageDescriptor': {
'Columns': [
{'Name': 'column_1', 'Type': 'string', 'Comment': 'None'},
{'Name': 'column_2', 'Type': 'string', 'Comment': 'None'},
{'Name': 'column_2', 'Type': 'string', 'Comment': 'None'},
{'Name': 'column_2', 'Type': 'string', 'Comment': 'None'}
],
'Location':'s3://xxxxxxx/',
'InputFormat': 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat',
'SerdeInfo': {
'Name': 'avro',
'SerializationLibrary': 'org.apache.hadoop.hive.serde2.avro.AvroSerDe',
'Parameters':'{"type":"record","name":"DynamicRecord","namespace":"root","fields":[{"name":"column_1","type":["string","null"]},{"name":"column_2","type":["string","null"]},{"name":"column_3","type":["string","null"]},{"name":"column_4","type":["string","null"]}]}'
}
}}
)

A DynamicFrame is similar to a DataFrame, except that each record is
self-describing, so no schema is required initially. Instead, AWS Glue
computes a schema on-the-fly when required, and explicitly encodes
schema inconsistencies using a choice (or union) type. You can resolve
these inconsistencies to make your datasets compatible with data
stores that require a fixed schema.
Link: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-schema
You may use apply_mapping() to set schema explicitly or need some data in s3 location.

Related

Why does apispec validation fail on format similar to documentation example for Python Flask API backend?

I am using apispec in a Python/Flask API backend.
i followed the format found in the apispec documentation example.
See: https://apispec.readthedocs.io/en/latest/
Can anyone tell me why I am getting a validation error with the below json schema and data? It says "responses" is required but it looks like it is there. Is the structure incorrect? Any help appreciated!
openapi_spec_validator.exceptions.OpenAPIValidationError: 'responses' is a required propertyFailed validating 'required' in schema['properties']['paths']['patternProperties']['^/']['properties']['get']:
{'additionalProperties': False,
'description': 'Describes a single API operation on a path.',
'patternProperties': {'^x-': {'$ref': '#/definitions/specificationExtension'}},
'properties': {'callbacks': {'$ref': '#/definitions/callbacksOrReferences'},
'deprecated': {'type': 'boolean'},
'description': {'type': 'string'},
'externalDocs': {'$ref': '#/definitions/externalDocs'},
'operationId': {'type': 'string'},
'parameters': {'items': {'$ref': '#/definitions/parameterOrReference'},
'type': 'array',
'uniqueItems': True},
'requestBody': {'$ref': '#/definitions/requestBodyOrReference'},
'responses': {'$ref': '#/definitions/responses'},
'security': {'items': {'$ref': '#/definitions/securityRequirement'},
'type': 'array',
'uniqueItems': True},
'servers': {'items': {'$ref': '#/definitions/server'},
'type': 'array',
'uniqueItems': True},
'summary': {'type': 'string'},
'tags': {'items': {'type': 'string'},
'type': 'array',
'uniqueItems': True}},
'required': ['responses'],
'type': 'object'}
On instance['paths']['/v1/activity']['get']:
{'get': {'description': 'Activity Get',
'responses': {'200': {'content': {'application/json': {'schema': 'ActivitySchema'}},
'description': 'success'}},
'security': [{'AccessTokenAuth': []}],
'tags': ['user', 'admin']}}
For reference, here is the original source comment that the data comes from:
"""
---
get:
description: Activity Get
responses:
200:
description: success
content:
application/json:
schema: ActivitySchema
security:
- AccessTokenAuth: []
tags:
- user
- admin
"""

I found the answer in the apispec documentation at:
https://apispec.readthedocs.io/en/latest/using_plugins.html#example-flask-and-marshmallow-plugins
where it says:
"If your API uses method-based dispatching, the process is similar. Note that the method no longer needs to be included in the docstring."
This is slightly misleading since it's not "no longer needs to be included" but rather "cannot be included".
So the correct doc string is:
"""
---
description: Activity Get
responses:
200:
description: success
content:
application/json:
schema: ActivitySchema
security:
- AccessTokenAuth: []
tags:
- user
- admin
"""

How to get schema from confluent schema registry with schema id and version using python

Can we pass both schema id and version to get the schema from schema registry? I know about these functions,
Getting schema by ID
sr = SchemaRegistryClient('localhost:8081')
my_schema = sr.get_by_id(schema_id=1)
which returns,
{'type': 'record', 'name': 'io.confluent.examples.clients.basicavro.Customer', 'fields': [{'name': 'customerId', 'type': 'string'}, {'name': 'firstName', 'type': 'int'}, {'name': 'lastName', 'type': 'string'}, {'name': 'email', 'type': 'string'}, {'name': 'phone', 'type': 'string'}], '__fastavro_parsed': True}
And,
Getting schema by subject name
sr = SchemaRegistryClient('localhost:8081')
my_schema = sr.get_schema(subject='mySubject', version='latest')
which returns,
SchemaVersion(subject='mySubject', schema_id=1, schema=<schema_registry.client.schema.AvroSchema object at 0x000001A7271D6C18>, version=1)
In get_schema() I am able to give the version but schema is not proper. In get_by_id() I am getting the schema in proper format but not able to choose the version.
Is there any way around so I can get the schema and choose the version also? Any help would be appreciated.

I think you're on the right track, but you need to use fields in that returned class
version = 1
schema_version = sr.get_schema(subject='mySubject', version=version)
print(schema_version.schema)
print(schema_version.version == version)

Query nested JSON document in MongoDB collection using Python

I have a MongoDB collection containing multiple documents. A document looks like this:
{
'name': 'sys',
'type': 'system',
'path': 'sys',
'children': [{
'name': 'folder1',
'type': 'folder',
'path': 'sys/folder1',
'children': [{
'name': 'folder2',
'type': 'folder',
'path': 'sys/folder1/folder2',
'children': [{
'name': 'textf1.txt',
'type': 'file',
'path': 'sys/folder1/folder2/textf1.txt',
'children': ['abc', 'def']
}, {
'name': 'textf2.txt',
'type': 'file',
'path': 'sys/folder1/folder2/textf2.txt',
'children': ['a', 'b', 'c']
}]
}, {
'name': 'text1.txt',
'type': 'file',
'path': 'sys/folder1/text1.txt',
'children': ['aaa', 'bbb', 'ccc']
}]
}],
'_id': ObjectId('5d1211ead866fc19ccdf0c77')
}
There are other documents containing similar structure. How can I query this collection to find part of one document among multiple documents where path matches sys/folder1/text1.txt?
My desired output would be:
{
'name': 'text1.txt',
'type': 'file',
'path': 'sys/folder1/text1.txt',
'children': ['aaa', 'bbb', 'ccc']
}
EDIT:
What I have come up with so far is this. My Flask endpoint:
class ExecuteQuery(Resource):
def get(self, collection_name):
result_list = [] # List to store query results
query_list = [] # List to store the incoming queries
for k, v in request.json.items():
query_list.append({k: v}) # Store query items in list
cursor = mongo.db[collection_name].find(*query_list) # Execute query
for document in cursor:
encoded_data = JSONEncoder().encode(document) # Encode the query results to String
result_list.append(json.loads(encoded_data)) # Update dict by iterating over Documents
return result_list # Return query result to client
My client side:
request = {"name": "sys"}
response = requests.get(url, json=request, headers=headers)
print(response.text)
This gives me the entire document but I cannot extract a specific part of the document by matching the path.

I don't think mongodb supports recursive or deep queries within a document (neither recursive $unwind). What it does provide however, are recursive queries across documents referencing another, i.e. aggregating elements from a graph ($graphLookup).
This answer explains pretty well, what you need to do to query a tree.
Although it does not directly address your problem, you may want to reevaluate your data structure. It certainly is intuitive, but updates can be painful -- as well as queries for nested elements, as you just noticed.
Since $graphLookup allows you to create a view equal to your current document, I cannot think of any advantages the explicitly nested structure has over one document per path. There will be a slight performance loss for reading and writing the entire tree, but with proper indexing it should be ok.

Using Softlayer Object Filters for activeTransaction

I am trying to use the Python SoftLayer API to return a list of virtual servers that are do not have an active transaction in "RECLAIM_WAIT" status (which is the state you have when you delete a virtual server in Softlayer). I am expecting to get back all virtual servers that have no activeTransaction at all, and also ones that have an activeTransaction but is in a status other than "RECLAIM_WAIT".
I call the vs manager with a filter that I think should work:
f={'virtualGuests': {'activeTransaction': {'transactionStatus': {'name': {'operation': '!= RECLAIM_WAIT'}}}}}
instance = vs.list_instances(hostname="node5-0",filter=f)
but it returns only instances that have an activeTransaction (including the ones that have a RECLAIM_WAIT status).
Here is an example of a returned instance from that call:
[{'status': {'keyName': 'DISCONNECTED', 'name': 'Disconnected'}, 'datacenter': {'statusId': 2, 'id': 265592, 'name': 'xxxx', 'longName': 'xxx'}, 'domain': 'xxxx', 'powerState': {'keyName': 'HALTED', 'name': 'Halted'}, 'maxCpu': 2, 'maxMemory': 8192, 'hostname': 'node5-0', 'primaryIpAddress': 'xxxx', 'activeTransaction': {'modifyDate': '2017-01-16T05:20:01-06:00', 'statusChangeDate': '2017-01-16T05:20:01-06:00', 'elapsedSeconds': 22261, 'createDate': '2017-01-16T05:19:05-06:00', 'hardwareId': '', 'guestId': 27490599, 'id': 46204349, 'transactionStatus': {'friendlyName': 'This is a buffer time in which the customer may cancel the server', 'name': 'RECLAIM_WAIT'}}, 'globalIdentifier': 'xx', 'primaryBackendIpAddress': 'xxx', 'id': xxx, 'fullyQualifiedDomainName': 'xxx'}]
What am I doing wrong with the filter?

There is nothing wrong in your request, unfortunately, it's not possible to filter transactions for its transactionStatus, because the transaction doesn't have access to "transactionStatusId" key, you can check in the transaction datatype, there not exist the "transactionStatusId" in the local properties.
SoftLayer_Provisioning_Version1_Transaction
So, the best way would be to filter directly in your code.

How to change the series name using django-chartit?

I just got crazy about changing the series name using django-chartit. I have google the solutions but i cannot find it.
Here's my views.py. I even didn't know which place should i add the series' name attribute. So i just guess that might should be placed in the series_option . But as you have guessed, nothing has changed. And the series' name still remained "click time"
def userdata_chart_view(request):
userdata = \
DataPool(
series=
[{'options': {
'source': User_Data.objects.filter(user=request.user.username)},
'terms': [
'user',
'word',
'click_times']}
])
cht = Chart(
datasource= userdata,
series_options =
[{'options':{
'type': 'column',
'stacking': False},
'terms':{
'word': ['click_times']
},
'name': '搜索次数',
}],
chart_options=
{
'title': {'text': '搜索频率'},
'xAxis': {'title': {'text': '词条'}},
'yAxis': {'title': {'text': '频率'}},
}
)
content1 = {'user_data_chart': cht}
return render(request, 'yigu/charts.html', content1)

I had the exact same difficulty, and found a way that I share with you : based on this example from the chartit demo, it looks that you can simply rename a field during the DataPool creation. I used the syntax from the example (though based on a PivotDataPool, but it looks ok for any DataPool...) by changing the 'terms' value from a list to a dictionnary. In this dictionnary keys are the customized names used in the charts objects, and values are the original field names.
For your example, it might look like this :
userdata = \
DataPool(
series=
[{'options': {
'source': User_Data.objects.filter(user=request.user.username)},
'terms': {
'user': 'user',
'word': 'word',
'<your customized name>': 'click_times'}}
])
cht = Chart(
datasource= userdata,
series_options =
[{'options':{
'type': 'column',
'stacking': False},
'terms':{
'word': ['<your customized name>']
},
...
I hope this works for you, I'll be happy to know if there are more conventional ways to do it...

PolRaguénès' answer pointed me in the right direction. It was just the bracketing in the DataPool definition for terms that had to be fixed:
userdata = \
DataPool(
series=
[{'options': {
'source': User_Data.objects.filter(user=request.user.username)},
'terms': [
'user': 'user',
'word': 'word',
{'<your customized name>': 'click_times'}]}
])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

create dynamic frame with schema from catalog table - python

Related

Why does apispec validation fail on format similar to documentation example for Python Flask API backend?

How to get schema from confluent schema registry with schema id and version using python

Query nested JSON document in MongoDB collection using Python

Using Softlayer Object Filters for activeTransaction

How to change the series name using django-chartit?

Categories

Resources