Changing schema of avro file when writing to it in append mode

Changing schema of avro file when writing to it in append mode - python

I'm looking for a way to modify the schema of an avro file in python. Taking the following example, using the fastavro package, first write out some initial records, with corresponding schema:
from fastavro import writer, parse_schema
schema = {
'name': 'test',
'type': 'record',
'fields': [
{'name': 'id', 'type': 'int'},
{'name': 'val', 'type': 'long'},
],
}
records = [
{u'id': 1, u'val': 0.2},
{u'id': 2, u'val': 3.1},
]
with open('test.avro', 'wb') as f:
writer(f, parse_schema(schema), records)
Uhoh, I've got some more records, but they contain None values. I'd like to append these records to the avro file, and modify my schema accordingly:
more_records = [
{u'id': 3, u'val': 1.5},
{u'id': 2, u'val': None},
]
schema['fields'][1]['type'] = ['long', 'null']
with open('test.avro', 'a+b') as f:
writer(f, parse_schema(schema), more_records)
Instead of overwriting the schema, this results in an error:
ValueError: Provided schema {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': ['long', 'null']}], '__fastavro_parsed': True, '__named_schemas': {'test': {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': ['long', 'null']}]}}} does not match file writer_schema {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': 'long'}], '__fastavro_parsed': True, '__named_schemas': {'test': {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': 'long'}]}}}
Is there a workaround for this? The fastavro docs for this suggest it's not possible, but I'm hoping someone knows of a way!
Cheers

The append API in fastavro does not currently support this. You could open an issue in that repository and discuss if something like this makes sense.

Related

Create a new dictionary with the key-value pair from values in a list of dictionaries based on matches from a separate list

I am trying to get a new dictionary with the k: v pair from values in a list of dictionary based on matches from a separate list.
The casing is different.
My data looks like this:
list_of_dicts = [
{'fieldname': 'Id', 'source': 'microsoft', 'nullable': True, 'type': 'int'},
{'fieldname': 'FirstName', 'source': 'microsoft', 'nullable': True, 'type': 'string'},
{'fieldname': 'LastName', 'source': 'microsoft', 'nullable': False, 'type': 'string'},
{'fieldname': 'Address1', 'source': 'microsoft', 'nullable': False, 'type': 'string'}
]
fieldname_list = ['FIRSTNAME', 'LASTNAME']
From this I would like to create a new dictionary as follows:
new_dict = {'FirstName': 'string', 'LastName': 'string'}
I think this should be possible with a dictionary comprehension, but I can't work it out - can anyone help?

Do you want to do?
list_of_dicts = [
{'fieldname': 'Id', 'source': 'microsoft', 'nullable': True, 'type': 'int'},
{'fieldname': 'FirstName', 'source': 'microsoft', 'nullable': True, 'type': 'string'},
{'fieldname': 'LastName', 'source': 'microsoft', 'nullable': False, 'type': 'string'},
{'fieldname': 'Address1', 'source': 'microsoft', 'nullable': False, 'type': 'string'}
]
fieldname_list = ['FIRSTNAME', 'LASTNAME']
new_dict = {x['fieldname']: x['type'] for x in list_of_dicts if x['fieldname'].upper() in fieldname_list}
print(new_dict)
# output: {'FirstName': 'string', 'LastName': 'string'}

Convert nested dictionary within JSON from a string

I have JSON data that I loaded that appears to have a bit of a messy data structure where nested dictionaries are wrapped in single quotes and recognized as a string, rather than a single dictionary which I can loop through. What is the best way to drop the single quotes from the key-value property ('value').
Provided below is an example of the structure:
for val in json_data:
print(val)
{'id': 'status6',
'title': 'Estimation',
'text': '> 2 days',
'type': 'color',
'value': '{"index":14,"post_id":null,"changed_at":"2020-06-12T09:04:58.659Z"}',
'name': 'Internal: online course'},
{'id': 'date',
'title': 'Deadline',
'text': '2020-06-26',
'type': 'date',
'value': '{"date":"2020-06-26","changed_at":"2020-06-12T11:33:37.195Z"}',
'name': 'Internal: online course'},
{'id': 'tags',
'title': 'Tags',
'text': 'Internal',
'type': 'tag',
'value': '{"tag_ids":[3223513]}',
'name': 'Internal: online course'},
If I add a nested look targeting ['value'], it loops by character and not key-value pair in the dictionary.

Using json.loads to convert string to dict
import json
json_data = [{'id': 'status6',
'title': 'Estimation',
'text': '> 2 days',
'type': 'color',
'value': '{"index":14,"post_id":null,"changed_at":"2020-06-12T09:04:58.659Z"}',
'name': 'Internal: online course'},
{'id': 'date',
'title': 'Deadline',
'text': '2020-06-26',
'type': 'date',
'value': '{"date":"2020-06-26","changed_at":"2020-06-12T11:33:37.195Z"}',
'name': 'Internal: online course'},
{'id': 'tags',
'title': 'Tags',
'text': 'Internal',
'type': 'tag',
'value': '{"tag_ids":[3223513]}',
'name': 'Internal: online course'}]
# the result is a Python dictionary:
for val in json_data:
print(json.loads(val['value']))
this should be work!!

How to parse VirtualMachinePaged object using Azure SDK for Python?

I am trying to get list of VMs in a Resource Group using Azure SDK for Python. I configured my Visual Studio code with all the required Azure Tools. I created a function and used below code to get List of VMs.
import os
import random
import string
from azure.identity import DefaultAzureCredential
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.network import NetworkManagementClient
from azure.mgmt.resource import ResourceManagementClient
def main():
SUBSCRIPTION_ID = os.environ.get("SUBSCRIPTION_ID", None)
GROUP_NAME = "testgroupx"
VIRTUAL_MACHINE_NAME = "virtualmachinex"
SUBNET_NAME = "subnetx"
INTERFACE_NAME = "interfacex"
NETWORK_NAME = "networknamex"
VIRTUAL_MACHINE_EXTENSION_NAME = "virtualmachineextensionx"
resource_client = ResourceManagementClient(
credential=DefaultAzureCredential(),
subscription_id=SUBSCRIPTION_ID
)
network_client = NetworkManagementClient(
credential=DefaultAzureCredential(),
subscription_id=SUBSCRIPTION_ID
)
compute_client = ComputeManagementClient(
credential=DefaultAzureCredential(),
subscription_id=SUBSCRIPTION_ID
)
vm = compute_client .virtual_machines.list(
'RGName'
)
print("Get virtual machine:\n{}", vm)
When I see the logs, I see below as the print response.
<azure.mgmt.compute.v2019_12_01.models._paged_models.VirtualMachinePaged object at 0x0000024584F92EC8>
I am really trying to get the actual object, I am not sure how can I parse it. Any ideas?

Since it returns a collection you need to use Use for loop , You can do something like this
for vm in compute_client .virtual_machines.list('RGName'):
print("\tVM: {}".format(vm.name))

VirtualMachinePaged contains a collection of an object of type VirtualMachine. You can see the source code of that class here: https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/compute/azure-mgmt-compute/azure/mgmt/compute/v2019_12_01/models/_models.py.
From this link, here're the list of attributes:
{
'id': {'key': 'id', 'type': 'str'},
'name': {'key': 'name', 'type': 'str'},
'type': {'key': 'type', 'type': 'str'},
'location': {'key': 'location', 'type': 'str'},
'tags': {'key': 'tags', 'type': '{str}'},
'plan': {'key': 'plan', 'type': 'Plan'},
'hardware_profile': {'key': 'properties.hardwareProfile', 'type': 'HardwareProfile'},
'storage_profile': {'key': 'properties.storageProfile', 'type': 'StorageProfile'},
'additional_capabilities': {'key': 'properties.additionalCapabilities', 'type': 'AdditionalCapabilities'},
'os_profile': {'key': 'properties.osProfile', 'type': 'OSProfile'},
'network_profile': {'key': 'properties.networkProfile', 'type': 'NetworkProfile'},
'diagnostics_profile': {'key': 'properties.diagnosticsProfile', 'type': 'DiagnosticsProfile'},
'availability_set': {'key': 'properties.availabilitySet', 'type': 'SubResource'},
'virtual_machine_scale_set': {'key': 'properties.virtualMachineScaleSet', 'type': 'SubResource'},
'proximity_placement_group': {'key': 'properties.proximityPlacementGroup', 'type': 'SubResource'},
'priority': {'key': 'properties.priority', 'type': 'str'},
'eviction_policy': {'key': 'properties.evictionPolicy', 'type': 'str'},
'billing_profile': {'key': 'properties.billingProfile', 'type': 'BillingProfile'},
'host': {'key': 'properties.host', 'type': 'SubResource'},
'provisioning_state': {'key': 'properties.provisioningState', 'type': 'str'},
'instance_view': {'key': 'properties.instanceView', 'type': 'VirtualMachineInstanceView'},
'license_type': {'key': 'properties.licenseType', 'type': 'str'},
'vm_id': {'key': 'properties.vmId', 'type': 'str'},
'resources': {'key': 'resources', 'type': '[VirtualMachineExtension]'},
'identity': {'key': 'identity', 'type': 'VirtualMachineIdentity'},
'zones': {'key': 'zones', 'type': '[str]'},
}
For Python 3, the code can be found here: https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/compute/azure-mgmt-compute/azure/mgmt/compute/v2019_12_01/models/_models_py3.py.

How to convert a JSON list into two different columns in a CSV file?

This is my first question on this spectacular website, I need to know how to export complex information from a JSON to a CSV.
The problem is that I need from the list that I have in the column to have two different values.
I tried a lot of different combinations and I couldn't so one of my last resources are asked to the community.
My code is this:
def output(alerts):
output = list()
for alert in alerts:
applications = alerts['applications']
for app in applications:
categories = app['categories']
for cat in categories:
output_alert = [list(cat.items())[0], app['confidence'], app['icon'],
app['name'], app['version'], app['website'], alerts['language'], alerts['status']]
output.append(output_alert)
df = pd.DataFrame(output, columns=['Categories', 'Confidence', 'Icon', 'Name', 'Version', 'Website',
'Language', 'Status'])
df.to_csv(args.output)
print('Scan completed, you already have your new CSV file')
return
enter image description here
I left you a picture of the CSV file with the problem in column B (I have a list there) but I need actually two columns with each value...
I attached the JSON response that I have from a REST API
{'applications': [{'categories': [{'59': 'JavaScript libraries'}],
'confidence': '100',
'icon': 'Lo-dash.png',
'name': 'Lodash',
'version': '4.17.15',
'website': 'http://www.lodash.com'},
{'categories': [{'12': 'JavaScript frameworks'}],
'confidence': '100',
'icon': 'RequireJS.png',
'name': 'RequireJS',
'version': '2.3.6',
'website': 'http://requirejs.org'},
{'categories': [{'13': 'Issue trackers'}],
'confidence': '100',
'icon': 'Sentry.svg',
'name': 'Sentry',
'version': '4.6.2',
'website': 'https://sentry.io/'},
{'categories': [{'1': 'CMS'},
{'6': 'Ecommerce'},
{'11': 'Blogs'}],
'confidence': '100',
'icon': 'Wix.png',
'name': 'Wix',
'version': None,
'website': 'https://www.wix.com'},
{'categories': [{'59': 'JavaScript libraries'}],
'confidence': '100',
'icon': 'Zepto.png',
'name': 'Zepto',
'version': None,
'website': 'http://zeptojs.com'},
{'categories': [{'19': 'Miscellaneous'}],
'confidence': '100',
'icon': 'webpack.svg',
'name': 'webpack',
'version': None,
'website': 'https://webpack.js.org/'},
{'categories': [{'12': 'JavaScript frameworks'}],
'confidence': '0',
'icon': 'React.png',
'name': 'React',
'version': None,
'website': 'https://reactjs.org'}], 'language': 'es', 'status': 'success'}
[{'59': 'JavaScript libraries'}] this last thing is my big problem! Thank you for your time and help!

You could try using list(cat.keys())[0], list(cat.values())[0] in your output_alert variable to extract key and value separately.

You can use json_normalize to extract your columns without the for-loop, and then create two new columns with the extracted keys and values from categories:
result = pd.json_normalize(
alerts,
record_path=["applications"],
meta=["language", "status"]
).explode("categories")
result["category_labels"] = result.categories.apply(lambda x: list(x.keys())[0])
result["category_values"] = result.categories.apply(lambda x: list(x.values())[0])
The output is:

Accessing YAML data in Python

I have a YAML file that parses into an object, e.g.:
{'name': [{'proj_directory': '/directory/'},
{'categories': [{'quick': [{'directory': 'quick'},
{'description': None},
{'table_name': 'quick'}]},
{'intermediate': [{'directory': 'intermediate'},
{'description': None},
{'table_name': 'intermediate'}]},
{'research': [{'directory': 'research'},
{'description': None},
{'table_name': 'research'}]}]},
{'nomenclature': [{'extension': 'nc'}
{'handler': 'script'},
{'filename': [{'id': [{'type': 'VARCHAR'}]},
{'date': [{'type': 'DATE'}]},
{'v': [{'type': 'INT'}]}]},
{'data': [{'time': [{'variable_name': 'time'},
{'units': 'minutes since 1-1-1980 00:00 UTC'},
{'latitude': [{'variable_n...
I'm having trouble accessing the data in python and regularly see the error TypeError: list indices must be integers, not str
I want to be able to access all elements corresponding to 'name' so to retrieve each data field I imagine it would look something like:
import yaml
settings_stream = open('file.yaml', 'r')
settingsMap = yaml.safe_load(settings_stream)
yaml_stream = True
print 'loaded settings for: ',
for project in settingsMap:
print project + ', ' + settingsMap[project]['project_directory']
and I would expect each element would be accessible via something like ['name']['categories']['quick']['directory']
and something a little deeper would just be:
['name']['nomenclature']['data']['latitude']['variable_name']
or am I completely wrong here?

The brackets, [], indicate that you have lists of dicts, not just a dict.
For example, settingsMap['name'] is a list of dicts.
Therefore, you need to select the correct dict in the list using an integer index, before you can select the key in the dict.
So, giving your current data structure, you'd need to use:
settingsMap['name'][1]['categories'][0]['quick'][0]['directory']
Or, revise the underlying YAML data structure.
For example, if the data structure looked like this:
settingsMap = {
'name':
{'proj_directory': '/directory/',
'categories': {'quick': {'directory': 'quick',
'description': None,
'table_name': 'quick'}},
'intermediate': {'directory': 'intermediate',
'description': None,
'table_name': 'intermediate'},
'research': {'directory': 'research',
'description': None,
'table_name': 'research'},
'nomenclature': {'extension': 'nc',
'handler': 'script',
'filename': {'id': {'type': 'VARCHAR'},
'date': {'type': 'DATE'},
'v': {'type': 'INT'}},
'data': {'time': {'variable_name': 'time',
'units': 'minutes since 1-1-1980 00:00 UTC'}}}}}
then you could access the same value as above with
settingsMap['name']['categories']['quick']['directory']
# quick

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Changing schema of avro file when writing to it in append mode - python

The append API in fastavro does not currently support this. You could open an issue in that repository and discuss if something like this makes sense.

Related

Create a new dictionary with the key-value pair from values in a list of dictionaries based on matches from a separate list

Convert nested dictionary within JSON from a string

How to parse VirtualMachinePaged object using Azure SDK for Python?

How to convert a JSON list into two different columns in a CSV file?

Accessing YAML data in Python

Categories

Resources