How to load a nested json file into a pandas DataFrame

How to load a nested json file into a pandas DataFrame - python

please help I cannot seem to get the json data into a Dataframe.
loaded the data
data =json.load(open(r'path'))#this works fine and displays:
json data
{'type': 'FeatureCollection', 'name': 'Altstadt Nord', 'crs': {'type': 'name', 'properties': {'name': 'urn:ogc:def:crs:OGC:1.3:CRS84'}}, 'features': [{'type': 'Feature', 'properties': {'Name': 'City-Martinsviertel', 'description': None}, 'geometry': {'type': 'Polygon', 'coordinates': [[[6.9595637, 50.9418396], [6.956624, 50.9417382], [6.9543173, 50.941603], [6.9529869, 50.9413664], [6.953062, 50.9408593], [6.9532873, 50.9396289], [6.9533624, 50.9388176], [6.9529333, 50.9378373], [6.9527509, 50.9371815], [6.9528367, 50.9360659], [6.9532122, 50.9352884], [6.9540705, 50.9350653], [6.9553258, 50.9350044], [6.9568815, 50.9351667], [6.9602074, 50.9355047], [6.9608189, 50.9349165], [6.9633939, 50.9348827], [6.9629433, 50.9410622], [6.9616236, 50.9412176], [6.9603898, 50.9414881], [6.9595637, 50.9418396]]]}}, {'type': 'Feature', 'properties': {'Name': 'Gereonsviertel', 'description': None}, 'geometry': {'type': 'Polygon', 'coordinates': [[[6.9629433, 50.9410622], [6.9629433, 50.9431646], [6.9611408, 50.9433539], [6.9601752, 50.9436649], [6.9588234, 50.9443409], [6.9579651, 50.9449763], [6.9573213, 50.945801], [6.9563128, 50.9451926], [6.9551756, 50.9448546], [6.9535663, 50.9446518], [6.9523432, 50.9449763], [6.9494464, 50.9452602], [6.9473435, 50.9454495], [6.9466998, 50.9456928], [6.9458415, 50.946531], [6.9434168, 50.9453954], [6.9424726, 50.9451926], [6.9404342, 50.9429888], [6.9404771, 50.9425156], [6.9403269, 50.9415016], [6.9400479, 50.9405281], [6.9426228, 50.9399872], [6.9439103, 50.9400143], [6.9453051, 50.9404875], [6.9461634, 50.9408931], [6.9467427, 50.941096], [6.9475581, 50.9410013], [6.9504227, 50.9413191], [6.9529869, 50.9413664], [6.9547464, 50.9416368], [6.9595637, 50.9418396], [6.9603898, 50.9414881], [6.9616236, 50.9412176], [6.9629433, 50.9410622]]]}}, {'type': 'Feature', 'properties': {'Name': 'Kunibertsviertel', 'description': None}, 'geometry': {'type': 'Polygon', 'coordinates': [[[6.9629433, 50.9431646], [6.9637129, 50.9454917], [6.9651506, 50.9479252], [6.9666097, 50.9499124], [6.9667599, 50.9500882], [6.9587777, 50.9502504], [6.9573213, 50.945801], [6.9579651, 50.9449763], [6.9588234, 50.9443409], [6.9601752, 50.9436649], [6.9611408, 50.9433539], [6.9629433, 50.9431646]]]}}, {'type': 'Feature', 'properties': {'Name': 'Nördlich Neumarkt', 'description': None}, 'geometry': {'type': 'Polygon', 'coordinates': [[[6.9390331, 50.9364418], [6.9417153, 50.9358738], [6.9462214, 50.9358062], [6.9490109, 50.9355628], [6.9505129, 50.9353329], [6.9523798, 50.9352924], [6.9532122, 50.9352884], [6.9528367, 50.9360659], [6.9527509, 50.9371815], [6.9529333, 50.9378373], [6.9533624, 50.9388176], [6.9532381, 50.9398222], [6.9529869, 50.9413664], [6.9504227, 50.9413191], [6.9475581, 50.9410013], [6.9467427, 50.941096], [6.9453051, 50.9404875], [6.9439103, 50.9400143], [6.9424663, 50.9399574], [6.9400479, 50.9405281], [6.9390331, 50.9364418]]]}}]}
now i cannot seem to fit it into a Dataframe //
pd.DataFrame(data) --> ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.full error
I tried to flatten with json_flatten but ModuleNotFoundError: No module named 'flatten_json' even though I installed json-flatten via pip install
also tried df =pd.DataFrame.from_dict(data,orient='index')
df
Out[22]:
0
type FeatureCollection
name Altstadt Nord
crs {'type': 'name', 'properties': {'name': 'urn:o...
features [{'type': 'Feature', 'properties': {'Name': 'C...
df Out[22]

I think you can use json_normalize to load them to pandas.
test.json in this case is your full json file (with double quotes).
import json
from pandas.io.json import json_normalize
with open('path_to_json.json') as f:
data = json.load(f)
df = json_normalize(data, record_path=['features'], meta=['name'])
print(df)
This results in a dataframe as shown below.
You can further add record field in the normalize method to create more columns for the polygon coordinates.
You can find more documentation at https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html
Hope that helps.

The json data contains elements with different datatypes and these cannot be loaded into one single dataframe.
View datatypes in the json:
[type(data[k]) for k in data.keys()]
# Out: [str, str, dict, list]
data.keys()
# Out: dict_keys(['type', 'name', 'crs', 'features'])
You can load each single chunk of data in a separate dataframe like this:
df_crs = pd.DataFrame(data['crs'])
df_features = pd.DataFrame(data['features'])
data['type'] and data['name'] are strings
data['type']
# Out 'FeatureCollection'
data['name']
# Out 'Altstadt Nord'

Related

Changing schema of avro file when writing to it in append mode

I'm looking for a way to modify the schema of an avro file in python. Taking the following example, using the fastavro package, first write out some initial records, with corresponding schema:
from fastavro import writer, parse_schema
schema = {
'name': 'test',
'type': 'record',
'fields': [
{'name': 'id', 'type': 'int'},
{'name': 'val', 'type': 'long'},
],
}
records = [
{u'id': 1, u'val': 0.2},
{u'id': 2, u'val': 3.1},
]
with open('test.avro', 'wb') as f:
writer(f, parse_schema(schema), records)
Uhoh, I've got some more records, but they contain None values. I'd like to append these records to the avro file, and modify my schema accordingly:
more_records = [
{u'id': 3, u'val': 1.5},
{u'id': 2, u'val': None},
]
schema['fields'][1]['type'] = ['long', 'null']
with open('test.avro', 'a+b') as f:
writer(f, parse_schema(schema), more_records)
Instead of overwriting the schema, this results in an error:
ValueError: Provided schema {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': ['long', 'null']}], '__fastavro_parsed': True, '__named_schemas': {'test': {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': ['long', 'null']}]}}} does not match file writer_schema {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': 'long'}], '__fastavro_parsed': True, '__named_schemas': {'test': {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': 'long'}]}}}
Is there a workaround for this? The fastavro docs for this suggest it's not possible, but I'm hoping someone knows of a way!
Cheers

The append API in fastavro does not currently support this. You could open an issue in that repository and discuss if something like this makes sense.

Convert nested dictionary within JSON from a string

I have JSON data that I loaded that appears to have a bit of a messy data structure where nested dictionaries are wrapped in single quotes and recognized as a string, rather than a single dictionary which I can loop through. What is the best way to drop the single quotes from the key-value property ('value').
Provided below is an example of the structure:
for val in json_data:
print(val)
{'id': 'status6',
'title': 'Estimation',
'text': '> 2 days',
'type': 'color',
'value': '{"index":14,"post_id":null,"changed_at":"2020-06-12T09:04:58.659Z"}',
'name': 'Internal: online course'},
{'id': 'date',
'title': 'Deadline',
'text': '2020-06-26',
'type': 'date',
'value': '{"date":"2020-06-26","changed_at":"2020-06-12T11:33:37.195Z"}',
'name': 'Internal: online course'},
{'id': 'tags',
'title': 'Tags',
'text': 'Internal',
'type': 'tag',
'value': '{"tag_ids":[3223513]}',
'name': 'Internal: online course'},
If I add a nested look targeting ['value'], it loops by character and not key-value pair in the dictionary.

Using json.loads to convert string to dict
import json
json_data = [{'id': 'status6',
'title': 'Estimation',
'text': '> 2 days',
'type': 'color',
'value': '{"index":14,"post_id":null,"changed_at":"2020-06-12T09:04:58.659Z"}',
'name': 'Internal: online course'},
{'id': 'date',
'title': 'Deadline',
'text': '2020-06-26',
'type': 'date',
'value': '{"date":"2020-06-26","changed_at":"2020-06-12T11:33:37.195Z"}',
'name': 'Internal: online course'},
{'id': 'tags',
'title': 'Tags',
'text': 'Internal',
'type': 'tag',
'value': '{"tag_ids":[3223513]}',
'name': 'Internal: online course'}]
# the result is a Python dictionary:
for val in json_data:
print(json.loads(val['value']))
this should be work!!

Iterate through dictionary to replace leading zeros?

I want to iterate through this dictionary and find any 'id' that has a leading zero, like the one below, and replace it without the zero. So 'id': '01001' would become 'id': '1001'
Here is how to get the data I'm working with:
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
so far I've been able to get one ID at a time, but not sure how to loop through to get all of the IDs:
My code so far: counties['features'][0]['id']
{ 'type': 'FeatureCollection',
'features': [{'type': 'Feature',
'properties': {'GEO_ID': '0500000US01001',
'STATE': '01',
'COUNTY': '001',
'NAME': 'Autauga',
'LSAD': 'County',
'CENSUSAREA': 594.436},
'geometry': {'type': 'Polygon',
'coordinates': [[[-86.496774, 32.344437],
[-86.717897, 32.402814],
[-86.814912, 32.340803],
[-86.890581, 32.502974],
[-86.917595, 32.664169],
[-86.71339, 32.661732],
[-86.714219, 32.705694],
[-86.413116, 32.707386],
[-86.411172, 32.409937],
[-86.496774, 32.344437]]]},
'id': '01001'}
]
}

from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
Then iterate over the list if id's with your JSON structure. And update the id
as
counties['features'][0]['id'] = counties['features'][0]['id'].lstrip("0")
lstrip will remove leading zeroes from the string.

Suppose your dictionary counties has the following data. You can use the following code:
counties={'type': 'FeatureCollection',
'features': [ {'type': 'Feature','properties': {'GEO_ID': '0500000US01001','STATE': '01','COUNTY': '001','NAME': 'Autauga', 'LSAD': 'County','CENSUSAREA': 594.436},
'geometry': {'type': 'Polygon','coordinates': [[[-86.496774, 32.344437],[-86.717897, 32.402814],[-86.814912, 32.340803],
[-86.890581, 32.502974],
[-86.917595, 32.664169],
[-86.71339, 32.661732],
[-86.714219, 32.705694],
[-86.413116, 32.707386],
[-86.411172, 32.409937],
[-86.496774, 32.344437] ]] } ,'id': '01001'}, {'type': 'Feature','properties': {'GEO_ID': '0500000US01001','STATE': '01','COUNTY': '001','NAME': 'Autauga', 'LSAD': 'County','CENSUSAREA': 594.436},
'geometry': {'type': 'Polygon','coordinates': [[[-86.496774, 32.344437],[-86.717897, 32.402814],[-86.814912, 32.340803],
[-86.890581, 32.502974],
[-86.917595, 32.664169],
[-86.71339, 32.661732],
[-86.714219, 32.705694],
[-86.413116, 32.707386],
[-86.411172, 32.409937],
[-86.496774, 32.344437] ]] } ,'id': '000000000001001'} ]}
for feature in counties['features']:
feature ['id']=feature ['id'].lstrip("0")
print(counties)

Here is shorter and faster way of doing this using json object hooks,
def stripZeroes(d):
if 'id' in d:
d['id'] = d['id'].lstrip('0')
return d
return d
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response, object_hook=stripZeroes)

Mapbox Distances with Python

I am trying to calculate distances among given points with Mapbox in Python. I read the documentation and some examples and I came up with the following code.
from mapbox import Distance
glinavos = {
'type': 'Feature',
'properties': {'name': 'glinavos'},
'geometry': {
'type': 'Point',
'coordinates': [39.754598, 20.654121]}}
zoinos = {
'type': 'Feature',
'properties': {'name': 'zoinos'},
'geometry': {
'type': 'Point',
'coordinates': [39.754204, 20.640761]}}
katogi = {
'type': 'Feature',
'properties': {'name': 'katogi'},
'geometry': {
'type': 'Point',
'coordinates': [39.776992, 21.180688]}}
myDistance = Distance(access_token="pk.eyJ1IjoiaWxpdHNlIiwiYSI6ImNpenZmcm11YjAwMGQyd2x1Nm9nd2pqcGUifQ.1PZaOWTVajnQZGeBb_x1Bw")
result=myDistance.distances([glinavos,zoinos,katogi], 'driving')
It keeps returning a 403 error while everything else seems fine. The three test points are real places and I have tried with two access tokens: my public and my secret (more privileges) one. In addition, I have tried calling the service with the same points and tokens through curl and it worked fine. What is wrong with my script? In the code above I use the public token.

Accessing YAML data in Python

I have a YAML file that parses into an object, e.g.:
{'name': [{'proj_directory': '/directory/'},
{'categories': [{'quick': [{'directory': 'quick'},
{'description': None},
{'table_name': 'quick'}]},
{'intermediate': [{'directory': 'intermediate'},
{'description': None},
{'table_name': 'intermediate'}]},
{'research': [{'directory': 'research'},
{'description': None},
{'table_name': 'research'}]}]},
{'nomenclature': [{'extension': 'nc'}
{'handler': 'script'},
{'filename': [{'id': [{'type': 'VARCHAR'}]},
{'date': [{'type': 'DATE'}]},
{'v': [{'type': 'INT'}]}]},
{'data': [{'time': [{'variable_name': 'time'},
{'units': 'minutes since 1-1-1980 00:00 UTC'},
{'latitude': [{'variable_n...
I'm having trouble accessing the data in python and regularly see the error TypeError: list indices must be integers, not str
I want to be able to access all elements corresponding to 'name' so to retrieve each data field I imagine it would look something like:
import yaml
settings_stream = open('file.yaml', 'r')
settingsMap = yaml.safe_load(settings_stream)
yaml_stream = True
print 'loaded settings for: ',
for project in settingsMap:
print project + ', ' + settingsMap[project]['project_directory']
and I would expect each element would be accessible via something like ['name']['categories']['quick']['directory']
and something a little deeper would just be:
['name']['nomenclature']['data']['latitude']['variable_name']
or am I completely wrong here?

The brackets, [], indicate that you have lists of dicts, not just a dict.
For example, settingsMap['name'] is a list of dicts.
Therefore, you need to select the correct dict in the list using an integer index, before you can select the key in the dict.
So, giving your current data structure, you'd need to use:
settingsMap['name'][1]['categories'][0]['quick'][0]['directory']
Or, revise the underlying YAML data structure.
For example, if the data structure looked like this:
settingsMap = {
'name':
{'proj_directory': '/directory/',
'categories': {'quick': {'directory': 'quick',
'description': None,
'table_name': 'quick'}},
'intermediate': {'directory': 'intermediate',
'description': None,
'table_name': 'intermediate'},
'research': {'directory': 'research',
'description': None,
'table_name': 'research'},
'nomenclature': {'extension': 'nc',
'handler': 'script',
'filename': {'id': {'type': 'VARCHAR'},
'date': {'type': 'DATE'},
'v': {'type': 'INT'}},
'data': {'time': {'variable_name': 'time',
'units': 'minutes since 1-1-1980 00:00 UTC'}}}}}
then you could access the same value as above with
settingsMap['name']['categories']['quick']['directory']
# quick

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to load a nested json file into a pandas DataFrame - python

Related

Changing schema of avro file when writing to it in append mode

Convert nested dictionary within JSON from a string

Iterate through dictionary to replace leading zeros?

Mapbox Distances with Python

Accessing YAML data in Python

Categories

Resources