Flattening of JSON for analysis - python

Require a small help. I need to flatten this json so that i can use it for analysis.
Sample for the json is :
{'data': [{'tag': 'U128_CRC_2', 'timestamp': 1575234889002, 'value': 0.0}],
'metadata': {'event': 'alarm.reset',
'idx': '1372',
'timestamp': 1575234889.002701},
'productID': '41ae4b41-56be-4bf2-a6a8-7aee4d15bf54',
'timestamp': 1575234889008,
'topicIdx': '1'}
I ran the following code :
from pandas import json_normalize
with open('NewJson.json') as f:
d1 = json.load(f)
works_data = json_normalize(data=d1, record_path='data',
meta=['tag','value','timestamp'])
I get the below error for the same:
KeyError: "Try running with errors='ignore' as key 'tag' is not always present"
Can anybody please help

Your problem is that the 'data' key is a list and a dictionary. You have to remove the list manually:
Example:
from pandas.io.json import json_normalize
d1 = {'data': [{'tag': 'U128_CRC_2', 'timestamp': 1575234889002, 'value': 0.0}],
'metadata': {'event': 'alarm.reset',
'idx': '1372',
'timestamp': 1575234889.002701},
'productID': '41ae4b41-56be-4bf2-a6a8-7aee4d15bf54',
'timestamp': 1575234889008,
'topicIdx': '1'}
d1['data'] = d1.get('data')[0]
works_data = json_normalize(data=d1)
works_data
Output:
productID timestamp topicIdx data.tag data.timestamp data.value metadata.event metadata.idx metadata.timestamp
0 41ae4b41-56be-4bf2-a6a8-7aee4d15bf54 1575234889008 1 U128_CRC_2 1575234889002 0.0 alarm.reset 1372 1.575235e+09

Related

Replace single quotes with doubles to turn contents of a file into a nested JSON and normalize it afterwards

I have 70k files all of which look similar to this:
{'id': 24, 'name': None, 'city': 'City', 'region_id': 19,
'story_id': 1, 'description': 'text', 'uik': None, 'ustatus': 'status',
'wuiki_tik_name': '', 'reaction': None, 'reaction_official': '',
'created_at': '2011-09-07T07:24:44.420Z', 'lat': 54.7, 'lng': 20.5,
'regions': {'id': 19, 'name': 'name'}, 'stories': {'id': 1, 'name': '2011-12-04'}, 'assets': [], 'taggings': [{'tags': {'id': 6, 'name': 'name',
'tag_groups': {'id': 3, 'name': 'Violation'}}},
{'tags': {'id': 8, 'name': 'name', 'tag_groups': {'id': 5, 'name': 'resource'}}},
{'tags': {'id': 1, 'name': '01. Federal', 'tag_groups': {'id': 1, 'name': 'Level'}}},
{'tags': {'id': 3, 'name': '03. Local', 'tag_groups': {'id': 1, 'name': 'stuff'}}},
{'tags': {'id': 2, 'name': '02. Regional', 'tag_groups':
{'id': 1, 'name': 'Level'}}}], 'message_id': None, '_count': {'assets': 0, 'other_messages': 0, 'similars': 0, 'taggings': 5}}
The ultimate goal is to export it into a single CSV file. It can be successfully done without flattening. But since it has a lot of nested values, I would like to flatten it, and this is where I began facing problems related to data types. Here's the code:
import json
from pandas.io.json import json_normalize
import glob
path = glob.glob("all_messages/*.json")
for file in path:
with open(file, "r") as filer:
content = json.loads(json.dumps(filer.read()))
if content != 404:
df_main = json_normalize(content)
df_regions = json_normalize(content, record_path=['regions'], record_prefix='regions.', meta=['id'])
df_stories = json_normalize(content, record_path=['stories'], record_prefix='stories.', meta=['id'])
#... More code related to normalization
df_out.to_csv('combined_json.csv')
This code occasionally throws:
AttributeError: 'str' object has no attribute 'values' or ValueError: DataFrame constructor not properly called!. I realise that this is caused by json.dumps() JSON string output. However, I have failed to turn it into anything useable.
Any possible solutions to this?
If you only need to change ' to ":
...
for file in path:
with open(file, "r") as filer:
filer.replace("\'", "\"")
...
Making copies and using grep would be easier
While it is not the solution I was initially expecting, this approach worked as well. I kept getting error messages related to the structure of the dict literals that were reluctant to become json, so I took the csv file that I wanted to normalise and worked with each column one by one:
df = pd.read_csv("combined_json.csv")
df['regions'] = df['regions'].apply(lambda x: x.replace("'", '"'))
regions = pd.json_normalize(df['regions'].apply(json.loads).tolist()).rename(
columns=lambda x: x.replace('regions.', ''))
df['regions'] = regions['name']
Or, if it had more nested levels:
df['taggings'] = df['taggings'].apply(lambda x: x.replace("'", '"'))
taggings = pd.concat([pd.json_normalize(json.loads(j)) for j in df['taggings']])
df = df.reset_index(drop=True)
taggings = taggings.reset_index(drop=True)
df[['tags_id', 'nametag', 'group_tag', 'group_tag_name']] = taggings[['tags.id', 'tags.name', 'tags.tag_groups.id', 'tags.tag_groups.name']]
Which was eventually df.to_csv().

Trying to follow django docs to create serialized json

Trying to seed a database in django app. I have a csv file that I converted to json and now I need to reformat it to match the django serialization required format found here
This is what the json format needs to look like to be acceptable to django (Which looks an awful lot like a dictionary with 3 keys, the third having a value which is a dictionary itself):
[
{
"pk": "4b678b301dfd8a4e0dad910de3ae245b",
"model": "sessions.session",
"fields": {
"expire_date": "2013-01-16T08:16:59.844Z",
...
}
}
]
My json data looks like this after converting it from csv with pandas:
[{'model': 'homepage.territorymanager', 'pk': 1, 'Name': 'Aaron ##', 'Distributor': 'National Energy', 'State': 'BC', 'Brand': 'Trane', 'Cell': '778-###-####', 'email address': None, 'Notes': None, 'Unnamed: 9': None}, {'model': 'homepage.territorymanager', 'pk': 2, 'Name': 'Aaron Martin ', 'Distributor': 'Pierce ###', 'State': 'PA', 'Brand': 'Bryant/Carrier', 'Cell': '267-###-####', 'email address': None, 'Notes': None, 'Unnamed: 9': None},...]
I am using this function to try and reformat
def re_serialize_reg_json(d, jsonFilePath):
for i in d:
d2 = {'Name': d[i]['Name'], 'Distributor' : d[i]['Distributor'], 'State' : d[i]['State'], 'Brand' : d[i]['Brand'], 'Cell' : d[i]['Cell'], 'EmailAddress' : d[i]['email address'], 'Notes' : d[i]['Notes']}
d[i] = {'pk': d[i]['pk'],'model' : d[i]['model'], 'fields' : d2}
print(d)
and it returns this error which doesn't make any sense because the format that django requires has a dictionary as the value of the third key:
d2 = {'Name': d[i]['Name'], 'Distributor' : d[i]['Distributor'], 'State' : d[i]['State'], 'Brand' : d[i]['Brand'], 'Cell' : d[i]['Cell'], 'EmailAddress' : d[i]['email address'], 'Notes' : d[i]['Notes']}
TypeError: list indices must be integers or slices, not dict
Any help appreciated!
Here is what I did to get d:
df = pandas.read_csv('/Users/justinbenfit/territorymanagerpython/territory managers - Sheet1.csv')
df.to_json('/Users/justinbenfit/territorymanagerpython/territorymanagers.json', orient='records')
jsonFilePath = '/Users/justinbenfit/territorymanagerpython/territorymanagers.json'
def load_file(file_path):
with open(file_path) as f:
d = json.load(f)
return d
d = load_file(jsonFilePath)
print(d)
D is actually a list containing multiple dictionaries, so in order to make it work you want to change that for i in d part to: for i in range(len(d)).

Converting nested list into data frame using pandas

I have below list which stored in data
{'id': 255719,
'display_name': 'Off Broadway Apartments',
'access_right': {'status': 'OWNED', 'api_enabled': True},
'creation_time': '2021-04-26T15:53:29+00:00',
'status': {'value': 'OFFLINE', 'last_change': '2021-07-10T17:26:50+00:00'},
'licence': {'id': 74213,
'code': '23AF-15A8-0514-2E4B-04DE-5C19-A574-B20B',
'bound_mac_address': '00:11:32:C2:FE:6A',
'activation_time': '2021-04-26T15:53:29+00:00',
'type': 'SUBSCRIPTION'},
'timezone': 'America/Chicago',
'version': {'agent': '3.7.0-b001', 'package': '2.5.1-0022'},
'location': {'latitude': '41.4126', 'longitude': '-99.6345'}}
I would like to convert into data frame.can anyone advise?
I tried below code
df = pd.DataFrame(data)
but it's not coming properly as many nested lists. can anyone advise?
from pandas.io.json import json_normalize
# load json
json = {'id': 255719,
'display_name': 'Off Broadway Apartments',
'access_right': {'status': 'OWNED', 'api_enabled': True},
'creation_time': '2021-04-26T15:53:29+00:00',
'status': {'value': 'OFFLINE', 'last_change': '2021-07-10T17:26:50+00:00'},
'licence': {'id': 74213,
'code': '23AF-15A8-0514-2E4B-04DE-5C19-A574-B20B',
'bound_mac_address': '00:11:32:C2:FE:6A',
'activation_time': '2021-04-26T15:53:29+00:00',
'type': 'SUBSCRIPTION'},
'timezone': 'America/Chicago',
'version': {'agent': '3.7.0-b001', 'package': '2.5.1-0022'},
'location': {'latitude': '41.4126', 'longitude': '-99.6345'}}
Create a fuction to flat nested jsons:
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
You can now use that function on your original json file:
flat = flatten_json(json)
df = json_normalize(flat)
Results:
id display_name ... location_latitude location_longitude
0 255719 Off Broadway Apartments ... 41.4126 -99.6345

How to convert json into a pandas dataframe?

I'm trying to covert an api response from json to a dataframe in pandas. the problem I am having is that de data is nested in the json format and I am not getting the right columns in my dataframe.
The data is collect from a api with the following format:
{'tickets': [{'url': 'https...',
'id': 1,
'external_id': None,
'via': {'channel': 'web',
'source': {'from': {}, 'to': {}, 'rel': None}},
'created_at': '2020-05-01T04:16:33Z',
'updated_at': '2020-05-23T03:02:49Z',
'type': 'incident',
'subject': 'Subject',
'raw_subject': 'Raw subject',
'description': 'Hi, this is the description',
'priority': 'normal',
'status': 'closed',
'recipient': None,
'requester_id': 409467360874,
'submitter_id': 409126461453,
'assignee_id': 409126461453,
'organization_id': None,
'group_id': 360009916453,
'collaborator_ids': [],
'follower_ids': [],
'email_cc_ids': [],
'forum_topic_id': None,
'problem_id': None,
'has_incidents': False,
'is_public': True,
'due_at': None,
'tags': ['tag_1',
'tag_2',
'tag_3',
'tag_4'],
'custom_fields': [{'id': 360042034433, 'value': 'value of the first custom field'},
{'id': 360041487874, 'value': 'value of the second custom field'},
{'id': 360041489414, 'value': 'value of the third custom field'},
{'id': 360040980053, 'value': 'correo_electrónico'},
{'id': 360040980373, 'value': 'suscribe_newsletter'},
{'id': 360042046173, 'value': None},
{'id': 360041028574, 'value': 'product'},
{'id': 360042103034, 'value': None}],
'satisfaction_rating': {'score': 'unoffered'},
'sharing_agreement_ids': [],
'comment_count': 2,
'fields': [{'id': 360042034433, 'value': 'value of the first custom field'},
{'id': 360041487874, 'value': 'value of the second custom field'},
{'id': 360041489414, 'value': 'value of the third custom field'},
{'id': 360040980053, 'value': 'correo_electrónico'},
{'id': 360040980373, 'value': 'suscribe_newsletter'},
{'id': 360042046173, 'value': None},
{'id': 360041028574, 'value': 'product'},
{'id': 360042103034, 'value': None}],
'followup_ids': [],
'ticket_form_id': 360003608013,
'deleted_ticket_form_id': 360003608013,
'brand_id': 360004571673,
'satisfaction_probability': None,
'allow_channelback': False,
'allow_attachments': True},
What I already tried is the following: I have converted the JSON format into a dict as following:
x = response.json()
df = pd.DataFrame(x['tickets'])
But I'm struggling with the output. I don't know how to get a correct, ordered, normalized dataframe.
(I'm new in this :) )
Let's supose you get your request data by this code r = requests.get(url, auth)
Your data ins't clear yet, so let's get a dataframe of it data = pd.read_json(json.dumps(r.json, ensure_ascii = False))
But, probably you will get a dataframe with one single row.
When I faced a problem like this, I wrote this function to get the full data:
listParam = []
def listDict(entry):
if type(entry) is dict:
listParam.append(entry)
elif type(entry) is list:
for ent in entry:
listDict(ent)
Because your data looks like a dict because of {'tickets': ...} you will need to get the information like that:
listDict(data.iloc[0][0])
And then,
pd.DataFrame(listParam)
I can't show the results because you didn't post the complete data nor told where I can find the data to test, but this will probably work.
You have to convert the json to dictionary first and then convert the dictionary value for key 'tickets' into dataframe.
file = open('file.json').read()
ticketDictionary = json.loads(file)
df = pd.DataFrame(ticketDictionary['tickets'])
'file.json' contains your data here.
df now contains your dataFrame in this format.
For the lists within the response you can have separate dataframes if required:
for field in df['fields']:
df = pd.DataFrame(field)
It will give you this for lengths:
id value
0 360042034433 value of the first custom field
1 360041487874 value of the second custom field
2 360041489414 value of the third custom field
3 360040980053 correo_electrónico
4 360040980373 suscribe_newsletter
5 360042046173 None
6 360041028574 product
7 360042103034 None
This can be one way to structure as you haven't mentioned the exact expected format.

Iterate through dictionary to replace leading zeros?

I want to iterate through this dictionary and find any 'id' that has a leading zero, like the one below, and replace it without the zero. So 'id': '01001' would become 'id': '1001'
Here is how to get the data I'm working with:
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
so far I've been able to get one ID at a time, but not sure how to loop through to get all of the IDs:
My code so far: counties['features'][0]['id']
{ 'type': 'FeatureCollection',
'features': [{'type': 'Feature',
'properties': {'GEO_ID': '0500000US01001',
'STATE': '01',
'COUNTY': '001',
'NAME': 'Autauga',
'LSAD': 'County',
'CENSUSAREA': 594.436},
'geometry': {'type': 'Polygon',
'coordinates': [[[-86.496774, 32.344437],
[-86.717897, 32.402814],
[-86.814912, 32.340803],
[-86.890581, 32.502974],
[-86.917595, 32.664169],
[-86.71339, 32.661732],
[-86.714219, 32.705694],
[-86.413116, 32.707386],
[-86.411172, 32.409937],
[-86.496774, 32.344437]]]},
'id': '01001'}
]
}
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
Then iterate over the list if id's with your JSON structure. And update the id
as
counties['features'][0]['id'] = counties['features'][0]['id'].lstrip("0")
lstrip will remove leading zeroes from the string.
Suppose your dictionary counties has the following data. You can use the following code:
counties={'type': 'FeatureCollection',
'features': [ {'type': 'Feature','properties': {'GEO_ID': '0500000US01001','STATE': '01','COUNTY': '001','NAME': 'Autauga', 'LSAD': 'County','CENSUSAREA': 594.436},
'geometry': {'type': 'Polygon','coordinates': [[[-86.496774, 32.344437],[-86.717897, 32.402814],[-86.814912, 32.340803],
[-86.890581, 32.502974],
[-86.917595, 32.664169],
[-86.71339, 32.661732],
[-86.714219, 32.705694],
[-86.413116, 32.707386],
[-86.411172, 32.409937],
[-86.496774, 32.344437] ]] } ,'id': '01001'}, {'type': 'Feature','properties': {'GEO_ID': '0500000US01001','STATE': '01','COUNTY': '001','NAME': 'Autauga', 'LSAD': 'County','CENSUSAREA': 594.436},
'geometry': {'type': 'Polygon','coordinates': [[[-86.496774, 32.344437],[-86.717897, 32.402814],[-86.814912, 32.340803],
[-86.890581, 32.502974],
[-86.917595, 32.664169],
[-86.71339, 32.661732],
[-86.714219, 32.705694],
[-86.413116, 32.707386],
[-86.411172, 32.409937],
[-86.496774, 32.344437] ]] } ,'id': '000000000001001'} ]}
for feature in counties['features']:
feature ['id']=feature ['id'].lstrip("0")
print(counties)
Here is shorter and faster way of doing this using json object hooks,
def stripZeroes(d):
if 'id' in d:
d['id'] = d['id'].lstrip('0')
return d
return d
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response, object_hook=stripZeroes)

Categories

Resources