Import JSON to dataframe and normalize - python

I have the following json document which i want to import into a dataframe:
{
"agents": [
{
"core_build": "17",
"core_version": "7.1.1",
"distro": "win-x86-64",
"groups": [
{
"id": 101819,
"name": "O Laptops"
}
],
"id": 2198802,
"ip": "x.x.x.x",
"last_connect": 1539962159,
"last_scanned": 1539373347,
"linked_on": 1534964847,
"name": "x1x1x1x1",
"platform": "WINDOWS",
"plugin_feed_id": "201810182051",
"status": "on",
"uuid": "ca8b941a-80cd-4c1c-8044-760e69781eb7"
},
{
"core_build": "17",
"core_version": "7.1.1",
"distro": "win-x86-64",
"groups": [
{
"id": 101839,
"name": "G Personal"
},
{
"id": 102037,
"name": "W6"
},
{
"id": 102049,
"name": "MS8"
}
],
"id": 2097601,
"ip": "x.x.x.x",
"last_connect": 1539962304,
"last_scanned": 1539437865,
"linked_on": 1529677890,
"name": "x2xx2x2x2",
"platform": "WINDOWS",
"plugin_feed_id": "201810181351",
"status": "on",
"uuid": "7e3ef1ff-4f08-445a-b500-e7ce3ca9a2f2"
},
{
"core_build": "14",
"core_version": "7.1.0",
"distro": "win-x86-64",
"id": 2234103,
"ip": "x6x6x6x6x",
"last_connect": 1537384290,
"linked_on": 1537384247,
"name": "x7x7x7x",
"platform": "WINDOWS",
"status": "off",
"uuid": "0696ee38-402a-4866-b753-2816482dfce6"
}],
"pagination": {
"limit": 5000,
"offset": 0,
"sort": [
{
"name": "name",
"order": "asc"
}
],
"total": 14416
}
}
I've written the following code for the same purpose:
import json
from pandas.io.json import json_normalize
with open('out.json') as f:
data = json.load(f)
df = json_normalize(data, 'agents', [['groups', 'name']], errors='ignore')
print(df)
This unpacks all the fields within 'agents'(along with the 'groups' field as a multi-value field) as is, along with a new field called 'groups.name' which is null(all values are NaN).
I only wish to unpack the fields within the 'agents' field into a dataframe, with the fields within 'groups' field being unpacked into individual columns ('core_build', 'core_version', 'distro', 'groups.name', 'id', 'ip','last_connect', 'last_scanned', 'linked_on', 'name', 'platform','plugin_feed_id', 'status', 'uuid').
How can I achieve this?
Edit:
Doing the following
df = json_normalize(pd.concat([pd.DataFrame(i) for i in data['agents']]).to_dict('r'))
returns an error
ValueError: If using all scalar values, you must pass an index

You can use pd.concat() with a list comprehension:
df = pd.concat([pd.DataFrame(i) for i in my_json['agents']])
Or try this if you would like to unpack the group column of type dict to separate columns:
df = json_normalize(pd.concat([pd.DataFrame(i) for i in my_json['agents']]).to_dict('r'))
Yields:
core_build core_version distro groups.id groups.name id \
0 17 7.1.1 win-x86-64 101819 O Laptops 2198802
1 17 7.1.1 win-x86-64 101893 V Laptops 2169839
2 17 7.1.1 win-x86-64 101839 Personal 2097601
3 17 7.1.1 win-x86-64 102037 Wi 2097601
4 17 7.1.1 win-x86-64 102049 MS8 2097601
ip last_connect last_scanned linked_on name platform \
0 x.x.x.x 1539962159 1539373347 1534964847 x1x1x1x1 WINDOWS
1 x.x.x.x 1539962767 1539374603 1533666075 x2x2x2x2 WINDOWS
2 x.x.x.x 1539962304 1539437865 1529677890 x3x3x3x3 WINDOWS
3 x.x.x.x 1539962304 1539437865 1529677890 x3x3x3x3 WINDOWS
4 x.x.x.x 1539962304 1539437865 1529677890 x3x3x3x3 WINDOWS
plugin_feed_id status uuid
0 201810182051 on ca8b941a-80cd-4c1c-8044-760e69781eb7
1 201810171657 on 9400817b-235b-423b-b163-c4c86f973232
2 201810181351 on 7e3ef1ff-4f08-445a-b500-e7ce3ca9a2f2
3 201810181351 on 7e3ef1ff-4f08-445a-b500-e7ce3ca9a2f2
4 201810181351 on 7e3ef1ff-4f08-445a-b500-e7ce3ca9a2f2

Related

Converting json to pandas dataframe with weather datasets

How can we convert this to dataframes? I have tried multiple ways on how it can be achived, i have tried with json file on w3school but it is working correctly, i am new with python, any recommendations on this?
Json format is
[
{
"id": 14256,
"city": {
"id": {
"$numberLong": "14256"
},
"name": "Azadshahr",
"findname": "AZADSHAHR",
"country": "IR",
"coord": {
"lon": 48.570728,
"lat": 34.790878
},
"zoom": {
"$numberLong": "10"
}
}
},
{
"id": {
"$numberLong": "465726"
},
"city": {
"id": {
"$numberLong": "465726"
},
"name": "Zadonsk",
"findname": "ZADONSK",
"country": "RU",
"coord": {
"lon": 38.926102,
"lat": 52.3904
},
"zoom": {
"$numberLong": "16"
}
}
}
]
The expected output is :
it tried to do a conversion but i am receiving error and it is not the whole data
with open('data/history.city.list.json') as f:
data = json.load(f)
but not able to load as data, This is what i have tried but i feel
_id = []
country = []
coord_lat = []
coord_lon = []
counter = 0
for i in data:
_id.append(data[counter]['id'])
country.append(data[counter]['city']['country'])
coord_lat.append(data[counter]['city']['coord']['lon'])
coord_lat.append(data[counter]['city']['coord']['lat'])
counter += 1
When i have tried to print it as a dataframe
df = pd.DataFrame({'Longtitude' : coord_lat , 'Latitude' : coord_lat})
df.head(10)
This was able to set it to dataframe, but as soon as i add 'Country' to pd.dataframe() , it will return as ValueError: arrays must all be same length.
i understand that country column does not match the other columns but can we achieve this and is there a simpler way to do it ?
You can use json_normalize() as described here:
import pandas as pd
d = [
{
"id": 14256,
"city": {
"id": {
"$numberLong": "14256"
},
"name": "Azadshahr",
"findname": "AZADSHAHR",
"country": "IR",
"coord": {
"lon": 48.570728,
"lat": 34.790878
},
"zoom": {
"$numberLong": "10"
}
}
},
{
"id": {
"$numberLong": "465726"
},
"city": {
"id": {
"$numberLong": "465726"
},
"name": "Zadonsk",
"findname": "ZADONSK",
"country": "RU",
"coord": {
"lon": 38.926102,
"lat": 52.3904
},
"zoom": {
"$numberLong": "16"
}
}
}
]
pd.io.json.json_normalize(d)
Output:
id city.id.$numberLong city.name city.findname city.country city.coord.lon city.coord.lat city.zoom.$numberLong id.$numberLong
0 14256.0 14256 Azadshahr AZADSHAHR IR 48.570728 34.790878 10 NaN
1 NaN 465726 Zadonsk ZADONSK RU 38.926102 52.390400 16 465726
The column names do not match your expected output, but you can change that easily with df.columns = ['Id', 'city', ... 'Zoom']

Convert JSON with nested objects to Pandas Dataframe

I am trying to load json from a url and convert to a Pandas dataframe, so that the dataframe would look like the sample below.
I've tried json_normalize, but it duplicates the columns, one for each data type (value and stringValue). Is there a simpler way than this method and then dropping and renaming columns after creating the dataframe? I want to keep the stringValue.
Person ID Position ID Job ID Manager
0 192 936 93 Tom
my_json = {
"columns": [
{
"alias": "c3",
"label": "Person ID",
"dataType": "integer"
},
{
"alias": "c36",
"label": "Position ID",
"dataType": "string"
},
{
"alias": "c40",
"label": "Job ID",
"dataType": "integer",
"entityType": "job"
},
{
"alias": "c19",
"label": "Manager",
"dataType": "integer"
},
],
"data": [
{
"c3": {
"value": 192,
"stringValue": "192"
},
"c36": {
"value": "936",
"stringValue": "936"
},
"c40": {
"value": 93,
"stringValue": "93"
},
"c19": {
"value": 12412453,
"stringValue": "Tom"
}
}
]
}
If c19 is of type string, this should work
alias_to_label = {x['alias']: x['label'] for x in my_json["columns"]}
is_str = {x['alias']: ('string' == x['dataType']) for x in my_json["columns"]}
data = []
for x in my_json["data"]:
data.append({
k: v["stringValue" if is_str[k] else 'value']
for k, v in x.items()
})
df = pd.DataFrame(data).rename(columns=alias_to_label)

Python parse large JSON nests and lists - string indices must be integers

NIST recently released all CVE data in JSON format, and I am trying to parse it out to add to a MySQL database so I can compare my security findings to what NIST shows.
The data, is very confusing to parses because there is a lot of nesting, with some lists included.
Here is a snippet of the JSON.
{
"CVE_data_type": "CVE",
"CVE_data_format": "MITRE",
"CVE_data_version": "4.0",
"CVE_data_numberOfCVEs": "600",
"CVE_data_timestamp": "Fri Apr 28 16:00:10 EDT 2017",
"CVE_Items": [
{
"CVE_data_meta": {
"CVE_ID": "CVE-2007-6761"
},
"CVE_affects": {
"CVE_vendor": {
"CVE_data_version": "4.0",
"CVE_vendor_data": [
{
"CVE_vendor_name": "linux",
"CVE_product": {
"CVE_product_data": [
{
"CVE_data_version": "4.0",
"CVE_product_name": "linux_kernel",
"CVE_version": {
"CVE_version_data": [
{
"CVE_version_value": "2.6.23",
"CVE_version_affected": "<="
}
]
}
}
]
}
}
]
}
},
"CVE_configurations": {
"CVE_data_version": "4.0",
"CVE_configuration_data": [
{
"operator": "OR",
"cpe": [
{
"vulnerable": true,
"previousVersions": true,
"cpeMatchString": "cpe:/o:linux:linux_kernel:2.6.23",
"cpe23Uri": "cpe:2.3:o:linux:linux_kernel:2.6.23:*:*:*:*:*:*:*"
}
]
}
]
},
"CVE_description": {
"CVE_data_version": "4.0",
"CVE_description_data": [
{
"lang": "en",
"value": "drivers/media/video/videobuf-vmalloc.c in the Linux kernel before 2.6.24 does not initialize videobuf_mapping data structures, which allows local users to trigger an incorrect count value and videobuf leak via unspecified vectors, a different vulnerability than CVE-2010-5321."
}
]
},
"CVE_references": {
"CVE_data_version": "4.0",
"CVE_reference_data": [
{
"url": "http://www.linuxgrill.com/anonymous/kernel/v2.6/ChangeLog-2.6.24",
"name": "CONFIRM",
"publish_date": "04/24/2017"
},
{
"url": "http://www.securityfocus.com/bid/98001",
"name": "BID",
"publish_date": "04/26/2017"
},
{
"url": "https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=827340",
"name": "MISC",
"publish_date": "04/24/2017"
},
{
"url": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b29669c065f60501e7289e1950fa2a618962358",
"name": "CONFIRM",
"publish_date": "04/24/2017"
},
{
"url": "https://github.com/torvalds/linux/commit/0b29669c065f60501e7289e1950fa2a618962358",
"name": "CONFIRM",
"publish_date": "04/24/2017"
}
]
},
"CVE_impact": {
"CVE_impact_cvssv2": {
"bm": {
"av": "LOCAL",
"ac": "LOW",
"au": "NONE",
"c": "PARTIAL",
"i": "PARTIAL",
"a": "PARTIAL",
"score": "4.6"
}
},
"CVE_impact_cvssv3": {
"bm": {
"av": "LOCAL",
"ac": "LOW",
"pr": "LOW",
"ui": "NONE",
"scope": "UNCHANGED",
"c": "HIGH",
"i": "HIGH",
"a": "HIGH",
"score": "7.8"
}
}
},
"CVE_problemtype": {
"CVE_data_version": "4.0",
"CVE_problemtype_data": [
{
"description": [
{
"lang": "en",
"value": "CWE-119"
}
]
}
]
}
}
]
}
When I try to parse it to get the info I want, I run into errors. Here is the code test.
import json
with open('/tmp/nvdcve-1.0-recent.json') as data_file:
cve_data = json.load(data_file)
product_list = []
for data_list in cve_data["CVE_Items"]:
for cve_tag,cve_id in data_list["CVE_data_meta"].items():
cve = str(cve_id)
for vendor_data in data_list["CVE_affects"]["CVE_vendor"]["CVE_vendor_data"]["CVE_product"]:
for data_version,product_name,version_set in vendor_data["CVE_product_data"].items():
print(product_name)
The Error
TypeError Traceback (most recent call last)
<ipython-input-10-81b0239327c1> in <module>()
10 cve = str(cve_id)
11
---> 12 for vendor_data in data_list["CVE_affects"]["CVE_vendor"]["CVE_vendor_data"]["CVE_product"]:
13 for data_version,product_name,version_set in vendor_data["CVE_product_data"].items():
14 print data_version
TypeError: list indices must be integers, not str
This is confusing to me because there is nests within nests, and lists within theses nests. I am having a hard time figuring out how to get some of this super nested info.
I feel your pain, but after closer inspection "CVE_vendor_data" is not a dictionary, but a list of dictionaries. Notice the "[]" after the colon. That is why it needs integers to index the list. Same goes for "CVE_product_data". It is also a list of dictionaries.

parsing nested JSON into multiple dataframe using pandas python

I have a nested JSON as shown below and want to parse into multiple dataframe in python .. please help
{
"tableName": "cases",
"url": "EndpointVoid",
"tableDataList": [{
"_id": "100017252700",
"title": "Test",
"type": "TECH",
"created": "2016-09-06T19:00:17.071Z",
"createdBy": "193164275",
"lastModified": "2016-10-04T21:50:49.539Z",
"lastModifiedBy": "1074113719",
"notes": [{
"id": "30",
"title": "Multiple devices",
"type": "INCCL",
"origin": "D",
"componentCode": "PD17A",
"issueCode": "IP321",
"affectedProduct": "134322",
"summary": "testing the json",
"caller": {
"email": "katie.slabiak#spps.org",
"phone": "651-744-4522"
}
}, {
"id": "50",
"title": "EDU: Multiple Devices - Lightning-to-USB Cable",
"type": "INCCL",
"origin": "D",
"componentCode": "PD17A",
"issueCode": "IP321",
"affectedProduct": "134322",
"summary": "parsing json 2",
"caller": {
"email": "testing1#test.org",
"phone": "123-345-1111"
}
}],
"syncCount": 2316,
"repair": [{
"id": "D208491610",
"created": "2016-09-06T19:02:48.000Z",
"createdBy": "193164275",
"lastModified": "2016-09-21T12:49:47.000Z"
}, {
"id": "D208491610"
}, {
"id": "D208491628",
"created": "2016-09-06T19:03:37.000Z",
"createdBy": "193164275",
"lastModified": "2016-09-21T12:49:47.000Z"
}
],
"enterpriseStatus": "8"
}],
"dateTime": 1475617849,
"primaryKeys": ["$._id"],
"primaryKeyVals": ["100017252700"],
"operation": "UPDATE"
}
I want to parse this and create 3 tables/dataframe/csv as shown below.. please help..
Output table in this format
I don't think this is best way, but I wanted to show you possibility.
import pandas as pd
from pandas.io.json import json_normalize
import json
with open('your_sample.json') as f:
dt = json.load(f)
Table1
df1 = json_normalize(dt, 'tableDataList', 'dateTime')[['_id', 'title', 'type', 'created', 'createdBy', 'lastModified', 'lastModifiedBy', 'dateTime']]
print df1
_id title type created createdBy \
0 100017252700 Test TECH 2016-09-06T19:00:17.071Z 193164275
lastModified lastModifiedBy dateTime
0 2016-10-04T21:50:49.539Z 1074113719 1475617849
Table 2
df2 = json_normalize(dt['tableDataList'], 'notes', '_id')
df2['phone'] = df2['caller'].map(lambda x: x['phone'])
df2['email'] = df2['caller'].map(lambda x: x['email'])
df2 = df2[['_id', 'id', 'title', 'email', 'phone']]
print df2
_id id title \
0 100017252700 30 Multiple devices
1 100017252700 50 EDU: Multiple Devices - Lightning-to-USB Cable
email phone
0 katie.slabiak#spps.org 651-744-4522
1 testing1#test.org 123-345-1111
Table 3
df3 = json_normalize(dt['tableDataList'], 'repair', '_id').dropna()
print df3
created createdBy id lastModified \
0 2016-09-06T19:02:48.000Z 193164275 D208491610 2016-09-21T12:49:47.000Z
2 2016-09-06T19:03:37.000Z 193164275 D208491628 2016-09-21T12:49:47.000Z
_id
0 100017252700
2 100017252700

Convert deeply nested json from facebook to dataframe in python

I am trying to get user details of persons who has put likes, comments on Facebook posts. I am using python facebook-sdk package. Code is as follows.
import facebook as fi
import json
graph = fi.GraphAPI('Access Token')
data = json.dumps(graph.get_object('DSIfootcandy/posts'))
From the above, I am getting a highly nested json. Here I will put only a json string for one post in the fb.
{
"paging": {
"next": "https://graph.facebook.com/v2.0/425073257683630/posts?access_token=&limit=25&until=1449201121&__paging_token=enc_AdD0DL6sN3aDZCwfYY25rJLW9IZBZCLM1QfX0venal6rpjUNvAWZBOoxTjbOYZAaFiBImzMqiv149HPH5FBJFo0nSVOPqUy78S0YvwZDZD",
"previous": "https://graph.facebook.com/v2.0/425073257683630/posts?since=1450843741&access_token=&limit=25&__paging_token=enc_AdCYobFJpcNavx6STzfPFyFe6eQQxRhkObwl2EdulwL7mjbnIETve7sJZCPMwVm7lu7yZA5FoY5Q4sprlQezF4AlGfZCWALClAZDZD&__previous=1"
},
"data": [
{
"picture": "https://fbcdn-photos-e-a.akamaihd.net/hphotos-ak-xfa1/v/t1.0-0/p130x130/1285_5066979392443_n.png?oh=b37a42ee58654f08af5abbd4f52b1ace&oe=570898E7&__gda__=1461440649_aa94b9ec60f22004675c4a527e8893f",
"is_hidden": false,
"likes": {
"paging": {
"cursors": {
"after": "MTU3NzQxODMzNTg0NDcwNQ==",
"before": "MTU5Mzc1MjA3NDE4ODgwMA=="
}
},
"data": [
{
"id": "1593752074188800",
"name": "Maduri Priyadarshani"
},
{
"id": "427605680763414",
"name": "Darshi Mashika"
},
{
"id": "599793563453832",
"name": "Shakeer Nimeshani Shashikala"
},
{
"id": "1577418335844705",
"name": "Däzlling Jalali Muishu"
}
]
},
"from": {
"category": "Retail and Consumer Merchandise",
"name": "Footcandy",
"category_list": [
{
"id": "2239",
"name": "Retail and Consumer Merchandise"
}
],
"id": "425073257683630"
},
"name": "Timeline Photos",
"privacy": {
"allow": "",
"deny": "",
"friends": "",
"description": "",
"value": ""
},
"is_expired": false,
"comments": {
"paging": {
"cursors": {
"after": "WTI5dGJXVnVkRjlqZFhKemIzSUVXdNVFExTURRd09qRTBOVEE0TkRRNE5EVT0=",
"before": "WTI5dGJXVnVkRjlqZFhKemIzNE16Y3dNVFExTVRFNE9qRTBOVEE0TkRRME5UVT0="
}
},
"data": [
{
"from": {
"name": "NiFû Shafrà",
"id": "1025030640553"
},
"like_count": 0,
"can_remove": false,
"created_time": "2015-12-23T04:20:55+0000",
"message": "wow lovely one",
"id": "50018692683829_500458145118",
"user_likes": false
},
{
"from": {
"name": "Shamnaz Lukmanjee",
"id": "160625809961884"
},
"like_count": 0,
"can_remove": false,
"created_time": "2015-12-23T04:27:25+0000",
"message": "Nice",
"id": "500186926838929_500450145040",
"user_likes": false
}
]
},
"actions": [
{
"link": "https://www.facebook.com/425073257683630/posts/5001866838929",
"name": "Comment"
},
{
"link": "https://www.facebook.com/42507683630/posts/500186926838929",
"name": "Like"
}
],
"updated_time": "2015-12-23T04:27:25+0000",
"link": "https://www.facebook.com/DSIFootcandy/photos/a.438926536298302.1073741827.4250732576630/50086926838929/?type=3",
"object_id": "50018692838929",
"shares": {
"count": 3
},
"created_time": "2015-12-23T04:09:01+0000",
"message": "Reach new heights in the cute and extremely comfortable \"Silviar\" www.focandy.lk",
"type": "photo",
"id": "425077683630_50018926838929",
"status_type": "added_photos",
"icon": "https://www.facebook.com/images/icons/photo1.gif"
}
]
}
Now I need to get this data into a dataframe as follows(no need to get all).
item | Like_id |Like_username | comments_userid |comments_username|comment(msg)|
-----+---------+--------------+-----------------+-----------------+------------+
Bag | 45546 | noel | 641 | James | nice work |
-----+---------+--------------+-----------------+-----------------+------------+
Any Help will be Highly Appreciated.
Not exactly like your intended format, but here is the making of a solution :
import pandas
DictionaryObject_as_List = str(mydict).replace("{","").replace("}","").replace("[","").replace("]","").split(",")
newlist = []
for row in DictionaryObject_as_List :
row = row.replace('https://',' ').split(":")
exec('newlist.append ( ' + "[" + " , ".join(row)+"]" + ')')
DataFrame_Object = pandas.DataFrame(newlist)
print DataFrame_Object

Categories

Resources