A efficient way to unpack nested json into a dataframe

A efficient way to unpack nested json into a dataframe - python

I have a nested json, and i want to transform it into a pandas dataframe. I was able to normalize with json_normalize.
However, there are still json layer within the dataframe, which i also want to unpack. How can i do it in the best way? I will likely have to deal with this a few more times within the project i am doing currently
The json i have is the following
{
"data": {
"allOpportunityApplication": {
"data": [
{
"id": "111111111",
"opportunity": {
"programme": {
"short_name": "XX"
}
},
"person": {
"home_lc": {
"name": "NAME"
}
},
"standards": [
{
"constant_name": "constant1",
"standard_option": {
"option": "true"
}
},
{
"constant_name": "constant2",
"standard_option": {
"option": "true"
}
}
]
}
]
}
}
}
Used json_normalize
standards_df = json_normalize(
standard_json['allOpportunityApplication']['data'],
record_path=['standards'],
meta=['id','person','opportunity']
)
with that i get a dataframe with the columns: constant_name, standard_option, id, person, opportunity. The problem is that the data standard_option, person and opportunity are json, with a single option inside.
The current ouput and expected output for each column is as follow
Standard_option
Currently an item in the column "standard_option" looks like:
{'option': 'true'}
I want it to be just true
Person
Currently an item in the column "person" looks like:
{'programme': {'short_name': 'XX'}}
I want it to look like: XX
Opportunity
Currently an item in the column "opportunity" looks like:
{'home_lc': {'name': 'NAME'}}
I want it to look like: NAME

Might not be the best way, but I think it works.
standards_df['person'] = (standards_df.loc[:, 'person']
.apply(lambda x: x['home_lc']['name']))
standards_df['opportunity'] = (standards_df.loc[:, 'opportunity']
.apply(lambda x: x['programme']['short_name']))
constant_name standard_option.option id person opportunity
0 constant1 true 111111111 NAME XX
1 constant2 true 111111111 NAME XX
standard_option was already fine when I run your code

Related

How do I convert this nested ndjson data into a pandas df with the desired output?

My desired output looks like this.
type.code
type.display
type.system
type.value
npi
National Provider Identifier
http://hl7.org/fhir/us/carin-bb/CodeSystem/C4BBIdentifierType
MI767478
npi
Referring
http://hl7.org/fhir/us/carin-bb/CodeSystem/C4BBClaimCareTeamRole
5150
My goal is to create specific columns by creating a column that uses the path. One example of a column name would be "provider.identifier.type.coding.code". The name would kinda look similar to a file path.
data =
[
{
"provider":{
"identifier":{
"type":{
"coding":[
{
"code":"npi",
"display":"National Provider Identifier",
"system":"http://hl7.org/fhir/us/carin-bb/CodeSystem/C4BBIdentifierType"
}
]
},
"value":"1033407770"
}
},
"role":{
"coding":[
{
"code":"referring",
"display":"Referring",
"system":"http://hl7.org/fhir/us/carin-bb/CodeSystem/C4BBClaimCareTeamRole"
}
]
},
"sequence":1
}
]

How can I define a structure of a json to transform it to csv

I have a json structured as this:
{
"data": [
{
"groups": {
"data": [
{
"group_name": "Wedding planning - places and bands (and others) to recommend!",
"date_joined": "2009-03-12 01:01:08.677427"
},
{
"group_name": "Harry Potter and the Deathly Hollows",
"date_joined": "2009-01-15 01:38:06.822220"
},
{
"group_name": "Xbox , Playstation, Wii - console fans",
"date_joined": "2010-04-02 04:02:58.078934"
}
]
},
"id": "0"
},
{
"groups": {
"data": [
{
"group_name": "Lost&Found (Strzegom)",
"date_joined": "2010-02-01 14:13:34.551920"
},
{
"group_name": "Tennis, Squash, Badminton, table tennis - looking for sparring partner (Strzegom)",
"date_joined": "2008-09-24 17:29:43.356992"
}
]
},
"id": "1"
}
]
}
How does one parse jsons in this form? Should i try building a class resembling this format? My desired output is a csv where index is an "id" and in the first column I have the most recently taken group, in the second column the second most recently taken group and so on.
Meaning the result of this would be:
most recent second most recent
0 Xbox , Playstation, Wii - console fans Wedding planning - places and bands (and others) to recommend!
1 Lost&Found (Strzegom) Tennis, Squash, Badminton, table tennis - looking for sparring partner (Strzegom)

solution could be like this:
data = json.load(f)
result = []
# it's max element in there for each id. Helping how many group_name here for this example [3,2]
max_element_group_name = [len(data['data'][i]['groups']['data']) for i in range(len(data['data']))]
max_element_group_name.sort()
for i in range(len(data['data'])):
# get id for each groups
id = data['data'][i]['id']
# sort data_joined in groups
sorted_groups_by_date = sorted(data['data'][i]['groups']['data'],key=lambda x : time.strptime(x['date_joined'],'%Y-%m-%d %H:%M:%S.%f'),reverse=True)
# get groups name using minumum value in max_element_group_name for this example [2]
group_names = [sorted_groups_by_date[j]['group_name'] for j in range(max_element_group_name[0])]
# add result list with id
result.append([id]+group_names)
# create df for list
df = pd.DataFrame(result, columns = ['id','most recent', 'second most recent'])
# it could be better.

Bulk Update for elasticsearch documents using Python

I have elasticsearch documents like below where I need to rectify age value based on creationtime currentdate
age = creationtime - currentdate
:
hits = [
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2018-05-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2013-07-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2014-08-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2015-09-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
}
]
I want to do bulk update based on each document ID, but the problem is I need to correct 6 months of data & per data size (doc count of Index) is almost 535329, I want to efficiently do bulk update on age based on _id for each day on all documents using python.
Is there a way to do this, without looping through, all examples I came across using Pandas dataframes for update is based on a known value. But here _id I will get as and when the code runs.
The logic I had written was to fetch all doc & store their _id & then for each _id update the age . But its not an efficient way if I want to update all documents in bulk for each day of 6 months.
Can anyone give me some ideas for this or point me in the right direction.

As mentioned in the comments, fetching the IDs won't be necessary. You don't even need to fetch the documents themselves!
A single _update_by_query call will be enough. You can use ChronoUnit to get the difference after you've parsed the dates:
POST your-index-name/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": """
def created = LocalDateTime.parse(ctx._source.creationtime, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss"));
def currentdate = LocalDateTime.parse(ctx._source.currentdate, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
def months = ChronoUnit.MONTHS.between(created, currentdate);
ctx._source._age = months + ' month' + (months > 1 ? 's' : '');
""",
"lang": "painless"
}
}
The official python client has this method too. Here's a working example.
🔑 Try running this update script on a small subset of your documents before letting in out on your whole index by adding a query other than the match_all I put there.
💡 It's worth mentioning that unless you search on this age field, it doesn't need to be stored in your index because it can be calculated at query time.
You see, if your index mapping's dates are properly defined like so:
{
"mappings": {
"properties": {
"creationtime": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss"
},
"currentdate": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
...
}
}
}
the age can be calculated as a script field:
POST ttimes/_search
{
"query": {
"match_all": {}
},
"script_fields": {
"age_calculated": {
"script": {
"source": """
def months = ChronoUnit.MONTHS.between(
doc['creationtime'].value,
doc['currentdate'].value );
return months + ' month' + (months > 1 ? 's' : '');
"""
}
}
}
}
The only caveat is, the value won't be inside of the _source but rather inside of its own group called fields (which implies that more script fields are possible at once!).
"hits" : [
{
...
"_id" : "FFfPuncBly0XYOUcdIs5",
"fields" : {
"age_calculated" : [ "32 months" ] <--
}
},
...

How do I return an upper field in a JSON with python?

So, I need some help returning an ID having found a certain string. My JSON looks something like this:
{
"id": "id1"
"field1": {
"subfield1": {
"subrield2": {
"subfield3": {
"subfield4": [
"string1",
"string2",
"string3"
]
}
}
}
}
"id": "id2"
"field1": {
"subfield1": {
"subrield2": {
"subfield3": {
"subfield4": [
"string4",
"string5",
"string6"
]
}
}
}
}
}
Now, I need to get the ID from a certain string, for example:
For "string5" I need to return "id2"
For "string2" I need to return "id1"
In order to find these strings I have used objectpath python module like this: json_Tree.execute('$..subfield4'))
After doing an analysis on a huge amount of strings, I need to return the ones that are meeting my criterias. I have the strings that I need (for example "string3"), but now I have to return the IDs.
Thank you!!
Note: I don't have a lot of experience with coding, I just started a few months ago to work on a project in Python and I have been stuck on this for a while

Making some assumptions about the actual structure of the data as being:
[
{
"id": "id1",
"subfield1": {
"subfield2": {
"subfield3": {
"subfield4": [
"string1",
"string2",
"string3"
]
}
}
}
}
// And so on
]
And assuming that each string1, string2 etc. is in only one id, then you can construct this mapping like so:
data: List[dict] # The json parsed as a list of dicts
string_to_id_mapping = {}
for record in data:
for string in record["subfield1"]["subfield2"]["subfield3"]["subfield4"]:
string_to_id_mapping[string] = record["id"]
assert string_to_id_mapping["string3"] == "id1"
If each string can appear in multiple ids then the following will catch all of them:
from collections import defaultdict
data: List[dict] # The json parsed as a list of dicts
string_to_id_mapping = defaultdict(set)
for record in data:
for string in record["subfield1"]["subfield2"]["subfield3"]["subfield4"]:
string_to_id_mapping[string].add(record["id"])
assert string_to_id_mapping["string3"] == {"id1"}

How to read this JSON into dataframe with specfic dataframe format

This is my JSON string, I want to make it read into dataframe in the following tabular format.
I have no idea what should I do after pd.Dataframe(json.loads(data))
JSON data, edited
{
"data":[
{
"data":{
"actual":"(0.2)",
"upper_end_of_central_tendency":"-"
},
"title":"2009"
},
{
"data":{
"actual":"2.8",
"upper_end_of_central_tendency":"-"
},
"title":"2010"
},
{
"data":{
"actual":"-",
"upper_end_of_central_tendency":"2.3"
},
"title":"longer_run"
}
],
"schedule_id":"2014-03-19"
}

That's a somewhat overly nested JSON. But if that's what you have to work with, and assuming your parsed JSON is in jdata:
datapts = jdata['data']
rownames = ['actual', 'upper_end_of_central_tendency']
colnames = [ item['title'] for item in datapts ] + ['schedule_id' ]
sched_id = jdata['schedule_id']
rows = [ [item['data'][rn] for item in datapts ] + [sched_id] for rn in rownames]
df = pd.DataFrame(rows, index=rownames, columns=colnames)
df is now:
If you wanted to simplify that a bit, you could construct the core data without the asymmetric schedule_id field, then add that after the fact:
datapts = jdata['data']
rownames = ['actual', 'upper_end_of_central_tendency']
colnames = [ item['title'] for item in datapts ]
rows = [ [item['data'][rn] for item in datapts ] for rn in rownames]
d2 = pd.DataFrame(rows, index=rownames, columns=colnames)
d2['schedule_id'] = jdata['schedule_id']
That will make an identical DataFrame (i.e. df == d2). It helps when learning pandas to try a few different construction strategies, and get a feel for what is more straightforward. There are more powerful tools for unfolding nested structures into flatter tables, but they're not as easy to understand first time out of the gate.
(Update) If you wanted a better structuring on your JSON to make it easier to put into this format, ask pandas what it likes. E.g. df.to_json() output, slightly prettified:
{
"2009": {
"actual": "(0.2)",
"upper_end_of_central_tendency": "-"
},
"2010": {
"actual": "2.8",
"upper_end_of_central_tendency": "-"
},
"longer_run": {
"actual": "-",
"upper_end_of_central_tendency": "2.3"
},
"schedule_id": {
"actual": "2014-03-19",
"upper_end_of_central_tendency": "2014-03-19"
}
}
That is a format from which pandas' read_json function will immediately construct the DataFrame you desire.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

A efficient way to unpack nested json into a dataframe - python

Related

How do I convert this nested ndjson data into a pandas df with the desired output?

How can I define a structure of a json to transform it to csv

Bulk Update for elasticsearch documents using Python

How do I return an upper field in a JSON with python?

How to read this JSON into dataframe with specfic dataframe format

Categories

Resources