How can I convert nested dictionary to pd.dataframe faster?

How can I convert nested dictionary to pd.dataframe faster? - python

I have a json file which looks like this
{
"file": "name",
"main": [{
"question_no": "Q.1",
"question": "what is ?",
"answer": [{
"user": "John",
"comment": "It is defined as",
"value": [
{
"my_value": 5,
"value_2": 10
},
{
"my_value": 24,
"value_2": 30
}
]
},
{
"user": "Sam",
"comment": "as John said above it simply means",
"value": [
{
"my_value": 9,
"value_2": 10
},
{
"my_value": 54,
"value_2": 19
}
]
}
],
"closed": "no"
}]
}
desired result:
Question_no question my_value_sum value_2_sum user comment
Q.1 what is ? 29 40 john It is defined as
Q.1 what is ? 63 29 Sam as John said above it simply means
What I have tried is data = json_normalize(file_json, "main") and then using a for loop like
for ans, row in data.iterrows():
....
....
df = df.append(the data)
But the issue using this is that it is taking a lot of time that my client would refuse the solution. there is around 1200 items in the main list and there are 450 json files like this to convert. So this intermediate process of conversion would take almost an hour to complete.
EDIT:
is it possible to get the sum of the my_value and value_2 as a column? (updated the desired result also)

Select dictionary by main with parameter record_path and meta:
data = pd.json_normalize(file_json["main"],
record_path='answer',
meta=['question_no', 'question'])
print (data)
user comment question_no question
0 John It is defined as Q.1 what is ?
1 Sam as John said above it simply means Q.1 what is ?
Then if order is important convert last N columns to first positions:
N = 2
data = data[data.columns[-N:].tolist() + data.columns[:-N].tolist()]
print (data)
question_no question user comment
0 Q.1 what is ? John It is defined as
1 Q.1 what is ? Sam as John said above it simply means

Related

How to flatten dict in a DataFrame & concatenate all resultant rows

I am using Github's GraphQL API to fetch some issue details.
I used Python Requests to fetch the data locally.
This is how the output.json looks like
{
"data": {
"viewer": {
"login": "some_user"
},
"repository": {
"issues": {
"edges": [
{
"node": {
"id": "I_kwDOHQ63-s5auKbD",
"title": "test issue 1",
"number": 146,
"createdAt": "2023-01-06T06:39:54Z",
"closedAt": null,
"state": "OPEN",
"updatedAt": "2023-01-06T06:42:00Z",
"comments": {
"edges": [
{
"node": {
"id": "IC_kwDOHQ63-s5R2XCV",
"body": "comment 01"
}
},
{
"node": {
"id": "IC_kwDOHQ63-s5R2XC9",
"body": "comment 02"
}
}
]
},
"labels": {
"edges": []
}
},
"cursor": "Y3Vyc29yOnYyOpHOWrimww=="
},
{
"node": {
"id": "I_kwDOHQ63-s5auKm8",
"title": "test issue 2",
"number": 147,
"createdAt": "2023-01-06T06:40:34Z",
"closedAt": null,
"state": "OPEN",
"updatedAt": "2023-01-06T06:40:34Z",
"comments": {
"edges": []
},
"labels": {
"edges": [
{
"node": {
"name": "food"
}
},
{
"node": {
"name": "healthy"
}
}
]
}
},
"cursor": "Y3Vyc29yOnYyOpHOWripvA=="
}
]
}
}
}
}
The json was put inside a list using
result = response.json()["data"]["repository"]["issues"]["edges"]
And then this list was put inside a DataFrame
import pandas as pd
df = pd.DataFrame (result, columns = ['node', 'cursor'])
df
These are the contents of the data frame
id
title
number
createdAt
closedAt
state
updatedAt
comments
labels
I_kwDOHQ63-s5auKbD
test issue 1
146
2023-01-06T06:39:54Z
None
OPEN
2023-01-06T06:42:00Z
{'edges': [{'node': {'id': 'IC_kwDOHQ63-s5R2XCV","body": "comment 01"}},{'node': {'id': 'IC_kwDOHQ63-s5R2XC9","body": "comment 02"}}]}
{'edges': []}
I_kwDOHQ63-s5auKm8
test issue 2
147
2023-01-06T06:40:34Z
None
OPEN
2023-01-06T06:40:34Z
{'edges': []}
{'edges': [{'node': {'name': 'food"}},{'node': {'name': 'healthy"}}]}
I would like to split/explode the comments and labels columns.
The values in these columns are nested dictionaries
I would like there to be as many rows for a single issue, as there are comments & labels.
I would like to flatten out the data frame.
So this involves split/explode and concat.
There are several stackoverflow answers that delve on this topic. And I have tried the code from several of them.
I can not paste the links to those questions, because stackoverflow marks my question as spam due to many links.
But these are the steps I have tried
df3 = df2['comments'].apply(pd.Series)
Drill down further
df4 = df3['edges'].apply(pd.Series)
df4
Drill down further
df5 = df4['node'].apply(pd.Series)
df5
The last statement above gives me the KeyError: 'node'
I understand, this is because node is not a key in the DataFrame.
But how else can i split this dictionary and concatenate all columns back to my issues row?
This is how I would like the output to look like
id
title
number
createdAt
closedAt
state
updatedAt
comments
labels
I_kwDOHQ63-s5auKbD
test issue 1
146
2023-01-06T06:39:54Z
None
OPEN
2023-01-06T06:42:00Z
comment 01
Null
I_kwDOHQ63-s5auKbD
test issue 1
146
2023-01-06T06:39:54Z
None
OPEN
2023-01-06T06:42:00Z
comment 02
Null
I_kwDOHQ63-s5auKm8
test issue 2
147
2023-01-06T06:40:34Z
None
OPEN
2023-01-06T06:40:34Z
Null
food
I_kwDOHQ63-s5auKm8
test issue 2
147
2023-01-06T06:40:34Z
None
OPEN
2023-01-06T06:40:34Z
Null
healthy

If dct is your dictionary from the question you can try:
df = pd.DataFrame(d['node'] for d in dct['data']['repository']['issues']['edges'])
df['comments'] = df['comments'].str['edges']
df = df.explode('comments')
df['comments'] = df['comments'].str['node'].str['body']
df['labels'] = df['labels'].str['edges']
df = df.explode('labels')
df['labels'] = df['labels'].str['node'].str['name']
print(df.to_markdown(index=False))
Prints:
id
title
number
createdAt
closedAt
state
updatedAt
comments
labels
I_kwDOHQ63-s5auKbD
test issue 1
146
2023-01-06T06:39:54Z
OPEN
2023-01-06T06:42:00Z
comment 01
nan
I_kwDOHQ63-s5auKbD
test issue 1
146
2023-01-06T06:39:54Z
OPEN
2023-01-06T06:42:00Z
comment 02
nan
I_kwDOHQ63-s5auKm8
test issue 2
147
2023-01-06T06:40:34Z
OPEN
2023-01-06T06:40:34Z
nan
food
I_kwDOHQ63-s5auKm8
test issue 2
147
2023-01-06T06:40:34Z
OPEN
2023-01-06T06:40:34Z
nan
healthy

#andrej-kesely has answered my question.
I have selected his response as the answer for this question.
I am now posting a consolidated script that includes my poor code and andrej's great code.
In this script i want to fetch details from Github's GraphQL API Server.
And put it inside pandas.
Primary source for this script is this gist.
And a major chunk of remaining code is an answer by #andrej-kesely.
Now onto the consolidated script.
First import the necessary packages and set headers
import requests
import json
import pandas as pd
headers = {"Authorization": "token <your_github_personal_access_token>"}
Now define the query that will fetch data from github.
In my particular case, I am fetching issue details form a particular repo
it can be something else for you.
query = """
{
viewer {
login
}
repository(name: "your_github_repo", owner: "your_github_user_name") {
issues(states: OPEN, last: 2) {
edges {
node {
id
title
number
createdAt
closedAt
state
updatedAt
comments(first: 10) {
edges {
node {
id
body
}
}
}
labels(orderBy: {field: NAME, direction: ASC}, first: 10) {
edges {
node {
name
}
}
}
comments(first: 10) {
edges {
node {
id
body
}
}
}
}
cursor
}
}
}
}
"""
Execute the query and save the response
def run_query(query):
request = requests.post('https://api.github.com/graphql', json={'query': query}, headers=headers)
if request.status_code == 200:
return request.json()
else:
raise Exception("Query failed to run by returning code of {}. {}".format(request.status_code, query))
result = run_query(query)
And now is the trickiest part.
In my query response, there are several nested dictionaries.
I would like to split them - more details in my question above.
This magic code from #andrej-kesely does that for you.
df = pd.DataFrame(d['node'] for d in result['data']['repository']['issues']['edges'])
df['comments'] = df['comments'].str['edges']
df = df.explode('comments')
df['comments'] = df['comments'].str['node'].str['body']
df['labels'] = df['labels'].str['edges']
df = df.explode('labels')
df['labels'] = df['labels'].str['node'].str['name']
print(df)

Python Pandas json_normalize with multiple lists of dicts

I'm trying to flatten a JSON file that was originally converted from XML using xmltodict(). There are multiple fields that may have a list of dictionaries. I've tried using record_path with meta data to no avail, but I have not been able to get it to work when there are multiple fields that may have other nested fields. It's expected that some fields will be empty for any given record
I have tried searching for another topic and couldn't find my specific problem with multiple nested fields. Can anyone point me in the right direction?
Thanks for any help that can be provided!
Sample base Python (without the record path)
import pandas as pd
import json
with open('./example.json', encoding="UTF-8") as json_file:
json_dict = json.load(json_file)
df = pd.json_normalize(json_dict['WIDGET'])
print(df)
df.to_csv('./test.csv', index=False)
Sample JSON
{
"WIDGET": [
{
"ID": "6",
"PROBLEM": "Electrical",
"SEVERITY_LEVEL": "1",
"TITLE": "Battery's Missing",
"CATEGORY": "User Error",
"LAST_SERVICE": "2020-01-04T17:39:37Z",
"NOTICE_DATE": "2022-01-01T08:00:00Z",
"FIXABLE": "1",
"COMPONENTS": {
"WHATNOTS": {
"WHATNOT1": "Battery Compartment",
"WHATNOT2": "Whirlygig"
}
},
"DIAGNOSIS": "Customer needs to put batteries in the battery compartment",
"STATUS": "0",
"CONTACT_TYPE": {
"CALL": "1"
}
},
{
"ID": "1004",
"PROBLEM": "Electrical",
"SEVERITY_LEVEL": "4",
"TITLE": "Flames emit from unit",
"CATEGORY": "Dangerous",
"LAST_SERVICE": "2015-06-04T21:40:12Z",
"NOTICE_DATE": "2022-01-01T08:00:00Z",
"FIXABLE": "0",
"DIAGNOSIS": "A demon seems to have possessed the unit and his expelling flames from it",
"CONSEQUENCE": "Could burn things",
"SOLUTION": "Call an exorcist",
"KNOWN_PROBLEMS": {
"PROBLEM": [
{
"TYPE": "RECALL",
"NAME": "Bad Servo",
"DESCRIPTION": "Bad servo's shipped in initial product"
},
{
"TYPE": "FAILURE",
"NAME": "Operating outside normal conditions",
"DESCRIPTION": "Device failed when customer threw into wood chipper"
}
]
},
"STATUS": "1",
"REPAIR_BULLETINS": {
"BULLETIN": [
{
"#id": "4",
"#text": "Known target of the occult"
},
{
"#id": "5",
"#text": "Not meant to be thrown into wood chippers"
}
]
},
"CONTACT_TYPE": {
"CALL": "1"
}
}
]
}
Sample CSV
ID
PROBLEM
SEVERITY_LEVEL
TITLE
CATEGORY
LAST_SERVICE
NOTICE_DATE
FIXABLE
DIAGNOSIS
STATUS
COMPONENTS.WHATNOTS.WHATNOT1
COMPONENTS.WHATNOTS.WHATNOT2
CONTACT_TYPE.CALL
CONSEQUENCE
SOLUTION
KNOWN_PROBLEMS.PROBLEM
REPAIR_BULLETINS.BULLETIN
6
Electrical
1
Battery's Missing
User Error
2020-01-04T17:39:37Z
2022-01-01T08:00:00Z
1
Customer needs to put batteries in the battery compartment
0
Battery Compartment
Whirlygig
1
1004
Electrical
4
Flames emit from unit
Dangerous
2015-06-04T21:40:12Z
2022-01-01T08:00:00Z
0
A demon seems to have possessed the unit and his expelling flames from it
1
1
Could burn things
Call an exorcist
[{'TYPE': 'RECALL', 'NAME': 'Bad Servo', 'DESCRIPTION': "Bad servo's shipped in initial product"}, {'TYPE': 'FAILURE', 'NAME': 'Operating outside normal conditions', 'DESCRIPTION': 'Device failed when customer threw into wood chipper'}]
[{'#id': '4', '#text': 'Known target of the occult'}, {'#id': '5', '#text': 'Not meant to be thrown into wood chippers'}]

I have attempted to extract the data and turned it into nested dictionary (instead of nested with list), so that pd.json_normalize() can work
for row in range(len(json_dict['WIDGET'])):
try:
lis = json_dict['WIDGET'][row]['KNOWN_PROBLEMS']['PROBLEM']
del json_dict['WIDGET'][row]['KNOWN_PROBLEMS']['PROBLEM']
for i, item in enumerate(lis):
json_dict['WIDGET'][row]['KNOWN_PROBLEMS'][str(i)] = item
lis = json_dict['WIDGET'][row]['REPAIR_BULLETINS']['BULLETIN']
del json_dict['WIDGET'][row]['REPAIR_BULLETINS']['BULLETIN']
for i, item in enumerate(lis):
json_dict['WIDGET'][row]['REPAIR_BULLETINS'][str(i)] = item
except KeyError:
continue
df = pd.json_normalize(json_dict['WIDGET']).T
print(df)
If you have to manually add the varying keys from the larger dataset, here's a way to extract them automatically by identifying them as type list (and provided they are nested by 2 levels only)
linkage = []
for item in json_dict['WIDGET']:
for k1 in item.keys(): #get keys from first level
if isinstance(item[k1], str):
continue
#print(item[k1])
for k2 in item[k1].keys(): #get keys from second level
if isinstance(item[k1][k2], str):
continue
#print(item[k1][k2])
if isinstance(item[k1][k2], list):
linkage.append((k1, k2))
print(linkage)
# [('KNOWN_PROBLEMS', 'PROBLEM'), ('REPAIR_BULLETINS', 'BULLETIN')]
for row in range(len(json_dict['WIDGET'])):
for link in linkage:
try:
lis = json_dict['WIDGET'][row][link[0]][link[1]]
del json_dict['WIDGET'][row][link[0]][link[1]] #delete original dict value (which is a list)
for i, item in enumerate(lis):
json_dict['WIDGET'][row][link[0]][str(i)] = item #replace list with dict value (which is a dict)
except KeyError:
continue
df = pd.json_normalize(json_dict['WIDGET']).T
print(df)
Output:
0 1
ID 6 1004
PROBLEM Electrical Electrical
SEVERITY_LEVEL 1 4
TITLE Battery's Missing Flames emit from unit
CATEGORY User Error Dangerous
LAST_SERVICE 2020-01-04T17:39:37Z 2015-06-04T21:40:12Z
NOTICE_DATE 2022-01-01T08:00:00Z 2022-01-01T08:00:00Z
FIXABLE 1 0
DIAGNOSIS Customer needs to put batt... A demon seems to have poss...
STATUS 0 1
COMPONENTS.WHATNOTS.WHATNOT1 Battery Compartment NaN
COMPONENTS.WHATNOTS.WHATNOT2 Whirlygig NaN
CONTACT_TYPE.CALL 1 1
CONSEQUENCE NaN Could burn things
SOLUTION NaN Call an exorcist
KNOWN_PROBLEMS.0.TYPE NaN RECALL
KNOWN_PROBLEMS.0.NAME NaN Bad Servo
KNOWN_PROBLEMS.0.DESCRIPTION NaN Bad servo's shipped in ini...
KNOWN_PROBLEMS.1.TYPE NaN FAILURE
KNOWN_PROBLEMS.1.NAME NaN Operating outside normal c...
KNOWN_PROBLEMS.1.DESCRIPTION NaN Device failed when custome...
REPAIR_BULLETINS.0.#id NaN 4
REPAIR_BULLETINS.0.#text NaN Known target of the occult
REPAIR_BULLETINS.1.#id NaN 5
REPAIR_BULLETINS.1.#text NaN Not meant to be thrown int...

How can I define a structure of a json to transform it to csv

I have a json structured as this:
{
"data": [
{
"groups": {
"data": [
{
"group_name": "Wedding planning - places and bands (and others) to recommend!",
"date_joined": "2009-03-12 01:01:08.677427"
},
{
"group_name": "Harry Potter and the Deathly Hollows",
"date_joined": "2009-01-15 01:38:06.822220"
},
{
"group_name": "Xbox , Playstation, Wii - console fans",
"date_joined": "2010-04-02 04:02:58.078934"
}
]
},
"id": "0"
},
{
"groups": {
"data": [
{
"group_name": "Lost&Found (Strzegom)",
"date_joined": "2010-02-01 14:13:34.551920"
},
{
"group_name": "Tennis, Squash, Badminton, table tennis - looking for sparring partner (Strzegom)",
"date_joined": "2008-09-24 17:29:43.356992"
}
]
},
"id": "1"
}
]
}
How does one parse jsons in this form? Should i try building a class resembling this format? My desired output is a csv where index is an "id" and in the first column I have the most recently taken group, in the second column the second most recently taken group and so on.
Meaning the result of this would be:
most recent second most recent
0 Xbox , Playstation, Wii - console fans Wedding planning - places and bands (and others) to recommend!
1 Lost&Found (Strzegom) Tennis, Squash, Badminton, table tennis - looking for sparring partner (Strzegom)

solution could be like this:
data = json.load(f)
result = []
# it's max element in there for each id. Helping how many group_name here for this example [3,2]
max_element_group_name = [len(data['data'][i]['groups']['data']) for i in range(len(data['data']))]
max_element_group_name.sort()
for i in range(len(data['data'])):
# get id for each groups
id = data['data'][i]['id']
# sort data_joined in groups
sorted_groups_by_date = sorted(data['data'][i]['groups']['data'],key=lambda x : time.strptime(x['date_joined'],'%Y-%m-%d %H:%M:%S.%f'),reverse=True)
# get groups name using minumum value in max_element_group_name for this example [2]
group_names = [sorted_groups_by_date[j]['group_name'] for j in range(max_element_group_name[0])]
# add result list with id
result.append([id]+group_names)
# create df for list
df = pd.DataFrame(result, columns = ['id','most recent', 'second most recent'])
# it could be better.

How to find rows in a dataframe based on other rows and other dataframes

From the question I asked here I took a JSON response looking similar to this:
(please note: id's in my sample data below are numeric strings but some are alphanumeric)
data=↓**
{
"state": "active",
"team_size": 20,
"teams": {
"id": "12345679",
"name": "Good Guys",
"level": 10,
"attacks": 4,
"destruction_percentage": 22.6,
"members": [
{
"id": "1",
"name": "John",
"level": 12
},
{
"id": "2",
"name": "Tom",
"level": 11,
"attacks": [
{
"attackerTag": "2",
"defenderTag": "4",
"damage": 64,
"order": 7
}
]
}
]
},
"opponent": {
"id": "987654321",
"name": "Bad Guys",
"level": 17,
"attacks": 5,
"damage": 20.95,
"members": [
{
"id": "3",
"name": "Betty",
"level": 17,
"attacks": [
{
"attacker_id": "3",
"defender_id": "1",
"damage": 70,
"order": 1
},
{
"attacker_id": "3",
"defender_id": "7",
"damage": 100,
"order": 11
}
],
"opponentAttacks": 0,
"some_useless_data": "Want to ignore, this doesn't show in every record"
},
{
"id": "4",
"name": "Fred",
"level": 9,
"attacks": [
{
"attacker_id": "4",
"defender_id": "9",
"damage": 70,
"order": 4
}
],
"opponentAttacks": 0
}
]
}
}
I loaded this using:
df = json_normalize([data['team'], data['opponent']],
'members',
['id', 'name'],
meta_prefix='team.',
errors='ignore')
print(df.iloc(1))
attacks [{'damage': 70, 'order': 4, 'defender_id': '9'...
id 4
level 9
name Fred
opponentAttacks 0
some_useless_data NaN
team.name Bad Guys
team.id 987654321
Name: 3, dtype: object
I have a 3 part question in essense.
How do I get a row like the one above using the member tag? I've tried:
member = df[df['id']=="1"].iloc[0]
#Now this works, but am I correctly doing this?
#It just feels weird is all.
How would I retrieve a member's defenses based only given that only attacks are recorded and not defenses (even though defender_id is given)? I have tried:
df.where(df['tag']==df['attacks'].str.get('defender_id'), df['attacks'], axis=0)
#This is totally not working.. Where am I going wrong?
Since I am retrieving new data from an API, I need to check vs the old data in my database to see if there are any new attacks. I can then loop through the new attacks where I then display to the user the attack info.
This I honestly cannot figure out, I've tried looking into this question and this one as well that I felt were anywhere close to what I needed and am still having trouble wrapping my brain around the concept. Essentially my logic is as follows:
def get_new_attacks(old_data, new_data)
'''params
old_data: Dataframe loaded from JSON in database
new_data: Dataframe loaded from JSON API response
hopefully having new attacks
returns:
iterator over the new attacks
'''
#calculate a dataframe with new attacks listed
return df.iterrows()
I know the function above shows little to no effort other than the docs I gave (basically to show my desired input/output) but trust me I've been wracking my brain over this part the most. I've been looking into merging all attacks then doing reset_index() and that just raises an error due to the attacks being a list. The map() function in the second question I linked above has me stumped.

Referring to your questions in order (code below):
I looks like id is a unique index of the data and so you can use df.set_index('id') which allows you to access data by player id via df.loc['1'] for example.
As far as I understand your data, all the dictionaries listed in each of the attacks are self-contained in a sense that the corresponding player id is not needed (as attacker_id or defender_id seems to be enough to identify the data). So instead of dealing with a rows that contains lists I recommend swapping that data out in its own data frame which makes it easily accessible.
Once you store attacks in its own data frame you can simply compare indices in order to filter out the old data.
Here's some example code to illustrate the various points:
# Question 1.
df.set_index('id', inplace=True)
print(df.loc['1']) # For example player id 1.
# Question 2 & 3.
attacks = pd.concat(map(
lambda x: pd.DataFrame.from_dict(x).set_index('order'), # Is 'order' the right index?
df['attacks'].dropna()
))
# Question 2.
print(attacks[attacks['defender_id'] == '1']) # For example defender_id 1.
# Question 3.
old_attacks = attacks.iloc[:2] # For example.
new_attacks = attacks[~attacks.index.isin(old_attacks.index)]
print(new_attacks)

How to take the first two letters of a variable in python?

I have a dataset like below:
data
id message
1 ffjffjf
2 ddjbgg
3 vvnvvv
4 eevvff
5 ggvfgg
Expected output:
data
id message splitmessage
1 ffjffjf ff
2 ddjbgg dd
3 vvnvvv vv
4 eevvff ee
5 ggvfgg gg
I am very new to Python. So how can I take 1st two letters from each row in splitmessage variable.
my data exactly looks like below image
so from the image i want only hour and min's which are 12 to 16 elements in each row of vfreceiveddate variable.

dataset = [
{ "id": 1, "message": "ffjffjf" },
{ "id": 2, "message": "ddjbgg" },
{ "id": 3, "message": "vvnvvv" },
{ "id": 4, "message": "eevvff" },
{ "id": 5, "message": "ggvfgg" }
]
for d in dataset:
d["splitmessage"] = d["message"][:2]

What you want is a substring.
mystr = str(id1)
splitmessage = mystr[:2]
Method we are using is called slicing in python, works for more than strings.
In next example 'message' and 'splitmessage' are lists.
message = ['ffjffjf', 'ddjbgg']
splitmessage = []
for myString in message:
splitmessage.append(myString[:2])
print splitmessage

If you want to do it for the entire table, I would do it like this:
dataset = [id1, id2, id3 ... ]
for id in dataset:
splitid = id[:2]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I convert nested dictionary to pd.dataframe faster? - python

Related

How to flatten dict in a DataFrame & concatenate all resultant rows

Python Pandas json_normalize with multiple lists of dicts

How can I define a structure of a json to transform it to csv

How to find rows in a dataframe based on other rows and other dataframes

How to take the first two letters of a variable in python?

Categories

Resources