Extracting str from pandas dataframe using json - python

I read csv file into a dataframe named df
Each rows contains str below.
'{"id":2140043003,"name":"Olallo Rubio",...}'
I would like to extract "name" and "id" from each row and make a new dataframe to store the str.
I use the following codes to extract but it shows an error. Please let me know if there is any suggestions on how to solve this problem. Thanks
JSONDecodeError: Expecting ',' delimiter: line 1 column 32 (char 31)

text={
"id": 2140043003,
"name": "Olallo Rubio",
"is_registered": True,
"chosen_currency": 'Null',
"avatar": {
"thumb": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=40&h=40&fit=crop&v=1510685152&auto=format&q=92&s=653706657ccc49f68a27445ea37ad39a",
"small": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa",
"medium": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa"
},
"urls": {
"web": {
"user": "https://www.kickstarter.com/profile/2140043003"
},
"api": {
"user": "https://api.kickstarter.com/v1/users/2140043003?signature=1531480520.09df9a36f649d71a3a81eb14684ad0d3afc83e03"
}
}
}
def extract(text,*args):
list1=[]
for i in args:
list1.append(text[i])
return list1
print(extract(text,'name','id'))
# ['Olallo Rubio', 2140043003]

Here's what I came up with using pandas.json_normalize():
import pandas as pd
sample = [{
"id": 2140043003,
"name":"Olallo Rubio",
"is_registered": True,
"chosen_currency": None,
"avatar":{
"thumb":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=40&h=40&fit=crop&v=1510685152&auto=format&q=92&s=653706657ccc49f68a27445ea37ad39a",
"small":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa",
"medium":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa"
},
"urls":{
"web":{
"user":"https://www.kickstarter.com/profile/2140043003"
},
"api":{
"user":"https://api.kickstarter.com/v1/users/2140043003?signature=1531480520.09df9a36f649d71a3a81eb14684ad0d3afc83e03"
}
}
}]
# Create datafrane
df = pd.json_normalize(sample)
# Select columns into new dataframe.
df1 = df.loc[:, ["name", "id",]]
Check df1:
Input:
print(df1)
Output:
name id
0 Olallo Rubio 2140043003

Related

How to flatten dict in a DataFrame & concatenate all resultant rows

I am using Github's GraphQL API to fetch some issue details.
I used Python Requests to fetch the data locally.
This is how the output.json looks like
{
"data": {
"viewer": {
"login": "some_user"
},
"repository": {
"issues": {
"edges": [
{
"node": {
"id": "I_kwDOHQ63-s5auKbD",
"title": "test issue 1",
"number": 146,
"createdAt": "2023-01-06T06:39:54Z",
"closedAt": null,
"state": "OPEN",
"updatedAt": "2023-01-06T06:42:00Z",
"comments": {
"edges": [
{
"node": {
"id": "IC_kwDOHQ63-s5R2XCV",
"body": "comment 01"
}
},
{
"node": {
"id": "IC_kwDOHQ63-s5R2XC9",
"body": "comment 02"
}
}
]
},
"labels": {
"edges": []
}
},
"cursor": "Y3Vyc29yOnYyOpHOWrimww=="
},
{
"node": {
"id": "I_kwDOHQ63-s5auKm8",
"title": "test issue 2",
"number": 147,
"createdAt": "2023-01-06T06:40:34Z",
"closedAt": null,
"state": "OPEN",
"updatedAt": "2023-01-06T06:40:34Z",
"comments": {
"edges": []
},
"labels": {
"edges": [
{
"node": {
"name": "food"
}
},
{
"node": {
"name": "healthy"
}
}
]
}
},
"cursor": "Y3Vyc29yOnYyOpHOWripvA=="
}
]
}
}
}
}
The json was put inside a list using
result = response.json()["data"]["repository"]["issues"]["edges"]
And then this list was put inside a DataFrame
import pandas as pd
df = pd.DataFrame (result, columns = ['node', 'cursor'])
df
These are the contents of the data frame
id
title
number
createdAt
closedAt
state
updatedAt
comments
labels
I_kwDOHQ63-s5auKbD
test issue 1
146
2023-01-06T06:39:54Z
None
OPEN
2023-01-06T06:42:00Z
{'edges': [{'node': {'id': 'IC_kwDOHQ63-s5R2XCV","body": "comment 01"}},{'node': {'id': 'IC_kwDOHQ63-s5R2XC9","body": "comment 02"}}]}
{'edges': []}
I_kwDOHQ63-s5auKm8
test issue 2
147
2023-01-06T06:40:34Z
None
OPEN
2023-01-06T06:40:34Z
{'edges': []}
{'edges': [{'node': {'name': 'food"}},{'node': {'name': 'healthy"}}]}
I would like to split/explode the comments and labels columns.
The values in these columns are nested dictionaries
I would like there to be as many rows for a single issue, as there are comments & labels.
I would like to flatten out the data frame.
So this involves split/explode and concat.
There are several stackoverflow answers that delve on this topic. And I have tried the code from several of them.
I can not paste the links to those questions, because stackoverflow marks my question as spam due to many links.
But these are the steps I have tried
df3 = df2['comments'].apply(pd.Series)
Drill down further
df4 = df3['edges'].apply(pd.Series)
df4
Drill down further
df5 = df4['node'].apply(pd.Series)
df5
The last statement above gives me the KeyError: 'node'
I understand, this is because node is not a key in the DataFrame.
But how else can i split this dictionary and concatenate all columns back to my issues row?
This is how I would like the output to look like
id
title
number
createdAt
closedAt
state
updatedAt
comments
labels
I_kwDOHQ63-s5auKbD
test issue 1
146
2023-01-06T06:39:54Z
None
OPEN
2023-01-06T06:42:00Z
comment 01
Null
I_kwDOHQ63-s5auKbD
test issue 1
146
2023-01-06T06:39:54Z
None
OPEN
2023-01-06T06:42:00Z
comment 02
Null
I_kwDOHQ63-s5auKm8
test issue 2
147
2023-01-06T06:40:34Z
None
OPEN
2023-01-06T06:40:34Z
Null
food
I_kwDOHQ63-s5auKm8
test issue 2
147
2023-01-06T06:40:34Z
None
OPEN
2023-01-06T06:40:34Z
Null
healthy
If dct is your dictionary from the question you can try:
df = pd.DataFrame(d['node'] for d in dct['data']['repository']['issues']['edges'])
df['comments'] = df['comments'].str['edges']
df = df.explode('comments')
df['comments'] = df['comments'].str['node'].str['body']
df['labels'] = df['labels'].str['edges']
df = df.explode('labels')
df['labels'] = df['labels'].str['node'].str['name']
print(df.to_markdown(index=False))
Prints:
id
title
number
createdAt
closedAt
state
updatedAt
comments
labels
I_kwDOHQ63-s5auKbD
test issue 1
146
2023-01-06T06:39:54Z
OPEN
2023-01-06T06:42:00Z
comment 01
nan
I_kwDOHQ63-s5auKbD
test issue 1
146
2023-01-06T06:39:54Z
OPEN
2023-01-06T06:42:00Z
comment 02
nan
I_kwDOHQ63-s5auKm8
test issue 2
147
2023-01-06T06:40:34Z
OPEN
2023-01-06T06:40:34Z
nan
food
I_kwDOHQ63-s5auKm8
test issue 2
147
2023-01-06T06:40:34Z
OPEN
2023-01-06T06:40:34Z
nan
healthy
#andrej-kesely has answered my question.
I have selected his response as the answer for this question.
I am now posting a consolidated script that includes my poor code and andrej's great code.
In this script i want to fetch details from Github's GraphQL API Server.
And put it inside pandas.
Primary source for this script is this gist.
And a major chunk of remaining code is an answer by #andrej-kesely.
Now onto the consolidated script.
First import the necessary packages and set headers
import requests
import json
import pandas as pd
headers = {"Authorization": "token <your_github_personal_access_token>"}
Now define the query that will fetch data from github.
In my particular case, I am fetching issue details form a particular repo
it can be something else for you.
query = """
{
viewer {
login
}
repository(name: "your_github_repo", owner: "your_github_user_name") {
issues(states: OPEN, last: 2) {
edges {
node {
id
title
number
createdAt
closedAt
state
updatedAt
comments(first: 10) {
edges {
node {
id
body
}
}
}
labels(orderBy: {field: NAME, direction: ASC}, first: 10) {
edges {
node {
name
}
}
}
comments(first: 10) {
edges {
node {
id
body
}
}
}
}
cursor
}
}
}
}
"""
Execute the query and save the response
def run_query(query):
request = requests.post('https://api.github.com/graphql', json={'query': query}, headers=headers)
if request.status_code == 200:
return request.json()
else:
raise Exception("Query failed to run by returning code of {}. {}".format(request.status_code, query))
result = run_query(query)
And now is the trickiest part.
In my query response, there are several nested dictionaries.
I would like to split them - more details in my question above.
This magic code from #andrej-kesely does that for you.
df = pd.DataFrame(d['node'] for d in result['data']['repository']['issues']['edges'])
df['comments'] = df['comments'].str['edges']
df = df.explode('comments')
df['comments'] = df['comments'].str['node'].str['body']
df['labels'] = df['labels'].str['edges']
df = df.explode('labels')
df['labels'] = df['labels'].str['node'].str['name']
print(df)

How to remove delimeted pipe from my json column and split them to different columns and their respective values

"description": ID|100|\nName|Sam|\nCity|New York City|\nState|New York|\nContact|1234567890|\nEmail|1234#yahoo.com|
This is how my code in json looks like. I wanted to convert this json file to excel sheet to split the nested column to separate columns and have used pandas for it, but couldn't achieve it. The output I want in my excel sheet is:
ID Name City State Contact Email
100 Sam New York City New York 1234567890 1234#yahoo.com
I want to remove those pipes and the solution should be in pandas. Please help me out with this.
The code I am trying:
I want output as:
The output on excel sheet:
[2]: https://i.stack.imgur.com/QjSUU.png
The list of dict column looks like:
"assignees": [{
"id": 1234,
"username": "xyz",
"name": "XYZ",
"state": "active",
"avatar_url": "aaaaaaaaaaaaaaa",
"web_url": "bbbbbbbbbbb"
},
{
"id": 5678,
"username": "abcd",
"name": "ABCD",
"state": "active",
"avatar_url": "hhhhhhhhhhh",
"web_url": "mmmmmmmmm"
}
],
This could be one way:
import pandas as pd
df = pd.read_json('Sample.json')
df2 = pd.DataFrame()
for i in df.index:
desc = df['description'][i]
attributes = desc.split("\n")
d = {}
for attrib in attributes:
if not(attrib.startswith('Name') or attrib.startswith('-----')):
kv = attrib.split("|")
d[kv[0]] = kv[1]
df2 = df2.append(d, ignore_index=True)
print(df2)
df2.to_csv("output.csv")
Output xls:

Creating pandas dataframe from accessing specific keys of nested dictionary

How can below dictionary converted to expected dataframe like below?
{
"getArticleAttributesResponse": {
"attributes": [{
"articleId": {
"id": "2345",
"locale": "en_US"
},
"keyValuePairs": [{
"key": "tags",
"value": "[{\"displayName\": \"Nice\", \"englishName\": \"Pradeep\", \"refKey\": \"Key2\"}, {\"displayName\": \"Family Sharing\", \"englishName\": \"Sarvendra\", \"refKey\": \"Key1\", \"meta\": {\"customerDisplayable\": [false]}}}]"
}]
}]
}
}
Expected dataframe:
id displayName englistname refKey
2345 Nice Pradeep Key2
2345 Family Sharing Sarvendra Key1
df1 = pd.DataFrame(d['getDDResponse']['attributes']).explode('keyValuePairs')
df2 = pd.concat([df1[col].apply(pd.Series) for col in df1],1).assign(value = lambda x :x.value.apply(eval)).explode('value')
df = pd.concat([df2[col].apply(pd.Series) for col in df2],1)
OUTPUT:
0 0 display englishName reference source
0 1234 tags Unarchived Unarchived friend monster

how to extract specific data from json and put in to csv using python

I have a JSON which is in nested form. I would like to extract specific data from json and put into csv using pandas python.
data = {
"class":"hudson.model.Hudson",
"jobs":[
{
"_class":"hudson.model.FreeStyleProject",
"name":"git_checkout",
"url":"http://localhost:8080/job/git_checkout/",
"builds":[
{
"_class":"hudson.model.FreeStyleBuild",
"duration":1201,
"number":6,
"result":"FAILURE",
"url":"http://localhost:8080/job/git_checkout/6/"
}
]
},
{
"_class":"hudson.model.FreeStyleProject",
"name":"output",
"url":"http://localhost:8080/job/output/",
"builds":[
]
},
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowJob",
"name":"pipeline_test",
"url":"http://localhost:8080/job/pipeline_test/",
"builds":[
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration":9274,
"number":85,
"result":"SUCCESS",
"url":"http://localhost:8080/job/pipeline_test/85/"
},
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration":4251,
"number":84,
"result":"SUCCESS",
"url":"http://localhost:8080/job/pipeline_test/84/"
}
]
}
]
}
From the above JSON i want to fetch jobs name value and builds result value . I am new to python any help will be appreciated .
Till now i have tried
main_data = data['jobs]
json_normalize(main_data,['builds'],
record_prefix='jobs_', errors='ignore')
which gives information only build key values and not the name of job .
Can anyone help ?
Expected Output:
Considering only first build result value you can need to be in csv column you can achieve this using pandas.
data = {
"class": "hudson.model.Hudson",
"jobs": [
{
"_class": "hudson.model.FreeStyleProject",
"name": "git_checkout",
"url": "http://localhost:8080/job/git_checkout/",
"builds": [
{
"_class": "hudson.model.FreeStyleBuild",
"duration": 1201,
"number": 6,
"result": "FAILURE",
"url": "http://localhost:8080/job/git_checkout/6/"
}
]
},
{
"_class": "hudson.model.FreeStyleProject",
"name": "output",
"url": "http://localhost:8080/job/output/",
"builds": []
},
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowJob",
"name": "pipeline_test",
"url": "http://localhost:8080/job/pipeline_test/",
"builds": [
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration": 9274,
"number": 85,
"result": "SUCCESS",
"url": "http://localhost:8080/job/pipeline_test/85/"
},
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration": 4251,
"number": 84,
"result": "SUCCESS",
"url": "http://localhost:8080/job/pipeline_test/84/"
}
]
}
]
}
main_data = data.get('jobs')
res = {'name':[], 'result':[]}
for name_dict in main_data:
res['name'].append(name_dict.get('name','NA'))
resultval = name_dict['builds'][0].get('result') if len(name_dict['builds'])>0 else 'NA'
res['result'].append(resultval)
print(res)
import pandas as pd
df = pd.DataFrame(res)
df.to_csv("/home/file_timer/jobs.csv", index=False)
Check the csv file output
name,result
git_checkout,FAILURE
output,NA
pipeline_test,SUCCESS
If 'NA' result want to skip then
main_data = data.get('jobs')
res = {'name':[], 'result':[]}
for name_dict in main_data:
if len(name_dict['builds'])==0:
continue
res['name'].append(name_dict.get('name', 'NA'))
resultval = name_dict['builds'][0].get('result')
res['result'].append(resultval)
print(res)
import pandas as pd
df = pd.DataFrame(res)
df.to_csv("/home/akash.pagar/shell_learning/file_timer/jobs.csv", index=False)
Output will bw like
name,result
git_checkout,FAILURE
pipeline_test,SUCCESS
Simply with build number,
for job in data.get('jobs'):
for build in job.get('builds'):
print(job.get('name'), build.get('number'), build.get('result'))
gives the result
git_checkout 6 FAILURE
pipeline_test 85 SUCCESS
pipeline_test 84 SUCCESS
If you want to get the result of latest build, and pretty sure about the build number always in decending order,
for job in data.get('jobs'):
if job.get('builds'):
print(job.get('name'), job.get('builds')[0].get('result'))
and if you are not sure the order,
for job in data.get('jobs'):
if job.get('builds'):
print(job.get('name'), sorted(job.get('builds'), key=lambda k: k.get('number'))[-1].get('result'))
then the result will be:
git_checkout FAILURE
pipeline_test SUCCESS
Assuming last build is the last element of its list and you don't care about jobs with no builds, this does:
import pandas as pd
#data = ... #same format as in the question
z = [(job["name"], job["builds"][-1]["result"]) for job in data["jobs"] if len(job["builds"])]
df = pd.DataFrame(data=z, columns=["name", "result"])
#df.to_csv #TODO
Also we don't necessarily need pandas to create the csv file.
You could do:
import csv
#z = ... #see previous code block
with open("f.csv", 'w') as fp:
csv.writer(fp).writerows([("name", "result")] + z)

A efficient way to unpack nested json into a dataframe

I have a nested json, and i want to transform it into a pandas dataframe. I was able to normalize with json_normalize.
However, there are still json layer within the dataframe, which i also want to unpack. How can i do it in the best way? I will likely have to deal with this a few more times within the project i am doing currently
The json i have is the following
{
"data": {
"allOpportunityApplication": {
"data": [
{
"id": "111111111",
"opportunity": {
"programme": {
"short_name": "XX"
}
},
"person": {
"home_lc": {
"name": "NAME"
}
},
"standards": [
{
"constant_name": "constant1",
"standard_option": {
"option": "true"
}
},
{
"constant_name": "constant2",
"standard_option": {
"option": "true"
}
}
]
}
]
}
}
}
Used json_normalize
standards_df = json_normalize(
standard_json['allOpportunityApplication']['data'],
record_path=['standards'],
meta=['id','person','opportunity']
)
with that i get a dataframe with the columns: constant_name, standard_option, id, person, opportunity. The problem is that the data standard_option, person and opportunity are json, with a single option inside.
The current ouput and expected output for each column is as follow
Standard_option
Currently an item in the column "standard_option" looks like:
{'option': 'true'}
I want it to be just true
Person
Currently an item in the column "person" looks like:
{'programme': {'short_name': 'XX'}}
I want it to look like: XX
Opportunity
Currently an item in the column "opportunity" looks like:
{'home_lc': {'name': 'NAME'}}
I want it to look like: NAME
Might not be the best way, but I think it works.
standards_df['person'] = (standards_df.loc[:, 'person']
.apply(lambda x: x['home_lc']['name']))
standards_df['opportunity'] = (standards_df.loc[:, 'opportunity']
.apply(lambda x: x['programme']['short_name']))
constant_name standard_option.option id person opportunity
0 constant1 true 111111111 NAME XX
1 constant2 true 111111111 NAME XX
standard_option was already fine when I run your code

Categories

Resources