Fastest way to generate a nested JSON using pandas - python

This is a sample of a real-world problem that I cannot find a way to solve.
I need to create a nested JSON from a pandas dataframe. Considering this data, I need to create a JSON object like that:
[
{
"city": "Belo Horizonte",
"by_rooms": [
{
"rooms": 1,
"total price": [
{
"total (R$)": 499,
"details": [
{
"animal": "acept",
"area": 22,
"bathroom": 1,
"parking spaces": 0,
"furniture": "not furnished",
"hoa (R$)": 30,
"rent amount (R$)": 450,
"property tax (R$)": 13,
"fire insurance (R$)": 6
}
]
}
]
},
{
"rooms": 2,
"total price": [
{
"total (R$)": 678,
"details": [
{
"animal": "not acept",
"area": 50,
"bathroom": 1,
"parking spaces": 0,
"furniture": "not furnished",
"hoa (R$)": 0,
"rent amount (R$)": 644,
"property tax (R$)": 25,
"fire insurance (R$)": 9
}
]
}
]
}
]
},
{
"city": "Campinas",
"by_rooms": [
{
"rooms": 1,
"total price": [
{
"total (R$)": 711,
"details": [
{
"animal": "acept",
"area": 42,
"bathroom": 1,
"parking spaces": 0,
"furniture": "not furnished",
"hoa (R$)": 0,
"rent amount (R$)": 690,
"property tax (R$)": 12,
"fire insurance (R$)": 9
}
]
}
]
}
]
}
]
each level can have one or more items.
Based on this answer, I have a snippet like that:
data = pd.read_csv("./houses_to_rent_v2.csv")
cols = data.columns
data = (
data.groupby(['city', 'rooms', 'total (R$)'])[['animal', 'area', 'bathroom', 'parking spaces', 'furniture',
'hoa (R$)', 'rent amount (R$)', 'property tax (R$)', 'fire insurance (R$)']]
.apply(lambda x: x.to_dict(orient='records'))
.reset_index(name='details')
.groupby(['city', 'rooms'])[['total (R$)', 'details']]
.apply(lambda x: x.to_dict(orient='records'))
.reset_index(name='total price')
.groupby(['city'])[['rooms', 'total price']]
.apply(lambda x: x.to_dict(orient='records'))
.reset_index(name='by_rooms')
)
data.to_json('./jsondata.json', orient='records', force_ascii=False)
but all those groupbys don't look very Pythonic and it's pretty slow.
Before use this method, I tried split this big dataframe into smaller ones to use individual groupbys for each level, but it's even slower than doing that way.
I tried dask, with no improvement at all.
I read about numba and cython, but I have no idea how to implement in this case. All docs that I find use only numeric data and I have string and date/datetime data too.
In my real-world problem, this data is processed to response to http request. My dataframe has 30+ columns and ~35K rows per request and it takes 45 seconds to process just this snippet.
So, there is a faster way to do that?

This can be done as list / dict comprehensions. Have not timed this, but I'm not waiting for it.
import kaggle.cli
import sys, requests
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
import urllib
# fmt: off
# download data set
url = "https://www.kaggle.com/rubenssjr/brasilian-houses-to-rent"
sys.argv = [sys.argv[0]] + f"datasets download {urllib.parse.urlparse(url).path[1:]}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f'{urllib.parse.urlparse(url).path.split("/")[-1]}.zip')
dfs = {f.filename: pd.read_csv(zfile.open(f)) for f in zfile.infolist()}
# fmt: on
js = [
{
"city": g[0],
"by_room": [
{
"rooms": r["rooms"],
"total_price": [
{
"total (R$)": r["total (R$)"],
"details": [
{
k: v
for k, v in r.items()
if k not in ["city", "rooms", "total (R$)"]
}
],
}
],
}
for r in g[1].to_dict("records")
],
}
for g in dfs["houses_to_rent_v2.csv"].groupby("city")
]
print(len(js), len(js[0]["by_room"]))

I needed to adapt #RobRaymond answer, because I need the inner data grouped too. So I take his code, did some adjustments and this is the final result:
import kaggle.cli
import sys, requests
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
import urllib
# fmt: off
# download data set
url = "https://www.kaggle.com/rubenssjr/brasilian-houses-to-rent"
sys.argv = [sys.argv[0]] + f"datasets download {urllib.parse.urlparse(url).path[1:]}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f'{urllib.parse.urlparse(url).path.split("/")[-1]}.zip')
dfs = {f.filename: pd.read_csv(zfile.open(f)) for f in zfile.infolist()}
# fmt: on
js = [
{
"city": g[0],
"by_room": [
{
"rooms": r["rooms"],
"total_price": [
{
"total (R$)": r["total (R$)"],
"details": [
{
k: v
for k, v in r.items()
if k not in ["city", "rooms", "total (R$)"]
}
],
}
],
}
for r in g[1].to_dict("records")
],
}
for g in dfs["houses_to_rent_v2.csv"].groupby("city")
]
for city in js:
rooms_qty = list(set([r['rooms'] for r in city['by_room']]))
newRooms = [{'rooms': x, 'total_price': []} for x in rooms_qty]
for r in city['by_room']:
newRooms[rooms_qty.index(r['rooms'])]'total_price'].extend(r['total_price'])
for r in newRooms:
prices = list(set([p['total (R$)'] for p in r['total_price']]))
newPrices = [{'total (R$)': x, 'details': []} for x in prices]
for price in r['total_price']:
newPrices[prices.index(price['total (R$)'])]['details'].extend(price['details'])
r['total_price'] = newPrices
city['by_room'] = newRooms
And the execution time drops to 5 seconds.

Related

How to get JSON data in expected format using Python json.dump

I read multiple sheets from excel files and combine then to a single JSON file.
Sample Data:
df1
Metric Value
0 salesamount 9.0
1 salespercentage 80.0
2 salesdays 56.0
3 salesconversionpercentage 0.3
df2
Metric Value
0 FromBudget 4K
1 ToBudget 5K
df3
Metric Value
0 Objective Customer Engagement
1 ExpectedOutcomesales 0.2
2 ExpectedOutcomeweeks 8 weeks
I then convert them into dictionary using:
s = dict(zip(df1.iloc[:,0], df1.iloc[:,1]))
eb = dict(zip(df2.iloc[:,0], df2.iloc[:,1]))
eo = dict(zip(df3.iloc[:,0], df3.iloc[:,1]))
I then store above items is a key ExpectedPlanPerformance using:
mydct = {
'ExpectedPlanPerformance' :
{
'EstimatedBudget' : eb,
'Sales' : s,
'ExpectedOutcome' : eo
}
}
mydct
{'ExpectedPlanPerformance': {'EstimatedBudget': {'FromBudget': '4K',
'ToBudget': '5K'},
'Sales': ({'salesamount': '9.0',
'salespercentage': '80.0',
'salesdays': '56.0',
'salesconversionpercentage': '0.3'},),
'ExpectedOutcome': {'Objective': 'Customer Engagement',
'ExpectedOutcomesales': 0.2,
'ExpectedOutcomeweeks': '8 weeks'}}}
I write this dictionary to JSON using:
outfile = open('file.json','w')
json.dump(mydct, outfile, indent = 4)
outfile.close()
The JSON file I append to already contains other elements. Those elements are actually dataframes that were converted to JSON format using:
json.loads(df.to_json(orient = 'records'))
Once such dataframes are converted to JSON format, they are stored in a dictionary as above and written to file using same json.dump.
But the output in the file is in below format:
{
"ExpectedPlanPerformance": {
"EstimatedBudget": "{\"FromBudget\": \"4K\", \"ToBudget\": \"5K\"}",
"Sales": "{\"salesamount\": \"9.0\", \"salespercentage\": \"80.0\", \"salesdays\": \"56.0\", \"salesconversionpercentage\": \"0.3\"}",
"ExpectedOutcome": "{\"Objective\": \"Customer Engagement\", \"ExpectedOutcomesales\": \"20%\", \"ExpectedOutcomeweeks\": \"8 weeks\"}"
}
Whereas some other elements are like below:
"TotalYield": [
"225K"
],
"TotalYieldText": [
"Lorem ipsum door sit amet"
],
Can someone please let me know how to fix this, expected output is as below:
"ExpectedPlanPerformance": [{
"ExpectedOutcome": {
"Objective": "Customer Engagement",
"ExpectedOutcomesales": "20%",
"ExpectedOutcomeweeks": "8 weeks"
},
"Sales": {
"salesamount": "9 ",
"salespercentage": "80",
"salesdays": "56",
"salesconversionpercentage": "0.3"
},
"EstimatedBudget": {
"FromBudget": "4K",
"ToBudget": "5K"
}
}],
Try:
out = {
"ExpectedPlanPerformance": [
{
"ExpectedOutcome": dict(zip(df3.Metric, df3.Value)),
"Sales": dict(zip(df1.Metric, df1.Value)),
"EstimatedBudget": dict(zip(df2.Metric, df2.Value)),
}
]
}
print(out)
Prints:
{
"ExpectedPlanPerformance": [
{
"ExpectedOutcome": {
"Objective": "Customer Engagement",
"ExpectedOutcomesales": "0.2",
"ExpectedOutcomeweeks": "8 weeks",
},
"Sales": {
"salesamount": 9.0,
"salespercentage": 80.0,
"salesdays": 56.0,
"salesconversionpercentage": 0.3,
},
"EstimatedBudget": {"FromBudget": "4K", "ToBudget": "5K"},
}
]
}
To save out to a file:
import json
with open("your_file.json", "w") as f_in:
json.dump(out, f_in, indent=4)

JSON with Nested Array from Pandas DataFrame

Generally I'm working to query data from Snowflake, format it into JSON, and then push that JSON into an API.
I'm very close, but struggling with the needed format that the API requires.. The fields key needs to have a nested array instead of a nested list of objects. The example of the JSON is for a single record. I've tried multiple things with the formatting options available in pandas to_dict module including to_dict('list'), but am whiffing. Any ideas are appreciated.
Current code and output:
j = (df.groupby(['text','date','channel','sentiment'], as_index=False)
.apply(lambda x:x[[
'product',
'segment',
'2b6276da-b135-4258-9971-cb08c070d859',
'7b84b8fc-5494-4fcb-bac7-ca91dc8faa32',
'5042388c-3144-4b5d-9aab-f0d03345646b',
'27cf2f54-3686-48c9-bfe4-5a6c70e90854',
'03de58c3-4ea0-4286-b5c4-1b8cef53646d',
'edee1277-1668-4e89-b206-5de08e8b3dc5',
'c9db7ba2-3c9f-40a8-852e-20ce5e8a5e8f',
'cb8d1d94-8976-4b31-9844-e47857226c2d',
'806335a9-e8ea-45b4-9904-54c52f1698e4',
'b2dfd157-436f-43a2-8ca2-36b5fe1fae54',
'511cfd95-8250-4796-97e1-9b02fb91e147',
'69c06db4-cc43-4dbb-abcb-6d5f40bfef08',
'ecdf55c5-bce9-4bc6-bca2-921d7c140dc2',
'6b711ef9-b789-48b5-97f3-7183bc5d6fa7',
'bfbc0bf1-49ca-4cb0-a76c-82999034e7cc',
'ee64e90c-0116-4fba-992d-6f6df1b0cfef',
'3c6edd01-bfa6-46c0-a9ea-5ffc01453f51'
]].to_dict('records'))
.rename(columns={None:'fields'})
.to_json(orient='records'))
data = json.dumps(json.loads(j), indent=2, sort_keys=True)
print(data)
[
{
"channel": "Zendesk",
"date": 1630465892000,
"sentiment": "predict",
"text": "\n STACK UP TOPIC SUGGESTION\nnode\n",
"fields": {
"03de58c3-4ea0-4286-b5c4-1b8cef53646d": "005j000000FVPO3AAP",
"27cf2f54-3686-48c9-bfe4-5a6c70e90854": 110010.0,
"2b6276da-b135-4258-9971-cb08c070d859": null,
"3c6edd01-bfa6-46c0-a9ea-5ffc01453f51": "Zendesk-959731",
"5042388c-3144-4b5d-9aab-f0d03345646b": "Financial Services - Banking",
"511cfd95-8250-4796-97e1-9b02fb91e147": "001j000000a7NKEAA2",
"69c06db4-cc43-4dbb-abcb-6d5f40bfef08": "Berkadia",
"6b711ef9-b789-48b5-97f3-7183bc5d6fa7": 1616070511039,
"7b84b8fc-5494-4fcb-bac7-ca91dc8faa32": "North America",
"806335a9-e8ea-45b4-9904-54c52f1698e4": 0,
"b2dfd157-436f-43a2-8ca2-36b5fe1fae54": 1,
"bfbc0bf1-49ca-4cb0-a76c-82999034e7cc": "Skills: Strategy-Driven",
"c9db7ba2-3c9f-40a8-852e-20ce5e8a5e8f": "005j0000000jdqXAAQ",
"cb8d1d94-8976-4b31-9844-e47857226c2d": 0,
"ecdf55c5-bce9-4bc6-bca2-921d7c140dc2": "B2B",
"edee1277-1668-4e89-b206-5de08e8b3dc5": 190.0,
"ee64e90c-0116-4fba-992d-6f6df1b0cfef": "4602c31e-d3e0-464b-8c11-75391f4ecece",
"product": null,
"segment": "Commercial 2"
}
}
]
The needed format is as such:
[
{
"channel": "Zendesk",
"date": 1630465892000,
"sentiment": "predict",
"text": "\n STACK UP TOPIC SUGGESTION\nnode\n",
"fields": [
{
"03de58c3-4ea0-4286-b5c4-1b8cef53646d": "005j000000FVPO3AAP",
"27cf2f54-3686-48c9-bfe4-5a6c70e90854": 110010.0,
"2b6276da-b135-4258-9971-cb08c070d859": null,
"3c6edd01-bfa6-46c0-a9ea-5ffc01453f51": "Zendesk-959731",
"5042388c-3144-4b5d-9aab-f0d03345646b": "Financial Services - Banking",
"511cfd95-8250-4796-97e1-9b02fb91e147": "001j000000a7NKEAA2",
"69c06db4-cc43-4dbb-abcb-6d5f40bfef08": "Berkadia",
"6b711ef9-b789-48b5-97f3-7183bc5d6fa7": 1616070511039,
"7b84b8fc-5494-4fcb-bac7-ca91dc8faa32": "North America",
"806335a9-e8ea-45b4-9904-54c52f1698e4": 0,
"b2dfd157-436f-43a2-8ca2-36b5fe1fae54": 1,
"bfbc0bf1-49ca-4cb0-a76c-82999034e7cc": "Skills: Strategy-Driven",
"c9db7ba2-3c9f-40a8-852e-20ce5e8a5e8f": "005j0000000jdqXAAQ",
"cb8d1d94-8976-4b31-9844-e47857226c2d": 0,
"ecdf55c5-bce9-4bc6-bca2-921d7c140dc2": "B2B",
"edee1277-1668-4e89-b206-5de08e8b3dc5": 190.0,
"ee64e90c-0116-4fba-992d-6f6df1b0cfef": "4602c31e-d3e0-464b-8c11-75391f4ecece",
"product": null,
"segment": "Commercial 2"
}
]
}
]
Why not just wrap it in an array like this?
j = (df.groupby(['text','date','channel','sentiment'], as_index=False)
.apply(lambda x:[x[[
'product',
'segment',
'2b6276da-b135-4258-9971-cb08c070d859',
'7b84b8fc-5494-4fcb-bac7-ca91dc8faa32',
'5042388c-3144-4b5d-9aab-f0d03345646b',
'27cf2f54-3686-48c9-bfe4-5a6c70e90854',
'03de58c3-4ea0-4286-b5c4-1b8cef53646d',
'edee1277-1668-4e89-b206-5de08e8b3dc5',
'c9db7ba2-3c9f-40a8-852e-20ce5e8a5e8f',
'cb8d1d94-8976-4b31-9844-e47857226c2d',
'806335a9-e8ea-45b4-9904-54c52f1698e4',
'b2dfd157-436f-43a2-8ca2-36b5fe1fae54',
'511cfd95-8250-4796-97e1-9b02fb91e147',
'69c06db4-cc43-4dbb-abcb-6d5f40bfef08',
'ecdf55c5-bce9-4bc6-bca2-921d7c140dc2',
'6b711ef9-b789-48b5-97f3-7183bc5d6fa7',
'bfbc0bf1-49ca-4cb0-a76c-82999034e7cc',
'ee64e90c-0116-4fba-992d-6f6df1b0cfef',
'3c6edd01-bfa6-46c0-a9ea-5ffc01453f51'
]].to_dict('records')])
.rename(columns={None:'fields'})
.to_json(orient='records'))
data = json.dumps(json.loads(j), indent=2, sort_keys=True)
print(data)

Flattening nested JSON to pandas.DataFrame: Ordering and Naming Columns based on dictionary values

My question raised when I exploited this helpful answer provided by Trenton McKinney on the issue of flattening multiple nested JSON-files for handling in pandas.
Following his advice, I have used the flatten_json function described here to flatten a batch of nested json files. However, I have run into a problem with the uniformity of my JSON-files.
A single JSON-File looks roughly like this made-up example data:
{
"product": "example_productname",
"product_id": "example_productid",
"product_type": "example_producttype",
"producer": "example_producer",
"currency": "example_currency",
"client_id": "example_clientid",
"supplement": [
{
"supplementtype": "RTZ",
"price": 300000,
"rebate": "500",
},
{
"supplementtype": "CVB",
"price": 500000,
"rebate": "250",
},
{
"supplementtype": "JKL",
"price": 100000,
"rebate": "750",
},
],
}
Utilizing the referenced code, I will end up with data looking like this:
product
product_id
product_type
producer
currency
client_id
supplement_0_supplementtype
supplement_0_price
supplement_0_rebate
supplement_1_supplementtype
supplement_1_price
supplement_1_rebate
etc
example_productname
example_productid
example_type
example_producer
example_currency
example_clientid
RTZ
300000
500
CVB
500000
250
etc
example_productname2
example_productid2
example_type2
example_producer2
example_currency2
example_clientid2
CVB
500000
250
RTZ
300000
500
etc
There are multiple issues with this.
Firstly, in my data, there is a limited list of "supplements", however, they do not always appear, and if they do, they are not always in the same order. In the example table, you can see that the two "supplements" switched positions in the second row. I would prefer a fixed order of the "supplement columns".
Secondly, the best option would be a table like this:
product
product_id
product_type
producer
currency
client_id
supplement_RTZ_price
supplement_RTZ_rebate
supplement_CVB_price
supplement_CVB_rebate
etc
example_productname
example_productid
example_type
example_producer
example_currency
example_clientid
300000
500
500000
250
etc
I have tried editing the flatten_json function referenced, but I don't have an inkling of how to make this work.
The solution consists of simply editing the dictionary (thanks to Andrej Kesely). I just added a pass to exceptions in case some columns are inexistent.
d = {
"product": "example_productname",
"product_id": "example_productid",
"product_type": "example_producttype",
"producer": "example_producer",
"currency": "example_currency",
"client_id": "example_clientid",
"supplement": [
{
"supplementtype": "RTZ",
"price": 300000,
"rebate": "500",
},
{
"supplementtype": "CVB",
"price": 500000,
"rebate": "250",
},
{
"supplementtype": "JKL",
"price": 100000,
"rebate": "750",
},
],
}
for s in d["supplement"]:
try:
d["supplementtype_{}_price".format(s["supplementtype"])] = s["price"]
except:
pass
try:
d["supplementtype_{}_rebate".format(s["supplementtype"])] = s["rebate"]
except:
pass
del d["supplement"]
df = pd.DataFrame([d])
print(df)
product product_id product_type producer currency client_id supplementtype_RTZ_price supplementtype_RTZ_rebate supplementtype_CVB_price supplementtype_CVB_rebate supplementtype_JKL_price supplementtype_JKL_rebate
0 example_productname example_productid example_producttype example_producer example_currency example_clientid 300000 500 500000 250 100000 750
The used/referenced code:
def flatten_json(nested_json: dict, exclude: list=[''], sep: str='_') -> dict:
"""
Flatten a list of nested dicts.
"""
out = dict()
def flatten(x: (list, dict, str), name: str='', exclude=exclude):
if type(x) is dict:
for a in x:
if a not in exclude:
flatten(x[a], f'{name}{a}{sep}')
elif type(x) is list:
i = 0
for a in x:
flatten(a, f'{name}{i}{sep}')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
# list of files
files = ['test1.json', 'test2.json']
# list to add dataframe from each file
df_list = list()
# iterate through files
for file in files:
with open(file, 'r') as f:
# read with json
data = json.loads(f.read())
# flatten_json into a dataframe and add to the dataframe list
df_list.append(pd.DataFrame.from_dict(flatten_json(data), orient='index').T)
# concat all dataframes together
df = pd.concat(df_list).reset_index(drop=True)
You can modify the dictionary before you create dataframe from it:
d = {
"product": "example_productname",
"product_id": "example_productid",
"product_type": "example_producttype",
"producer": "example_producer",
"currency": "example_currency",
"client_id": "example_clientid",
"supplement": [
{
"supplementtype": "RTZ",
"price": 300000,
"rebate": "500",
},
{
"supplementtype": "CVB",
"price": 500000,
"rebate": "250",
},
{
"supplementtype": "JKL",
"price": 100000,
"rebate": "750",
},
],
}
for s in d["supplement"]:
d["supplementtype_{}_price".format(s["supplementtype"])] = s["price"]
d["supplementtype_{}_rebate".format(s["supplementtype"])] = s["rebate"]
del d["supplement"]
df = pd.DataFrame([d])
print(df)
Prints:
product product_id product_type producer currency client_id supplementtype_RTZ_price supplementtype_RTZ_rebate supplementtype_CVB_price supplementtype_CVB_rebate supplementtype_JKL_price supplementtype_JKL_rebate
0 example_productname example_productid example_producttype example_producer example_currency example_clientid 300000 500 500000 250 100000 750

how to extract specific data from json and put in to csv using python

I have a JSON which is in nested form. I would like to extract specific data from json and put into csv using pandas python.
data = {
"class":"hudson.model.Hudson",
"jobs":[
{
"_class":"hudson.model.FreeStyleProject",
"name":"git_checkout",
"url":"http://localhost:8080/job/git_checkout/",
"builds":[
{
"_class":"hudson.model.FreeStyleBuild",
"duration":1201,
"number":6,
"result":"FAILURE",
"url":"http://localhost:8080/job/git_checkout/6/"
}
]
},
{
"_class":"hudson.model.FreeStyleProject",
"name":"output",
"url":"http://localhost:8080/job/output/",
"builds":[
]
},
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowJob",
"name":"pipeline_test",
"url":"http://localhost:8080/job/pipeline_test/",
"builds":[
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration":9274,
"number":85,
"result":"SUCCESS",
"url":"http://localhost:8080/job/pipeline_test/85/"
},
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration":4251,
"number":84,
"result":"SUCCESS",
"url":"http://localhost:8080/job/pipeline_test/84/"
}
]
}
]
}
From the above JSON i want to fetch jobs name value and builds result value . I am new to python any help will be appreciated .
Till now i have tried
main_data = data['jobs]
json_normalize(main_data,['builds'],
record_prefix='jobs_', errors='ignore')
which gives information only build key values and not the name of job .
Can anyone help ?
Expected Output:
Considering only first build result value you can need to be in csv column you can achieve this using pandas.
data = {
"class": "hudson.model.Hudson",
"jobs": [
{
"_class": "hudson.model.FreeStyleProject",
"name": "git_checkout",
"url": "http://localhost:8080/job/git_checkout/",
"builds": [
{
"_class": "hudson.model.FreeStyleBuild",
"duration": 1201,
"number": 6,
"result": "FAILURE",
"url": "http://localhost:8080/job/git_checkout/6/"
}
]
},
{
"_class": "hudson.model.FreeStyleProject",
"name": "output",
"url": "http://localhost:8080/job/output/",
"builds": []
},
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowJob",
"name": "pipeline_test",
"url": "http://localhost:8080/job/pipeline_test/",
"builds": [
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration": 9274,
"number": 85,
"result": "SUCCESS",
"url": "http://localhost:8080/job/pipeline_test/85/"
},
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration": 4251,
"number": 84,
"result": "SUCCESS",
"url": "http://localhost:8080/job/pipeline_test/84/"
}
]
}
]
}
main_data = data.get('jobs')
res = {'name':[], 'result':[]}
for name_dict in main_data:
res['name'].append(name_dict.get('name','NA'))
resultval = name_dict['builds'][0].get('result') if len(name_dict['builds'])>0 else 'NA'
res['result'].append(resultval)
print(res)
import pandas as pd
df = pd.DataFrame(res)
df.to_csv("/home/file_timer/jobs.csv", index=False)
Check the csv file output
name,result
git_checkout,FAILURE
output,NA
pipeline_test,SUCCESS
If 'NA' result want to skip then
main_data = data.get('jobs')
res = {'name':[], 'result':[]}
for name_dict in main_data:
if len(name_dict['builds'])==0:
continue
res['name'].append(name_dict.get('name', 'NA'))
resultval = name_dict['builds'][0].get('result')
res['result'].append(resultval)
print(res)
import pandas as pd
df = pd.DataFrame(res)
df.to_csv("/home/akash.pagar/shell_learning/file_timer/jobs.csv", index=False)
Output will bw like
name,result
git_checkout,FAILURE
pipeline_test,SUCCESS
Simply with build number,
for job in data.get('jobs'):
for build in job.get('builds'):
print(job.get('name'), build.get('number'), build.get('result'))
gives the result
git_checkout 6 FAILURE
pipeline_test 85 SUCCESS
pipeline_test 84 SUCCESS
If you want to get the result of latest build, and pretty sure about the build number always in decending order,
for job in data.get('jobs'):
if job.get('builds'):
print(job.get('name'), job.get('builds')[0].get('result'))
and if you are not sure the order,
for job in data.get('jobs'):
if job.get('builds'):
print(job.get('name'), sorted(job.get('builds'), key=lambda k: k.get('number'))[-1].get('result'))
then the result will be:
git_checkout FAILURE
pipeline_test SUCCESS
Assuming last build is the last element of its list and you don't care about jobs with no builds, this does:
import pandas as pd
#data = ... #same format as in the question
z = [(job["name"], job["builds"][-1]["result"]) for job in data["jobs"] if len(job["builds"])]
df = pd.DataFrame(data=z, columns=["name", "result"])
#df.to_csv #TODO
Also we don't necessarily need pandas to create the csv file.
You could do:
import csv
#z = ... #see previous code block
with open("f.csv", 'w') as fp:
csv.writer(fp).writerows([("name", "result")] + z)

extract urls from json file without data name using python

i have json file that containd the metadata of 900 articles and i want to extract the Urls from it. my file start like this
[
{
"title": "The histologic phenotypes of …",
"authors": [
{
"name": "JE Armes"
},
],
"publisher": "Wiley Online Library",
"article_url": "https://onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1097-0142(19981201)83:11%3C2335::AID-CNCR13%3E3.0.CO;2-N",
"cites": 261,
"use": true
},
{
"title": "Comparative epidemiology of pemphigus in ...",
"authors": [
{
"name": "S Bastuji-Garin"
},
{
"name": "R Souissi"
}
],
"year": 1995,
"publisher": "search.ebscohost.com",
"article_url": "http://search.ebscohost.com/login.aspx?direct=true&profile=ehost&scope=site&authtype=crawler&jrnl=0022202X&AN=12612836&h=B9CC58JNdE8SYy4M4RyVS%2FrPdlkoZF%2FM5hifWcv%2FwFvGxUCbEaBxwQghRKlK2vLtwY2WrNNl%2B3z%2BiQawA%2BocoA%3D%3D&crl=c",
"use": true
},
.........
I want to inspect the file with objectpath to create json.tree for the extraxtion of the url. this is the code i want to execute
1. import json
2. import objectpath
3. with open("Data_sample.json") as datafile: data = json.load(datafile)
4. jsonnn_tree = objectpath.Tree(data['name of data'])
5. result_tuple = tuple(jsonnn_tree.execute('$..article_url'))
But in the step 4 for the creation of the tree, I have to insert the name of the data whitch i think that i haven't in my file. How can i replace this line?
You can get all the article urls using a list comprehension.
import json
with open("Data_sample.json") as fh:
articles = json.load(fh)
article_urls = [article['article_url'] for article in articles]
You can instantiate the tree like this:
tobj = op.Tree(your_data)
results = tobj.execute("$.article_url")
And in the end:
results = [x for x in results]
will yield:
["url1", "url2", ...]
Did you try removing the reference and just using:
jsonnn_tree = objectpath.Tree(data)

Categories

Resources