Nested JSON file into Pandas Dataframe - python

I'm having trouble getting this nested JSON object into a pandas dataframe using python:
{
"count":275,
"calls":[
{
"connectedTo":"18885068980",
"serviceName":"",
"callGuid":"01541af0-d87c-4911-a868-f5ac573d1e31",
"origin":"+19178558701",
"stateChangedAt":"2016-04-15T18:21:23Z",
"sequence":9,
"appletName":"ACD Sales General"
}
]
}
I've tried using json_normalize and am going in circles. Any help would be very much appreciated!

I know that it includes json_normalize, but I think this is what you are trying to do.
import json
import pandas as pd
from pandas.io.json import json_normalize
from pprint import pprint
j = json.dumps( //to create the json
{'count': 275,
"calls":
[{'connectedTo': "18885068980",
"serviceName":"",
"callGuid":"01541af0-d87c-4911-a868-f5ac573d1e31",
"stateChangedAt":"2016-04-15T18:21:23Z",
"sequence":9,
"appletName":"ACD Sales General"}]})
data = json.loads(j)
pprint(json_normalize(data['calls']))
which returns
appletName callGuid connectedTo \
0 ACD Sales General 01541af0-d87c-4911-a868-f5ac573d1e31 18885068980
sequence serviceName stateChangedAt
0 9 2016-04-15T18:21:23Z

Related

Multiple excel data storing in Dictionary using Python

I am stuck in one problem and am not able to go ahead.. please need help to move further.
I have input excel in this format...
Name usn Sub marks
dhdn 1bm15mca13 c 90
java 95
python 98
subbu 1bm15mca13 java 92
perl 91
paddu 1bm15mca13 c# 80
java 81
And am trying to get expected dictionary in this format:
d = [
{
"name":"dhdn",
"usn":1bm15mca13",
"sub":["c","java","python"],
"marks":[90,95,98]
},
{
"name":"subbu",
"usn":1bm15mca14",
"sub":["java","perl"],
"marks":[92,91]
},
{
"name":"paddu",
"usn":1bm15mca17",
"sub":["c#","java"],
"marks":[80,81]
}
]
Tried code but it is working for only two column
import pandas as pd
existing_excel_file = 'test.xls'
df_service = pd.read_excel(existing_excel_file, sheet_name='Sheet1')
df_service = df_service.fillna(method='ffill')
result = [{'name':k,'sub':g["sub"].tolist()} for k,g in df_service.groupby("name")]
print (result)
Please provide idea or suggestion to solve my problem.
import pandas as pd
existing_excel_file = 'test.xls'
df_service = pd.read_excel(existing_excel_file, sheet_name='Sheet1')
df_service = df_service.fillna(method='ffill')
result = [{'name':k[0],'usn':k[1],'sub':v["sub"].tolist(),"marks":v["marks"].tolist()} for k,v in df_service.groupby(['name', 'usn'])]
pprint (result)

Most efficient way of converting RESTful output to dataframe

I have output from a REST call that I've converted to JSON.
It's a highly nested collection of dicts and lists, but I'm eventually able to convert it to dataframe as follows:
import panads as pd
from requests import get
url = 'http://stats.oecd.org/SDMX-JSON/data/MEI_FIN/IR3TIB.GBR+USA.M/all'
params = {
'startTime' : '2008-06',
'dimensionAtObservation' : 'TimeDimension'
}
r = get(url, params = params)
x = r.json()
d = x['dataSets'][0]['series']
a = pd.DataFrame(d['0:0:0']['observations'])
b = pd.DataFrame(d['0:1:0']['observations'])
This works absent some manipulation to make it easier to work with, and as there are multiple time series, I can do a version of the same for each, but it goes without saying it's kind of clunky.
Is there a better/cleaner way to do this.
The pandasdmx library makes this super-simple:
import pandasdmx as sdmx
df = sdmx.Request('OECD').data(
resource_id='MEI_FIN',
key='IR3TIB.GBR+USA.M',
params={'startTime': '2008-06', 'dimensionAtObservation': 'TimeDimension'},
).write()
Absent any responses, here's the solution I came up with. I added a list comprehension to deal with getting each series into a dataframe, and then a transpose as this source resulted in the series being aligned across rows instead of down columns.
import panads as pd
from requests import get
url = 'http://stats.oecd.org/SDMX-JSON/data/MEI_FIN/IR3TIB.GBR+USA.M/all'
params = {
'startTime' : '2008-06',
'dimensionAtObservation' : 'TimeDimension'
}
r = get(url, params = params)
x = r.json()
d = x['dataSets'][0]['series']
df = [pd.DataFrame(d[i]['observations']).loc[0] for i in d]
df = pd.DataFrame(df).T

Elastic-Search scroll(scan) into Pandas DataFrame

I need to get a lot of data from Elasticsearch (es), so I'm using the scan command which is a wrap-up for the native es scroll command.
As a result I will get the following generator Object: <generator object scan at 0x000001BF5A25E518>. Farther more, I'd like to insert all the data into a Pandas DataFrame object so I can easily process it.
Code goes as follows:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan as escan
import pandas as pd
es = Elasticsearch(dpl_server, verify_certs=False)
body = {
"size": 1000,
"query": {
"match_all": {}
}
}
response = escan(client=es,
index="index-*,
query=body, request_timeout=30, size=1000)
print(response)
#<generator object scan at 0x000001BF5A25E518>
What I want to do is putting all the results in Pandas DataFrame. If I print each element in the generator as follows:
for res in response:
print(res['_source'])
# { .... }
# { .... }
# { .... }
I will get a lot of dictionaries. A naive solution of mine so far is to add them 1 by 1 like so:
df = None
for res in response:
if (df is None):
df = pd.DataFrame([res['_source']])
else:
df = pd.concat([df, pd.DataFrame([res['_source']])], sort=True)
I wish to know if there's a better way in doing so (first, in terms of speed, second, in terms of clean code). For instance, would it be better to accumulate all the results from the generator into a list and then build a complete DataFrame ?
You can use panda's json_normalize.
from pandas.io.json import json_normalize
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan as escan
import pandas as pd
es = Elasticsearch(dpl_server, verify_certs=False)
body = {
"size": 1000,
"query": {
"match_all": {}
}
}
response = escan(client=es,
index="index",
query=body, request_timeout=30, size=1000)
# Initialize a double ended queue
output_all = deque()
# Extend deque with iterator
output_all.extend(response)
# Convert deque to DataFrame
output_df = json_normalize(output_all)
Here you can find more info on the double ended queue.

Easiest way to get API data into str/int format python

I have read through many articles and posts to connect to an api then format it into int/str however I did mange to make possibly the longest winded way ever its real ugly please could someone show me the shortest most efficient way to accomplish the below code any suggestions would be greatly appreciated bassically looking to print out "eos" in str format and "price" as int Thanks!
import urllib
import json
import pandas as pd
import numpy as np
import requests
r = requests.get('https://api.coinmarketcap.com/v1/ticker/eos/')
with open('events.csv','w') as fd:
fd.write(r.text)
data = pd.read_csv('events.csv', names=['Choose One'])
i = data.iloc[[6], [0]]
a = str(i)
name,price = a.split(":")
string = price[2:-1]
print(string)
It's simpler to just use pandas read_json to read the file into a data frame, read_json will automatically assign the apt datatype to each column, then use column selection to select 'name','price_usd' columns (of-course in this case there is only one row, but the same code can be used with multiple rows)
i.e.
import pandas as pd
df = pd.read_json('https://api.coinmarketcap.com/v1/ticker/eos/')
print(df[['name','price_usd']].apply(lambda row:'{}: {:.0f}'.format(ro
w['name'],row['price_usd']),axis=1))
using .0f in the format statement will display the integer part (rounded) of the price_usd value so the output will be.
0 EOS: 9
alternatively using the round function will round the float values
i.e.
In [34]: import pandas as pd
...: df = pd.read_json('https://api.coinmarketcap.com/v1/ticker/eos/')
...: print(df[['name','price_usd']].apply(lambda row:'{}: {:}'.format(row['n
...: ame'],round(row['price_usd'],2)),axis=1))
...:
...:
0 EOS: 8.99
dtype: object
Simply use json.loads(r.text) or much easier directly r.json().
Say, right now the api returns the following data:
[
{
"id": "eos",
"name": "EOS",
"symbol": "EOS",
"rank": "9",
"price_usd": "9.31992",
"price_btc": "0.00106154",
"24h_volume_usd": "596467000.0",
"market_cap_usd": "6034993504.0",
"available_supply": "647537050.0",
"total_supply": "900000000.0",
"max_supply": "1000000000.0",
"percent_change_1h": "1.3",
"percent_change_24h": "-6.81",
"percent_change_7d": "-36.4",
"last_updated": "1517755757"
}
]
If you use r.json(), you get this as a json, otherwise load it with data = json.loads(r.text) and save it to a pandas DataFrame with df = pd.DataFrame(data) which then looks like the following:
In [15]: df
Out[15]:
24h_volume_usd available_supply id last_updated market_cap_usd max_supply name percent_change_1h percent_change_24h percent_change_7d price_btc price_usd rank symbol total_supply
0 596467000.0 647537050.0 eos 1517755757 6034993504.0 1000000000.0 EOS 1.3 -6.81 -36.4 0.00106154 9.31992 9 EOS 900000000.0
Access the data with pandas indexing:
In [8]: df[['name', 'price_usd']]
Out[8]:
name price_usd
0 EOS 9.29186
Or for printing:
In [18]: print df.loc[0, 'name'], ': ', df.loc[0, 'price_usd']
EOS : 9.31992

import nested data into pandas from a json file

I have a generated file as follows:
[{"intervals": [{"overwrites": 35588.4, "latency": 479.52}, {"overwrites": 150375.0, "latency": 441.1485001192274}], "uid": "23"}]
I simplified the file a bit for space reasons (there are more columns besides for the "overwrites" and "latency" ). I would like to import the data into a dataframe so I can later on draw the latency. I tried the following:
with open(os.path.join(path, "my_file.json")) as json_file:
curr_list=json.load(json_file)
df=pd.Series(curr_list[0]['intervals'])
print df
which returned:
0 {u'overwrites': 35588.4, u'latency...
1 {u'overwrites': 150375.0, u'latency...
However I couldn't get to store df in a data structure that allows me to access the latency field as follows:
graph = df[['latency']]
graph.plot(title="latency")
Any ideas?
Thanks for the help!
I think you can use json_normalize:
import pandas as pd
from pandas.io.json import json_normalize
data = [{"intervals": [{"overwrites": 35588.4, "latency": 479.52},
{"overwrites": 150375.0, "latency": 441.1485001192274}],
"uid": "23"}]
result = json_normalize(data, 'intervals', ['uid'])
print result
latency overwrites uid
0 479.5200 35588.4 23
1 441.1485 150375.0 23

Categories

Resources