pandas json-normalize syntax issues - python

This is an example of my json payload:
{'data':
[{
'predictionValues':
[
{'value': 0.9926338328, 'label': 1.0},
{'value': 0.0073661672, 'label': 0.0}
],
'predictionThreshold': 0.5,
'prediction': 1.0,
'rowId': 0,
'passthroughValues':
{'Id': 'AMF012-000272'}
},
{
'predictionValues':
[
{'value': 0.446989075, 'label': 1.0},
{'value': 0.553010925, 'label': 0.0}
],
'predictionThreshold': 0.5,
'prediction': 0.0,
'rowId': 1,
'passthroughValues':
{'Id': 'NSF008-000165'}
}]
}
I am trying to get a df that looks like this and can't seem to figure it out:
passthroughValues.Id predictionValues.Value_1.0 predictionValues.Value_0.0
AMF012-000272 0.9926338328 0.0073661672
NSF008-000165 0.446989075 0.553010925
just running without any perameters doesn't work
df = json_normalize(finalPredictions)
returns predictionsValues as a series
df = json_normalize(finalPredictions, ['data', 'PredictionValues'])
that only returns 0 and 1 without the needed Id to associate it back to my data

Found the answer:
result = json_normalize(finalPredictions['data'], 'predictionValues', [['passthroughValues', 'Id']])
I really only care about the positive result
result = result[result['label']==1]

Related

Python json show data structure

I am working with json files that stores thousands or even more entries.
firstly I want to understand the data I am working with.
import json
with open("/home/xu/stock_data/stock_market_data/nasdaq/json/AAL.json", "r") as f:
data = json.load(f)
print(json.dumps(data, indent=4))
this gives me a easy to read format, but some of the "keys"(I am not familiar with the json name, so I use the word "key" as in dict objects) have thousands of values, which makes it hard to read as a whole.
I also tried:
import json
with open("/home/xu/stock_data/stock_market_data/nasdaq/json/AAL.json", "r") as f:
data = json.load(f)
df = pd.DataFrame.from_dict(data, orient="index")
print (df.info)
but got
<bound method DataFrame.info of result error
chart [{'meta': {'currency': 'USD', 'symbol': 'AAL',... None>
this result kind of shows the structure, but it ends with ... not showcasing the whole picture.
My Question:
Is there something that works like np.array.shape for json/dict/pandas, of which can show the shape of the structure?
Is there a better library usage of interpretating the json file's structure?
Edit:
Sorry perhaps my wording of my problem was misdirecting. I tried pprint, and it provided me with:
{ 'chart': { 'error': None,
'result': [ { 'events': { 'dividends': { '1406813400': { 'amount': 0.1,
'date': 1406813400},
'1414675800': { 'amount': 0.1,
'date': 1414675800},
'1423146600': { 'amount': 0.1,
'date': 1423146600},
'1430400600': { 'amount': 0.1,
'date': 1430400600},
'1438867800': { 'amount': 0.1,
'date': 1438867800},
'1446561000': { 'amount': 0.1,
'date': 1446561000},
'1454941800': { 'amount': 0.1,
'date': 1454941800},
'1462195800': { 'amount': 0.1,
'date': 1462195800},
'1470231000': { 'amount': 0.1,
'date': 1470231000},
'1478179800': { 'amount': 0.1,
'date': 1478179800},
'1486650600': { 'amount': 0.1,
'date': 1486650600},
'1494595800': { 'amount': 0.1,
'date': 1494595800},
'1502371800': { 'amount': 0.1,
'date': 1502371800},
'1510324200': { 'amount': 0.1,
'date': 1510324200},
'1517841000': { 'amount': 0.1,
'date': 1517841000},
'1525699800': { 'amount': 0.1,
'date': 1525699800},
'1533562200': { 'amount': 0.1,
'date': 1533562200},
'1541428200': { 'amount': 0.1,
'date': 1541428200},
'1549377000': { 'amount': 0.1,
'date': 1549377000},
'1557235800': { 'amount': 0.1,
'date': 1557235800},
'1565098200': { 'amount': 0.1,
'date': 1565098200},
'1572964200': { 'amount': 0.1,
'date': 1572964200},
'1580826600': { 'amount': 0.1,
'date': 1580826600}}},
'indicators': { 'adjclose': [ { 'adjclose': [ 18.19490623474121,
19.326200485229492,
19.05280113220215,
19.80699920654297,
20.268939971923828,
20.891149520874023,
20.928863525390625,
21.28710174560547,
20.88172149658203,
20.93828773498535,
20.721458435058594,
20.514055252075195,
20.466917037963867,
20.994853973388672,
20.81572914123535,
20.2595157623291,
20.155811309814453,
19.816425323486328,
20.702600479125977,
21.032560348510742,
20.740314483642578,
21.0419864654541,
21.26824951171875,
22.531522750854492,
23.266857147216797,
23.587390899658203,
25.9725284576416,
26.27420997619629,
27.150955200195312,
27.273509979248047,
27.7448787689209,
29.507808685302734,
30.92192840576172,
31.4404239654541,
31.817523956298828,
31.940074920654297,
31.676118850708008,
32.354888916015625,
31.157604217529297,
30.158300399780273,
30.63909339904785,
31.148174285888672,
30.969064712524414,
31.496990203857422,
31.01619529724121,
31.666685104370117,
32.31717300415039,
32.31717300415039,
30.497684478759766,
31.69496726989746,
32.006072998046875,
31.7326717376709,
31.940074920654297,
31.826950073242188,
31.346155166625977,
31.61954689025879,
...
...
...
#this goes on and on for the respective "keys" of the json file. which means I have to scroll down thousands of lines to find out what type of data I have.
what I am hoping to find a a solutions that outputs something like this, where it doesn't show the data itself in whole, but only shows the "keys" and maybe some additional information. as some files may literally contain many GBs of data, making it impractical to scroll through.
#this is what I am hoping to achieve.
{
"Name": {
"title": <datatype=str,len=20>,
"time_stamp":<data_type=list, len=3000>,
"closing_price":<data_type=list, len=3000>,
"high_price_of_the_day":<data_type=list, len=3000>
...
...
...
}
}
You have a few options on how to navigate this. If you want to render your data to make more informed decisions quickly, there are the built-in libraries for rendering dictionaries (see pprint) but on a personal level I recommend something that works out of the box without much configuration. I found pprintpp to be the ideal choice for any python data structure. https://pypi.org/project/pprintpp/
Simply run in your terminal:
pip3 install pprintpp
The libraries should install under C:\Users\User\AppData\Local\Programs\Python\PythonXX\Lib\site-packages\pprintpp
After that, simply do this in your code:
import json
from pprintpp import pprint
with open("/home/xu/stock_data/stock_market_data/nasdaq/json/AAL.json", "r") as f:
data = json.load(f)
pprint(data)
You can also do pprint(data, width=1) to guarantee next dictionary key goes on the next line, even if the key is short. Ie:
some_dict = {'a': 'b', 'c': {'aa': 'bb'}}
pprint(data, width=1)
Outputs:
{
'a': 'b',
'c': {
'aa': 'bb',
},
}
Hope this helped! Cheers :)

How to split JSON fields containing '_' into sub-objects?

I have the following JSON object, in which I need to post-process some labels:
{
'id': '123',
'type': 'A',
'fields':
{
'device_safety':
{
'cost': 0.237,
'total': 22
},
'device_unit_replacement':
{
'cost': 0.262,
'total': 7
},
'software_generalinfo':
{
'cost': 3.6,
'total': 10
}
}
}
I need to split the names of labels by _ to get the following hierarchy:
{
'id': '123',
'type': 'A',
'fields':
{
'device':
{
'safety':
{
'cost': 0.237,
'total': 22
},
'unit':
{
'replacement':
{
'cost': 0.262,
'total': 7
}
}
},
'software':
{
'generalinfo':
{
'cost': 3.6,
'total': 10
}
}
}
}
This is my current version, but I got stuck and not sure how to deal with the hierarchy of fields:
import json
json_object = json.load(raw_json)
newjson = {}
for x, y in json_object['fields'].items():
hierarchy = y.split("_")
if len(hierarchy) > 1:
for k in hierarchy:
newjson[k] = ????
newjson = json.dumps(newjson, indent = 4)
Here is recursive function that will process a dict and split the keys:
def splitkeys(dct):
if not isinstance(dct, dict):
return dct
new_dct = {}
for k, v in dct.items():
bits = k.split('_')
d = new_dct
for bit in bits[:-1]:
d = d.setdefault(bit, {})
d[bits[-1]] = splitkeys(v)
return new_dct
>>> splitkeys(json_object)
{'fields': {'device': {'safety': {'cost': 0.237, 'total': 22},
'unit': {'replacement': {'cost': 0.262, 'total': 7}}},
'software': {'generalinfo': {'cost': 3.6, 'total': 10}}},
'id': '123',
'type': 'A'}

Python - merge nested dictionaries

I have this data structure:
playlists_user1={'user1':[
{'playlist1':{
'tracks': [
{'name': 'Karma Police','artist': 'Radiohead', 'count': 1.0},
{'name': 'Bitter Sweet Symphony','artist': 'The Verve','count': 2.0}
]
}
},
{'playlist2':{
'tracks': [
{'name': 'We Will Rock You','artist': 'Queen', 'count': 3.0},
{'name': 'Roxanne','artist': 'Police','count': 5.0}
]
}
},
]
}
playlists_user2={'user2':[
{'playlist1':{
'tracks': [
{'name': 'Karma Police','artist': 'Radiohead', 'count': 2.0},
{'name': 'Sonnet','artist': 'The Verve','count': 4.0}
]
}
},
{'playlist2':{
'tracks': [
{'name': 'We Are The Champions','artist': 'Queen', 'count': 4.0},
{'name': 'Bitter Sweet Symphony','artist': 'The Verve','count': 1.0}
]
}
},
]
}
I would like to merge the two nested dictionaries into one single data structure, whose first item would be a playlist key, like so:
{'playlist1': {'tracks': [{'count': 1.0, 'name': 'Karma Police', 'artist': 'Radiohead'}, {'count': 2.0, 'name': 'Bitter Sweet Symphony', 'artist': 'The Verve'}]}}
I have tried:
prefs1 = playlists_user1['user1'][0]
prefs2 = playlists_user2['user2'][0]
prefs3 = prefs1.update(prefs2)
to no avail.
how do I solve this?
Since your playlist values are already a list of lists of dictionaries with one key (a bit obfuscated if you ask me):
combined = {}
for playlist in playlists_user1.values()[0]:
combined.update(playlist)
for playlist in playlists_user2.values()[0]:
combined.update(playlist)
or if you have many of these, make a list and:
combined = {}
for playlistlist in user_playlists:
for playlist in playlistlist.values()[0]:
combined.update(playlist)

remove points from excel file line chart using python package xlsxwriter

I have this chart, created using python package xlsxwriter and I wanna remove all the points in the middle and only keep the first one and last one.
before :
Final results wanted.:
I tried the attribute points but unfortunate, it didn't work for me.
line_chart.add_series(
{
'values': '='+worksheet_name+'!$C$'+str(row_range+1)+':$I'+str(row_range+1),
'marker': {'type': 'diamond'},
'data_labels': {'value': True, 'category': True, 'position': 'center', 'leader_lines': True},
'points': [
{'fill': {'color': 'green'}},
None,
None,
None,
None,
None,
{'fill': {'color': 'red'}}
],
'trendline': {
'type': 'polynomial',
'name': 'My trend name',
'order': 2,
'forward': 0.5,
'backward': 0.5,
'line': {
'color': 'red',
'width': 1,
'dash_type': 'long_dash'
}
}
}
)
I also tried:
line_chart.set_x_axis({
'major_gridlines': {'visible': False},
'minor_gridlines': {'visible': False}
'delete_series': [1, 6]
})
no luck.
Anobody can help me please?
Thanks in advance!

mongodb python get the element position from an array in a document

I use python + mongodb to store some item ranking data in a collection called chart
{
date: date1,
region: region1,
ranking: [
{
item: bson.dbref.DBRef(db.item.find_one()),
price: current_price,
version: '1.0'
},
{
item: bson.dbref.DBRef(db.item.find_another_one()),
price: current_price,
version: '1.0'
},
.... (and the array goes on)
]
}
Now my problem is, I want to make a history ranking chart for itemA. And according to the $ positional operator, the query should be something like this:
db.chart.find( {'ranking.item': bson.dbref.DBRef('item', itemA._id)}, ['$'])
And the $ operator doesn't work.
Any other possible solution to this?
The $ positional operator is only used in update(...) calls, you can't use it to return the position within an array.
However, you can use field projection to limit the fields returned to just those you need to calculate the position in the array from within Python:
db.foo.insert({
'date': '2011-04-01',
'region': 'NY',
'ranking': [
{ 'item': 'Coca-Cola', 'price': 1.00, 'version': 1 },
{ 'item': 'Diet Coke', 'price': 1.25, 'version': 1 },
{ 'item': 'Diet Pepsi', 'price': 1.50, 'version': 1 },
]})
db.foo.insert({
'date': '2011-05-01',
'region': 'NY',
'ranking': [
{ 'item': 'Diet Coke', 'price': 1.25, 'version': 1 },
{ 'item': 'Coca-Cola', 'price': 1.00, 'version': 1 },
{ 'item': 'Diet Pepsi', 'price': 1.50, 'version': 1 },
]})
db.foo.insert({
'date': '2011-06-01',
'region': 'NY',
'ranking': [
{ 'item': 'Coca-Cola', 'price': 1.00, 'version': 1 },
{ 'item': 'Diet Pepsi', 'price': 1.50, 'version': 1 },
{ 'item': 'Diet Coke', 'price': 1.25, 'version': 1 },
]})
def position_of(item, ranking):
for i, candidate in enumerate(ranking):
if candidate['item'] == item:
return i
return None
print [position_of('Diet Coke', x['ranking'])
for x in db.foo.find({'ranking.item': 'Diet Coke'}, ['ranking.item'])]
# prints [1, 0, 2]
In this (admittedly trivial) example, returning just a subset of fields may not show much benefit; however if your documents are especially large, doing may show performance improvements.

Categories

Resources