Extract key and value from json to new dataframe - python

I have a dataframe that has JSON values are in columns. Those were indented into multiple levels. I would like to extract the end key and value into a new dataframe. I will give you sample column values below
{'shipping_assignments': [{'shipping': {'address': {'address_type':
'shipping', 'city': 'Calder', 'country_id': 'US',
'customer_address_id': 1, 'email': 'roni_cost#example.com',
'entity_id': 1, 'firstname': 'Veronica', 'lastname': 'Costello',
'parent_id': 1, 'postcode': '49628-7978', 'region': 'Michigan',
'region_code': 'MI', 'region_id': 33, 'street': ['6146 Honey Bluff
Parkway'], 'telephone': '(555) 229-3326'}, 'method':
'flatrate_flatrate', 'total': {'base_shipping_amount': 5,
'base_shipping_discount_amount': 0,
'base_shipping_discount_tax_compensation_amnt': 0,
'base_shipping_incl_tax': 5, 'base_shipping_invoiced': 5,
'base_shipping_tax_amount': 0, 'shipping_amount': 5,
'shipping_discount_amount': 0,
'shipping_discount_tax_compensation_amount': 0, 'shipping_incl_tax':
5, 'shipping_invoiced': 5, 'shipping_tax_amount': 0}}, 'items':
[{'amount_refunded': 0, 'applied_rule_ids': '1',
'base_amount_refunded': 0, 'base_discount_amount': 0,
'base_discount_invoiced': 0, 'base_discount_tax_compensation_amount':
0, 'base_discount_tax_compensation_invoiced': 0,
'base_original_price': 29, 'base_price': 29, 'base_price_incl_tax':
31.39, 'base_row_invoiced': 29, 'base_row_total': 29, 'base_row_total_incl_tax': 31.39, 'base_tax_amount': 2.39,
'base_tax_invoiced': 2.39, 'created_at': '2019-09-27 10:03:45',
'discount_amount': 0, 'discount_invoiced': 0, 'discount_percent': 0,
'free_shipping': 0, 'discount_tax_compensation_amount': 0,
'discount_tax_compensation_invoiced': 0, 'is_qty_decimal': 0,
'item_id': 1, 'name': 'Iris Workout Top', 'no_discount': 0,
'order_id': 1, 'original_price': 29, 'price': 29, 'price_incl_tax':
31.39, 'product_id': 1434, 'product_type': 'configurable', 'qty_canceled': 0, 'qty_invoiced': 1, 'qty_ordered': 1,
'qty_refunded': 0, 'qty_shipped': 1, 'row_invoiced': 29, 'row_total':
29, 'row_total_incl_tax': 31.39, 'row_weight': 1, 'sku':
'WS03-XS-Red', 'store_id': 1, 'tax_amount': 2.39, 'tax_invoiced':
2.39, 'tax_percent': 8.25, 'updated_at': '2019-09-27 10:03:46', 'weight': 1, 'product_option': {'extension_attributes':
{'configurable_item_options': [{'option_id': '141', 'option_value':
167}, {'option_id': '93', 'option_value': 58}]}}}]}],
'payment_additional_info': [{'key': 'method_title', 'value': 'Check /
Money order'}], 'applied_taxes': [{'code': 'US-MI--Rate 1', 'title':
'US-MI--Rate 1', 'percent': 8.25, 'amount': 2.39, 'base_amount':
2.39}], 'item_applied_taxes': [{'type': 'product', 'applied_taxes': [{'code': 'US-MI--Rate 1', 'title': 'US-MI--Rate 1', 'percent':
8.25, 'amount': 2.39, 'base_amount': 2.39}]}], 'converting_from_quote': True}
Above is single row value of the dataframe column df['x']
My codes are below to convert
sample = data['x'].tolist()
data = json.dumps(sample)
df = pd.read_json(data)
it gives new dataframe with columns
Index(['applied_taxes', 'converting_from_quote', 'item_applied_taxes',
'payment_additional_info', 'shipping_assignments'],
dtype='object')
When I tried to do the same above to convert the column which has row values
m_df = df['applied_taxes'].apply(lambda x : re.sub('.?\[|$.|]',"", str(x)))
m_sample = m_df.tolist()
m_data = json.dumps(m_sample)
c_df = pd.read_json(m_data)
It doesn't work
Check this link to get the beautified_json

I came across a beautiful ETL package in python called petl. convert the json list into dict form with the help of function called fromdicts(json_string)
order_table = fromdicts(data_list)
If you find any nested dict in any of the columns, use unpackdict(order_table,'nested_col')
it will unpack the nested dict.
In my case, I need to unpack the applied_tax column. Below code will unpack and append the key and value as a column and row in the same table.
order_table = unpackdict(order_table, 'applied_taxes')
If you guys wants to know more about -petl

It seems that your mistake was in tolist(). Try the following:
import pandas as pd
import json
import re
data = {"shipping_assignments":[{"shipping":{"address":{"address_type":"shipping","city":"Calder","country_id":"US","customer_address_id":1,"email":"roni_cost#example.com","entity_id":1,"firstname":"Veronica","lastname":"Costello","parent_id":1,"postcode":"49628-7978","region":"Michigan","region_code":"MI","region_id":33,"street":["6146 Honey Bluff Parkway"],"telephone":"(555) 229-3326"},"method":"flatrate_flatrate","total":{"base_shipping_amount":5,"base_shipping_discount_amount":0,"base_shipping_discount_tax_compensation_amnt":0,"base_shipping_incl_tax":5,"base_shipping_invoiced":5,"base_shipping_tax_amount":0,"shipping_amount":5,"shipping_discount_amount":0,"shipping_discount_tax_compensation_amount":0,"shipping_incl_tax":5,"shipping_invoiced":5,"shipping_tax_amount":0}},"items":[{"amount_refunded":0,"applied_rule_ids":"1","base_amount_refunded":0,"base_discount_amount":0,"base_discount_invoiced":0,"base_discount_tax_compensation_amount":0,"base_discount_tax_compensation_invoiced":0,"base_original_price":29,"base_price":29,"base_price_incl_tax":31.39,"base_row_invoiced":29,"base_row_total":29,"base_row_total_incl_tax":31.39,"base_tax_amount":2.39,"base_tax_invoiced":2.39,"created_at":"2019-09-27 10:03:45","discount_amount":0,"discount_invoiced":0,"discount_percent":0,"free_shipping":0,"discount_tax_compensation_amount":0,"discount_tax_compensation_invoiced":0,"is_qty_decimal":0,"item_id":1,"name":"Iris Workout Top","no_discount":0,"order_id":1,"original_price":29,"price":29,"price_incl_tax":31.39,"product_id":1434,"product_type":"configurable","qty_canceled":0,"qty_invoiced":1,"qty_ordered":1,"qty_refunded":0,"qty_shipped":1,"row_invoiced":29,"row_total":29,"row_total_incl_tax":31.39,"row_weight":1,"sku":"WS03-XS-Red","store_id":1,"tax_amount":2.39,"tax_invoiced":2.39,"tax_percent":8.25,"updated_at":"2019-09-27 10:03:46","weight":1,"product_option":{"extension_attributes":{"configurable_item_options":[{"option_id":"141","option_value":167},{"option_id":"93","option_value":58}]}}}]}],"payment_additional_info":[{"key":"method_title","value":"Check / Money order"}],"applied_taxes":[{"code":"US-MI-*-Rate 1","title":"US-MI-*-Rate 1","percent":8.25,"amount":2.39,"base_amount":2.39}],"item_applied_taxes":[{"type":"product","applied_taxes":[{"code":"US-MI-*-Rate 1","title":"US-MI-*-Rate 1","percent":8.25,"amount":2.39,"base_amount":2.39}]}],"converting_from_quote":"True"}
df = pd.read_json(json.dumps(data))
m_df = df['applied_taxes'].apply(lambda x : re.sub('.?\[|$.|]',"", str(x)))
c_df = pd.read_json(json.dumps(list(m_df)))
print(c_df)
prints the following:
0
0 {'code': 'US-MI-*-Rate 1', 'title': 'US-MI-*-R...

Related

Is one of the numbers in this list in between the two given integers?

I have a list with barline ticks and midi notes that can overlap the barlines. So I made a list of 'barlineticks':
barlinepos = [0, 768.0, 1536.0, 2304.0, 3072.0, 3840.0, 4608.0, 5376.0, 6144.0, 6912.0, 0, 576.0, 1152.0, 1728.0, 2304.0, 2880.0, 3456.0, 4032.0, 4608.0, 5184.0, 5760.0, 6336.0, 6912.0, 7488.0]
And a MidiFile:
{'type': 'time_signature', 'numerator': 4, 'denominator': 4, 'time': 0, 'duration': 768, 'ID': 0}
{'type': 'set_tempo', 'tempo': 500000, 'time': 0, 'ID': 1}
{'type': 'track_name', 'name': 'Tempo Track', 'time': 0, 'ID': 2}
{'type': 'track_name', 'name': 'New Instrument', 'time': 0, 'ID': 3}
{'type': 'note_on', 'time': 0, 'channel': 0, 'note': 48, 'velocity': 100, 'ID': 4, 'duration': 956}
{'type': 'time_signature', 'numerator': 3, 'denominator': 4, 'time': 768, 'duration': 6911, 'ID': 5}
{'type': 'note_on', 'time': 768, 'channel': 0, 'note': 46, 'velocity': 100, 'ID': 6, 'duration': 575}
{'type': 'note_off', 'time': 956, 'channel': 0, 'note': 48, 'velocity': 0, 'ID': 7}
{'type': 'note_off', 'time': 1343, 'channel': 0, 'note': 46, 'velocity': 0, 'ID': 8}
{'type': 'end_of_track', 'time': 7679, 'ID': 9}
And I want to check if the midi note is overlapping a barline. Every note_on message has a 'time' and a 'duration' value. I have to check if one of the barlineticks(in the list) is inside the range of the note('time' and 'duration'). I tried:
if barlinepos in range(0, 956):
print(True)
Of course this doesn't work because barlinepos is a list. How can I check if one of the values in the list results in True?
Simple iteration to solve the requirement:
for i in midifile:
start, end = i["time"], i["time"]+i["duration"]
for j in barlinepos:
if j >= start and j<= end:
print(True)
break
print(False)

How can I add column values from a pandas dataframe as a new key value pair in another column having dictionary?

I have a dataframe for example
df = {'dicts':[{'id': 0, 'text': 'Willamette'},
{'id': 1, 'text': 'Valley'}],
'ner': ["Person", "Location"]}
df= pd.DataFrame(df)
`
I want end result like
{'id': 0, 'text': 'Willamette', 'ner': 'Person'}
{'id': 1, 'text': 'Valley', 'ner': 'Location'}
`
I am using following logic but it isn't working for me-
for i, rows in df["dicts"].iteritems():
for cat in df['ner']:
df["dicts"][i]=df["dicts"][i].update({'ner' : df['ner'][cat]})
How can i solve this?
IIUC
d=pd.DataFrame(df.dicts.tolist(),index=df.index).join(df[['ner']]).to_dict('r')
[{'id': 0, 'text': 'Willamette', 'ner': 'Person'}, {'id': 1, 'text': 'Valley', 'ner': 'Location'}]

Extract values from array in python

I'm having some trouble accessing a value that is inside an array that contains a dictionary and another array.
It looks like this:
[{'name': 'Alex',
'number_of_toys': [{'classification': 3, 'count': 383},
{'classification': 1, 'count': 29},
{'classification': 0, 'count': 61}],
'total_toys': 473},
{'name': 'John',
'number_of_toys': [{'classification': 3, 'count': 8461},
{'classification': 0, 'count': 3825},
{'classification': 1, 'count': 1319}],
'total_toys': 13605}]
I want to access the 'count' number for each 'classification'. For example, for 'name' Alex, if 'classification' is 3, then the code returns the 'count' of 383, and so on for the other classifications and names.
Thanks for your help!
Not sure what your question asks, but if it's just a mapping exercise this will get you on the right track.
def get_toys(personDict):
person_toys = personDict.get('number_of_toys')
return [ (toys.get('classification'), toys.get('count')) for toys in person_toys]
def get_person_toys(database):
return [(personDict.get('name'), get_toys(personDict)) for personDict in database]
This result is:
[('Alex', [(3, 383), (1, 29), (0, 61)]), ('John', [(3, 8461), (0, 3825), (1, 1319)])]
This isn't as elegant as the previous answer because it doesn't iterate over the values, but if you want to select specific elements, this is one way to do that:
data = [{'name': 'Alex',
'number_of_toys': [{'classification': 3, 'count': 383},
{'classification': 1, 'count': 29},
{'classification': 0, 'count': 61}],
'total_toys': 473},
{'name': 'John',
'number_of_toys': [{'classification': 3, 'count': 8461},
{'classification': 0, 'count': 3825},
{'classification': 1, 'count': 1319}],
'total_toys': 13605}]
import pandas as pd
df = pd.DataFrame(data)
print(df.loc[0]['name'])
print(df.loc[0][1][0]['classification'])
print(df.loc[0][1][0]['count'])
which gives:
Alex
3
383

Adding node elements to json object in Python from NetworkX

I have a json object that I made using networkx:
json_data = json_graph.node_link_data(network_object)
It is structured like this (mini version of my output):
>>> json_data
{'directed': False,
'graph': {'name': 'compose( , )'},
'links': [{'source': 0, 'target': 7, 'weight': 1},
{'source': 0, 'target': 2, 'weight': 1},
{'source': 0, 'target': 12, 'weight': 1},
{'source': 0, 'target': 9, 'weight': 1},
{'source': 2, 'target': 18, 'weight': 25},
{'source': 17, 'target': 25, 'weight': 1},
{'source': 29, 'target': 18, 'weight': 1},
{'source': 30, 'target': 18, 'weight': 1}],
'multigraph': False,
'nodes': [{'bipartite': 1, 'id': 'Icarus', 'node_type': 'Journal'},
{'bipartite': 1,
'id': 'A Giant Step: from Milli- to Micro-arcsecond Astrometry',
'node_type': 'Journal'},
{'bipartite': 1,
'id': 'The Astrophysical Journal Supplement Series',
'node_type': 'Journal'},
{'bipartite': 1,
'id': 'Astronomy and Astrophysics Supplement Series',
'node_type': 'Journal'},
{'bipartite': 1, 'id': 'Astronomy and Astrophysics', 'node_type': 'Journal'},
{'bipartite': 1,
'id': 'Astronomy and Astrophysics Review',
'node_type': 'Journal'}]}
What I want to do is add the following elements to each of the nodes so I can use this data as an input for sigma.js:
"x": 0,
"y": 0,
"size": 3
"centrality": 0
I can't seem to find an efficient way to do this though using add_node(). Is there some obvious way to add this that I'm missing?
While you have your data as a networkx graph, you could use the set_node_attributes method to add the attributes (e.g. stored in a python dictionary) to all the nodes in the graph.
In my example the new attributes are stored in the dictionary attr:
import networkx as nx
from networkx.readwrite import json_graph
# example graph
G = nx.Graph()
G.add_nodes_from(["a", "b", "c", "d"])
# your data
#G = json_graph.node_link_graph(json_data)
# dictionary of new attributes
attr = {"x": 0,
"y": 0,
"size": 3,
"centrality": 0}
for name, value in attr.items():
nx.set_node_attributes(G, name, value)
# check new node attributes
print(G.nodes(data=True))
You can then export the new graph in JSON with node_link_data.

cx_Oracle ignores order by clause

I've created complex query builder in my project, and during tests stumbled upon strange issue: same query with the same plan produces different results on different clients: cx_Oracle ignores order by clause, while Oracle SQLDeveloper Studio process query correctly, however in both cases order by present in both plans.
Query in question is:
select *
from
(
select
a.*,
ROWNUM tmp__rnum
from
(
select base.*
from
(
select id
from
(
(
select
profile_id as id,
surname as sort__col
from names
)
/* here usually are several other subqueries chained by unions */
)
group by id
order by min(sort__col) asc
) tmp
left join (profiles) base
on tmp.id = base.id
where exists
(
select t.object_id
from object_rights t
where
t.object_id = base.id
and t.subject_id = :a__subject_id
and t.rights in ('r','w')
)
) a
where ROWNUM < :rows_to
)
where tmp__rnum >= :rows_from
and plan from cx_Oracle in case I missed anything:
{'operation': 'SELECT STATEMENT', 'position': 9225, 'cardinality': 2164, 'time': 1, 'cost': 9225, 'depth': 0, 'bytes': 84396, 'optimizer': 'ALL_ROWS', 'id': 0, 'cpu_cost': 1983805801},
{'operation': 'VIEW', 'position': 1, 'filter_predicates': '"TMP__RNUM">=TO_NUMBER(:ROWS_FROM)', 'parent_id': 0, 'object_instance': 1, 'cardinality': 2164SEL$1', 'projection': '"from$_subquery$_001"."ID"[NUMBER,22], "from$_subquery$_001"."CREATION_TIME"[TIMESTAMP,11], "TMP__RNUM"[NUMBER,22]', 'time': 1, 'cost': 9225, 'depth': 1, 'bytes': 84396, 'id': 1, 'cpu_cost': 1983805801},
{'operation': 'COUNT', 'position': 1, 'filter_predicates': 'ROWNUM<TO_NUMBER(:ROWS_TO)', 'parent_id': 1, 'projection': '"BASE"."ID"[NUMBER,22], "BASE"."CREATION_TIME"[TIMESTAMP,11], ROWNUM[8]', 'options': 'STOPKEY', 'depth': 2, 'id': 2,
{'operation': 'HASH JOIN', 'position': 1, 'parent_id': 2, 'access_predicates': '"TMP"."ID"="BASE"."ID"', 'cardinality': 2164, 'projection': '(#keys=1) "BASE"."ID"[NUMBER,22], "BASE"."CREATION_TIME"[TIMESTAMP,11]', 'time': 1, 'cost': 9225, 'depth': 3, 'bytes': 86560, 'id': 3, 'cpu_cost': 1983805801},
{'operation': 'JOIN FILTER', 'position': 1, 'parent_id': 3, 'object_owner': 'SYS', 'cardinality': 2219, 'projection': '"BASE"."ID"[NUMBER,22], "BASE"."CREATION_TIME"[TIMESTAMP,11]', 'object_name': ':BF0000', 'time': 1, 'cost': 662, 'options': 'CREATE', 'depth': 4, 'bytes': 59913, 'id': 4, 'cpu_cost': 223290732},
{'operation': 'HASH JOIN', 'position': 1, 'parent_id': 4, 'access_predicates': '"T"."OBJECT_ID"="BASE"."ID"', 'cardinality': 2219, 'projection': '(#keys=1) "BASE"."ID"[NUMBER,22], "BASE"."CREATION_TIME"[TIMESTAMP,11]', 'time': 1, 'cost': 662, 'options': 'RIGHT SEMI', 'depth': 5, 'bytes': 59913, 'id': 5, 'cpu_cost': 223290732},
{'operation': 'TABLE ACCESS', 'position': 1, 'filter_predicates': '"T"."SUBJECT_ID"=TO_NUMBER(:A__SUBJECT_ID) AND ("T"."RIGHTS"=\'r\' OR "T"."RIGHTS"=\'w\')', 'parent_id': 5, 'object_type': 'TABLE', 'object_instance': 8, 'cardinality': 2219, 'projection': '"T"."OBJECT_ID"[NUMBER,22]', 'object_name': 'OBJECT_RIGHTS', 'time': 1, 'cost': 5, 'options': 'FULL', 'depth': 6, 'bytes': 24409, 'optimizer': 'ANALYZED', 'id': 6, 'cpu_cost': 1823386},
{'operation': 'TABLE ACCESS', 'position': 2, 'parent_id': 5, 'object_type': 'TABLE', 'object_instance': 6, 'cardinality': 753862, 'projection': '"BASE"."ID"[NUMBER,22], "BASE"."CREATION_TIME"[TIMESTAMP,11]', 'object_name': 'PROFILES', 'time': 1, 'cost': 654, 'options': 'FULL', 'depth': 6, 'bytes': 12061792, 'optimizer': 'ANALYZED', 'id': 7, 'cpu_cost': 145148296},
{'operation': 'VIEW', 'position': 2, 'parent_id': 3, 'object_instance': 3, 'cardinality': 735296, 'projection': '"TMP"."ID"[NUMBER,22]', 'time': 1, 'cost': 8559, 'depth': 4, 'bytes': 9558848, 'id': 8, 'cpu_cost': 1686052619},
{'operation': 'SORT', 'position': 1, 'parent_id': 8, 'cardinality': 735296, 'projection': '(#keys=1) MIN("SURNAME")[50], "PROFILE_ID"[NUMBER,22]', 'time': 1, 'cost': 8559, 'options': 'ORDER BY', 'temp_space': 18244000, 'depth': 5, 'bytes': 10294144, 'id': 9, 'cpu_cost': 1686052619},
{'operation': 'HASH', 'position': 1, 'parent_id': 9, 'cardinality': 735296, 'projection': '(#keys=1; rowset=200) "PROFILE_ID"[NUMBER,22], MIN("SURNAME")[50]', 'time': 1, 'cost': 8559, 'options': 'GROUP BY', 'temp_space': 18244000, 'depth': 6, 'bytes': 10294144, 'id': 10, 'cpu_cost': 1686052619},
{'operation': 'JOIN FILTER', 'position': 1, 'parent_id': 10, 'object_owner': 'SYS', 'cardinality': 756586, 'projection': '(rowset=200) "PROFILE_ID"[NUMBER,22], "SURNAME"[VARCHAR2,50]', 'object_name': ':BF0000', 'time': 1, 'cost': 1202, 'options': 'USE', 'depth': 7, 'bytes': 10592204, 'id': 11, 'cpu_cost': 190231639},
{'operation': 'TABLE ACCESS', 'position': 1, 'filter_predicates': 'SYS_OP_BLOOM_FILTER(:BF0000,"PROFILE_ID")', 'parent_id': 11, 'object_type': 'TABLE', 'object_instance': 5, 'cardinality': 756586, 'projection': '(rowset=200) "PROFILE_ID"[NUMBER,22], "SURNAME"[VARCHAR2,50]', 'object_name': 'NAMES', 'time': 1, 'cost': 1202, 'options': 'FULL', 'depth': 8, 'bytes': 10592204, 'optimizer': 'ANALYZED', 'id': 12, 'cpu_cost': 190231639}
cx_Oracle output (appears to be ordered by id):
ID, Created, rownum
(1829, 2016-08-24, 1)
(2438, 2016-08-24, 2)
SQLDeveloper Output (ordered by surname, as expected):
ID, Created, rownum
(518926, 2016-08-28, 1)
(565556, 2016-08-29, 2)
I don't see an ORDER BY clause that would affect the ordering of the results of the query. In SQL, the only way to guarantee the ordering of a result set is to have an ORDER BY clause for the outer-most SELECT.
In almost all cases, an ORDER BY in a subquery is not necessarily respected (Oracle makes an exception when there are rownum comparisons in the next level of the query -- and even that is now out of date with the support of FETCH FIRST <n> ROWS).
So, there is no reason to expect that an ORDER BY in the innermost subquery would have any effect, particularly with the JOIN that then happens.
Suggestions:
Move the ORDER BY to the outermost query.
Use FETCH FIRST syntax, if you are using Oracle 12c+.
Move the ORDER BY after the JOIN.
Use ROW_NUMBER() instead of rownum.

Categories

Resources