pandas dataframe - how to extract particular values inside json object - python

My json looks like below:
json_obj = [{'extracted_value': {'other': 'Not found', 'sound': 'false', 'longterm': 'false', 'physician': 'false'}, 'page_num': '33', 'score': '0.75', 'number': 12223611, 'misc':'true'}]
df=pd.DataFrame(json_obj)[['extracted_value', 'page_num','conf_score','number']]
I am extracting only the above info. But now i wanted to ignore 'other': 'Not found' in the extracted_value column and extract like above values.

you can try df['extracted_value'].apply(remove_other) i.e apply a function on column extracted_value.
complete code will be:
json_obj = [{'extracted_value': {'other': 'Not found', 'sound': 'false', 'longterm': 'false', 'physician': 'false'}, 'page_num': '33', 'score': '0.75', 'number': 12223611, 'misc':'true'}]
df=pd.DataFrame(json_obj)[['extracted_value', 'page_num','number']]
def remove_other(my_dict):
return {e:my_dict[e] for e in my_dict if e != 'other' and my_dict[e] != 'Not Found' } # condition to remove other and not found pair
df['extracted_value']=df['extracted_value'].apply(remove_other)
and the result will be:
extracted_value page_num number
0 {'sound': 'false', 'longterm': 'false', 'physi... 33 12223611
additional response:
df['extracted_value'].apply(remove_other) implies that column value will be passed as a parameter to the function. you can put print statement print(my_dict) in the remove_other to visualize it better.
code can be changed to remove dictionary value from and condition.
def remove_other(my_dict):
return {e:my_dict[e] for e in my_dict if e != 'other' }#and my_dict[e] != 'Not Found' } # remove'other' key item
i would suggest getting familiarized with JSON. in this case , need to go to [0]['coord'][0] . so function will be like :
# Section_Page_start and Section_End_Page
def get_start_and_end(var1):
my_dict=var1[0]['coord'][0]
return {ek:my_dict[ek] for ek in my_dict if ek in ['Section_Page_start','Section_End_Page']}

Related

Convert an uneven nested dictioary into tabular form

res = {'Head': {'Ide': 'GLE', 'ID': '7b', 'Source': 'CARS', 'Target': 'TULUM', 'Country': 'GL'},
'Load': {'Stat': {'Code': '21', 'Reason': 'invalid'}, 'SrcFilePath': '/path.xls'}}
res is the nested dictionary that needs to be converted into a tabular form.
With the following columns and respective values:
Ide ID Source Target Country Code Reason SrcFilePath
Code:
for col,data in res.items():
final_data = dict(data.items())
df = pd.DataFrame(final_data)
print(df)
Error:
ValueError: If using all scalar values, you must pass an index
You can try:
pd.DataFrame.from_dict(res, orient='index')
You could try using:
pd.json_normalize(res)
Although the output can be a bit "ugly", but it actually works.
I assume that res isn't the only record and there's data like:
data = [
{'Head': {'Ide': 'GLE', 'ID': '7b', 'Source': 'CARS', 'Target': 'TULUM', 'Country': 'GL'}, 'Load': {'Stat': {'Code': '21', 'Reason': 'invalid'}, 'SrcFilePath': '/path.xls'}}
, {'Head': {'Ide': 'ABC', 'ID': '8b', 'Source': 'CARS', 'Target': 'TULUM', 'Country': 'AB'}, 'Load': {'Stat': {'Code': '21', 'Reason': 'invalid'}, 'SrcFilePath': '/path.xls'}}
, {'Head': {'Ide': 'EFG', 'ID': '9b', 'Source': 'CARS', 'Target': 'TULUM', 'Country': 'EF'}, 'Load': {'Stat': {'Code': '21', 'Reason': 'invalid'}, 'SrcFilePath': '/path.xls'}}
]
So we have to write a procedure to flatten records and apply it by map to the data before transforming records into a frame:
def flatten_dict(d:dict) -> dict:
res = {}
for k, v in d.items():
if type(v) is dict:
res.update(flatten_dict(v))
else:
res[k] = v
return res
output = pd.DataFrame(map(flatten_dict, data))
The output:
Ide ID Source Target Country Code Reason SrcFilePath
0 GLE 7b CARS TULUM GL 21 invalid /path.xls
1 ABC 8b CARS TULUM AB 21 invalid /path.xls
2 EFG 9b CARS TULUM EF 21 invalid /path.xls

How to get a nested response data from json python

I call an API and this my JSON response:
{'data': [{
'id': 'd-1225959',
'startTime': '2022-12-30T00:00:00.000Z',
'endTime': '2022-12-30T23:59:00.000Z',
'checkedInAt': None,
'checkedOutAt': None,
'status': 'PENDING',
'space': {
'id': 'd-4063963',
'name': '082',
'type': 'DESK',
'createdAt': '2021-07-06T11:48:57.000Z',
'updatedAt': '2021-07-06T11:48:57.000Z',
'isAvailable': False,
'assignedTo': None,
'locationId': '133778',
'floorId': '41681',
'floorName': 'Car Park',
'neighborhoodId': '92267',
'neighborhoodName': 'NEI1'}}
I'm struggling to get the 'space' 'id' and 'name' extracted out if I do a nested python loop like so it only returns the headers like 'id' and 'name' not the values held within.
for order in response['data']:
print(order['id'])
print(order['startTime'])
print(order['endTime'])
print(order['checkedInAt'])
print(order['checkedOutAt'])
print(order['status'])
print(order['space'])
for doc in response['space']:
print(doc['id'], doc['name'])
Any help with this would be much appreciated!
for doc in response['space']: will iterate over the keys in response['space'] dict, i.e. doc will be str.
You want to do doc = response['space'] instead and then print(doc['id']). or directly print(response['space']['id']).
Note, you may want to use dict.get() method to avoid KeyError.
# if response dict has no 'space' key, return empty dict.
# if no 'id' key - return None
space_id = response.get('space', {}).get('id')

Python - How to flatten JSON message in pandas/JSON

I have a JSON message as below
JSON_MSG = {
'Header': {
'Timestamp': '2020-10-25T02:49:25.489Z',
'ID': '0422',
'msgName': 'Order',
'Source': 'OrderSys'
},
'CustomerOrderLine': [
{
**'Parameter'**: [
{'ParameterID': 'ACTIVATION_DATE', 'ParameterValue': '2020-10-25'},
{'ParameterID': 'CYCLES', 'ParameterValue': '1'},
{'ParameterID': 'EXPIRY_PERIOD', 'ParameterValue': '30'},
{'ParameterID': 'MAX_NUMBER', 'ParameterValue': '1'}
],
'Subscription': {
'Sub': '3020611',
'LoanAcc': '',
'CustomerAcc': '2020002',
'SubscriptionCreatedDate': '2020-06-23T14:42:30Z',
'BillingAcc': '40010101',
'SubscriptionContractTerm': '12',
'ServiceAcc': '11111',
'SubscriptionStatus': 'Active'
},
'PaymentOpt': 'Upfront',
'OneTimeAmt': '8.0',
'RecurringAmt': '0.0'
'BeneficiaryID': '',
'CustomerOrderID': '111',
'OrderLineCreatedDate': '2020-10-25T02:47:18Z',
'ProductOfferingPriceId': 'PP_6GB_Booster',
'ParentCustomerOrderLineID': '',
'OrderLineRequestedDate': '2020-10-25T00:00:00.000Z',
'ProductCategoryId': 'PRODUCT_OFFER',
'OrderLinePurposeName': 'ADD',
'OrderQuantity': '1.0',
'CustomerOrderLineID': '11111',
'OrderLineDeliveryAddress': {
'OrderLineDeliveryPostCode': '',
'OrderLineDeliveryTown': '',
'OrderLineDeliveryCounty': '',
'OrderLineDeliveryCountryName': ''
},
'ProductInstanceID': '95',
'ProductOfferingId': 'OFF_6GBBOOST_MONTHLY'
}
]
}
I need to flatten the JSON message and convert it into rows and capture the row count/record count
(or)
I need to find out how many elements are present under the nested array Parameter
as this would give me same result as that of flattened JSON(because Parameter is the innermost array)
So far i have tried the below code
data = json.loads(JSON_MSG)
list1 = data['CustomerOrderLine']
rec_count = len(list1)
but this code gives the outer list's result only i.e. 1
as CustomerOrderLine contains only one structs
I need the record/row count as 4 (Parameter array has 4 structs)
Not the prettiest one, but you could try something like:
list1 = JSON_MSG['CustomerOrderLine'][0]['Parameter']
To get 'Parameter' sizes of all elements, you can use list comprehension:
data = json.loads(JSON_MSG)
sizes = [len(order.get('Parameter')) for order in data.get('CustomerOrderLine', [])]

How to read and convert this json into a DF?

I want to convert this nested json into a df.
Tried different functions but none works correctly.
The encoding that worked for my was -
encoding = "utf-8-sig"
[{'replayableActionOperationState': 'SKIPPED',
'replayableActionOperationGuid': 'RAO_1037351',
'failedMessage': 'Cannot replay action: RAO_1037351: com.ebay.sd.catedor.core.model.DTOEntityPropertyChange; local class incompatible: stream classdesc serialVersionUID = 7777212484705611612, local class serialVersionUID = -1785129380151507142',
'userMessage': 'Skip all mode',
'username': 'gfannon',
'sourceAuditData': [{'guid': '24696601-b73e-43e4-bce9-28bc741ac117',
'operationName': 'UPDATE_CATEGORY_ATTRIBUTE_PROPERTY',
'creationTimestamp': 1563439725240,
'auditCanvasInfo': {'id': '165059', 'name': '165059'},
'auditUserInfo': {'id': 1, 'name': 'gfannon'},
'externalId': None,
'comment': None,
'transactionId': '0f135909-66a7-46b1-98f6-baf1608ffd6a',
'data': {'entity': {'guid': 'CA_2511202',
'tagType': 'BOTH',
'description': None,
'name': 'Number of Shelves'},
'propertyChanges': [{'propertyName': 'EntityProperty',
'oldEntity': {'guid': 'CAP_35',
'name': 'DisableAsVariant',
'group': None,
'action': 'SET',
'value': 'true',
'tagType': 'SELLER'},
'newEntity': {'guid': 'CAP_35',
'name': 'DisableAsVariant',
'group': None,
'action': 'SET',
'value': 'false',
'tagType': 'SELLER'}}],
'entityChanges': None,
'primary': True}}],
'targetAuditData': None,
'conflictedGuids': None,
'fatal': False}]
This is what i tried so far, there are more tries but that got me as close as i can.
with open(r"Desktop\Ann's json parsing\report.tsv", encoding='utf-8-sig') as data_file:
data = json.load(data_file)
df = json_normalize(data)
print (df)
pd.DataFrame(df) ## The nested lists are shown as a whole column, im trying to parse those colums - 'failedMessage' and 'sourceAuditData'`I also tried json.loads/json(df) but the output isnt correct.
pd.DataFrame.from_dict(a['sourceAuditData'][0]['data']['propertyChanges'][0]) ##This line will retrive one of the outputs i need but i dont know how to perform it on the whole file.
The expected result should be a csv/xlsx file with a column and value for each row.
For your particular example:
def unroll_dict(d):
data = []
for k, v in d.items():
if isinstance(v, list):
data.append((k, ''))
data.extend(unroll_dict(v[0]))
elif isinstance(v, dict):
data.append((k, ''))
data.extend(unroll_dict(v))
else:
data.append((k,v))
return data
And given the data in your question is stored in the variable example:
df = pd.DataFrame(unroll_dict(example[0])).set_index(0).transpose()

Accessing nested lists/dictionaries

I am extracting data from ebay using ebay-sdkpython. The end goal is to have a XLS dump of items with specific fields for each item. The output that I want to manipulate comes as a complex nested list/dict as below:
dict = {'ack': 'Success',
'timestamp': '2015-11-20T03:49:08.302Z',
'version': '1.13.0',
'searchResult':
{'item': [
{'itemId': '111827613927',
'topRatedListing': 'false',
'sellingStatus':
{'currentPrice':
{'_currencyId': 'AUD',
'value': '290.0'},
'convertedCurrentPrice': {'_currencyId': 'AUD', 'value': '290.0'},
'sellingState': 'EndedWithSales'},
'listingInfo':
{'listingType': 'FixedPrice',
'endTime': '2015-11-20T03:35:07.000Z'}
},
'_count': '100'},
'paginationOutput':
{'totalPages': '114',
'entriesPerPage': '100',
'pageNumber': '1',
'totalEntries': '11364'}}
I know that I can pull a specific key out using the following format:
print dict['searchResult']['item'][0]['itemId']
But I want to pull all of them, with the varying levels of nesting. I also need error handling as some fields are missing, these should return a ''. My code so far is:
x=1
for item in response.dict()['searchResult']['item']:
print x,
for field in ['itemId','topRatedListing']:
try:
print item[field]
except KeyError:
print ''
x=x+1
print '\n'
How would I modify this for loop to also include, for example, ['sellingstatus']['currentprice']['value'] ?

Categories

Resources