Pulling info from Json data - python

Problem:
I am having trouble pulling some info off ups. The city and state in the shipToAddress section.
Below is the data in a easy to read format that i am pulling from ups website with requests:
Data:
data = {
'statusCode': '200',
'statusText': 'Successful',
'isLoggedInUser': False,
'trackedDateTime': '04/16/2019 1:33 P.M. EST',
'isBcdnMultiView': False,
'trackDetails': [{
'errorCode': None,
'errorText': None,
'requestedTrackingNumber': '1Z3774E8YN99957400',
'trackingNumber': '1Z3774E8YN99957400',
'isMobileDevice': False,
'packageStatus': 'Loaded on Delivery Vehicle',
'packageStatusType': 'I',
'packageStatusCode': '072',
'progressBarType': 'InTransit',
'progressBarPercentage': '90',
'simplifiedText': '',
'scheduledDeliveryDayCMSKey': 'cms.stapp.tue',
'scheduledDeliveryDate': '04/16/2019',
'noEstimatedDeliveryDateLabel': None,
'scheduledDeliveryTime': 'cms.stapp.eod',
'scheduledDeliveryTimeEODLabel': 'cms.stapp.eod',
'packageCommitedTime': '',
'endOfDayResCMSKey': None,
'deliveredDayCMSKey': '',
'deliveredDate': '',
'deliveredTime': '',
'receivedBy': '',
'leaveAt': None,
'leftAt': '',
'shipToAddress': {
'streetAddress1': '',
'streetAddress2': '',
'streetAddress3': '',
'city': 'OCEAN',
'state': 'NJ',
'province': None,
'country': 'US',
'zipCode': '',
'companyName': '',
'attentionName': '',
'isAddressCorrected': False,
'isReturnAddress': False,
'isHoldAddress': False,
}}]}
Code:
data = response.text
addressinfo =json.loads(data)['trackDetails']['shipToAddress']
for entry in addressinfo:
city = (entry['city'])
state = (entry['state'])
country = (entry['country'])
My Expected Results:
city = 'Ocean'
state = 'NJ'
etc
this is error:
addressinfo =json.loads(data2)['trackDetails']['shipToAddress']
TypeError: list indices must be integers or slices, not str

Note the format of your JSON:
'trackDetails': [{
...
'shipToAddress': {...}
}]
The dict you're trying to index into is actually contained inside of a list (note the square brackets). The proper way to access the shipToAddress field would be to do this:
addressinfo = json.loads(data2)['trackDetails'][0]['shipToAddress']
^^^
instead of what you were doing.

When you return data = response.text you should instead do data = response.json() since it is a json. This will allow you to access it like a json. Instead you are converting it to a string with .text and then attempting to load it back in which is not necessary.
Then access city:
city = data['trackDetails'][0]['shipToAddress']['city']
state = data['trackDetails'][0]['shipToAddress']['state']

Related

How to get data from python

data = obj.generateSession("P78013","Ujhdy#2")
print(data)
The result printed in the following format
{'status': True, 'message': 'SUCCESS', 'errorcode': '', 'data':
{'clientcode': 'K98913', 'name': 'HPP', 'email': '',
'mobileno': '', 'exchanges': ['bse_cm', 'cde_fo', 'mcx_fo', 'ncx_fo',
'nse_cm', 'nse_fo'], 'products': ['CNC', 'NRML', 'MARGIN', 'MIS',
'BO', 'CO'], 'lastlogintime': '', 'broker': '', 'jwtToken': 'Bearer
eyJhbGciOiJIUzUxMiJ9.eyJ1c2VybmFtZSI6Iko4ODkxMyIsInJvbGVzIjowLCJ1c2VydHlwZSI6IlVTRVIiLCJpYXQiOjE2NTU3NTAxNDksImV4cCI6MTc0MjE1MDE0OX0.P1Ne0T0lTgScZJ1udMYRaJ32WeNDB-bZIwMg4uSAGC4RDFnYRsdvXGRyIEx7KS1LpQ6ndRIt7UjoyIewCs7HLA',
'refreshToken':
'eyJhbGciOiJIUzUxMiJ9.eyJ0b2tlbiI6IlJFRlJFU0gtVE9LRU4iLCJpYXQiOjE2NTU3NTAxNDl9.9DM1ggWfaervPe3qCpoDywfdb8kJ6okQrqZeR_mjsbGliqM7w0DdRyxTHyB7m-742Sfj9tVsZ4qQrOK0RQ9TmQ'}}
i am trying to filter out the 'jwtToken' value in the string format like below
jwtToken='Bearer eyJhbGciOiJIUzUxMiJ9.eyJ1c2VybmFtZSI.....'
here is one way to extract it
re.findall("(jwtToken).?:(.*)\'\,",s)[0]
('jwtToken',
" 'Bearer eyJhbGciOiJIUzUxMiJ9.eyJ1c2VybmFtZSI6Iko4ODkxMyIsInJvbGVzIjowLCJ1c2VydHlwZSI6IlVTRVIiLCJpYXQiOjE2NTU3NTAxNDksImV4cCI6MTc0MjE1MDE0OX0.P1Ne0T0lTgScZJ1udMYRaJ32WeNDB-bZIwMg4uSAGC4RDFnYRsdvXGRyIEx7KS1LpQ6ndRIt7UjoyIewCs7HLA")
jwt tokens are just base64encoded json
import base64
token="Bearer eyJhbGciOiJIUzUxMiJ9.eyJ1c2VybmFtZSI6Iko4ODkxMyIsInJvbGVzIjowLCJ1c2VydHlwZSI6IlVTRVIiLCJpYXQiOjE2NTU3NTAxNDksImV4cCI6MTc0MjE1MDE0OX0.P1Ne0T0lTgScZJ1udMYRaJ32WeNDB-bZIwMg4uSAGC4RDFnYRsdvXGRyIEx7KS1LpQ6ndRIt7UjoyIewCs7HLA".split(" ")[1]
b64tok = token.split(".",1)[1]
print(base64.b64decode(b64tok))

DuckDuckGo API not responding when space in encoded url query

I kind of have two real questions. Both relate to this code:
import urllib
import requests
def query(q):
base_url = "https://api.duckduckgo.com/?q={}&format=json"
resp = requests.get(base_url.format(urllib.parse.quote(q)))
json = resp.json()
return json
One is this: When I query something like this: "US Presidents", I get back something like this:
{'Abstract': '', 'AbstractSource': '', 'AbstractText': '', 'AbstractURL': '', 'Answer': '', 'AnswerType': '', 'Definition': '', 'DefinitionSource': '', 'DefinitionURL': '', 'Entity': '', 'Heading': '', 'Image': '', 'ImageHeight': '', 'ImageIsLogo': '', 'ImageWidth': '', 'Infobox': '', 'Redirect': '', 'RelatedTopics': [], 'Results': [], 'Type': '', 'meta': {'attribution': None, 'blockgroup': None, 'created_date': '2021-03-24', 'description': 'testing', 'designer': None, 'dev_date': '2021-03-24', 'dev_milestone': 'development', 'developer': [{'name': 'zt', 'type': 'duck.co', 'url': 'https://duck.co/user/zt'}], 'example_query': '', 'id': 'just_another_test', 'is_stackexchange': 0, 'js_callback_name': 'another_test', 'live_date': None, 'maintainer': {'github': ''}, 'name': 'Just Another Test', 'perl_module': 'DDG::Lontail::AnotherTest', 'producer': None, 'production_state': 'offline', 'repo': 'fathead', 'signal_from': 'just_another_test', 'src_domain': 'how about there', 'src_id': None, 'src_name': 'hi there', 'src_options': {'directory': '', 'is_fanon': 0, 'is_mediawiki': 0, 'is_wikipedia': 0, 'language': '', 'min_abstract_length': None, 'skip_abstract': 0, 'skip_abstract_paren': 0, 'skip_icon': 0, 'skip_image_name': 0, 'skip_qr': '', 'src_info': '', 'src_skip': ''}, 'src_url': 'Hello there', 'status': None, 'tab': 'is this source', 'topic': [], 'unsafe': None}}
Basically, everything is empty. Even the Heading key, which I know was sent as "US Presidents" encoded into url form. This issue seems to affect all queries I send with a space in them. Even when I go to this url: "https://api.duckduckgo.com/?q=US%20Presidents&format=json&pretty=1" in a browser, all I get is a bunch of blank json keys.
My next question is this. When I send in something like this: "1+1", the json response's "Answer" key is this:
{'from': 'calculator', 'id': 'calculator', 'name': 'Calculator', 'result': '', 'signal': 'high', 'templates': {'group': 'base', 'options': {'content': 'DDH.calculator.content'}}}
Everything else seems to be correct, but the 'result' should be '2', should it not? The entire rest of the json seems to be correct, including all 'RelatedTopics'
Any help with this would be greatly appreciated.
Basically duckduckgo api is not a real search engine. It is just a dictionary. So try US%20President instead of US%20presidents and you'll get an answer. For encoding you can use blanks, but if it's not a fixed term I would prefer the plus-sign. You can do this by using urllib.parse.quote_plus()
With calculation you're right. But I see absolutely no use case to use a calculus-api within a python code. It is like using trampoline to travel to the moon if there is a rocket available. And maybe they see it the same and do not offer calc services in their api?!

Scrapy: Extract Dictionary Stored as Text in Script Tag

Subject: Extract Dictionary Stored in Script Tag.
Hello,
I am trying to scrape this data from the tag.
The goal is to be able to extract the data dictionary and get the values for each of the key value pairs.
Example:
print(digitalData['page']['pageInfo']['language'])
>>> en
I have written the code below and everything works until get to step 3 where I try to convert the string to a dictionary using the ast module.
I get the following error message:
ValueError: malformed node or string: <_ast.Name object at 0x00000238C9100B48>
Scrapy Code
import scrapy
import re
import pprint
import ast
class OptumSpider(scrapy.Spider):
name = 'optum'
allowed_domains = ['optum.com']
start_urls = ['http://optum.com/']
def parse(self, response):
#Step 1: remove all spaces, new lines and tabs
#string = (re.sub('\s+','',response.xpath('//script/text()')[2].extract()))
string = (re.sub('[\s+;]','',response.xpath('//script/text()')[2].extract()))
print(string)
# Step 2: convert string to dictionary. Creates a key as "digitalData"
key, value = string.split("=")
my_dict = {key: value}
print(my_dict)
# Step 3: extract dictionary
print(my_dict['digitalData']) # result is a dictionary stored as a string.
print(type(my_dict['digitalData'])) # shows data type as string.
#ast.literal_eval(my_dict['digitalData']) # convert string to dictionary.
I get the following error message:
ValueError: malformed node or string: <_ast.Name object at 0x00000238C9100B48>
Please provide comments on how to solve. If there is another way to approach or solve please suggest.
Your issue is that within the extracted Javascript dictionary: you have objects.
{
page: {
pageInfo: {
destinationURL: window.location.href,
error: '',
language: 'en',
country: 'US',
pageName: 'tangelo2',
articlepubdate: '',
articleenddate: '',
pageTitle: 'HealthServicesInnovationCompany',
pageOwner: '',
pageTemplate: '',
pageCampaign: '',
tags: '',
pageLastPublishDate: '2020-01-08T12:15:04.032-06:00',
pageLastPublishedBy: 'admin',
pageLastModifiedDate: '2020-01-08T10:24:36.466-06:00',
pageLastModifiedBy: 'katrina'
},
recEngine: {
title: 'Home',
image: '',
description: ''
},
category: {
siteName: window.location.hostname.replace("www.", ""),
version: '3.0',
contentType: '',
contentTopic: '',
contentSegment: '',
contentInitiative: '',
contentProduct: '',
contentProductLine: '',
primaryCategory: 'tangelo2'
}
},
event: {}
}
Notice the page.pageInfo.destinationURL as well as the page.category.siteName values.
What is going on is that ast.literal_eval or any other method you might try to convert this Javascript dictionary to Python will result in an error. You'll need to find a way to remove the window... from the my_dict['digitalData'] before processing it via ast, demjson, or any other tool.
A possible solution is one like so, utilizing demjson as oppose to ast.
import scrapy
import pprint
import demjson
import re
class OptumSpider(scrapy.Spider):
name = 'optum'
allowed_domains = ['optum.com']
start_urls = ['http://optum.com/']
def parse(self, response):
#Step 1: remove all spaces, new lines and tabs
string = (re.sub('[\s+;]','',response.xpath('//script/text()')[2].extract()))
# Step 2: convert string to dictionary. Creates a key as "digitalData"
js_dict = string.split("=")[1]
js_dict = re.sub(r"\bwindow(.*?),\b", '"",', js_dict)
# Step 3: extract dictionary
my_dict = demjson.decode(js_dict)
pprint.pprint(my_dict)
print(type(my_dict))
On running
scrapy runspider test.py -s LOG_ENABLED=False
It outputs:
{'event': {},
'page': {'category': {'contentInitiative': '',
'contentProduct': '',
'contentProductLine': '',
'contentSegment': '',
'contentTopic': '',
'contentType': '',
'primaryCategory': 'tangelo2',
'siteName': '',
'version': '3.0'},
'pageInfo': {'articleenddate': '',
'articlepubdate': '',
'country': 'US',
'destinationURL': '',
'error': '',
'language': 'en',
'pageCampaign': '',
'pageLastModifiedBy': 'katrina',
'pageLastModifiedDate': '2020-01-08T10:24:36.466-06:00',
'pageLastPublishDate': '2020-01-08T12:15:04.032-06:00',
'pageLastPublishedBy': 'admin',
'pageName': 'tangelo2',
'pageOwner': '',
'pageTemplate': '',
'pageTitle': 'HealthServicesInnovationCompany',
'tags': ''},
'recEngine': {'description': '', 'image': '', 'title': 'Home'}}}
<class 'dict'>

Accessing keys/values in a paginated/nested dictionary

I know that somewhat related questions have been asked here: Accessing key, value in a nested dictionary and here: python accessing elements in a dictionary inside dictionary among other places but I can't quite seem to apply the answers' methodology to my issue.
I'm getting a KeyError trying to access the keys within response_dict, which I know is due to it being nested/paginated and me going about this the wrong way. Can anybody help and/or point me in the right direction?
import requests
import json
URL = "https://api.constantcontact.com/v2/contacts?status=ALL&limit=1&api_key=<redacted>&access_token=<redacted>"
#make my request, store it in the requests object 'r'
r = requests.get(url = URL)
#status code to prove things are working
print (r.status_code)
#print what was retrieved from the API
print (r.text)
#visual aid
print ('---------------------------')
#decode json data to a dict
response_dict = json.loads(r.text)
#show how the API response looks now
print(response_dict)
#just for confirmation
print (type(response_dict))
print('-------------------------')
# HERE LIES THE ISSUE
print(response_dict['first_name'])
And my output:
200
{"meta":{"pagination":{}},"results":[{"id":"1329683950","status":"ACTIVE","fax":"","addresses":[{"id":"4e19e250-b5d9-11e8-9849-d4ae5275509e","line1":"222 Fake St.","line2":"","line3":"","city":"Kansas City","address_type":"BUSINESS","state_code":"","state":"OK","country_code":"ve","postal_code":"19512","sub_postal_code":""}],"notes":[],"confirmed":false,"lists":[{"id":"1733488365","status":"ACTIVE"}],"source":"Site Owner","email_addresses":[{"id":"1fe198a0-b5d5-11e8-92c1-d4ae526edd6c","status":"ACTIVE","confirm_status":"NO_CONFIRMATION_REQUIRED","opt_in_source":"ACTION_BY_OWNER","opt_in_date":"2018-09-11T18:18:20.000Z","email_address":"rsmith#fake.com"}],"prefix_name":"","first_name":"Robert","middle_name":"","last_name":"Smith","job_title":"I.T.","company_name":"FBI","home_phone":"","work_phone":"5555555555","cell_phone":"","custom_fields":[],"created_date":"2018-09-11T15:12:40.000Z","modified_date":"2018-09-11T18:18:20.000Z","source_details":""}]}
---------------------------
{'meta': {'pagination': {}}, 'results': [{'id': '1329683950', 'status': 'ACTIVE', 'fax': '', 'addresses': [{'id': '4e19e250-b5d9-11e8-9849-d4ae5275509e', 'line1': '222 Fake St.', 'line2': '', 'line3': '', 'city': 'Kansas City', 'address_type': 'BUSINESS', 'state_code': '', 'state': 'OK', 'country_code': 've', 'postal_code': '19512', 'sub_postal_code': ''}], 'notes': [], 'confirmed': False, 'lists': [{'id': '1733488365', 'status': 'ACTIVE'}], 'source': 'Site Owner', 'email_addresses': [{'id': '1fe198a0-b5d5-11e8-92c1-d4ae526edd6c', 'status': 'ACTIVE', 'confirm_status': 'NO_CONFIRMATION_REQUIRED', 'opt_in_source': 'ACTION_BY_OWNER', 'opt_in_date': '2018-09-11T18:18:20.000Z', 'email_address': 'rsmith#fake.com'}], 'prefix_name': '', 'first_name': 'Robert', 'middle_name': '', 'last_name': 'Smith', 'job_title': 'I.T.', 'company_name': 'FBI', 'home_phone': '', 'work_phone': '5555555555', 'cell_phone': '', 'custom_fields': [], 'created_date': '2018-09-11T15:12:40.000Z', 'modified_date': '2018-09-11T18:18:20.000Z', 'source_details': ''}]}
<class 'dict'>
-------------------------
Traceback (most recent call last):
File "C:\Users\rkiek\Desktop\Python WIP\Chris2.py", line 20, in <module>
print(response_dict['first_name'])
KeyError: 'first_name'
first_name = response_dict["results"][0]["first_name"]
Even though I think this question would be better answered by yourself by reading some documentation, I will explain what is going on here. You see the dict-object of the man named "Robert" is within a list which is a value under the key "results". So, at first you need to access the value within results which is a python-list.
Then you can use a loop to iterate through each of the elements within the list, and treat each individual element as a regular dictionary object.
results = response_dict["results"]
results = response_dict.get("results", None)
# use any one of the two above, the first one will throw a KeyError if there is no key=="results" the other will return NULL
# this results is now a list according to the data you mentioned.
for item in results:
print(item.get("first_name", None)
# here you can loop through the list of dictionaries and treat each item as a normal dictionary

Schema not validating

I'm working with a JSON schema and I'm trying to use the python JSON module to validate some JSON I output against the schema.
I get the following error, indicating that the Schema itself is not validating:
validation
Traceback (most recent call last):
File "/private/var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup At Startup/gc_aicep-395698294.764.py", line 814, in <module>
validate(entry,gc_schema)
File "/Library/Python/2.7/site-packages/jsonschema/validators.py", line 468, in validate
cls(schema, *args, **kwargs).validate(instance)
File "/Library/Python/2.7/site-packages/jsonschema/validators.py", line 117, in validate
raise error
jsonschema.exceptions.ValidationError: ({'website': 'www.stepUp.com', 'bio': '', 'accreditations': {'portfolio': '', 'certifications': [], 'degrees': {'degree_name': [], 'major': '', 'institution_name': '', 'graduation_distinction': '', 'institution_website': ''}}, 'description': 'A great counselor', 'photo': '', 'twitter': '', 'additionaltext': '', 'linkedin': '', 'price': {'costtype': [], 'costrange': []}, 'phone': {'phonetype': [], 'value': '1234567891'}, 'facebook': '', 'counselingtype': [], 'logourl': '', 'counselingoptions': [], 'linkurl': '', 'name': {'first_name': u'Rob', 'last_name': u'Er', 'middle_name': u'', 'title': u''}, 'email': {'emailtype': [], 'value': ''}, 'languages': 'english', 'datasource': {'additionaltext': '', 'linkurl': '', 'linktext': '', 'logourl': ''}, 'linktext': '', 'special_needs_offer': '', 'company': 'Step Up', 'location': {'city': 'New York', 'zip': '10011', 'locationtype': '', 'state': 'NY', 'address': '123 Road Dr', 'loc_name': '', 'country': 'united states', 'geo': ['', '']}},) is not of type 'object'
The validationError message indicates that what follows the colon is not a valid JSON object, I think, but I can't figure out why it would not be.
This JSON validates using a validator like JSON Lint if you replace the single quotation marks with double and remove the basic parentheses from either side.
The 'u' before the name has been flagged as a possible bug.
This is the code which outputs the name:
name = HumanName(row['name'])
first_name = name.first
middle_name = name.middle
last_name = name.last
title = name.title
full_name = dict(first_name=first_name, middle_name=middle_name, last_name=last_name, title=title)
name is inserted into the JSON using the following:
gc_ieca = dict(
name = full_name,
twitter = twitter,
logourl = logourl,
linktext = linktext,
linkurl = linkurl,
additionaltext = additionaltext,
datasource = datasource,
phone=phone,
email = email,
price = price,
languages = languages,
special_needs_offer = special_needs_offer,
# location
location = location,
accreditations = accreditations,
website = website
),
That isn't what a ValidationError indicates. It indicates that validation failed :), not that the JSON is invalid (jsonschema doesn't even deal with JSON, it deals with deserialized JSON i.e. Python objects, here a dict). If the JSON were invalid you'd get an error when you called json.load.
The reason it's failing is because that in fact is not an object, it's a tuple with a single element, an object, so it is in fact invalid. Why it is a tuple is a bug in your code (you have a stray comma at the end there I see).
(And FYI, the u prefixes are because those are unicode literals, and the single quotes are because that's the repr of str, nothing to do with JSON).
I see two potential issues here:
Use of single quotes. Strictly speaking, the json spec calls for a use of double-quotes for strings. Your final note sort of implies this is not your issue, however it is worth mentioning, as something to check if fixing #2 doesn't resolve the problem.
Values for name: these are listed as u'...' which is not valid json. Use of u must be followed by 4 hex digits, and should fall within the double-quotes surrounding a string, after a \ escape character.

Categories

Resources