Glassdoor API Not Printing Custom Response - python

I have the following problem when I try to print something from this api. I'm trying to set it up so I can access different headers, then print specific items from it. But instead when I try to print soup it gives me the entire api response in json format.
import requests, json, urlparse, urllib2
from BeautifulSoup import BeautifulSoup
url = "apiofsomesort"
#Create Dict based on JSON response; request the URL and parse the JSON
#response = requests.get(url)
#response.raise_for_status() # raise exception if invalid response
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
response = urllib2.urlopen(req)
soup = BeautifulSoup(response)
print soup
When it prints it looks like the below:
{
"success": true,
"status": "OK",
"jsessionid": "0541E6136E5A2D5B2A1DF1F0BFF66D03",
"response": {
"attributionURL": "http://www.glassdoor.com/Reviews/airbnb-reviews-SRCH_KE0,6.htm",
"currentPageNumber": 1,
"totalNumberOfPages": 1,
"totalRecordCount": 1,
"employers": [{
"id": 391850,
"name": "Airbnb",
"website": "www.airbnb.com",
"isEEP": true,
"exactMatch": true,
"industry": "Hotels, Motels, & Resorts",
"numberOfRatings": 416,
"squareLogo": "https://media.glassdoor.com/sqll/391850/airbnb-squarelogo-1459271200583.png",
"overallRating": 4.3,
"ratingDescription": "Very Satisfied",
"cultureAndValuesRating": "4.4",
"seniorLeadershipRating": "4.0",
"compensationAndBenefitsRating": "4.3",
"careerOpportunitiesRating": "4.1",
"workLifeBalanceRating": "3.9",
"recommendToFriendRating": "0.9",
"sectorId": 10025,
"sectorName": "Travel & Tourism",
"industryId": 200140,
"industryName": "Hotels, Motels, & Resorts",
"featuredReview": {
"attributionURL": "http://www.glassdoor.com/Reviews/Employee-Review-Airbnb-RVW12111314.htm",
"id": 12111314,
"currentJob": false,
"reviewDateTime": "2016-09-28 16:44:00.083",
"jobTitle": "Employee",
"location": "",
"headline": "An amazing place to work!",
"pros": "Wonderful people and great culture. Airbnb really strives to make you feel at home as an employee, and everyone is genuinely excited about the company mission.",
"cons": "The limitations of Rails 3 and the company infrastructure make developing difficult sometimes.",
"overall": 5,
"overallNumeric": 5
},
"ceo": {
"name": "Brian Chesky",
"title": "CEO & Co-Founder",
"numberOfRatings": 306,
"pctApprove": 95,
"pctDisapprove": 5,
"image": {
"src": "https://media.glassdoor.com/people/sqll/391850/airbnb-brian-chesky.png",
"height": 200,
"width": 200
}
}
}]
}
}
I want to print out specific items like employers":name, industry etc...

You can load the JSON response into a dict then look for the values you want like you would in any other dict.
I took your data and saved it in an external JSON file to do a test since I don't have access to the API. This worked for me.
import json
# Load JSON from external file
with open (r'C:\Temp\json\data.json') as json_file:
data = json.load(json_file)
# Print the values
print 'Name:', data['response']['employers'][0]['name']
print 'Industry:', data['response']['employers'][0]['industry']
Since you're getting your data from an API something like this should work.
import json
import urlib2
url = "apiofsomesort"
# Load JSON from API
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url, headers=hdr)
response = urllib2.urlopen(req)
data = json.load(response.read())
# Print the values
print 'Name:', data['response']['employers'][0]['name']
print 'Industry:', data['response']['employers'][0]['industry']

import json, urlib2
url = "http..."
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url, headers=hdr)
response = urllib2.urlopen(req)
data = json.loads(response.read())
# Print the values
print 'numberOfRatings:', data['response']['employers'][0]['numberOfRatings']

Related

Execute js function in HTML page scraped by python to get json data

I have a website with products https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01 When I inspect the html page I see they have all info in json format in script tag under
window.INITIAL_DATA = JSON.parse('{"pa...')
I tried to scrape the html with requests and get the json string with regex, however my code somehow change the json structure and I cannot load it with json.loads()
response = requests.get('https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
regex = "JSON.parse\(.*;"
match = re.search(regex, str(soup))
json_string = match.group(0).replace("JSON.parse(", "")[1:-3]
json_data = json.loads(json_string)
it ends with json error because there are multiple weird spaces and " which does json library in python cannot handle
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 22173 (char 22172)
Is there a way how to get the json data or even better how to execute the window.INITIAL_DATA function directly in html response in python?
Try:
import re
import js2py
import requests
url = "https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01"
html_doc = requests.get(url).text
data = re.search(r"window\.INITIAL_DATA = (.*)", html_doc)
data = js2py.eval_js(data.group(1))
print(data)
Prints:
{
"currentCountry": {
"englishName": "Sweden",
"localName": "Sverige",
"twoLetterCode": "SE",
},
"currentCurrency": "SEK",
"currentLanguage": "sv-SE",
"currentLanguageRevision": "43",
"currentLanguageTwoLetterName": "sv",
"dynamicData": [
{
"data": {},
"type": "NordicNest.ContentApi.DynamicData.MenuApiModel,NordicNest.ContentApi",
},
{
"type": "NordicNest.Core.Contentful.Model.SiteLayout.Footer,NordicNest.Core"
},
...

How to scrape data from sciencedirect

I want to scrape all data from sciencedirect by keyword.
I know that sciencedirect is program by ajax,
so the data of their page could't be extract directly via the
url of search result page.
The page I want to scrape
I've find the json data from numerous requests in Network area, in my view, I could get json data by this url of the request.But there are some error msg and garbled. Here is my code.
The request that contain json
import requests as res
import json
from bs4 import BeautifulSoup
keyword="digital game"
url = 'https://www.sciencedirect.com/search/api?'
payload = {
'tak': keyword,
't': 'ZNS1ixW4GGlMjTKbRHccgZ2dHuMVHqLqNBwYzIZayNb8FZvZFnVnLBYUCU%2FfHTxZMgwoaQmcp%2Foemth5%2FnqtM%2BGQW3NGOv%2FI0ng6yDADzynQO66j9EPEGT0aClusSwPFvKdDbfVcomCzYflUlyb3MA%3D%3D',
'hostname': 'www.sciencedirect.com'
}
r = res.get(url, params = payload)
print(r.content) # get garbled
r = r.json()
print(r) # get error msg
Garbled (not json data I expect)
Error msg (about .json()
Try setting the HTTP headers in the request such as user-agent to mimic a standard web browser. This will return query search results in JSON format.
import requests
keyword = "digital game"
url = 'https://www.sciencedirect.com/search/api?'
headers = {
'User-Agent': 'Mozilla/5.0',
'Accept': 'application/json'
}
payload = {
'tak': keyword,
't': 'ZNS1ixW4GGlMjTKbRHccgZ2dHuMVHqLqNBwYzIZayNb8FZvZFnVnLBYUCU%2FfHTxZMgwoaQmcp%2Foemth5%2FnqtM%2BGQW3NGOv%2FI0ng6yDADzynQO66j9EPEGT0aClusSwPFvKdDbfVcomCzYflUlyb3MA%3D%3D',
'hostname': 'www.sciencedirect.com'
}
r = requests.get(url, headers=headers, params=payload)
# need to check if the response output is JSON
if "json" in r.headers.get("Content-Type"):
data = r.json()
else:
print(r.status_code)
data = r.text
print(data)
Output:
{'searchResults': [{'abstTypes': ['author', 'author-highlights'], 'authors': [{'order': 1, 'name': 'Juliana Tay'},
..., 'resultsCount': 961}}
I've got the same problem. The point is that sciencedirect.com is using cloudflare which blocks the access for scraping bots. I've tried to use different approaches like cloudsraper, cfscrape etc... Unsuccessful! Then I've made a small parser based on Selenium which allows me to take metadata from publications and put it into my own json file with following schema:
schema = {
"doi_number": {
"metadata": {
"pub_type": "Review article" | "Research article" | "Short communication" | "Conference abstract" | "Case report",
"open_access": True | False,
"title": "title_name",
"journal": "journal_name",
"date": "publishing_date",
"volume": str,
"issue": str,
"pages": str,
"authors": [
"author1",
"author2",
"author3"
]
}
}
}
If you have any questions or maybe ideas fill free to contact me.

JSON key and value

I am trying to get the values from objects in the following JSON response:
[
{
"compositionId": "-Mkl92Mii2UF3xzi1q7L",
"compositionName": null,
"mainComposition": true,
"animation": {
"state": "Out1"
}
},
{
"compositionId": "bbbbbb",
"compositionName": null,
"mainComposition": true,
"animation": {
"state": "Out1"
}
}
]
What I would like to get in a loop is all the compositionIds but I don't get the correct output.
I can dump the complete JSON with the following code:
import requests
import json
url = 'http://192.168.1.33/data'
r = requests.get(url)
data = json.loads(r.content.decode())
json_str = json.dumps(data)
resp = json.loads(json_str)
print (resp)
You can simply use the requests module, in fact it does provide a builtin json decoder, that is the .json() function. Done that, you can simply iterate over your json objects with a simple for.
You could do something similar to this:
import requests
url = 'http://192.168.1.33/data'
r = requests.get(url)
my_json_file = r.json()
for json_object in my_json_file:
# Do something with json_object['compoitionId']
pass
Try something like this:
import requests
import json
url = 'http://192.168.1.33/data'
r = requests.get(url)
data = json.loads(r.content.decode())
print([d['compositionId'] for d in data])

How can I use a JSON file as an input to MS Azure text analytics using Python?

I've looked through many responses to variants of this problem but still not able to get my code working.
I am trying to use the MS Azure text analytics service and when I paste the example code (including 2/3 sample sentences) it works as you might expect. However, my use case requires the same analysis to be performed on hundreds of free text survey responses so rather than pasting in each and every sentence, I would like to use a JSON file containing these responses as an input, pass that to Azure for analysis and receive back a JSON output.
The code I am using and the response it yields is shown below (note that the last bit of ID 2 response has been chopped off before the error message).
key = "xxxxxxxxxxx"
endpoint = "https://blablabla.cognitiveservices.azure.com/"
import json
with open(r'example.json', encoding='Latin-1') as f:
data = json.load(f)
print (data)
import os
from azure.cognitiveservices.language.textanalytics import TextAnalyticsClient
from msrest.authentication import CognitiveServicesCredentials
def authenticateClient():
credentials = CognitiveServicesCredentials(key)
text_analytics_client = TextAnalyticsClient(
endpoint=endpoint, credentials=credentials)
return text_analytics_client
import requests
# pprint is used to format the JSON response
from pprint import pprint
import os
subscription_key = "xxxxxxxxxxxxx"
endpoint = "https://blablabla.cognitiveservices.azure.com/"
entities_url = "https://blablabla.cognitiveservices.azure.com/text/analytics/v2.1/entities/"
documents = data
headers = {"Ocp-Apim-Subscription-Key": subscription_key}
response = requests.post(entities_url, headers=headers, json=documents)
entities = response.json()
pprint(entities)
[{'ID': 1, 'text': 'dog ate my homework', {'ID': 2, 'text': 'cat sat on the
{'code': 'BadRequest',
'innerError': {'code': 'InvalidRequestBodyFormat',
'message': 'Request body format is wrong. Make sure the json '
'request is serialized correctly and there are no '
'null members.'},
'message': 'Invalid request'}
According to my research, when we call Azure text analytics rest API to Identify Entities, the request body should be like
{
"documents": [
{
"id": "1",
"text": "."
},
{
"id": "2",
"text": ""
}
]
}
For example
My json file
[{
"id": "1",
"text": "dog ate my homework"
}, {
"id": "2",
"text": "cat sat on the sofa"
}
]
My code
key = ''
endpoint = "https://<>.cognitiveservices.azure.com/"
import requests
from pprint import pprint
import os
import json
with open(r'd:\data.json', encoding='Latin-1') as f:
data = json.load(f)
pprint(data)
entities_url = endpoint + "/text/analytics/v2.1/entities?showStats=true"
headers = {"Ocp-Apim-Subscription-Key": key}
documents=data
response = requests.post(entities_url, headers=headers, json=documents)
entities = response.json()
pprint(entities)
pprint("--------------------------------")
documents ={}
documents["documents"]=data
response = requests.post(entities_url, headers=headers, json=documents)
entities = response.json()
pprint(entities)

Print Specific Value from an API Request in Python

I am trying to print the values from an API Request. The JSON file returned is large(4,000 lines) so I am just trying to get specific values from the key value pair and automate a message.
Here is what I have so far:
import requests
import json
import urllib
url = "https://api.github.com/repos/<companyName>/<repoName>/issues" #url
payload = {}
headers = {
'Authorization': 'Bearer <masterToken>' #authorization works fine
}
name = (user.login) #pretty sure nothing is being looked out
url = (url)
print(hello %name, you have a pull request to view. See here %url for more information) # i want to print those keys here
The JSON file (exported from the API get request is as followed:
[
{
**"url": "https://github.com/<ompanyName>/<repo>/issues/1000",**
"repository_url": "https://github.com/<ompanyName>/<repo>",
"labels_url": "https://github.com/<ompanyName>/<repo>/issues/1000labels{/name}",
"comments_url": "https://github.com/<ompanyName>/<repo>/issues/1000",
"events_url": "https://github.com/<ompanyName>/<repo>/issues/1000",
"html_url": "https://github.com/<ompanyName>/<repo>/issues/1000",
"id": <id>,
"node_id": "<nodeID>",
"number": 702,
"title": "<titleName>",
"user": {
**"login": "<userName>",**
"id": <idNumber>,
"node_id": "nodeID",
"avatar_url": "https://avatars3.githubusercontent.com/u/urlName?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/<userName>",
"html_url": "https://github.com/<userName>",
"followers_url": "https://api.github.com/users/<userName>/followers",
"following_url": "https://api.github.com/users/<userName>/following{/other_user}",
"gists_url": "https://api.github.com/users/<userName>/gists{/gist_id}",
"starred_url": "https://api.github.com/users/<userName>/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/<userName>/subscriptions",
"organizations_url": "https://api.github.com/users/<userName>/orgs",
"repos_url": "https://api.github.com/users/<userName>/repos",
"events_url": "https://api.github.com/users/<userName>/events{/privacy}",
"received_events_url": "https://api.github.com/users/<userName>/received_events",
"type": "User",
"site_admin": false
},
]
(note this JSON file repeats a few hundred times)
From the API request, I am trying to get the nested "login" and the url.
What am I missing?
Thanks
Edit:
Solved:
import requests
import json
import urllib
url = "https://api.github.com/repos/<companyName>/<repoName>/issues"
payload = {}
headers = {
'Authorization': 'Bearer <masterToken>'
}
response = requests.get(url).json()
for obj in response:
name = obj['user']['login']
url = obj['url']
print('Hello {0}, you have an outstanding ticket to review. For more information see here:{1}.'.format(name,url))
Since it's a JSON array you have to loop over it. And JSON objects are converted to dictionaries, so you use ['key'] to access the elements.
for obj in response:
name = obj['user']['login']
url = obj['url']
print(f'hello {name}, you have a pull request to view. See here {url} for more information')
you can parse it into a python lists/dictionaries and then access it like any other python object.
response = requests.get(...).json()
login = response[0]['user']
You can convert JSON formatted data to a Python dictionary like this:
https://www.w3schools.com/python/python_json.asp
json_data = ... # response from API
dict_data = json.loads(json_data)
login = response[0]['user']['login']
url = response[0]['url']

Categories

Resources