How to scrape data from sciencedirect - python

I want to scrape all data from sciencedirect by keyword.
I know that sciencedirect is program by ajax,
so the data of their page could't be extract directly via the
url of search result page.
The page I want to scrape
I've find the json data from numerous requests in Network area, in my view, I could get json data by this url of the request.But there are some error msg and garbled. Here is my code.
The request that contain json
import requests as res
import json
from bs4 import BeautifulSoup
keyword="digital game"
url = 'https://www.sciencedirect.com/search/api?'
payload = {
'tak': keyword,
't': 'ZNS1ixW4GGlMjTKbRHccgZ2dHuMVHqLqNBwYzIZayNb8FZvZFnVnLBYUCU%2FfHTxZMgwoaQmcp%2Foemth5%2FnqtM%2BGQW3NGOv%2FI0ng6yDADzynQO66j9EPEGT0aClusSwPFvKdDbfVcomCzYflUlyb3MA%3D%3D',
'hostname': 'www.sciencedirect.com'
}
r = res.get(url, params = payload)
print(r.content) # get garbled
r = r.json()
print(r) # get error msg
Garbled (not json data I expect)
Error msg (about .json()

Try setting the HTTP headers in the request such as user-agent to mimic a standard web browser. This will return query search results in JSON format.
import requests
keyword = "digital game"
url = 'https://www.sciencedirect.com/search/api?'
headers = {
'User-Agent': 'Mozilla/5.0',
'Accept': 'application/json'
}
payload = {
'tak': keyword,
't': 'ZNS1ixW4GGlMjTKbRHccgZ2dHuMVHqLqNBwYzIZayNb8FZvZFnVnLBYUCU%2FfHTxZMgwoaQmcp%2Foemth5%2FnqtM%2BGQW3NGOv%2FI0ng6yDADzynQO66j9EPEGT0aClusSwPFvKdDbfVcomCzYflUlyb3MA%3D%3D',
'hostname': 'www.sciencedirect.com'
}
r = requests.get(url, headers=headers, params=payload)
# need to check if the response output is JSON
if "json" in r.headers.get("Content-Type"):
data = r.json()
else:
print(r.status_code)
data = r.text
print(data)
Output:
{'searchResults': [{'abstTypes': ['author', 'author-highlights'], 'authors': [{'order': 1, 'name': 'Juliana Tay'},
..., 'resultsCount': 961}}

I've got the same problem. The point is that sciencedirect.com is using cloudflare which blocks the access for scraping bots. I've tried to use different approaches like cloudsraper, cfscrape etc... Unsuccessful! Then I've made a small parser based on Selenium which allows me to take metadata from publications and put it into my own json file with following schema:
schema = {
"doi_number": {
"metadata": {
"pub_type": "Review article" | "Research article" | "Short communication" | "Conference abstract" | "Case report",
"open_access": True | False,
"title": "title_name",
"journal": "journal_name",
"date": "publishing_date",
"volume": str,
"issue": str,
"pages": str,
"authors": [
"author1",
"author2",
"author3"
]
}
}
}
If you have any questions or maybe ideas fill free to contact me.

Related

Execute js function in HTML page scraped by python to get json data

I have a website with products https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01 When I inspect the html page I see they have all info in json format in script tag under
window.INITIAL_DATA = JSON.parse('{"pa...')
I tried to scrape the html with requests and get the json string with regex, however my code somehow change the json structure and I cannot load it with json.loads()
response = requests.get('https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
regex = "JSON.parse\(.*;"
match = re.search(regex, str(soup))
json_string = match.group(0).replace("JSON.parse(", "")[1:-3]
json_data = json.loads(json_string)
it ends with json error because there are multiple weird spaces and " which does json library in python cannot handle
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 22173 (char 22172)
Is there a way how to get the json data or even better how to execute the window.INITIAL_DATA function directly in html response in python?
Try:
import re
import js2py
import requests
url = "https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01"
html_doc = requests.get(url).text
data = re.search(r"window\.INITIAL_DATA = (.*)", html_doc)
data = js2py.eval_js(data.group(1))
print(data)
Prints:
{
"currentCountry": {
"englishName": "Sweden",
"localName": "Sverige",
"twoLetterCode": "SE",
},
"currentCurrency": "SEK",
"currentLanguage": "sv-SE",
"currentLanguageRevision": "43",
"currentLanguageTwoLetterName": "sv",
"dynamicData": [
{
"data": {},
"type": "NordicNest.ContentApi.DynamicData.MenuApiModel,NordicNest.ContentApi",
},
{
"type": "NordicNest.Core.Contentful.Model.SiteLayout.Footer,NordicNest.Core"
},
...

Print Specific Value from an API Request in Python

I am trying to print the values from an API Request. The JSON file returned is large(4,000 lines) so I am just trying to get specific values from the key value pair and automate a message.
Here is what I have so far:
import requests
import json
import urllib
url = "https://api.github.com/repos/<companyName>/<repoName>/issues" #url
payload = {}
headers = {
'Authorization': 'Bearer <masterToken>' #authorization works fine
}
name = (user.login) #pretty sure nothing is being looked out
url = (url)
print(hello %name, you have a pull request to view. See here %url for more information) # i want to print those keys here
The JSON file (exported from the API get request is as followed:
[
{
**"url": "https://github.com/<ompanyName>/<repo>/issues/1000",**
"repository_url": "https://github.com/<ompanyName>/<repo>",
"labels_url": "https://github.com/<ompanyName>/<repo>/issues/1000labels{/name}",
"comments_url": "https://github.com/<ompanyName>/<repo>/issues/1000",
"events_url": "https://github.com/<ompanyName>/<repo>/issues/1000",
"html_url": "https://github.com/<ompanyName>/<repo>/issues/1000",
"id": <id>,
"node_id": "<nodeID>",
"number": 702,
"title": "<titleName>",
"user": {
**"login": "<userName>",**
"id": <idNumber>,
"node_id": "nodeID",
"avatar_url": "https://avatars3.githubusercontent.com/u/urlName?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/<userName>",
"html_url": "https://github.com/<userName>",
"followers_url": "https://api.github.com/users/<userName>/followers",
"following_url": "https://api.github.com/users/<userName>/following{/other_user}",
"gists_url": "https://api.github.com/users/<userName>/gists{/gist_id}",
"starred_url": "https://api.github.com/users/<userName>/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/<userName>/subscriptions",
"organizations_url": "https://api.github.com/users/<userName>/orgs",
"repos_url": "https://api.github.com/users/<userName>/repos",
"events_url": "https://api.github.com/users/<userName>/events{/privacy}",
"received_events_url": "https://api.github.com/users/<userName>/received_events",
"type": "User",
"site_admin": false
},
]
(note this JSON file repeats a few hundred times)
From the API request, I am trying to get the nested "login" and the url.
What am I missing?
Thanks
Edit:
Solved:
import requests
import json
import urllib
url = "https://api.github.com/repos/<companyName>/<repoName>/issues"
payload = {}
headers = {
'Authorization': 'Bearer <masterToken>'
}
response = requests.get(url).json()
for obj in response:
name = obj['user']['login']
url = obj['url']
print('Hello {0}, you have an outstanding ticket to review. For more information see here:{1}.'.format(name,url))
Since it's a JSON array you have to loop over it. And JSON objects are converted to dictionaries, so you use ['key'] to access the elements.
for obj in response:
name = obj['user']['login']
url = obj['url']
print(f'hello {name}, you have a pull request to view. See here {url} for more information')
you can parse it into a python lists/dictionaries and then access it like any other python object.
response = requests.get(...).json()
login = response[0]['user']
You can convert JSON formatted data to a Python dictionary like this:
https://www.w3schools.com/python/python_json.asp
json_data = ... # response from API
dict_data = json.loads(json_data)
login = response[0]['user']['login']
url = response[0]['url']

Using the POST Method for Batch Geocoding with ArcGIS Server REST API?

I'm trying to hit my geocoding server's REST API:
[https://locator.stanford.edu/arcgis/rest/services/geocode/USA_StreetAddress/GeocodeServer] (ArcGIS Server 10.6.1)
...using the POST method (which, BTW, could use an example or two, there only seems to be this VERY brief "note" on WHEN to use POST, not HOW: https://developers.arcgis.com/rest/geocode/api-reference/geocoding-geocode-addresses.htm#ESRI_SECTION1_351DE4FD98FE44958C8194EC5A7BEF7D).
I'm trying to use requests.post(), and I think I've managed to get the token accepted, etc..., but I keep getting a 400 error.
Based upon previous experience, this means something about the formatting of the data is bad, but I've cut-&-pasted directly from the Esri support site, this test pair.
# import the requests library
import requests
# Multiple address records
addresses={
"records": [
{
"attributes": {
"OBJECTID": 1,
"Street": "380 New York St.",
"City": "Redlands",
"Region": "CA",
"ZIP": "92373"
}
},
{
"attributes": {
"OBJECTID": 2,
"Street": "1 World Way",
"City": "Los Angeles",
"Region": "CA",
"ZIP": "90045"
}
}
]
}
# Parameters
# Geocoder endpoint
URL = 'https://locator.stanford.edu/arcgis/rest/services/geocode/USA_StreetAddress/GeocodeServer/geocodeAddresses?'
# token from locator.stanford.edu/arcgis/tokens
mytoken = <GeneratedToken>
# output spatial reference id
outsrid = 4326
# output format
format = 'pjson'
# params data to be sent to api
params ={'outSR':outsrid,'f':format,'token':mytoken}
# Use POST to batch geocode
r = requests.post(url=URL, data=addresses, params=params)
print(r.json())
print(r.text)
Here's what I consistently get:
{'error': {'code': 400, 'message': 'Unable to complete operation.', 'details': []}}
I had to play around with this for longer than I'd like to admit, but the trick (I guess) is to use the correct request header and convert the raw addresses to a JSON string using json.dumps().
import requests
import json
url = 'http://sampleserver6.arcgisonline.com/arcgis/rest/services/Locators/SanDiego/GeocodeServer/geocodeAddresses'
headers = { 'Content-Type': 'application/x-www-form-urlencoded' }
addresses = json.dumps({ 'records': [{ 'attributes': { 'OBJECTID': 1, 'SingleLine': '2920 Zoo Dr' }}] })
r = requests.post(url, headers = headers, data = { 'addresses': addresses, 'f':'json'})
print(r.text)

Glassdoor API Not Printing Custom Response

I have the following problem when I try to print something from this api. I'm trying to set it up so I can access different headers, then print specific items from it. But instead when I try to print soup it gives me the entire api response in json format.
import requests, json, urlparse, urllib2
from BeautifulSoup import BeautifulSoup
url = "apiofsomesort"
#Create Dict based on JSON response; request the URL and parse the JSON
#response = requests.get(url)
#response.raise_for_status() # raise exception if invalid response
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
response = urllib2.urlopen(req)
soup = BeautifulSoup(response)
print soup
When it prints it looks like the below:
{
"success": true,
"status": "OK",
"jsessionid": "0541E6136E5A2D5B2A1DF1F0BFF66D03",
"response": {
"attributionURL": "http://www.glassdoor.com/Reviews/airbnb-reviews-SRCH_KE0,6.htm",
"currentPageNumber": 1,
"totalNumberOfPages": 1,
"totalRecordCount": 1,
"employers": [{
"id": 391850,
"name": "Airbnb",
"website": "www.airbnb.com",
"isEEP": true,
"exactMatch": true,
"industry": "Hotels, Motels, & Resorts",
"numberOfRatings": 416,
"squareLogo": "https://media.glassdoor.com/sqll/391850/airbnb-squarelogo-1459271200583.png",
"overallRating": 4.3,
"ratingDescription": "Very Satisfied",
"cultureAndValuesRating": "4.4",
"seniorLeadershipRating": "4.0",
"compensationAndBenefitsRating": "4.3",
"careerOpportunitiesRating": "4.1",
"workLifeBalanceRating": "3.9",
"recommendToFriendRating": "0.9",
"sectorId": 10025,
"sectorName": "Travel & Tourism",
"industryId": 200140,
"industryName": "Hotels, Motels, & Resorts",
"featuredReview": {
"attributionURL": "http://www.glassdoor.com/Reviews/Employee-Review-Airbnb-RVW12111314.htm",
"id": 12111314,
"currentJob": false,
"reviewDateTime": "2016-09-28 16:44:00.083",
"jobTitle": "Employee",
"location": "",
"headline": "An amazing place to work!",
"pros": "Wonderful people and great culture. Airbnb really strives to make you feel at home as an employee, and everyone is genuinely excited about the company mission.",
"cons": "The limitations of Rails 3 and the company infrastructure make developing difficult sometimes.",
"overall": 5,
"overallNumeric": 5
},
"ceo": {
"name": "Brian Chesky",
"title": "CEO & Co-Founder",
"numberOfRatings": 306,
"pctApprove": 95,
"pctDisapprove": 5,
"image": {
"src": "https://media.glassdoor.com/people/sqll/391850/airbnb-brian-chesky.png",
"height": 200,
"width": 200
}
}
}]
}
}
I want to print out specific items like employers":name, industry etc...
You can load the JSON response into a dict then look for the values you want like you would in any other dict.
I took your data and saved it in an external JSON file to do a test since I don't have access to the API. This worked for me.
import json
# Load JSON from external file
with open (r'C:\Temp\json\data.json') as json_file:
data = json.load(json_file)
# Print the values
print 'Name:', data['response']['employers'][0]['name']
print 'Industry:', data['response']['employers'][0]['industry']
Since you're getting your data from an API something like this should work.
import json
import urlib2
url = "apiofsomesort"
# Load JSON from API
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url, headers=hdr)
response = urllib2.urlopen(req)
data = json.load(response.read())
# Print the values
print 'Name:', data['response']['employers'][0]['name']
print 'Industry:', data['response']['employers'][0]['industry']
import json, urlib2
url = "http..."
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url, headers=hdr)
response = urllib2.urlopen(req)
data = json.loads(response.read())
# Print the values
print 'numberOfRatings:', data['response']['employers'][0]['numberOfRatings']

post data using python-requests

I'm trying to post the following data. But I'm getting an error. Can you please take look? Thanks a lot.
I'm posting the same data using Postman. And it works.
def _build_post_data(bike_instance):
"""
data = {
"apikey": "XXX",
"data": {
"created_at": "date_XX",
"Price": "Decimal_XX"
}
}
"""
data = {}
raw_data = serializers.serialize('python', [bike_instance])
actual_data = [d['fields'] for d in raw_data]
data.update(
{
"apikey": XXX,
"data": actual_data[0]
}
)
return data
Posting data
bike = Bike.objects.get(pk=XXX)
data = _build_post_data(bike)
dump_data = json.dumps(data, cls=DjangoJSONEncoder)
requests.post(url, data=dump_data)
error
u'{"error":{"message":"422 Unprocessable Entity","errors":[["The data field is required."],["The apikey field is required."]],"status_code":422}}'
data and apikey already in the dict. then why I'm getting an error? Any idea?
Postman works
With Postman you are sending a multipart/form-data request, with requests you only send JSON (the value of the data field in Postman), and are not including the apikey field.
Use a dictionary with the JSON data as one of the values, and pass that in as the files argument. It probably also works as the data argument (sent as application/x-www-urlencoded):
form_structure = {'apikey': 'XXXX', 'data': dump_data}
requests.post(url, files=form_structure)
# probably works too: requests.post(url, data=form_structure)

Categories

Resources