How can I scrape the JSON file off this website? - python

I can't find any solutions for the problem I'm having.
I want to scrape the JSON file from https://www.armadarealestate.com/Inventory.aspx
When I go to the network and select the url where the JSON is being loaded from I just get sent to another HTML page, but the response section says that it contains the information about the properties which is what I need.
So how can I pull the JSON file from the website?
import json
import requests
resp = requests.get(url='https://buildout.com/plugins/3e0f3893dc334368bb1ee6274ad5fd7b546414e9/inventory?utf8=%E2%9C%93&page=-3&brandingId=&searchText=&q%5Bsale_or_lease_eq%5D=&q%5Bs%5D%5B%5D=&viewType=list&q%5Btype_eq_any%5D%5B%5D=2&q%5Btype_eq_any%5D%5B%5D=5&q%5Btype_eq_any%5D%5B%5D=1&q%5Bcity_eq%5D=')
print(json.loads(resp.text))
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
In fact when I pull the request that belongs to the JSON file I instead get the response from scraping the url at 'https://buildout.com/plugins/3e0f3893dc334368bb1ee6274ad5fd7b546414e9/inventory?utf8=%E2%9C%93&page=0&brandingId=&searchText=&q%5Bsale_or_lease_eq%5D=&q%5Bs%5D%5B%5D=&viewType=list&q%5Btype_eq_any%5D%5B%5D=2&q%5Btype_eq_any%5D%5B%5D=5&q%5Btype_eq_any%5D%5B%5D=1&q%5Bcity_eq%5D='
which is a html file.
How can I fix this?

Your response object "resp" is not a valid JSON format. It is just a html content.
You can use beautifulsoup to scrape the content from the html.
The reason you are not getting JSON object is due to the Javascript in the html. Python requests only download html document alone, if you want to render the Javascript use libs like selenium.
else, find the URL which loads the JSON via ajax and use requests to get JSON.
In your case, the tested code to scrape JSON:
import requests
url = "https://buildout.com/plugins/3e0f3893dc334368bb1ee6274ad5fd7b546414e9/inventory?utf8=%E2%9C%93&page=0&brandingId=&searchText=&q%5Bsale_or_lease_eq%5D=&q%5Bs%5D%5B%5D=&viewType=list&q%5Btype_eq_any%5D%5B%5D=2&q%5Btype_eq_any%5D%5B%5D=5&q%5Btype_eq_any%5D%5B%5D=1&q%5Bcity_eq%5D="
h = {'accept': 'application/json', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
r = requests.get(url, headers=h)
print(r.json())
#prints the JSON data

Related

Scrapy - How does a request sent using requests library to an API differs from the request that is sent using Scrapy.Request?

I am a beginner at using Scrapy and I was trying to scrape this website https://directory.ntschools.net/#/schools which is using javascript to load the contents. So I checked the networks tab and there's an API address available https://directory.ntschools.net/api/System/GetAllSchools If you open this address, the data is in XML format. But when you check the response tab while inspecting the network tab, the data is there in json format.
I first tried using Scrapy, sent the request to the API address WITHOUT any headers and the response that it returned was in XML which was throwing JSONDecode error upon using json.loads(). So I used the header 'Accept' : 'application/json' and the response I got was in JSON. That worked well
import scrapy
import json
import requests
class NtseSpider_new(scrapy.Spider):
name = 'ntse_new'
header = {
'Accept': 'application/json',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.56',
}
def start_requests(self):
yield scrapy.Request('https://directory.ntschools.net/api/System/GetAllSchools',callback=self.parse,headers=self.header)
def parse(self,response):
data = json.loads(response.body) #returned json response
But then I used the requests module WITHOUT any headers and the response I got was in JSON too!
import requests
import json
res = requests.get('https://directory.ntschools.net/api/System/GetAllSchools')
js = json.loads(res.content) #returned json response
Can anyone please tell me if there's any difference between both the types of requests? Is there a default response format for requests module when making a request to an API? Surely, I am missing something?
Thanks
It's because Scrapy sets the Accept header to 'text/html,application/xhtml+xml,application/xml ...'. You can see that from this.
I experimented and found that server sends a JSON response if the request has no Accept header.

Find __INITIAL_STATE__ API url in website

I want to scrape a website that contains a JSON inside after the "window.INITIAL_STATE=". Although I am able to scrape it by parsing the HTML, I want to know how I can find (if exists at all) the API where that data comes from. I have also checked some APIs that documented in their developers site, but they miss some information that is only available in that INITIAL_STATE
My goal is to do the request directly to an API instead of having to load the entire HTML to then parse it.
Here is the website Im trying to get info from
That script is being loaded in the initial html, it's not pulled from any API. You can get that script data like below:
import requests
from bs4 import BeautifulSoup
import re
import json
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0',
'Accept-Language' : 'en-US,en;q=0.5'}
url = 'https://www.just-eat.co.uk/restaurants-theshadse1/'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
data_script = soup.find('script', string=re.compile("INITIAL_STATE"))
data = json.loads(data_script.text.split('window.__INITIAL_STATE__=')[1])
print(data['state'])
Result in terminal:
{'brazeApiKey': 'f714b0fc-6de5-4460-908e-2d9930f31339', 'menuVersion': 'VCE.1b_tZDFq8FmpZNn.ric9K4otZqBy', 'restaurantId': '17395', 'countryCode': 'uk', 'language': 'en-GB', 'localDateTime': '2022-08-04T00:41:58.9747334', 'canonicalBase': 'www.just-eat.co.uk', 'androidAppBaseUrl': [...]
You can analyze and dissect that json object further, to get what you need from it.

Not getting any HTML Response Codes

I am new to the whole scraping thing and am trying to scrape some information off a website through python but when checking for HTML response (i.e. 200) I am not getting any results back on the terminal. below is my code. Appreciate all sort of help! Edit: I have fixed my rookie mistake in the print section below xD thank you guys for the correction!
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
page = requests.get(url)
print(page.status_code)
The problem is that the page you are trying to scrape protects against scraping by ignoring requests from unusual user agents.
Set the user agent to some well-known string like below
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.status_code)
For one thing, you don't print to the console in Python with the syntax Print = (page). That code assigns the page variable to a variable called Print, which is probably not a good idea as print is a keyword in Python. In order to output to the console, change your code to:
print(page)
Second, printing page is just printing the response object you are receiving after making your GET request, which is not very helpful. The response object has a number of properties you can access, which you can read about in the documentation for the requests Python library.
To get the status code of your response, try:
print(page.status_code)

Loggin to instagram to scrape user infos

I need to scrape infos from an instagram user page, more, I need to use this url page : "https://www.instagram.com/cristiano/?__a=1"
The problem is that I need to be loggin with my instagram account to execute this script
from requests import get
from bs4 import BeautifulSoup
import json
import re
import requests
url_user = "https://www.instagram.com/cristiano/?__a=1"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.74 Safari/537.36 Edg/79.0.309.43'}
response = get(url_user, headers=headers)
print(response)
# print(page.text)
soup = BeautifulSoup(response.text, 'html.parser')
# print(soup)
jsondata=json.loads(str(soup))
I get this error :
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
How can I avoid that connection problem to scrape infos and access data?
Thank you
Adding the __a=1 parameter gets you a JSON response therefore you do not need to go through BeautifulSoup you simply load the JSON directly.
response = get(url_user, headers=headers)
jsondata=json.loads(response.text)
Alternatively you can use the json() function to load the JSON:
response = get(url_user, headers=headers)
jsondata = response.json()

ValueError while scraping instagram with python

Hello I am trying to scrape this url : https://www.instagram.com/cristiano/?__a=1 but I get a Value Error
url_user = "https://www.instagram.com/cristiano/?__a=1"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = get(url_user,headers=headers)
print(response) # 200
html_soup = BeautifulSoup(response.content, 'html.parser')
# print(html_soup)
jsondata=json.loads(str(html_soup))
ValueError: No JSON object could be decoded
Any idea why I get this error?
The reason you're getting the error is because you're trying to parse a JSON response as if it was HTML. You don't need BeautifulSoup for that.
Try this:
import json
import requests
url_user = "https://www.instagram.com/cristiano/?__a=1"
d = json.loads(requests.get(url_user).text)
print(d)
However, best practice suggests to use .json() from requests, as it'll do a better job of figuring out the encoding used.
import requests
url_user = "https://www.instagram.com/cristiano/?__a=1"
d = requests.get(url_user).json()
print(d)
You might be getting non-200 HTTP Status Code, which means that server responded with error, e.g. server might have banned your IP for frequent requests. requests library doesn't throw any errors for that. To control erroneous status codes insert after get(...) line this code:
response.raise_for_status()
Also it is enough just to do jsondata = response.json(). requests library can parse json this way without need for beautiful soup. Easy to read tutorial about main requests library features is located here.
Also if there is some parsing problem save binary content of response to file to attach it to question like this:
with open('response.dat', 'wb') as f:
f.write(response.content)

Categories

Resources