I want to scrape a website that contains a JSON inside after the "window.INITIAL_STATE=". Although I am able to scrape it by parsing the HTML, I want to know how I can find (if exists at all) the API where that data comes from. I have also checked some APIs that documented in their developers site, but they miss some information that is only available in that INITIAL_STATE
My goal is to do the request directly to an API instead of having to load the entire HTML to then parse it.
Here is the website Im trying to get info from
That script is being loaded in the initial html, it's not pulled from any API. You can get that script data like below:
import requests
from bs4 import BeautifulSoup
import re
import json
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0',
'Accept-Language' : 'en-US,en;q=0.5'}
url = 'https://www.just-eat.co.uk/restaurants-theshadse1/'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
data_script = soup.find('script', string=re.compile("INITIAL_STATE"))
data = json.loads(data_script.text.split('window.__INITIAL_STATE__=')[1])
print(data['state'])
Result in terminal:
{'brazeApiKey': 'f714b0fc-6de5-4460-908e-2d9930f31339', 'menuVersion': 'VCE.1b_tZDFq8FmpZNn.ric9K4otZqBy', 'restaurantId': '17395', 'countryCode': 'uk', 'language': 'en-GB', 'localDateTime': '2022-08-04T00:41:58.9747334', 'canonicalBase': 'www.just-eat.co.uk', 'androidAppBaseUrl': [...]
You can analyze and dissect that json object further, to get what you need from it.
Related
I am trying to scrape data from a Twitter webpage using Python but instead of getting the data back, I keep getting "Javascript is not available". I've enabled Javascript in my browser(chrome) but nothing changes.
Here is the error -->
<h1>JavaScript is not available.</h1>
<p>We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using twitter.com. You can see a list of supported browsers in our Help Center.</p>
Here is the code -->
from bs4 import BeautifulSoup
import requests
url = "https://twitter.com/search?q=%23developer%20advocate&src=typed_query&f=user"
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, "lxml")
content = soup.find("div")
print(content)
I've tried enabling Javascript in my browser(chrome), I expected to get the required data back instead the error "Javascript is not availble" persists.
I will never advise scraping twitter by violating their policies, you should use an API instead! But for the Javascript part, just pass user agent in headers in your requests.
from bs4 import BeautifulSoup
import requests
user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
headers = {'User-Agent': user_agent}
url = "https://twitter.com/search?q=%23developer%20advocate&src=typed_query&f=user"
source_code = requests.get(url, headers=headers).text
soup = BeautifulSoup(source_code, "lxml")
content = soup.find("div")
print(content)
I want to scrap data from a real estate website for my education project. I am using beautifulsoup. I write following code. Code works properly but shows very less data.
import requests
from bs4 import BeautifulSoup
url = "https://www.zillow.com/homes/San-Francisco,-CA_rb/"
headers = {
"Accept-Language": "en-GB,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0"
}
response = requests.get(url=url, headers=headers )
soup = BeautifulSoup(response.text, "html.parser")
prices = soup.find_all("span", attrs={"data-test":True})
prices_list = [price.getText().strip("+,/,m,o,1,bd, ") for price in prices]
print(prices_list)
The output of this only shows first 9 listings prices.
['$2,959', '$2,340', '$2,655', '$2,632', '$2,524', '$2,843', '$2,64', '$2,300', '$2,604']
It's because the content is created progressively with continuous requests (Lazy loading). You could try to reverse engineer the backend of the site. I'll look into it and if I find an easy solution I'll update the answer. :)
The API call to their backend looks something like this: https://www.zillow.com/search/GetSearchPageState.htm?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22San%20Francisco%2C%20CA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-123.07190982226562%2C%22east%22%3A-121.79474917773437%2C%22south%22%3A37.63132659190023%2C%22north%22%3A37.918977518603874%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A20330%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22sortSelection%22%3A%7B%22value%22%3A%22days%22%7D%2C%22isAllHomes%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%7D&wants={%22cat1%22:[%22mapResults%22]}&requestId=3
You need to handle cookies correctly in order to see the results but if delivers around 1000 results. Have fun :)
UPDATE:
should look like this
import json
with open("GetSearchPageState.json", "r") as f:
a = json.load(f)
print(a["cat1"]["searchResults"]["mapResults"])
I can't find any solutions for the problem I'm having.
I want to scrape the JSON file from https://www.armadarealestate.com/Inventory.aspx
When I go to the network and select the url where the JSON is being loaded from I just get sent to another HTML page, but the response section says that it contains the information about the properties which is what I need.
So how can I pull the JSON file from the website?
import json
import requests
resp = requests.get(url='https://buildout.com/plugins/3e0f3893dc334368bb1ee6274ad5fd7b546414e9/inventory?utf8=%E2%9C%93&page=-3&brandingId=&searchText=&q%5Bsale_or_lease_eq%5D=&q%5Bs%5D%5B%5D=&viewType=list&q%5Btype_eq_any%5D%5B%5D=2&q%5Btype_eq_any%5D%5B%5D=5&q%5Btype_eq_any%5D%5B%5D=1&q%5Bcity_eq%5D=')
print(json.loads(resp.text))
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
In fact when I pull the request that belongs to the JSON file I instead get the response from scraping the url at 'https://buildout.com/plugins/3e0f3893dc334368bb1ee6274ad5fd7b546414e9/inventory?utf8=%E2%9C%93&page=0&brandingId=&searchText=&q%5Bsale_or_lease_eq%5D=&q%5Bs%5D%5B%5D=&viewType=list&q%5Btype_eq_any%5D%5B%5D=2&q%5Btype_eq_any%5D%5B%5D=5&q%5Btype_eq_any%5D%5B%5D=1&q%5Bcity_eq%5D='
which is a html file.
How can I fix this?
Your response object "resp" is not a valid JSON format. It is just a html content.
You can use beautifulsoup to scrape the content from the html.
The reason you are not getting JSON object is due to the Javascript in the html. Python requests only download html document alone, if you want to render the Javascript use libs like selenium.
else, find the URL which loads the JSON via ajax and use requests to get JSON.
In your case, the tested code to scrape JSON:
import requests
url = "https://buildout.com/plugins/3e0f3893dc334368bb1ee6274ad5fd7b546414e9/inventory?utf8=%E2%9C%93&page=0&brandingId=&searchText=&q%5Bsale_or_lease_eq%5D=&q%5Bs%5D%5B%5D=&viewType=list&q%5Btype_eq_any%5D%5B%5D=2&q%5Btype_eq_any%5D%5B%5D=5&q%5Btype_eq_any%5D%5B%5D=1&q%5Bcity_eq%5D="
h = {'accept': 'application/json', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
r = requests.get(url, headers=h)
print(r.json())
#prints the JSON data
Three days ago I started learning Python to create a web scraper and collect information about new book releases. I´m stuck on one of my target websites...I know this is a really basic question but I´ve watched some videos, looked at many related questions on stack overflow, tried more than 10 different solutions and nothing. If anybody could help, much appreciated:
My problem:
I can retrieve the title information but can´t retrieve the price information
Data Source:
https://www.bloomsbury.com/uk/non-fiction/business-and-management/?pagesize=25
My code:
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://www.bloomsbury.com/uk/non-fiction/business-and-management/?pagesize=25'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
source = requests.get(url, headers=headers).text
#code to retrieve title
soup = BeautifulSoup(source, 'lxml')
for productdetails in soup.find_all("div", class_='figDetails'):
producttitle = productdetails.a.text
print(producttitle)
#code to retrieve price
for productpricedetails in soup.find_all("div", class_='related-products-block'):
productprice = productdetails.find("div", class_="new-price").span.text
print(productprice)
There are two elements with the name span, I need the information on the second one but don´t know how to get to it.
Also, on trying different possible solutions I kept getting a noneType error...
It looks like the source you're trying to scrape populates this data via Javascript.
Viewing the source of the page you can see the raw HTML shows the div you're trying to target is empty.
<html>
...
<div class="related-products-block" id="reletedProduct_490420">
</div>
...
</html>
You can also see this if you update your second loop like so:
for productpricedetails in soup.find_all("div", class_="related-products-block"):
print(productpricedetails)
Edit:
As a bonus, you can inspect the Javascript the page uses. It is very easy to understand, and the request simply returns the HTML which you are looking for. It will be a bit more involved to get the JSON prepared for the requests but here's an example:
import requests
url = "https://www.bloomsbury.com/uk/catalog/RelatedProductsData"
payload = {"productId": 490420, "type": "List", "ordertype": 0, "formatType": 0}
headers = {"Content-Type": "application/json"}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text.encode("utf8"))
(I've tried looking but all of the other answers seem to be using urllib2)
I've just started trying to use requests, but I'm still not very clear on how to send or request something additional from the page. For example, I'll have
import requests
r = requests.get('http://google.com')
but I have no idea how to now, for example, do a google search using the search bar presented. I've read the quickstart guide but I'm not very familiar with HTML POST and the like, so it hasn't been very helpful.
Is there a clean and elegant way to do what I am asking?
Request Overview
The Google search request is a standard HTTP GET command. It includes a collection of parameters relevant to your queries. These parameters are included in the request URL as name=value pairs separated by ampersand (&) characters. Parameters include data like the search query and a unique CSE ID (cx) that identifies the CSE that is making the HTTP request. The WebSearch or Image Search service returns XML results in response to your HTTP requests.
First, you must get your CSE ID (cx parameter) at Control Panel of Custom Search Engine
Then, See the official Google Developers site for Custom Search.
There are many examples like this:
http://www.google.com/search?
start=0
&num=10
&q=red+sox
&cr=countryCA
&lr=lang_fr
&client=google-csbe
&output=xml_no_dtd
&cx=00255077836266642015:u-scht7a-8i
And there are explained the list of parameters that you can use.
import requests
from bs4 import BeautifulSoup
headers_Get = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
def google(q):
s = requests.Session()
q = '+'.join(q.split())
url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
output = []
for searchWrapper in soup.find_all('h3', {'class':'r'}): #this line may change in future based on google's web page structure
url = searchWrapper.find('a')["href"]
text = searchWrapper.find('a').text.strip()
result = {'text': text, 'url': url}
output.append(result)
return output
Will return an array of google results in {'text': text, 'url': url} format. Top result url would be google('search query')[0]['url']
input:
import requests
def googleSearch(query):
with requests.session() as c:
url = 'https://www.google.co.in'
query = {'q': query}
urllink = requests.get(url, params=query)
print urllink.url
googleSearch('Linkin Park')
output:
https://www.google.co.in/?q=Linkin+Park
The readable way to send a request with many query parameters would be to pass URL parameters as a dictionary:
params = {
'q': 'minecraft', # search query
'gl': 'us', # country where to search from
'hl': 'en', # language
}
requests.get('URL', params=params)
But, in order to get the actual response (output/text/data) that you see in the browser you need to send additional headers, more specifically user-agent which is needed to act as a "real" user visit when bot or browser sends a fake user-agent string to announce themselves as a different client.
The reason that your request might be blocked is that the default requests user agent is python-requests and websites understand that. Check what's your user agent.
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
params = {
'q': 'minecraft',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "tesla",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
Disclaimer, I work for SerpApi.
In this code by using bs4 you can get all the h3 and print their text
# Import the beautifulsoup
# and request libraries of python.
import requests
import bs4
# Make two strings with default google search URL
# 'https://google.com/search?q=' and
# our customized search keyword.
# Concatenate them
text= "c++ linear search program"
url = 'https://google.com/search?q=' + text
# Fetch the URL data using requests.get(url),
# store it in a variable, request_result.
request_result=requests.get( url )
# Creating soup from the fetched request
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
filter=soup.find_all("h3")
for i in range(0,len(filter)):
print(filter[i].get_text())
You can use 'webbroser', I think it doesn't get easier than that:
import webbrowser
query = input('Enter your query: ')
webbrowser.open(f'https://google.com/search?q={query}')