I am trying to learn how to use BS4 but I ran into this problem. I try to find the text in the Google Search results page showing the number of results for the search but I can't find no text 'results' neither in the html_page nor in the soup HTML parser. This is the code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
print(b'results' in html_page)
print('results' in soup)
Both prints return False, what am I doing wrong? How to fix that?
EDIT:
Turns out the language of the webpage was a problem, adding &hl=en to the URL almost fixed it.
url = 'https://www.google.com/search?q=stack&hl=en'
The first print is now True but the second is still False.
requests library when returning the response in form of response.content usually returns in a raw format. So to answer your second question, replace the res.content with res.text.
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.text
soup = BeautifulSoup(html_page, 'html.parser')
print('results' in soup)
Output: True
Keep in mind, Google is usually very active in handling scrapers. To avoid getting blocked/captcha'ed, you can add a user agent to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
...
<your code here>
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
It's not because res.content should be changed to res.text as 0xInfection mentioned, it would still return the result.
However, in some cases, it will only return bytes content if it's not gzip or deflate transfer-encodings, which are automatically decoded by requests to a readable format (correct me in the comments or edit this answer if I'm wrong).
It's because there's no user-agent specified thus Google will block a request eventually because default requests user-agent is python-requests and Google understands that it's a bot/script. Learn more about request headers.
Pass user-agent into request headers:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
request.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah definition", # query
"gl": "us", # country to make request from
"hl": "en" # language
}
response = requests.get('https://www.google.com/search',
headers=headers,
params=params).content
soup = BeautifulSoup(response, 'lxml')
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)
# About 114,000 results
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want without thinking about how to extract stuff or figure out how to bypass blocks from Google or other search engines since it's already done for the end-user.
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro dah definition",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
result = results["search_information"]['total_results']
print(result)
# 112000
Disclaimer, I work for SerpApi.
Related
I'm trying to make a pop-up program with mir4 draco price. But the price return None :
import requests
from bs4 import BeautifulSoup
urll = 'https://www.xdraco.com/coin/price/'
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/86.0.4240.198 Safari/537.36"}
site = requests.get(urll, headers=headers)
soup = BeautifulSoup(site.content, 'html5lib')
price = soup.find('span', class_="amount")
print(price)
You won't be able to parse a site that is dynamically loaded using JS as #jabbson mentioned.
This might be a way to get the data you want.
If you check the network requests being made by the page, you will find that it makes calls to a few different APIs. I found one that might have the info you're looking for. You can make POST requests to this API as shown below...
import requests
import json
headers = {'accept':'application/json, text/plain, */*','user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
html = requests.post('https://api.mir4global.com/wallet/prices/hydra/daily', headers=headers)
output = json.loads(html.text)
# 'output' is a dictionary. If we index the last element, we can get the latest data entry
print(output['Data'][-1])
OUTPUT:
{'CreatedDT': '2022-08-04 21:55:00', 'HydraPrice': '2.1301000000000001', 'HydraAmount': '13434', 'HydraPricePrev': '2.3336000000000001', 'HydraAmountPrev': '5972', 'HydraUSDWemixRate': '2.9401340627166839', 'HydraUSDKLAYRate': '0.29840511595654395', 'USDHydraRate': '6.2627795669928084'}
I am trying to do a simple WebScrapper to monitor Nike's site here in Brazil.
Basically i want to track products that have stock right now, to check when new products are added.
My problem is that when i navigate to the site https://www.nike.com.br/snkrs#estoque I see different products compared to what I see using python requests method.
Here is the code I am using:
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://www.nike.com.br/snkrs#estoque'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
len(soup.find_all(class_='produto produto--comprar'))
This code gives me 40, but using the browser I can see 56 products https://prnt.sc/26jeo1i
The data comes from a different source, within 3 pages.
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
productList = []
for p in [1,2,3]:
url = f'https://www.nike.com.br/Snkrs/Estoque?p={p}&demanda=true'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
productList += soup.find_all(class_='produto produto--comprar')
Output:
print(len(productList))
56
I am running this code
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
r = requests.get('https://www.bohus.no/spiseplassen/oppbevaring-1/gradino-vitrine-2')
soup = BeautifulSoup(r.content, 'lxml')
print(soup.find('div', class_='price').text)
I am trying to get the price of the product on this site: https://www.bohus.no/spiseplassen/oppbevaring-1/gradino-vitrine-2
All I am getting is empty data when running my code. Am I doing something wrong or does the website do something special to stop me from scaping price?
As stated in the comments, you can get the product data from the store's API.
Here's how:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/78.0.3904.108 Safari/537.36',
"x-requested-with": "XMLHttpRequest"
}
product_url = "https://www.bohus.no/spiseplassen/oppbevaring-1/gradino-vitrine-2"
page_content = requests.get(product_url).content
soup = BeautifulSoup(page_content, 'lxml')
product_id = soup.find("input", {"name": "d-session-product"})["value"]
payload = {
"debug": "off",
"ajax": "1",
"product_list": product_id,
"action": "init",
"showStockStatusAndShoppingcart": "1",
"enablePickupAtNearbyStores": "yes",
}
endpoint = "https://www.bohus.no/lite.cgi/module/priceAndStock"
product_data = requests.post(endpoint, data=payload, headers=headers).json()
print(product_data["price"][0]["salesPriceNormal"])
Output:
8799
Also, you can use requests-HTML library:
import requests_html
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.bohus.no/spiseplassen/oppbevaring-1/gradino-vitrine-2')
r.html.render()
sel='div.price-data > div.price'
print(r.html.find(sel, first=True).text)
I have tried to scrape twitter account followers list. For that, authentication is required. So i used requests library for authentication purpose. The problem i am getting is, when i try to authenticate, I am getting 200 response but authentication is not done. The code is:
import requests
from bs4 import BeautifulSoup
import json
payload={
"session[username_or_email]":"*****************",
"session[password]":"**********************",
"authenticity_token":"aa3520020157738bdabb6d60f2e02894c6c85689",
"ui_metrics":'{"rf":{"a67dd0828000993f688a64a8238f647dd8ef987feb0db5979725fc7e304c3989":-250,"a4cd98aa5fd1d026bfded286fc24eb6ac9cf01a65b789ade51b68558cb0f6ae0":-21,"a88c7b5bdeb04ce3cf55df08c0f981f99df760b9348680c735fbff5b60ad054f":51,"a5e59c69fb04ab30f2f8468030c31ca1150f4265e4c2a35dbb1b67b85be6954f":-68},"s":"QdcvZJ9RhjLcVcW2N_pDt5j5AKQJCkqnh9caYV5ykW35tRpQc_RN5s_VefN2uVCONpXf-qZa-fr8VtCAFrtiOf2f6PhloU2GyxLDN38wGppFNWhb4psCr7x-kibioS9PDxWZF1pe3FM-MOz9YtIQrWxbmEAWnRTK3gUn-1nv4kTFDa159YxJoXiYt43g41sRUJWezJI2yJaECnO1ARbkNAPKrMndxRAcq_5qSFpT8CqzEUvBKPMdFMKeUrzeEecqmx632lTV1NlucVIvV9co3Y3Rk7CtURoaiCwsjTED1brU4XAY3VwsTEuNRUYZqirRNZrYQBCHqsMh5FV_UHpO2QAAAWE40pmN"}',
"scribe_log":"",
"redirect_after_login":"",
"authenticity_token":"aa3520020157738bdabb6d60f2e02894c6c85689",
"return_to_ssl":"",
"remember_me":"1",
"lang":"",
"redirect":""
}
headers={
"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-encoding":"gzip, deflate, br",
"accept-language":"en-US,en;q=0.9",
"cache-control":"max-age=0",
"cookie":'moments_profile_moments_nav_tooltip_self=true; syndication_guest_id=v1%3A150345116906281638; eu_cn=1; kdt=QErLcBT9OjM5gjEznmsRcHlMTK6biDyAw4gfI5ro; _ga=GA1.2.1923324433.1496571570; tfw_exp=0; __utma=43838368.1923324433.1496571570.1516764481.1516764481.1; __utmz=43838368.1516764481.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); remember_checked_on=0; personalization_id="v1_Iq7dc3Mq746/e91mchhhJg=="; guest_id=v1%3A151698504007256847; ads_prefs="HBERAAA="; _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCF5%252F0jhhAToMY3NyZl9p%250AZCIlN2ZmZjExM2NkYjUzODEzZDNiNDE4YWI3NGRhZTAxOTc6B2lkIiU3YWFl%250AZjVhNDY1OWJlNzdiN2RiYjEzNjIwYWVjMGMyMQ%253D%253D--d69792331ec3a3b6c9d994a07f2159bfd5697089; ct0=ecc095f3a61b1c77279538584cb6f20e; _gid=GA1.2.253357133.1517076775; _gat=1',
"referer":"https://twitter.com/login",
"upgrade-insecure-requests":"1",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
str=payload["ui_metrics"]
x=json.dumps(str)
y=json.loads(str)
payload["ui_metrics"]=y
res = requests.post("https://twitter.com/login",data=payload,headers=headers)
r = requests.get("https://twitter.com/following")
soup = BeautifulSoup(r.text,"html.parser")
print(res.status_code)
print(r.url)
print(soup.prettify())
for item in soup.find_all({"class":"u-textInheritColor js-nav"}):
print(item.text)
I am getting 200 response for status code. How to solve this problem?
NOTE: I am not using any APIs. I want to authenticate using requests library.
Try this. It should get you there:
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
r = s.get("https://twitter.com/login")
soup = BeautifulSoup(r.text,"lxml")
token = soup.select_one("[name='authenticity_token']")['value']
payload={
'session[username_or_email]':'your_email',
'session[password]':'your_password',
'authenticity_token':token,
'ui_metrics':'{"rf":{"c6fc1daac14ef08ff96ef7aa26f8642a197bfaad9c65746a6592d55075ef01af":3,"a77e6e7ab2880be27e81075edd6cac9c0b749cc266e1cea17ffc9670a9698252":-1,"ad3dbab6c68043a1127defab5b7d37e45d17f56a6997186b3a08a27544b606e8":252,"ac2624a3b325d64286579b4a61dd242539a755a5a7fa508c44eb1c373257d569":-125},"s":"fTQyo6c8mP7d6L8Og_iS8ulzPObBOzl3Jxa2jRwmtbOBJSk4v8ClmBbF9njbZHRLZx0mTAUPsImZ4OnbZV95f-2gD6-03SZZ8buYdTDkwV-xItDu5lBVCQ_EAiv3F5EuTpVl7F52FTIykWowpNIzowvh_bhCM0_6ReTGj6990294mIKUFM_mPHCyZxkIUAtC3dVeYPXff92alrVFdrncrO8VnJHOlm9gnSwTLcbHvvpvC0rvtwapSbTja-cGxhxBdekFhcoFo8edCBiMB9pip-VoquZ-ddbQEbpuzE7xBhyk759yQyN4NmRFwdIjjedWYtFyOiy_XtGLp6zKvMjF8QAAAWE468LY"}',
'scribe_log':'',
'redirect_after_login':'',
'authenticity_token':token,
'remember_me':1
}
headers={
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'content-type':'application/x-www-form-urlencoded',
'origin':'https://twitter.com',
'referer':'https://twitter.com/login',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
res = s.post("https://twitter.com/sessions",data=payload,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".tweet-text"):
print(item.text)
In some pages, when I use beautifulsoup, return nothing...just blank pages.
from bs4 import BeautifulSoup
import urllib.request
Site = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
URL = Site
html = urllib.request.urlopen(URL).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
I can use beautifulsoup any other site except this site. and I dont know way...
This URL will require certain headers passed while requesting.
Pass this headers parameter while requesting the URL and you will get the HTML.
HTML = requests.get(URL , headers = headers).content
while
headers = {
"method":"GET",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"Host":"gall.dcinside.com",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"Accept":"text/html,application/xhtml+xml,
application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
}
As I can see, this site is using cookies. You can see the headers in the browser's developer tool. You can get the cookie by following:
import urllib.request
r = urllib.request.urlopen(URL)
ck = r.getheader('Set-Cookie')
Now you can create the header like this and send it with subsequent requests.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Cookie": ck,
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
}
req = urllib.request.Request(URL, headers=headers)
html = urllib.request.urlopen(req).read()
Some website servers look for robot scripts trying to access their pages. One of the simpler methods of doing this is to check to see which User-Agent is being sent by the browser. In this case as you are using Python and not a web browser, the following is being sent:
python-requests/2.18.4
When it sees an agent it does not like, it will return nothing. To get around this, you need to change the User-Agent string in your request. There are hundreds to choose from, as the agent string changes with each release of a browser. For example see this list of Firefox User-Agent strings e.g.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
The trick is to try a few, and find one that the server is happy with. In your case, ONLY the header needs to be changed in order to get HTML to be returned from the website. In some cases, cookies will also need to be used.
The header can be easily changed by passing a dictionary. This could be done using requests as follows:
from bs4 import BeautifulSoup
import requests
url = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}).content
soup = BeautifulSoup(html, "html.parser")
print(soup)