google search using bs4,python - python

I wanted to pick up the address of "Spotlight 29 casino address" through Google search in the python script. Why isn't my code working?
from bs4 import BeautifulSoup
# from googlesearch import search
import urllib.request
import datetime
article='spotlight 29 casino address'
url1 ='https://www.google.co.in/#q='+article
content1 = urllib.request.urlopen(url1)
soup1 = BeautifulSoup(content1,'lxml')
#print(soup1.prettify())
div1 = soup1.find('div', {'class':'Z0LcW'}) #get the div where it's located
# print (datetime.datetime.now(), 'street address: ' , div1.text)
print (div1)
Pastebin Link

Google uses javascript rendering for that purpose, that's why you don't receive that div with urllib.request.urlopen.
As solution you can use selenium - python library for emulating browser. Install it using 'pip install selenium' console command, then code like this shall work:
from bs4 import BeautifulSoup
from selenium import webdriver
article = 'spotlight 29 casino address'
url = 'https://www.google.co.in/#q=' + article
driver = webdriver.Firefox()
driver.get(url)
html = BeautifulSoup(driver.page_source, "lxml")
div = html.find('div', {'class': 'Z0LcW'})
print(div.text)

If you want to get google search result. Selenium with Python is more simple way.
below is simple code.
from selenium import webdriver
import urllib.parse
from bs4 import BeautifulSoup
chromedriver = '/xxx/chromedriver' #xxx is chromedriver in your installed path
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chromedriver, chrome_options=chrome_options)
article='spotlight 29 casino address'
driver.get("https://www.google.co.in/#q="+urllib.parse.quote(article))
# driver.page_source <-- html source, you can parser it later.
soup = BeautifulSoup(driver.page_source, 'lxml')
div = soup.find('div',{'class':'Z0LcW'})
print(div.text)
driver.quit()

You were getting an empty div because if you were using requests library the default user-agent is python-requests thus your request is being blocked by Google (in this case) and you received a different HTML with different elements. User-agent fakes "real" user visit.
You can achieve it without selenium if the address was in HTML code (which in this case it is) by passing user-agent into request headers:
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
Here's code and full example:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get("https://www.google.com/search?q=spotlight 29 casino address", headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
print(soup.select_one(".sXLaOe").text)
# 46-200 Harrison Pl, Coachella, CA 92236
P.S. There's a dedicated web scraping blog of mine. If you need to parse search engines, have a try using SerpApi.
Disclaimer, I work for SerpApi.

Related

Difficulty Scraping Product Information from Website

I am having difficulties scraping the "product name" and "price" from this website: https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571
Looking to scrap "$4.30" and "Zespri New Zealand Kiwifruit - Green" from the webpage. I have tried various approaches (Beautiful Soup, request_html, selenium) without any success. Attached the sample code approaches I have taken.
I am able to view the 'price' and 'product name' details in the "Developer Tools" tab of Chrome. It seems like that webpage uses Javascript to dynamically load the product information, so the various approaches mentioned above are not able to scrape the information properly.
Appreciate any assistance on this issue.
Requests_html Approach:
from requests_html import HTMLSession
import json
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
session = HTMLSession()
r = session.get(url)
r.html.render(timeout=20)
json_text=r.html.xpath("//script[#type='application/ld+json']/text()")[0][:-1]
json_data = json.loads(json_text)
print(json_data['name']['price'])
Beautiful Soup Approach:
import sys
import time
from bs4 import BeautifulSoup
import requests
import re
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69'}
page=requests.get(url, headers=headers)
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')
linkitem=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
linkprice=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 sc-13n2dsm-5 kxEbZl deQJPo'})
print(linkprice)
Selenium Approach:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
linkitem = soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
That approach of yours with the embedded JSON needs some refinement. In other words, you're almost there. Also, this can be done with pure requests and bs4.
PS. I'm using different URLS, as the one you give returns a 404.
Here's how:
import json
import requests
from bs4 import BeautifulSoup
urls = [
"https://www.fairprice.com.sg/product/11798142",
"https://www.fairprice.com.sg/product/vina-maipo-cabernet-sauvignon-merlot-750ml-11690254",
"https://www.fairprice.com.sg/product/new-moon-new-zealand-abalone-425g-75342",
]
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
}
for url in urls:
product_data = (
json.loads(
BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
.find("script", type="application/ld+json")
.string[:-1]
)
)
print(product_data["name"])
print(product_data["offers"]["price"])
This should output:
Nongshim Instant Cup Noodle - Spicy
1.35
Vina Maipo Red Wine - Cabernet Sauvignon Merlot
14.95
New Moon New Zealand Abalone
33.8

Python How Can I Scrape Image URL and Image Title From This Link With BeautifulSoup

I could not reach the image url and image title with Beautiful soup. I will be glad if you help me. Thanks
The image url I want to scrape is:
https://cdn.homebnc.com/homeimg/2017/01/29-entry-table-ideas-homebnc.jpg
Title I want to scrape
37 Best Entry Table Ideas (Decorations and Designs) for 2017
from bs4 import BeautifulSoup
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
}
response = requests.get("https://www.bing.com/images/search?view=detailV2&ccid=sg67yP87&id=39EC3D95F0FC25C52E714B1776D819AB564D474B&thid=OIP.sg67yP87Kr9hQF8PiKnKZQHaLG&mediaurl=https%3a%2f%2fcdn.homebnc.com%2fhomeimg%2f2017%2f01%2f29-entry-table-ideas-homebnc.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.b20ebbc8ff3b2abf61405f0f88a9ca65%3frik%3dS0dNVqsZ2HYXSw%26pid%3dImgRaw%26r%3d0&exph=2247&expw=1500&q=table+ideas&simid=608015185750411203&FORM=IRPRST&ck=7EA9EDE471AB12F7BDA2E7DA12DC56C9&selectedIndex=0&qft=+filterui%3aimagesize-large", headers=headers)
print(response.status_code)
soup = BeautifulSoup(response.content, "html.parser")
As mentioned in the comment above by #Carst3n, BeautifulSoup is only giving you the html format before any scripts are executed. For this reason you should try to scrape the website with a combination of Selenium and BeautifulSoup.
You can download chromedriver from here
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
opts = Options()
opts.add_argument(" --headless")
opts.add_argument("--log-level=3")
driver = webdriver.Chrome(PATH_TO_CHROME_DRIVER, options=opts)
driver.get("https://www.bing.com/images/search?view=detailV2&ccid=sg67yP87&id=39EC3D95F0FC25C52E714B1776D819AB564D474B&thid=OIP.sg67yP87Kr9hQF8PiKnKZQHaLG&mediaurl=https%3a%2f%2fcdn.homebnc.com%2fhomeimg%2f2017%2f01%2f29-entry-table-ideas-homebnc.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.b20ebbc8ff3b2abf61405f0f88a9ca65%3frik%3dS0dNVqsZ2HYXSw%26pid%3dImgRaw%26r%3d0&exph=2247&expw=1500&q=table+ideas&simid=608015185750411203&FORM=IRPRST&ck=7EA9EDE471AB12F7BDA2E7DA12DC56C9&selectedIndex=0&qft=+filterui%3aimagesize-large")
time.sleep(2)
soup_file = driver.page_source
driver.quit()
soup = BeautifulSoup(soup_file)
print(soup.select('#mainImageWindow .nofocus')[0].get('src'))
print(soup.select('.novid')[0].string)

How to scrape a part of a google search result

I want to scrape the information from AMD stock that google provide.
I have been able to scrape the whole webpage, but as soon I try to get a specific div or class I am not able to find anything and the console returns [].
When scraping the whole page I cannot find those classes either, after searching I found that this is possibly hidden by Javascript and can somehow be accesed with Selenium? I tried to use Selenium Webdriver but this got me nowhere.
This is what i have so far:
import requests
from bs4 import BeautifulSoup
import urllib3
from selenium import webdriver
requests.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36"}
url = "https://www.google.com/search?q=amd+stock&oq=amd+stock&aqs=chrome..69i57j35i39j0l5j69i60.1017j0j7&sourceid=chrome&ie=UTF-8"
source_code = requests.get(url, requests.headers)
soup = BeautifulSoup(source_code.text, "html.parser")
amd = soup.find_all('div', attrs = {'class': 'aviV4d'})
print(amd)
When printing 'soup' I get the whole page, but when printing 'amd' I get [].
Your code is ok, but use headers= parameter in request() call:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36"}
url = "https://www.google.com/search?q=amd+stock&oq=amd+stock&aqs=chrome..69i57j35i39j0l5j69i60.1017j0j7&sourceid=chrome&ie=UTF-8"
source_code = requests.get(url, headers=headers)
soup = BeautifulSoup(source_code.text, "html.parser")
amd = soup.find('div', attrs = {'class': 'aviV4d'})
print(amd.get_text(strip=True, separator='|').split('|')[:3])
Prints:
['Advanced Micro Devices', 'NASDAQ: AMD', '48,16']
That's a dynamic page, it's not gonna give the stock price by just requesting the page source via requests. You will have to use scraping to do that. Try this instead:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--incognito")
chromedriver_path = './chromedriver'
driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)
driver.get("https://www.google.com/search?q=amd+stock&oq=amd+stock&aqs=chrome..69i57j35i39j0l5j69i60.1017j0j7&sourceid=chrome&ie=UTF-8")
time.sleep(2)
x = driver.find_element_by_xpath('//*[#id="knowledge-finance-wholepage__entity-summary"]/div/g-card-section/div/g-card-section/span[1]/span/span[1]')
print(x.text)
driver.quit()
Output:
48.16
I believe you need to add amd.response or amd.text
print(amd.response)
print(amd.text)

printing number from html tag in python

Hi I have been trying to get the time data from this website: https://clockofeidolon.com (hours, minutes, seconds) and tried to use beautifulsoup to print contents of 'span class="big' tags since the time information is kept there and I have come up with this:
from bs4 import BeautifulSoup
from requests import Session
session = Session()
session.headers['user-agent'] = (
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/'
'66.0.3359.181 Safari/537.36'
)
url = 'https://clockofeidolon.com'
response = session.get(url=url)
data = response.text
soup = BeautifulSoup(data, "html.parser")
spans = soup.find_all('<span class="big')
print([span.text for span in spans])
But the output only shows "[]" and nothing else. How would I go about printing the number in each of the 3 tags?
As mentioned this can be achieved with selenium once you have the correct geckodriver installed the following should get you on the right track:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://clockofeidolon.com')
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
spans = soup.find_all(class_='big-hour')
for span in spans:
print(span.text)
driver.quit()

Why is this python web scraping code using requests package not working?

import lxml.html
import requests
l1=[]
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get('http://www.naukri.com/jobs-by-location', headers=headers)
html = r.content
root = lxml.html.fromstring(html)
urls = root.xpath('//div[4]/div/div[1]/div/a/#href') #This xpath should give the list of cities(their links)
l1.extend(urls)
This python code is meant to scrape the list of job cities(their 'a href' tags) and store it in list l1. But here I am getting a blank list. The same xpath is working on Chrome console but it's not working in this code. Due to that I added headers to make my code act as a browser but still it's not working..
http://i.stack.imgur.com/Xx1xW.jpg
I tried to achieve the same using Selenium WebDriver, and this also succeeds. When this succeeds from your computer, it might be a problem in one of the used libraries.
import selenium.webdriver as driver
browser = driver.Chrome()
browser.get("http://www.naukri.com/jobs-by-location")
links = browser.find_elements_by_xpath("//div[4]/div/div[1]/div/a")
for link in links:
href = link.get_attribute("href")
print(href)
browser.quit()

Categories

Resources