I'm trying to get back the stock price from a Google search but the results of the BS4 are more than 300 lines.
enter image description here
Here is my code:
import bs4, requests
exampleFile = requests.get('https://www.google.com/search?q=unip6')
exampleSoup = bs4.BeautifulSoup(exampleFile.text, features="html.parser")
elems = exampleSoup.select('div', {"class": 'IsqQVc NprOob'})
print(len(elems))
for each in elems:
print(each.getText())
print(each.attrs)
print('')
I'd like the outcome was only the price: '23,85'
In this case, the page isn't loaded dynamically, so the target details can be found in the soup. It's also possible to avoid the issue of the changing class name (at least for the time being...), by not using a class selector:
for s in soup.select("div"):
if 'Latest Trade' in s.text:
print(s.text.split('Latest Trade. ')[1].split('BRL')[0])
break
Output:
23.85
Using yahoo finance, you can try:
import pandas as pd
from datetime import datetime, timedelta
now = datetime.now() # time now
past = int((now - timedelta(days=30)).timestamp()) # 30 days ago
now = int(now.timestamp())
ticker = "UNIP6.SA" # https://finance.yahoo.com/quote/UNIP6.SA/
interval = "1d" # "1wk" , "1mo"
df = pd.read_csv(f"https://query1.finance.yahoo.com/v7/finance/download/{ticker}?period1={past}&period2={now}&interval={interval}&events=history")
print(df.iloc[-1]['Close'])
# 23.85
Demo
It's much simpler. All you have to do is to use select_one() to grab just one element since there's no need to use a for loop (use SelectorGadget Chrome extension to grab CSS selectors):
soup.select_one('.wT3VGc').text
# 93,42
And don't forget about user-agent to fake a real user visit, otherwise, Google will treat your requests as python-requests.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=unip6', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
current_stock_price = soup.select_one('.wT3VGc').text
print(current_stock_price)
# 93,42
Alternatively, you can do the same thing except don't figure out why the output has more than 300 lines by using Google Direct Answer Box API from SerpApi. It's a paid API with a free trial of 5,000 searches.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "unip6",
}
search = GoogleSearch(params)
results = search.get_dict()
current_stock_price = results['answer_box']['price']
print(current_stock_price)
# 93,42
Disclaimer, I work for SerpApi
Related
Hi Everyone receive error msg when executing this code :
from bs4 import BeautifulSoup
import requests
import html.parser
from requests_html import HTMLSession
session = HTMLSession()
response = session.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht")
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all("tr")
for table in tables:
movie_name = table.find("span", class_ = "secondaryInfo")
print(movie_name)
output:
movie_name = table.find("span", class_ = "secondaryInfo").text
AttributeError: 'NoneType' object has no attribute 'text'
You selected for the first row which is the header and doesn't have that class as it doesn't list the prices. An alternative way is to simply exclude the header with a css selector of nth-child(n+2). You also only need requests.
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht")
soup = BeautifulSoup(response.content, 'html.parser')
for row in soup.select('tr:nth-child(n+2)'):
movie_name = row.find("span", class_ = "secondaryInfo")
print(movie_name.text)
Just use the SelectorGadget Chrome extension to grab CSS selector by clicking on the desired element in your browser without inventing anything superfluous. However, it's not working perfectly if the HTML structure is terrible.
You're looking for this:
for result in soup.select(".titleColumn a"):
movie_name = result.text
Also, there's no need in using HTMLSession IF you don't want to persist certain parameters across requests to the same host (website).
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
# user-agent is used to act as a real user visit
# this could reduce the chance (a little bit) of being blocked by a website
# and prevent from IP limit block or permanent ban
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht", headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
for result in soup.select(".titleColumn a"):
movie_name = result.text
print(movie_name)
# output
'''
Eternals
Dune: Part One
No Time to Die
Venom: Let There Be Carnage
Ron's Gone Wrong
The French Dispatch
Halloween Kills
Spencer
Antlers
Last Night in Soho
'''
P.S. There's a dedicated web scraping blog of mine. If you need to parse search engines, have a try using SerpApi.
Disclaimer, I work for SerpApi.
I was trying to scrape some urls from the search result and I tried to include both cookies setting or user-agent as Mozilla/5.0 and so on. I still cannot get any urls from the search result. Any solution I can get this working?
from bs4 import BeautifulSoup
import requests
monitored_tickers = ['GME', 'TSLA', 'BTC']
def search_for_stock_news_urls(ticker):
search_url = "https://www.google.com/search?q=yahoo+finance+{}&tbm=nws".format(ticker)
r = requests.get(search_url)
soup = BeautifulSoup(r.text, 'html.parser')
atags = soup.find_all('a')
hrefs = [link['href'] for link in atags]
return hrefs
raw_urls = {ticker:search_for_stock_news_urls(ticker) for ticker in monitored_tickers}
raw_urls
You could be running into the issue that requests and bs4 may not be the best tool for what you're trying to accomplish. As balderman said in another comment, using google search api will be easier.
This code:
from googlesearch import search
tickers = ['GME', 'TSLA', 'BTC']
links_list = []
for ticker in tickers:
ticker_links = search(ticker, stop=25)
links_list.append(ticker_links)
will make a list of the top 25 links on google for each ticker, and append that list into another list. Yahoo finance is sure to be in that list of links, and a simple parser based on keyword will get the yahoo finance url for that specific ticker. You could also adjust the search criteria in the search() function to whatever you wish, say ticker + ' yahoo finance' for example.
Google News could be easily scraped with requests and beautifulsoup. It would be enough to use user-agent to extract data from there.
Check out SelectorGadget Chrome extension to visually grab CSS selectors by clicking on the element you want to extract.
If you only want to extract URLs from Google News, then it's as simple as:
for result in soup.select('.dbsr'):
link = result.a['href']
# 10 links here..
Code and example that scrape more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "yahoo finance BTC",
"hl": "en",
"gl": "us",
"tbm": "nws",
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.dbsr'):
link = result.a['href']
print(link)
-----
'''
https://finance.yahoo.com/news/riot-blockchain-reports-record-second-203000136.html
https://finance.yahoo.com/news/el-salvador-not-require-bitcoin-175818038.html
https://finance.yahoo.com/video/bitcoin-hovers-around-50k-paypal-155437774.html
... other links
'''
Alternatively, you can achieve the same result by using Google News Results API from SerpApi. It's a paid API with a free plan.
The differences is that you don't have to figure out how to extract elements, maintain the parser over time, bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "coca cola",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
-----
'''
Title: Coca-Cola Co. stock falls Monday, underperforms market
Link: https://www.marketwatch.com/story/coca-cola-co-stock-falls-monday-underperforms-market-01629752653-994caec748bb
... more results
'''
P.S. I wrote a blog post about how to scrape Google News (including pagination) in a bit more detailed way with visual representation.
Disclaimer, I work for SerpApi.
I was web-scraping weather-searched Google with bs4, and Python can't find a <span> tag when there is one. How can I solve this problem?
I tried to find this <span> with the class and the id, but both failed.
<div id="wob_dcp">
<span class="vk_gy vk_sh" id="wob_dc">Clear with periodic clouds</span>
</div>
Above is the HTML code I was trying to scrape in the page:
response = requests.get('https://www.google.com/search?hl=ja&ei=coGHXPWEIouUr7wPo9ixoAg&q=%EC%9D%BC%EB%B3%B8+%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E+%EB%82%B4%EC%9D%BC+%EB%82%A0%EC%94%A8&oq=%EC%9D%BC%EB%B3%B8+%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E+%EB%82%B4%EC%9D%BC+%EB%82%A0%EC%94%A8&gs_l=psy-ab.3...232674.234409..234575...0.0..0.251.929.0j6j1......0....1..gws-wiz.......35i39.yu0YE6lnCms')
soup = BeautifulSoup(response.content, 'html.parser')
tomorrow_weather = soup.find('span', {'id': 'wob_dc'}).text
But failed with this code, the error is:
Traceback (most recent call last):
File "C:\Users\sungn_000\Desktop\weather.py", line 23, in <module>
tomorrow_weather = soup.find('span', {'id': 'wob_dc'}).text
AttributeError: 'NoneType' object has no attribute 'text'
Please solve this error.
This is because the weather section is rendered by the browser via JavaScript. So when you use requests you only get the HTML content of the page which doesn't have what you need.
You should use for example selenium (or requests-html) if you want to parse page with elements rendered by web browser.
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('https://www.google.com/search?hl=en&ei=coGHXPWEIouUr7wPo9ixoAg&q=%EC%9D%BC%EB%B3%B8%20%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E%20%EB%82%B4%EC%9D%BC%20%EB%82%A0%EC%94%A8&oq=%EC%9D%BC%EB%B3%B8%20%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E%20%EB%82%B4%EC%9D%BC%20%EB%82%A0%EC%94%A8&gs_l=psy-ab.3...232674.234409..234575...0.0..0.251.929.0j6j1......0....1..gws-wiz.......35i39.yu0YE6lnCms')
soup = BeautifulSoup(response.content, 'html.parser')
tomorrow_weather = soup.find('span', {'id': 'wob_dc'}).text
print(tomorrow_weather)
Output:
pawel#pawel-XPS-15-9570:~$ python test.py
Clear with periodic clouds
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(a)
>>> a
'<div id="wob_dcp">\n <span class="vk_gy vk_sh" id="wob_dc">Clear with periodic clouds</span> \n</div>'
>>> soup.find("span", id="wob_dc").text
'Clear with periodic clouds'
Try this out.
It's not rendered via JavaScript as pawelbylina mentioned, and you don't have to use requests-html or selenium since everything needed is in the HTML, and it will slow down the scraping process a lot because of page rendering.
It could be because there's no user-agent specified thus Google blocks your request and you receiving a different HTML with some sort of error because the default requests user-agent is python-requests. Google understands it and blocks a request since it's not the "real" user visit. Checks what's your user-agent.
Pass user-agent intro request headers:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
You're looking for this, use select_one() to grab just one element:
soup.select_one('#wob_dc').text
Have a look at SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired elements in your browser.
Code and full example that scrapes more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "일본 桜川市真壁町古城 내일 날씨",
"hl": "en",
}
response = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')
location = soup.select_one('#wob_loc').text
weather_condition = soup.select_one('#wob_dc').text
tempature = soup.select_one('#wob_tm').text
precipitation = soup.select_one('#wob_pp').text
humidity = soup.select_one('#wob_hm').text
wind = soup.select_one('#wob_ws').text
current_time = soup.select_one('#wob_dts').text
print(f'Location: {location}\n'
f'Weather condition: {weather_condition}\n'
f'Temperature: {tempature}°F\n'
f'Precipitation: {precipitation}\n'
f'Humidity: {humidity}\n'
f'Wind speed: {wind}\n'
f'Current time: {current_time}\n')
------
'''
Location: Makabecho Furushiro, Sakuragawa, Ibaraki, Japan
Weather condition: Cloudy
Temperature: 79°F
Precipitation: 40%
Humidity: 81%
Wind speed: 7 mph
Current time: Saturday
'''
Alternatively, you can achieve the same thing by using the Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to think about how to bypass block from Google or figure out why data from certain elements aren't extracting as it should since it's already done for the end-user. The only thing that needs to be done is to iterate over structured JSON and grab the data you want.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "일본 桜川市真壁町古城 내일 날씨",
"api_key": os.getenv("API_KEY"),
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
loc = results['answer_box']['location']
weather_date = results['answer_box']['date']
weather = results['answer_box']['weather']
temp = results['answer_box']['temperature']
precipitation = results['answer_box']['precipitation']
humidity = results['answer_box']['humidity']
wind = results['answer_box']['wind']
print(f'{loc}\n{weather_date}\n{weather}\n{temp}°F\n{precipitation}\n{humidity}\n{wind}\n')
--------
'''
Makabecho Furushiro, Sakuragawa, Ibaraki, Japan
Saturday
Cloudy
79°F
40%
81%
7 mph
'''
Disclaimer, I work for SerpApi.
I also had this problem.
You should not import like this
from bs4 import BeautifulSoup
you should import like this
from bs4 import *
This should work.
I'm trying to make a script that will scrape the first link of a google search so that it will give me back only the first link so I can run a search in the terminal and look at the link later on with the search term. I'm struggling to only get the first result. This is the closest thing I've got so far.
import requests
from bs4 import BeautifulSoup
research_later = "hiya"
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later
r = requests.get(goog_search)
soup = BeautifulSoup(r.text)
for link in soup.find_all('a'):
print research_later + " :"+link.get('href')
Seems like Google use cite tag to save the link, so we can just use soup.find('cite').text like this:
import requests
from bs4 import BeautifulSoup
research_later = "hiya"
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later
r = requests.get(goog_search)
soup = BeautifulSoup(r.text, "html.parser")
print soup.find('cite').text
Output is:
www.urbandictionary.com/define.php?term=hiya
You can use either select_one() for selecting CSS selectors or find() bs4 methods to get only one element from the page. To grab CSS selectors you can use SelectorGadget extension.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=ice cream', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# locating .tF2Cxc class
# calling for <a> tag and then calling for 'href' attribute
link = soup.select('.yuRUbf a')['href']
print(link)
# https://en.wikipedia.org/wiki/Ice_cream
Alternatively, you can do the same thing by using Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The main difference is that everything (selecting, bypass blocks, proxy rotation, and more) is already done for the end-user with a JSON output.
Code to integrate:
params = {
"engine": "google",
"q": "ice cream",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] - first index from the search results
link = results['organic_results'][0]['link']
print(link)
# https://en.wikipedia.org/wiki/Ice_cream
Disclaimer, I work for SerpApi.
I would like to parse Google search results with Python. Everything worked perfectly, but now I keep getting an empty list. Here is the code that used to work fine:
query = urllib.urlencode({'q': self.Tagsinput.GetValue()+footprint,'ie': 'utf-8', 'num':searchresults, 'start': '100'})
result = url + query1
myopener = MyOpener()
page = myopener.open(result)
xss = page.read()
soup = BeautifulSoup.BeautifulSoup(xss)
contents = [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]
This script worked perfectly in December, now it stopped working.
As far as I understand the problem is in this line:
contents = [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]
when I print contents the program returns an empty list: []
Please, anybody, help.
The API works a whole lot better, too. Simple JSON which you can easily parse and manipulate.
import urllib, json
BASE_URL = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&'
url = BASE_URL + urllib.urlencode({'q' : SearchTerm.encode('utf-8')})
raw_res = urllib.urlopen(url).read()
results = json.loads(raw_res)
hit1 = results['responseData']['results'][0]
prettyresult = ' - '.join((urllib.unquote(hit1['url']), hit1['titleNoFormatting']))
At the time of writing this answer you don't have to parse <script> tag (for the most part) to get the output from the Google Search. This can be achieved by using beautifulsoup, requests, and lxml libraries.
Code to get the title, link, and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get(f'https://www.google.com/search?q=minecraft', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
title = container.select_one('.DKV0Md').text
link = container.find('a')['href']
print(f'{title}\n{link}')
# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''
Alternatively, you can do it as well by using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches. Check out the Playground.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # environment for API_KEY
"engine": "google",
"q": "minecraft",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
print(f'{title}\n{link}')
# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''
Disclaimer, I work for SerpApi.