I have a simple script to analyze sold data on eBay (baseball trading cards). It seems to be working fine for the first 4 pages but on the 5th page it simply does not load in the desired html content anymore, and I am not able to figure out why this happens:
#Import statements
import requests
import time
from bs4 import BeautifulSoup as soup
from tqdm import tqdm
#FOR DEBUG
Page_1="https://www.ebay.com/sch/213/i.html?_from=R40&LH_Sold=1&_sop=16&_pgn=1"
#Request URL working example
source=requests.get(Page_1)
time.sleep(5)
eBay_full = soup(source.text, "lxml")
Complete_container=eBay_full.find("ul",{"class":"b-list__items_nofooter"})
Single_item=Complete_container.find_all("div",{"class":"s-item__wrapper clearfix"})
items=[]
#For all items on page perform desired operation
for i in tqdm(Single_item):
items.append(i.find("a", {"class": "s-item__link"})["href"].split('?')[0].split('/')[-1])
#Works fine for Links_to_check[0] upto Links_to_check[3]
However, when I try to scrape the fifth page or further pages the following occurs:
Page_5="https://www.ebay.com/sch/213/i.html?_from=R40&LH_Sold=1&_sop=16&_pgn=5"
source=requests.get(Page_5)
time.sleep(5)
eBay_full = soup(source.text, "lxml")
Complete_container=eBay_full.find("ul",{"class":"b-list__items_nofooter"})
Single_item=Complete_container.find_all("div",{"class":"s-item__wrapper clearfix"})
items=[]
#For all items on page perform desired operation
for i in tqdm(Single_item):
items.append(i.find("a", {"class": "s-item__link"})["href"].split('?')[0].split('/')[-1])
----> 5 Single_item=Complete_container.find_all("div",{"class":"s-item__wrapper clearfix"})
6 items=[]
7 #For all items on page perform desired operation
AttributeError: 'NoneType' object has no attribute 'find_all'
This seems a logical consequence of the ul class b-list__items_nofooter missing in the eBay_full soup for the later pages. The question however is why is this information missing? Scrolling through the soup, all items of interest seem to be absent. On the webpage itself this information is, as expected, present. Who can guide me?
As per #Sebastien D his remark the problem has been solved
In the headers variable put only one of these browsers, along with the current stable version number (e.g. Chrome/53.0.2785.143, latest found here)
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
source= requests.get(Page_5, headers=headers, timeout=2)
As Sebastien D suggested, the main problem lies in that eBay understands that the bot/script send a request.
But how does eBay understand it? It's because default requests user-agent is python-requests and eBay understands it and seems to block the requests made with such user-agent.
By adding a custom user-agent we can somewhat fake real user request. However, it's not completely reliable, and headers might need to be rotated or/and used with proxies, ideally residential.
List of user-agents at whatismybrowser.
As a side note, you can use the SelectorGadget Chrome extension to easily select CSS selectors by clicking on the desired element in your browser, which does not always work perfectly if the page is heavily using JS ( in this case we can).
The example below shows how to extract listings from all pages. Code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
params = {
'_nkw': 'baseball trading cards', # search query
'LH_Sold': '1', # shows sold items
'_pgn': 1 # page number
}
data = []
while True:
page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
print(f"Extracting page: {params['_pgn']}")
print("-" * 10)
for products in soup.select(".s-item__info"):
title = products.select_one(".s-item__title span").text
price = products.select_one(".s-item__price").text
link = products.select_one(".s-item__link")["href"]
data.append({
"title" : title,
"price" : price,
"link" : link
})
if soup.select_one(".pagination__next"):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output
Extracting page: 1
----------
[
{
"title": "Shop on eBay",
"price": "$20.00",
"link": "https://ebay.com/itm/123456?hash=item28caef0a3a:g:E3kAAOSwlGJiMikD&amdata=enc%3AAQAHAAAAsJoWXGf0hxNZspTmhb8%2FTJCCurAWCHuXJ2Xi3S9cwXL6BX04zSEiVaDMCvsUbApftgXEAHGJU1ZGugZO%2FnW1U7Gb6vgoL%2BmXlqCbLkwoZfF3AUAK8YvJ5B4%2BnhFA7ID4dxpYs4jjExEnN5SR2g1mQe7QtLkmGt%2FZ%2FbH2W62cXPuKbf550ExbnBPO2QJyZTXYCuw5KVkMdFMDuoB4p3FwJKcSPzez5kyQyVjyiIq6PB2q%7Ctkp%3ABlBMULq7kqyXYA"
},
{
"title": "Ken Griffey Jr. Seattle Mariners 1989 Topps Traded RC Rookie Card #41T",
"price": "$7.20",
"link": "https://www.ebay.com/itm/385118055958?hash=item59aad32e16:g:EwgAAOSwhgljI0Vm&amdata=enc%3AAQAHAAAAoFRRlvb50yb%2FN4cmlg5OtVDKIH0DsaMJBL3Tp67SI1dCSP1WPdZW3f16bTf4HTSUhX0g3OMmZSitEY3F3SVGg0%2FhSBF3ykE9X88Lo2EHuS2b23tA1kGiG92F9xyr73RLorcidserdH8tvUXhxmT4pJDnCfMAdfqtRzSIxcB6h4aDC1J1XvJ5IyRfYtWBGUQ60ykrA7mNlhH53cwZe5MiRSw%3D%7Ctkp%3ABk9SR7rKxt7sYA"
},
{
"title": "Ken Griffey Jr. 1989 Score Traded Rookie Card Gem 10 Auto Beckett 13604418",
"price": "$349.00",
"link": "https://www.ebay.com/itm/353982131344?hash=item526afaac90:g:9hQAAOSwvCpiQ5FY&amdata=enc%3AAQAHAAAAoOKm1SWvHtdNVIEqtE4m5%2B453xtvR75ZimUBLL16P0WwfJy%2BJJQ2Phd9crgAacTWlsqp9HB%2Ft0McttOjmCfyL0RDQB%2FYOWQK3hxj%2FoDRmybJRipjqb0JG2%2BCa1RhI04PN3R5wpH9vvYqefwY6JuAsPqDU0SmSk6h1h%2FQr7cfJqOmdCo0cqbwPcJ8OcvAyP07txigrDyO55XqFD7CHcSmUPA%3D%7Ctkp%3ABk9SR7rKxt7sYA"
},
{
"title": "Mike Jorgensen NY Mets MLB OF-1B 1972 Topps Baseball Card #16 Single Original",
"price": "$1.19",
"link": "https://www.ebay.com/itm/374255790865?hash=item5723622b11:g:KiwAAOSwz4ljI0G4&amdata=enc%3AAQAHAAAAoPVkKyeDZ7wbRNBwQppCcjVmLlOlY3ylPVwQyG7dfOy1UtPYhK7tRXtvn5v3M5n%2F35MS1LXLvWAioKRrMGPEPCmDoMkhdynuH3csaincrM%2F6JNwwIUFa3F%2FcylfPqnrxjJXF7cZ3ga9aCihTM6sfVJc1kzNkaBw2C2ewMyQ3ARgYpuDcUa6CMo4zBKF%2FGTj5KlZieLYywQm4dnzLCrFbtEM%3D%7Ctkp%3ABk9SR7rKxt7sYA"
},
# ...
]
Related
I'm using the following code to scrape papers from google scholar. I noticed that only the shorted descriptions of the papers are scraped, but not the entire description. If you look on the google scholar search results page, only a short excerpt from the text is seen ending with a triple dot (...)
The scraper only scrapes this, leaving the rest of the information out. This happens for authors (especially when there are many), journal names, and abstracts, leaving parts of the information out.
Do you maybe know a solution to this? If you execute the code yourself you will see what I mean.
from bs4 import BeautifulSoup
import requests, lxml, os, json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samsung",
"hl": "en",
}
html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# Scrape just PDF links
for pdf_link in soup.select('.gs_or_ggsm a'):
pdf_file_link = pdf_link['href']
print(pdf_file_link)
# JSON data will be collected here
data = []
# Container where all needed data is located
for result in soup.select('.gs_ri'):
title = result.select_one('.gs_rt').text
title_link = result.select_one('.gs_rt a')['href']
publication_info = result.select_one('.gs_a').text
snippet = result.select_one('.gs_rs').text
cited_by = result.select_one('#gs_res_ccl_mid .gs_nph+ a')['href']
related_articles = result.select_one('a:nth-child(4)')['href']
try:
all_article_versions = result.select_one('a~ a+ .gs_nph')['href']
except:
all_article_versions = None
data.append({
'title': title,
'title_link': title_link,
'publication_info': publication_info,
'snippet': snippet,
'cited_by': f'https://scholar.google.com{cited_by}',
'related_articles': f'https://scholar.google.com{related_articles}',
'all_article_versions': f'https://scholar.google.com{all_article_versions}',
})
print(json.dumps(data, indent = 2, ensure_ascii = False))
I think, I saw your code in Scrape Google Scholar with Python blog post.
This is because only part of the page's content is displayed in search results. Mostly this information is related to your search question or written in advance.
Therefore, it makes no sense to display all text in search results. If you are still interested in the full text, then you can follow each of the links and scrape the information you need. But keep in mind that each site uses its own selectors and the script will have to be rewritten.
I was trying to scrape some urls from the search result and I tried to include both cookies setting or user-agent as Mozilla/5.0 and so on. I still cannot get any urls from the search result. Any solution I can get this working?
from bs4 import BeautifulSoup
import requests
monitored_tickers = ['GME', 'TSLA', 'BTC']
def search_for_stock_news_urls(ticker):
search_url = "https://www.google.com/search?q=yahoo+finance+{}&tbm=nws".format(ticker)
r = requests.get(search_url)
soup = BeautifulSoup(r.text, 'html.parser')
atags = soup.find_all('a')
hrefs = [link['href'] for link in atags]
return hrefs
raw_urls = {ticker:search_for_stock_news_urls(ticker) for ticker in monitored_tickers}
raw_urls
You could be running into the issue that requests and bs4 may not be the best tool for what you're trying to accomplish. As balderman said in another comment, using google search api will be easier.
This code:
from googlesearch import search
tickers = ['GME', 'TSLA', 'BTC']
links_list = []
for ticker in tickers:
ticker_links = search(ticker, stop=25)
links_list.append(ticker_links)
will make a list of the top 25 links on google for each ticker, and append that list into another list. Yahoo finance is sure to be in that list of links, and a simple parser based on keyword will get the yahoo finance url for that specific ticker. You could also adjust the search criteria in the search() function to whatever you wish, say ticker + ' yahoo finance' for example.
Google News could be easily scraped with requests and beautifulsoup. It would be enough to use user-agent to extract data from there.
Check out SelectorGadget Chrome extension to visually grab CSS selectors by clicking on the element you want to extract.
If you only want to extract URLs from Google News, then it's as simple as:
for result in soup.select('.dbsr'):
link = result.a['href']
# 10 links here..
Code and example that scrape more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "yahoo finance BTC",
"hl": "en",
"gl": "us",
"tbm": "nws",
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.dbsr'):
link = result.a['href']
print(link)
-----
'''
https://finance.yahoo.com/news/riot-blockchain-reports-record-second-203000136.html
https://finance.yahoo.com/news/el-salvador-not-require-bitcoin-175818038.html
https://finance.yahoo.com/video/bitcoin-hovers-around-50k-paypal-155437774.html
... other links
'''
Alternatively, you can achieve the same result by using Google News Results API from SerpApi. It's a paid API with a free plan.
The differences is that you don't have to figure out how to extract elements, maintain the parser over time, bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "coca cola",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
-----
'''
Title: Coca-Cola Co. stock falls Monday, underperforms market
Link: https://www.marketwatch.com/story/coca-cola-co-stock-falls-monday-underperforms-market-01629752653-994caec748bb
... more results
'''
P.S. I wrote a blog post about how to scrape Google News (including pagination) in a bit more detailed way with visual representation.
Disclaimer, I work for SerpApi.
I try to get data from website using BeautifulSoup but I get an empty list. Also tried with "html.parser" but it is also not helping. Please help me to find a solution. Thank you very much.
My code:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.empireonline.com/movies/features/best-movies-2/")
movies_webpage = response.text
soup = BeautifulSoup(movies_webpage, "html.parser")
all_movies = soup.find_all(name="h3", class_="jsx-2692754980")
movie_titles = [movie.getText() for movie in all_movies]
print(movie_titles)
Output:
[]
What happens?
Response do not contain the h3 elements cause content of the website is served dynamically.
How to fix?
You can use the json information from the response or use selenium to request the site and get the content as expected
Example
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://www.empireonline.com/movies/features/best-movies-2/'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
all_movies = soup.find_all("h3", class_="jsx-2692754980")
movie_titles = [movie.getText() for movie in all_movies]
print(movie_titles)
In this question, you can not use selenium, because it slows down the scraping process, it will be enough to use only BeautifulSoup using regular expressions.
The movie list data is located in page source in the inline JSON.
In order to extract data from inline JSON you need:
open page source CTRL + U;
find the data (title, name, etc.) CTRL + F;
using regular expression to extract parts of the inline JSON:
# https://regex101.com/r/CqzweN/1
portion_of_script = re.findall("\"Author:42821\"\:{(.*)\"Article:54591\":", str(all_script))
retrieve the list of movies directly:
# https://regex101.com/r/jRgmKA/1
movie_list = re.findall("\"titleText\"\:\"(.*?)\"", str(portion_of_script))
We can also get snippet and image using CSS selectors because they are not rendered with JavaScript. You can use SelectorGadget Chrome extension to define CSS selectors.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, re, json, lxml
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
html = requests.get("https://www.empireonline.com/movies/features/best-movies-2/", headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
movie_data = []
movie_snippets = []
movie_images = []
all_script = soup.select("script")
# https://regex101.com/r/CqzweN/1
portion_of_script = re.findall("\"Author:42821\"\:{(.*)\"Article:54591\":", str(all_script))
# https://regex101.com/r/jRgmKA/1
movie_list = re.findall("\"titleText\"\:\"(.*?)\"", str(portion_of_script))
for snippets in soup.select(".listicle-item"):
movie_snippets.append(snippets.select_one(".listicle-item-content p:nth-child(1)").text)
for image in soup.select('.image-container img'):
movie_images.append(f"https:{image['data-src']}")
# [1:] exclude first unnecessary result
for movie, snippet, image in zip(movie_list, movie_snippets, movie_images[1:]):
movie_data.append({
"movie_list": movie,
"movie_snippet": snippet,
"movie_image": image
})
print(json.dumps(movie_data, indent=2, ensure_ascii=False))
Example output:
[
{
"movie_list": "11) Star Wars",
"movie_snippet": "George Lucas' cocktail of fantasy, sci-fi, Western and World War II movie remains as culturally pervasive as ever. It's so mythically potent, you sense in time it could become a bona-fide religion...",
"movie_image": "https://images.bauerhosting.com/legacy/media/619d/b9f5/3ebe/477b/3f9c/e48a/11%20Star%20Wars.jpg?q=80&w=500"
},
{
"movie_list": "10) Goodfellas",
"movie_snippet": "Where Coppola embroiled us in the politics of the Mafia elite, Martin Scorsese drew us into the treacherous but seductive world of the Mob's foot soldiers. And its honesty was as impactful as its sudden outbursts of (usually Joe Pesci-instigated) violence. Not merely via Henry Hill's (Ray Liotta) narrative, but also Karen's (Lorraine Bracco) perspective: when Henry gives her a gun to hide, she admits, \"It turned me on.\"",
"movie_image": "https://images.bauerhosting.com/legacy/media/619d/ba59/5165/43e0/333b/7c6f/10%20Goodfellas.jpg?q=80&w=500"
},
{
"movie_list": "9) Raiders Of The Lost Ark",
"movie_snippet": "In '81, it must have sounded like the ultimate pitch: the creator of Star Wars teams up with the director of Jaws to make a rip-roaring, Bond-style adventure starring the guy who played Han Solo, in which the bad guys are the evillest ever (the Nazis) and the MacGuffin is a big, gold box which unleashes the power of God. It still sounds like the ultimate pitch.",
"movie_image": "https://images.bauerhosting.com/legacy/media/619d/bb13/f590/5e77/c706/49ac/9%20Raiders.jpg?q=80&w=500"
},
# ...
]
There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.
I was web-scraping weather-searched Google with bs4, and Python can't find a <span> tag when there is one. How can I solve this problem?
I tried to find this <span> with the class and the id, but both failed.
<div id="wob_dcp">
<span class="vk_gy vk_sh" id="wob_dc">Clear with periodic clouds</span>
</div>
Above is the HTML code I was trying to scrape in the page:
response = requests.get('https://www.google.com/search?hl=ja&ei=coGHXPWEIouUr7wPo9ixoAg&q=%EC%9D%BC%EB%B3%B8+%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E+%EB%82%B4%EC%9D%BC+%EB%82%A0%EC%94%A8&oq=%EC%9D%BC%EB%B3%B8+%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E+%EB%82%B4%EC%9D%BC+%EB%82%A0%EC%94%A8&gs_l=psy-ab.3...232674.234409..234575...0.0..0.251.929.0j6j1......0....1..gws-wiz.......35i39.yu0YE6lnCms')
soup = BeautifulSoup(response.content, 'html.parser')
tomorrow_weather = soup.find('span', {'id': 'wob_dc'}).text
But failed with this code, the error is:
Traceback (most recent call last):
File "C:\Users\sungn_000\Desktop\weather.py", line 23, in <module>
tomorrow_weather = soup.find('span', {'id': 'wob_dc'}).text
AttributeError: 'NoneType' object has no attribute 'text'
Please solve this error.
This is because the weather section is rendered by the browser via JavaScript. So when you use requests you only get the HTML content of the page which doesn't have what you need.
You should use for example selenium (or requests-html) if you want to parse page with elements rendered by web browser.
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('https://www.google.com/search?hl=en&ei=coGHXPWEIouUr7wPo9ixoAg&q=%EC%9D%BC%EB%B3%B8%20%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E%20%EB%82%B4%EC%9D%BC%20%EB%82%A0%EC%94%A8&oq=%EC%9D%BC%EB%B3%B8%20%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E%20%EB%82%B4%EC%9D%BC%20%EB%82%A0%EC%94%A8&gs_l=psy-ab.3...232674.234409..234575...0.0..0.251.929.0j6j1......0....1..gws-wiz.......35i39.yu0YE6lnCms')
soup = BeautifulSoup(response.content, 'html.parser')
tomorrow_weather = soup.find('span', {'id': 'wob_dc'}).text
print(tomorrow_weather)
Output:
pawel#pawel-XPS-15-9570:~$ python test.py
Clear with periodic clouds
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(a)
>>> a
'<div id="wob_dcp">\n <span class="vk_gy vk_sh" id="wob_dc">Clear with periodic clouds</span> \n</div>'
>>> soup.find("span", id="wob_dc").text
'Clear with periodic clouds'
Try this out.
It's not rendered via JavaScript as pawelbylina mentioned, and you don't have to use requests-html or selenium since everything needed is in the HTML, and it will slow down the scraping process a lot because of page rendering.
It could be because there's no user-agent specified thus Google blocks your request and you receiving a different HTML with some sort of error because the default requests user-agent is python-requests. Google understands it and blocks a request since it's not the "real" user visit. Checks what's your user-agent.
Pass user-agent intro request headers:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
You're looking for this, use select_one() to grab just one element:
soup.select_one('#wob_dc').text
Have a look at SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired elements in your browser.
Code and full example that scrapes more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "일본 桜川市真壁町古城 내일 날씨",
"hl": "en",
}
response = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')
location = soup.select_one('#wob_loc').text
weather_condition = soup.select_one('#wob_dc').text
tempature = soup.select_one('#wob_tm').text
precipitation = soup.select_one('#wob_pp').text
humidity = soup.select_one('#wob_hm').text
wind = soup.select_one('#wob_ws').text
current_time = soup.select_one('#wob_dts').text
print(f'Location: {location}\n'
f'Weather condition: {weather_condition}\n'
f'Temperature: {tempature}°F\n'
f'Precipitation: {precipitation}\n'
f'Humidity: {humidity}\n'
f'Wind speed: {wind}\n'
f'Current time: {current_time}\n')
------
'''
Location: Makabecho Furushiro, Sakuragawa, Ibaraki, Japan
Weather condition: Cloudy
Temperature: 79°F
Precipitation: 40%
Humidity: 81%
Wind speed: 7 mph
Current time: Saturday
'''
Alternatively, you can achieve the same thing by using the Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to think about how to bypass block from Google or figure out why data from certain elements aren't extracting as it should since it's already done for the end-user. The only thing that needs to be done is to iterate over structured JSON and grab the data you want.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "일본 桜川市真壁町古城 내일 날씨",
"api_key": os.getenv("API_KEY"),
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
loc = results['answer_box']['location']
weather_date = results['answer_box']['date']
weather = results['answer_box']['weather']
temp = results['answer_box']['temperature']
precipitation = results['answer_box']['precipitation']
humidity = results['answer_box']['humidity']
wind = results['answer_box']['wind']
print(f'{loc}\n{weather_date}\n{weather}\n{temp}°F\n{precipitation}\n{humidity}\n{wind}\n')
--------
'''
Makabecho Furushiro, Sakuragawa, Ibaraki, Japan
Saturday
Cloudy
79°F
40%
81%
7 mph
'''
Disclaimer, I work for SerpApi.
I also had this problem.
You should not import like this
from bs4 import BeautifulSoup
you should import like this
from bs4 import *
This should work.
I'm trying to grab all relevant links that show up on the results page for any given query using bs4, and then open them up on a new window.
The problem is, I'm not getting the relevant links. For any given query, my script returns links to things like gmail, google images, etc -- not links relevant to the query.
#!/usr/bin/python3
import webbrowser as wb
import requests
import bs4 as bs
search=input()
url="https://www.google.ae/?gfe_rd=cr&ei=mgSoWKmWO-aG7gTgmJ2QDA&gws_rd=ssl#q="+search
#print(url)
user_agent = {'User-Agent': 'Mozilla/5.0'}
#headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'
req=requests.get(url,headers=user_agent)
soup=bs.BeautifulSoup(req.text,"lxml")
print(req.status_code)
count=0
for link in soup.find_all("a"):
print(link.get("href"))
if search in link.text:
wb.open(link.get("href"))
I tried changing my user-agent to a really old one in the hopes that google might revert to html, but no such luck with that.
I know it it's possible to retrieve links with the google search API, but I'm curious to know if there's any way I can get the job done with bs4 instead.
You can use the google package which gives intuitive access to the search results of google.
from google import search
for result in search('example'):
print(result)
It was returning random links because you were extracting all <a> tags from HTML in a for loop:
for link in soup.find_all("a"):
# returns all <a> tags from the HTML
Instead, you're looking for this specific <a> tag from the "organic results part" in the HTML:
# container with needed data
for result in soup.select('.tF2Cxc'):
# extracting title from container above
title = result.select_one('.DKV0Md').text
# extracting link from container above
link = result.select_one('.yuRUbf a')['href']
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean", # query
"gl": "us", # country to search from
"hl": "en", # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
-----------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with such problems since it's done for the end-user, and pretty much the only thing that needs to be done is to iterate over structured JSON and get the data you want.
Code to integrate:
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
-----------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
'''
Disclaimer, I work for SerpApi.