BeautifulSoup.select Method - python

This script is suppose to take command line string and run it through the google search engine and then if results are found it will open up the first 5 in different tabs. I am having some issues trying to get it to work. I think the problem is happening towards the bottom where it says link = soup.select(".r a"), I have been altering the values here and then it will show the next line with an actual length. But running it like this shows the length to still be 0. I am trying to scrape the .r class and a tag because that seems to be where the searched results start on the google result source code.
import requests
import bs4
import sys
import webbrowser
print("Googling...")
response = requests.get("https://www.google.com/#q=" + " ".join(sys.argv[1:]))
response.raise_for_status()
'''Function to return the top search result links'''
soup = bs4.BeautifulSoup(response.text, "html.parser")
'''Open a browser tab for each result'''
links = soup.select(".r a")
print(len(links))
numOpen = min(5, len(links))
for i in range(numOpen):
webbrowser.open("https://google.com/#q=" + links[i].get("href"))

Your logic is right except the URL for google search is not right.
It's gotta be
response = requests.get("https://www.google.com/search?q=" + " ".join(sys.argv[1:]))
...
for i in range(numOpen):
webbrowser.open("https://www.google.com" + links[i].get("href"))
Here is the full code:
import requests
import bs4
import sys
import webbrowser
print("Googling...")
response = requests.get("https://www.google.com/search?q=" + " ".join(sys.argv[1:]))
response.raise_for_status()
'''Function to return the top search result links'''
soup = bs4.BeautifulSoup(response.text, "html.parser")
'''Open a browser tab for each result'''
links = soup.select(".r a")
print(len(links))
numOpen = min(5, len(links))
for i in range(numOpen):
webbrowser.open("https://www.google.com" + links[i].get("href"))

You are right! The problem should be resulted from select(".r a")
I suggest you try find_all('a',{"data-uch":1}), which will find all a tags with attribute data-uch = 1
Explanation:
"If you look up a little from the element, though, there is an element like this: . Looking through the rest of the HTML source,
it looks like the r class is used only for search result links."
The sentence above is from the book. However, in real, if you print this soup variable, soup = bs4.BeautifulSoup(response.text, "html.parser"), you will not find any <h3 class="r">`` in the HTML source code. That is whyprint(len(links))``` always show 0.

Instead of using min(5, len(links)) you can use slicing:
links = soup.select('.r a')[:5]
# or
for i in soup.select('.r a')[:5]:
# other code..
Also, you can use find_all() limit argument.
Make sure you're using user-agent because default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
I wrote a dedicated blog about how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean",
"gl": "us",
"hl": "en",
"num": "100"
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc')[:5]:
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
--------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to figure out how to pick the correct selector or how to bypass blocks from search engines since it's already done for the end-user. All that really needs to be done is to iterate over structured JSON and get the data you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result['title'])
print(result['link'])
---------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Disclaimer, I work for SerpApi.

Related

Scrape urls from search results in Python and BeautifulSoup

I was trying to scrape some urls from the search result and I tried to include both cookies setting or user-agent as Mozilla/5.0 and so on. I still cannot get any urls from the search result. Any solution I can get this working?
from bs4 import BeautifulSoup
import requests
monitored_tickers = ['GME', 'TSLA', 'BTC']
def search_for_stock_news_urls(ticker):
search_url = "https://www.google.com/search?q=yahoo+finance+{}&tbm=nws".format(ticker)
r = requests.get(search_url)
soup = BeautifulSoup(r.text, 'html.parser')
atags = soup.find_all('a')
hrefs = [link['href'] for link in atags]
return hrefs
raw_urls = {ticker:search_for_stock_news_urls(ticker) for ticker in monitored_tickers}
raw_urls
You could be running into the issue that requests and bs4 may not be the best tool for what you're trying to accomplish. As balderman said in another comment, using google search api will be easier.
This code:
from googlesearch import search
tickers = ['GME', 'TSLA', 'BTC']
links_list = []
for ticker in tickers:
ticker_links = search(ticker, stop=25)
links_list.append(ticker_links)
will make a list of the top 25 links on google for each ticker, and append that list into another list. Yahoo finance is sure to be in that list of links, and a simple parser based on keyword will get the yahoo finance url for that specific ticker. You could also adjust the search criteria in the search() function to whatever you wish, say ticker + ' yahoo finance' for example.
Google News could be easily scraped with requests and beautifulsoup. It would be enough to use user-agent to extract data from there.
Check out SelectorGadget Chrome extension to visually grab CSS selectors by clicking on the element you want to extract.
If you only want to extract URLs from Google News, then it's as simple as:
for result in soup.select('.dbsr'):
link = result.a['href']
# 10 links here..
Code and example that scrape more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "yahoo finance BTC",
"hl": "en",
"gl": "us",
"tbm": "nws",
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.dbsr'):
link = result.a['href']
print(link)
-----
'''
https://finance.yahoo.com/news/riot-blockchain-reports-record-second-203000136.html
https://finance.yahoo.com/news/el-salvador-not-require-bitcoin-175818038.html
https://finance.yahoo.com/video/bitcoin-hovers-around-50k-paypal-155437774.html
... other links
'''
Alternatively, you can achieve the same result by using Google News Results API from SerpApi. It's a paid API with a free plan.
The differences is that you don't have to figure out how to extract elements, maintain the parser over time, bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "coca cola",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
-----
'''
Title: Coca-Cola Co. stock falls Monday, underperforms market
Link: https://www.marketwatch.com/story/coca-cola-co-stock-falls-monday-underperforms-market-01629752653-994caec748bb
... more results
'''
P.S. I wrote a blog post about how to scrape Google News (including pagination) in a bit more detailed way with visual representation.
Disclaimer, I work for SerpApi.

How to print Google Search results properly with bs4?

I have a working code, that prints firstly search titles and then urls but it prints a lot of urls between website titles. But how to print them in format like this and avoid printing the same urls 10 times for each:
1) Title url
2) Title url
and so on...
My code:
search = input("Search:")
page = requests.get(f"https://www.google.com/search?q=" + search)
soup = BeautifulSoup(page.content, "html5lib")
links = soup.findAll("a")
heading_object = soup.find_all('h3')
for info in heading_object:
x = info.getText()
print(x)
for link in links:
link_href = link.get('href')
if "url?q=" in link_href:
y = (link.get('href').split("?q=")[1].split("&sa=U")[0])
print(y)
If you get separatelly titles and links then you can use zip() to group them in pairs
for info, link in zip(heading_object, links):
info = info.getText()
link = link.get('href')
if "?q=" in link:
link = link.split("?q=")[1].split("&sa=U")[0]
print(info, link)
But this may have problem when some title or link doesn't exist on page because then it will create wrong pairs. It will pair title with link for next element. You should rather search elements which keep both title and link and inside every element search single title and single link to create pair. If there is no title or link then you can put some default value and it will not create wrong pairs.
You're looking for this:
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n') # prints TITLE, URL followed by a new line.
If you're using f-string then the appropriate way is to use it like so:
page = requests.get(f"https://www.google.com/search?q=" + search) # not proper f-string
page = requests.get(f"https://www.google.com/search?q={search}") # proper f-string
Code:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "python memes",
"hl": "en"
}
soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n')
--------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (#python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
One of the differences is that you only need to iterate over JSON rather than figuring out how to scrape stuff.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "python memes",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
url = result['link']
print(f'{title}, {url}\n')
-------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (#python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''
Disclaimer, I work for SerpApi.

How extract description in a google search using python?

I want to extract the description from the google search,
now I have this code:
from urlparse import urlparse, parse_qs
import urllib
from lxml.html import fromstring
from requests import get
url='https://www.google.com/search?q=Gotham'
raw = get(url).text
pg = fromstring(raw)
v=[]
for result in pg.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print url[0]
that extract urls related with the search, how can I extract the description that appears under the url?
You can scrape Google Search Description Website using BeautifulSoup web scraping library.
To collect information from all pages you can use "pagination" with while True loop. The while loop is an endless loop, the exit from which in our case is the presence of a switch button to the next page, namely the CSS selector ".d6cvqb a[id=pnnext]":
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
You can use CSS selectors search to find all the information you need (description, title, etc.) which are easy to identify on the page using a SelectorGadget Chrome extension (not always work perfectly if the website is rendered via JavaScript).
Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "gotham", # query
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
page_num = 0
website_data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
website_name = result.select_one(".yuRUbf a")["href"]
try:
description = result.select_one(".lEBKkf").text
except:
description = None
website_data.append({
"website_name": website_name,
"description": description
})
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
print(json.dumps(website_data, indent=2, ensure_ascii=False))
Example output:
[
{
"website_name": "https://www.imdb.com/title/tt3749900/",
"description": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
},
{
"website_name": "https://www.netflix.com/watch/80023082",
"description": "When the key witness in a homicide ends up dead while being held for questioning, Gordon suspects an inside job and seeks details from an old friend."
},
{
"website_name": "https://www.gothamknightsgame.com/",
"description": "Gotham Knights is an open-world, action RPG set in the most dynamic and interactive Gotham City yet. In either solo-play or with one other hero, ..."
},
# ...
]
Or you can also use Google Search Engine Results API from SerpApi. It's a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": os.getenv("API_KEY"), # serpapi key
"engine": "google", # serpapi parser engine
"q": "gotham", # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet")
})
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "Gotham (TV Series 2014–2019) - IMDb",
"snippet": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
},
{
"title": "Gotham (TV series) - Wikipedia",
"snippet": "Gotham is an American superhero crime drama television series developed by Bruno Heller, produced by Warner Bros. Television and based on characters from ..."
},
# ...
]

Unable to retrieve links off google search results page using BeautifulSoup

I'm trying to grab all relevant links that show up on the results page for any given query using bs4, and then open them up on a new window.
The problem is, I'm not getting the relevant links. For any given query, my script returns links to things like gmail, google images, etc -- not links relevant to the query.
#!/usr/bin/python3
import webbrowser as wb
import requests
import bs4 as bs
search=input()
url="https://www.google.ae/?gfe_rd=cr&ei=mgSoWKmWO-aG7gTgmJ2QDA&gws_rd=ssl#q="+search
#print(url)
user_agent = {'User-Agent': 'Mozilla/5.0'}
#headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'
req=requests.get(url,headers=user_agent)
soup=bs.BeautifulSoup(req.text,"lxml")
print(req.status_code)
count=0
for link in soup.find_all("a"):
print(link.get("href"))
if search in link.text:
wb.open(link.get("href"))
I tried changing my user-agent to a really old one in the hopes that google might revert to html, but no such luck with that.
I know it it's possible to retrieve links with the google search API, but I'm curious to know if there's any way I can get the job done with bs4 instead.
You can use the google package which gives intuitive access to the search results of google.
from google import search
for result in search('example'):
print(result)
It was returning random links because you were extracting all <a> tags from HTML in a for loop:
for link in soup.find_all("a"):
# returns all <a> tags from the HTML
Instead, you're looking for this specific <a> tag from the "organic results part" in the HTML:
# container with needed data
for result in soup.select('.tF2Cxc'):
# extracting title from container above
title = result.select_one('.DKV0Md').text
# extracting link from container above
link = result.select_one('.yuRUbf a')['href']
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean", # query
"gl": "us", # country to search from
"hl": "en", # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
-----------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with such problems since it's done for the end-user, and pretty much the only thing that needs to be done is to iterate over structured JSON and get the data you want.
Code to integrate:
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
-----------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
'''
Disclaimer, I work for SerpApi.

Not Getting proper links from google search results using mechanize and Beautifulsoup

I am using the following snippet to get links from the google search results for the "keyword" I give.
import mechanize
from bs4 import BeautifulSoup
import re
def googlesearch():
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0')]
br.open('http://www.google.com/')
# do the query
br.select_form(name='f')
br.form['q'] = 'scrapy' # query
data = br.submit()
soup = BeautifulSoup(data.read())
for a in soup.find_all('a', href=True):
print "Found the URL:", a['href']
googlesearch()
Since am parsing the search results HTML page to get links.Its getting all the 'a' tags.But what I need is to get only the links for the results.Another thing is when you see the output of the href attribute it gives something like this
Found the URL:
/search?q=scrapy&hl=en-IN&gbv=1&prmd=ivns&source=lnt&tbs=li:1&sa=X&ei=DT8HU9SlG8bskgWvqIHQAQ&ved=0CBgQpwUoAQ
But the actual link present in href attitube is http://scrapy.org/
Can anyone point me the solution for the above two questions mentioned above??
Thanks in advance
Get only the links for the results
The links you're interested in are inside the h3 tags (with r class):
<li class="g">
<h3 class="r">
<a href="/url?q=http://scrapy.org/&sa=U&ei=XdIUU8DOHo-ElAXuvIHQDQ&ved=0CBwQFjAA&usg=AFQjCNHVtUrLoWJ8XWAROG-a4G8npQWXfQ">
<b>Scrapy</b> | An open source web scraping framework for Python
</a>
</h3>
..
You can find the links using css selector:
soup.select('.r a')
Get the actual link
URLs are in the following format:
/url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ
^^^^^^^^^^^^^^^^^^^^
Actual url is in the q parameter.
To get the the entire query string, use urlparse.urlparse:
>>> url = '/url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'
>>> urlparse.urlparse(url).query
'q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'
Then, use urlparse.parse_qs to parse the query string and extract the q parameter value:
>>> urlparse.parse_qs(urlparse.urlparse(url).query)['q']
['http://scrapy.org/']
>>> urlparse.parse_qs(urlparse.urlparse(url).query)['q'][0]
'http://scrapy.org/'
Final result
for a in soup.select('.r a'):
print urlparse.parse_qs(urlparse.urlparse(a['href']).query)['q'][0]
output:
http://scrapy.org/
http://doc.scrapy.org/en/latest/intro/tutorial.html
http://doc.scrapy.org/
http://scrapy.org/download/
http://doc.scrapy.org/en/latest/intro/overview.html
http://scrapy.org/doc/
http://scrapy.org/companies/
https://github.com/scrapy/scrapy
http://en.wikipedia.org/wiki/Scrapy
http://www.youtube.com/watch?v=1EFnX1UkXVU
https://pypi.python.org/pypi/Scrapy
http://pypix.com/python/build-website-crawler-based-upon-scrapy/
http://scrapinghub.com/scrapy-cloud
Or you could use https://code.google.com/p/pygoogle/ which basically does the same thing.
And you can get links to results as well.
A snippet of output from sample query for 'stackoverflow':
*Found 3940000 results*
[Stack Overflow]
Stack Overflow is a question and answer site for professional and enthusiast
programmers. It's 100% free, no registration required. Take the 2-minute tour
http://stackoverflow.com/
In your code example you were extracting all <a> tags from the HTML, not only related to organic results:
for a in soup.find_all('a', href=True):
print "Found the URL:", a['href']
You're looking for this to grab links from organic results only:
# container with needed data: title, link, etc.
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'minecraft',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to make everything from scratch, bypass blocks, and maintain the parser over time.
Code to integrate to achieve your goal:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "minecraft",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Disclaimer, I work for SerpApi.

Categories

Resources