How to scrape Google News articles content from Google News RSS? - python

In the future, (maybe still far away, due to the fact that I'm still a novice) I want to do data analysis, based on the content of the news I get from the Google News RSS, but for that, I need to have access to that content, and that is my problem.
Using the URL "https://news.google.cl/news/rss" I have access to data like the title, and the URL of each news item, but the URL is in a format that does not allow me to scrape it (https://news.google.com/__i/rss/rd/articles/CBMilgFod...).
news_url="https://news.google.cl/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_page=soup(xml_page,"xml")
news_list=soup_page.findAll("item")
for news in news_list:
print(news.title.text)
print("-"*60)
response = urllib.request.urlopen(news.link.text)
html = response.read()
soup = soup(html,"html.parser")
text = soup.get_text(strip=True)
print(text)
The last print(text) prints some code like:
if(typeof bbclAM === 'undefined' || !bbclAM.isAM()) {
googletag.display('div-gpt-ad-1418416256666-0');
} else {
document.getElementById('div-gpt-ad-1418416256666-0').st
yle.display = 'none'
}
});(function(s, p, d) {
var h=d.location.protocol, i=p+"-"+s,
e=d.getElementById(i), r=d.getElementById(p+"-root"),
u=h==="https:"?"d1z2jf7jlzjs58.cloudfront.net"
:"static."+p+".com";
if (e) return;
I expect to print the title and the content of each news item from the RSS

This script can get you something to start with (prints title, url, short description and content from the site). Parsing the content from the site is in basic form - each site has different format/styling etc. :
import textwrap
import requests
from bs4 import BeautifulSoup
news_url="https://news.google.cl/news/rss"
rss_text=requests.get(news_url).text
soup_page=BeautifulSoup(rss_text,"xml")
def get_items(soup):
for news in soup.findAll("item"):
s = BeautifulSoup(news.description.text, 'lxml')
a = s.select('a')[-1]
a.extract() # extract lat 'See more on Google News..' link
html = requests.get(news.link.text)
soup_content = BeautifulSoup(html.text,"lxml")
# perform basic sanitization:
for t in soup_content.select('script, noscript, style, iframe, nav, footer, header'):
t.extract()
yield news.title.text.strip(), html.url, s.text.strip(), str(soup_content.select_one('body').text)
width = 80
for (title, url, shorttxt, content) in get_items(soup_page):
title = '\n'.join(textwrap.wrap(title, width))
url = '\n'.join(textwrap.wrap(url, width))
shorttxt = '\n'.join(textwrap.wrap(shorttxt, width))
content = '\n'.join(textwrap.wrap(textwrap.shorten(content, 1024), width))
print(title)
print(url)
print('-' * width)
print(shorttxt)
print()
print(content)
print()
Prints:
WWF califica como inaceptable y condenable adulteración de información sobre
salmones de Nova Austral - El Mostrador
https://m.elmostrador.cl/dia/2019/06/30/wwf-califica-como-inaceptable-y-
condenable-adulteracion-de-informacion-sobre-salmones-de-nova-austral/
--------------------------------------------------------------------------------
El MostradorLa organización pide investigar los centros de cultivo de la
salmonera de capitales noruegos y abrirá un proceso formal de quejas. La empresa
ubicada en la ...
01:41:28 WWF califica como inaceptable y condenable adulteración de información
sobre salmones de Nova Austral - El Mostrador País PAÍS WWF califica como
inaceptable y condenable adulteración de información sobre salmones de Nova
Austral por El Mostrador 30 junio, 2019 La organización pide investigar los
centros de cultivo de la salmonera de capitales noruegos y abrirá un proceso
formal de quejas. La empresa ubicada en la Patagonia chilena es acusada de
falsear información oficial ante Sernapesca. 01:41:28 Compartir esta Noticia
Enviar por mail Rectificar Tras una investigación periodística de varios meses,
El Mostrador accedió a abundante información reservada, que incluye correos
electrónicos de la gerencia de producción de la compañía salmonera Nova Austral
–de capitales noruegos– a sus jefes de área, donde se instruye manipular las
estadísticas de mortalidad de los salmones para ocultar las verdaderas cifras a
Sernapesca –la entidad fiscalizadora–, a fin de evitar multas y ver disminuir
las [...]
...and so on.

Clone this project,
git clone git#github.com:philipperemy/google-news-scraper.git gns
cd gns
sudo pip install -r requirements.txt
python main_no_vpn.py
Out put will be
{
"content": "............",
"datetime": "...",
"keyword": "...",
"link": "...",
"title": "..."
},
{
"content": "............",
"datetime": "...",
"keyword": "...",
"link": "...",
"title": "..."
}
Source : Here

In order to access data such as title and others, you first need to collect all the news in a list. Each news item is located in the iter tag, and they are in the channel tag. So let's use this sample:
soup.channel.find_all('item')
After that, you can extract the necessary data for each news.
for result in soup.channel.find_all('item'):
title = result.title.text
link = result.link.text
date = result.pubDate.text
source = result.source.get("url")
print(title, link, date, source, sep='\n', end='\n\n')
Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Code and full example in online IDE:
from bs4 import BeautifulSoup
import requests
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"hl": "en-US", # language
"gl": "US", # country of the search, US -> USA
"ceid": "US:en",
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
html = requests.get("https://news.google.com/rss", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "xml")
for result in soup.channel.find_all('item'):
title = result.title.text
link = result.link.text
date = result.pubDate.text
source = result.source.get("url")
print(title, link, date, source, sep='\n', end='\n\n')
Output:
UK and Europe Heat Wave News: Live Updates - The New York Times
https://news.google.com/__i/rss/rd/articles/CBMiRGh0dHBzOi8vd3d3Lm55dGltZXMuY29tL2xpdmUvMjAyMi8wNy8xOS93b3JsZC91ay1ldXJvcGUtaGVhdC13ZWF0aGVy0gEA?oc=5
Tue, 19 Jul 2022 11:56:58 GMT
https://www.nytimes.com
... other results
Another way to achieve the same thing is to scrape Google News from the HTML instead.
I want to demonstrate how to scrape Google News using pagination. Оne of the ways is to use the start URL parameter which is equal to 0 by default. 0 means the first page, 10 is for the second, and so on.
Also, default search results return about ~10-15 pages. To increase the number of returned pages, you need to set the filter parameter to 0 and pass it to the URL which will return 10+ pages. Basically, this parameter defines the filters for Similar Results and Omitted Results.
While the next button exists, you need to increment the ["start"] parameter value by 10 to access the next page if it's present, otherwise we need to break out of the while loop.
And here is the code:
from bs4 import BeautifulSoup
import requests, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "Elon Musk",
"hl": "en-US", # language
"gl": "US", # country of the search, US -> USA
"tbm": "nws", # google news
"start": 0, # number page by default up to 0
# "filter": 0 # shows more than 10 pages. By default up to ~10-15 if filter = 1.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
page_num = 0
while True:
page_num += 1
print(f"{page_num} page:")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for result in soup.select(".WlydOe"):
source = result.select_one(".NUnG9d").text
title = result.select_one(".mCBkyc").text
link = result.get("href")
try:
snippet = result.select_one(".GI74Re").text
except AttributeError:
snippet = None
date = result.select_one(".ZE0LJd").text
print(source, title, link, snippet, date, sep='\n', end='\n\n')
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Output:
1 page:
BuzzFeed News
Elon Musk’s Viral Shirtless Photos Have Sparked A Conversation Around
Body-Shaming After Some People Argued That He “Deserves” To See The Memes
Mocking His Physique
https://www.buzzfeednews.com/article/leylamohammed/elon-musk-shirtless-yacht-photos-memes-body-shaming
None
18 hours ago
People
Elon Musk Soaks Up Sun While Spending Time with Pals Aboard Luxury Yacht in
Greece
https://people.com/human-interest/elon-musk-spends-time-with-friends-aboard-luxury-yacht-in-greece/
None
2 days ago
New York Post
Elon Musk jokes shirtless pictures in Mykonos are 'good motivation' to hit
gym
https://nypost.com/2022/07/21/elon-musk-jokes-shirtless-pics-in-mykonos-are-good-motivation/
None
14 hours ago
... other results from the 1st and subsequent pages.
10 page:
Vanity Fair
A Reminder of Just Some of the Terrible Things Elon Musk Has Said and Done
https://www.vanityfair.com/news/2022/04/elon-musk-twitter-terrible-things-hes-said-and-done
... yesterday's news with “shock and dismay,” a lot of people are not
enthused about the idea of Elon Musk buying the social media network.
Apr 26, 2022
CNBC
Elon Musk is buying Twitter. Now what?
https://www.cnbc.com/2022/04/27/elon-musk-just-bought-twitter-now-what.html
Elon Musk has finally acquired Twitter after a weekslong saga during which
he first became the company's largest shareholder, then offered...
Apr 27, 2022
New York Magazine
11 Weird and Upsetting Facts About Elon Musk
https://nymag.com/intelligencer/2022/04/11-weird-and-upsetting-facts-about-elon-musk.html
3. Elon allegedly said some pretty awful things to his first wife · While
dancing at their wedding reception, Musk told Justine, “I am the alpha...
Apr 30, 2022
... other results from 10th page.
If you need more information about Google News, have a look at Web Scraping Google News with Python blog post.

Related

Python Web Scrapping: Get html links from within a specific Div and from sub pages also

I need to scrape datafrom a link. The required data is hidden within another link on the webpage.
Something similar to the webpage I am working is this link - College List. Say I need to get data about each college listed in this site. First, I land on this page. Then I extract all relevant links on this page and subsequent other pages. Then I go to each link and get relevant data.
I am not able to get the desired list of links and how to go to next page and do the same thing?
What I have tried so far is -
import requests
import lxml.html as lh
url = 'https://www.indiacollegeshub.com/colleges/'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//*[#id="ContentPlaceHolder1_pnl_collegelist"]/ul/li[1]')
col=[]
for t in tr_elements[0]: # starting from 2nd row for column headers
name=t.text_content()
col.append(name)
print(col) #gives me string values and not links
print(tr_elements[0].xpath('//a/#href')) # Gives me all links. I need links within Div [#id="ContentPlaceHolder1_pnl_collegelist"] only.
I am not able to get the required link list by page. I think there are some 2K+ pages in this site.
Thanks in advance.
I used beautiful soup to scrape the site.
import requests
from bs4 import BeautifulSoup
data = []
url = f"https://www.indiacollegeshub.com/colleges/"
print(f"Scraping {url} ...")
page = requests.get(url)
page.raise_for_status()
soup = BeautifulSoup(page.content, "html.parser")
table = soup.find("div", class_="clg-lists").find("ul")
assert table, "table not found"
for item in table.find_all("a", href=True):
data.append({
"link": item["href"],
"text": item.text.strip(),
})
print(data)
# Returns in format
# [
# {
# "link": "https://www.indiacollegeshub.com/colleges/iifa-lancaster-degree-college-bangalore.aspx",
# "text": "IIFA Lancaster Degree College, Bangalore\n#5,14/2,Suvarna Jyothi Layout,\xa0Jnanabharathi post, Nagadevanahalli, Bangalore - Karnataka, India\nPhone : +91 9845984211,+91 7349241005, Landline No:08023241999",
# }, {
# "link": "https://www.indiacollegeshub.com/colleges/iifa-multimedia-bangalore.aspx",
# "text": "IIFA Multimedia, Bangalore\n#262 80 feet main road srinivasa nagar,\xa09th main corner, Bangalore - Karnataka, India\nPhone : 080 48659176, +91 7349241004,+91 9845006824",
# },
# ...
# ]
Outputs:
Scraping https://www.indiacollegeshub.com/colleges/ ...
[{'link': 'https://www.indiacollegeshub.com/colleges/iifa-lancaster-degree-college-bangalore.aspx', 'text': 'IIFA Lancaster Degree College, Bangalore\n#5,14/2,Suvarna Jyothi Layout,\xa0Jnanabharathi post, Nagadevanahalli, Bangalore - Karnataka, India\nPhone : +91 9845984211,+91 7349241005, Landline No:08023241999'}, {'link': 'https://www.indiacollegeshub.com/colleges/iifa-multimedia-bangalore.aspx', 'text': 'IIFA Multimedia, Bangalore\n#262 80 feet main road srinivasa nagar,\xa09th main corner, Bangalore - Karnataka, India\nPhone : 080 48659176, +91 7349241004,+91 9845006824'}, {'link': 'https://www.indiacollegeshub.com/colleges/3-berhampur-college-berhampur.aspx', 'text': '+3 Berhampur College, Berhampur\nRaj Berhampur Berhampur - Orissa, India\nPhone : N/A'}, {'link': 'https://www.indiacollegeshub.com/colleges/3-panchayat-samiti-mahavidyalaya-balangir.aspx', 'text': '+3 Panchayat Samiti Mahavidyalaya, Balangir\nGyana Vihar Deogaon Balangir - Orissa, India\nPhone : N/A'}, {'link': 'https://www.indiacollegeshub.com/colleges/21st-century-international-school-trust-sivagangai.aspx', 'text': '21St Century International School Trust, Sivagangai\nRani Velu Nachiar Nagar, Kangirangal Post, Sivagangai Sivagangai - Tamil Nadu, India\nPhone : 04575 - 244930'}, {'link': 'https://www.indiacollegeshub.com/colleges/3dfx-animation-school-kochi.aspx', 'text': '3DFX Animation School, Kochi\nNear MP Office, Kattuparambil Towers, Old Market Road, Angamaly, Kochi - Kerala, India\nPhone : 91 0484 2455799'}, {'link': 'https://www.indiacollegeshub.com/colleges/4-g-fire-college-sonipat.aspx', 'text': '4 G Fire College, Sonipat\nK.C. Plaza, 1ST FLOWER,Above Eye Q Hospita , Atlas Road, Near State Bank of India, Sonipat - Haryana, India\nPhone : 9466769467, 7206220706'}, {'link': 'https://www.indiacollegeshub.com/colleges/5-gates-multimedia-solutions-indore.aspx', 'text': '5 Gates Multimedia Solutions, Indore\n102 Krtgya Tower, 8, Janki Nagar, A.b. Road, Indore - Madhya Pradesh, India\nPhone : (0731) 2400656'}, {'link': 'https://www.indiacollegeshub.com/colleges/a-a-arts-and-science-college-chennai.aspx', 'text': 'A A Arts And Science College, Chennai\n42/1, Srinivasan Nagar, Iind Street, Koyambedu Chennai - Tamil Nadu, India\nPhone : 044-28553109, 28526202'}, {'link': 'https://www.indiacollegeshub.com/colleges/a-a-govt-arts-college-attur-salem.aspx', 'text': 'A A Govt Arts College Attur, Salem\nSalem Salem - Tamil Nadu, India\nPhone : N/A'}]
If you want to scrape all of the 2k+ pages, you need to use multithreading to scrape site faster. I used the code on this article. Don't forget to replace variale NUM_THREADS with your number of threads. I highly recommend to writing the output into a file while the program is scraping.
import requests
from bs4 import BeautifulSoup
import concurrent.futures
import time
# REPLACE WITH YOUR NUMBER OF THREADS
NUM_THREADS = 8
links = [f"https://www.indiacollegeshub.com/colleges/page-{index}.aspx" for index in range(1, 2480)] # 1 - 2479
data = []
def scrape(url):
print(f"Scraping {url} ...")
page = requests.get(url)
page.raise_for_status()
soup = BeautifulSoup(page.content, "html.parser")
table = soup.find("div", class_="clg-lists").find("ul")
assert table, "table not found"
for item in table.find_all("a", href=True):
data.append({
"link": item["href"],
"text": item.text.strip(),
})
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape, links)
total_time = time.time() - start_time
print(total_time)

How to scrape the URL, Title, and Description of Google Search Results

I'm using selenium to first ask Google a question and then scrape the first few results. I'm trying to add all URLs, Titles, and Descriptions to a Dict which I can then access later. Unfortunately, I can't get it to work - returns 'No Data Found'. Does anyone have an idea of what may be the issue?
Here is what I'm doing:
options = Options()
options.add_argument("--headless")
def googleSearch(query):
# specifing browser web driver
driver = webdriver.Chrome(options=options, executable_path='chromedriver')
# search query
search_engine = "https://www.google.com/search?q="
query = query.replace(" ","+")
driver.get(search_engine + query + "&start=" + "0")
# stored data
# which will be returned by this function
data = {}
# number of search reasult count of first page
s_len = 5
for s_block in range(s_len):
# result block
content_block_xpath = f'''//*[#id="yuRUbf"]/div[{s_block}]/div/div'''
# xpaths
xpath_url = f"""{content_block_xpath}/div[1]/a"""
xpath_title = f"""{content_block_xpath}/div[1]/a/h3"""
xpath_description = f"""{content_block_xpath}/div[2]/span/span"""
try:
# store data collected of each s_block to block {}
block = {}
# find url of content
url = driver.find_element(By.XPATH, xpath_url)
url = url.get_attribute('href')
links.append(url.get('href'))
# find domain name of web having content
pattern = r"""(https?:\/\/)?(([a-z0-9-_]+\.)?([a-z0-9-_]+\.[a-z0-9-_]+))"""
domain = re.search(pattern, url)[0]
print(links)
# find title of content
# title = driver.find_element_by_xpath(xpath_title)
title = driver.find_element(By.XPATH, xpath_title)
title = title.get_attribute("innerText")
# find description of content
# description = driver.find_element_by_xpath(xpath_description)
description = driver.find_element(By.XPATH, xpath_description)
description = description.get_attribute("innerText")
# save all data to block {}
block["domain"] = domain
block["url"] = url
block["title"] = title
block["description"] = description
# save block dictionary to main dictionary
data[f'{s_block}'] = block
except exceptions.NoSuchElementException:
continue
if len(data) == 0:
raise Exception("No data found")
driver.close()
return data
def getQuery():
query = str('How to change a car tire')
link = googleSearch(query)
print(link)
getQuery()
I see two problems:
a mix-up with class and id regarding the use of "yuRUbf"
indexing in xpath starts at 1 and not 0
I also don't get the same hierarchical structure as you, but that's just a tweak.
The following produces reasonable results for me:
content_block_xpath = f'''(//*[#class="yuRUbf"])[{s_block}]'''
xpath_url = f"""{content_block_xpath}/a"""
xpath_title = f"""{content_block_xpath}/a/h3"""
xpath_description = f"""{content_block_xpath}/a//cite/span"""
You can only use BeautifulSoup web scraping library to scrape Google Search without Selenium web driver as the data is not processed through JS and it will speed up the script.
Here's how you can extract title, link and a snippet (description) from Google search results using bs4 and requests packages:
params = {
"q": "How to change a car tire", # query example
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf span").text
except:
snippet = None
links = result.select_one(".yuRUbf a")["href"]
You can also extract not only the first page, but all the rest using pagination whith infinite while loop.
In this case, pagination is possible as long as the next button exists (determined by the presence of a button selector on the page, in our case the CSS selector .d6cvqb a[id=pnnext], you need to increase the value of ["start"] by 10 to access the next page (this may be called as non-token pagination), if present, otherwise, we need to exit the while loop:
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Check code in the online IDE
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "How to change a car tire", # query example
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_num = 0
data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf span").text
except:
snippet = None
links = result.select_one(".yuRUbf a")["href"]
data.append({
"title": title,
"snippet": snippet,
"links": links
})
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "How Long Do Tires Last and When Should I Replace Them?",
"snippet": "As a general rule, we recommend every 5,000-7,000 miles, but it depends on numerous factors, including your car's alignment. You can read more on The Drive's ...",
"links": "https://www.thedrive.com/cars-101/35041/how-long-do-tires-last"
},
{
"title": "Car Tire Valve Stem Replacement - iFixit Repair Guide",
"snippet": "Step 1 Car Tire Valve Stem · Locate the stem valve and remove the cap. · Using the Schrader valve core bit in your 1/4\" driver, unscrew the valve core from the ...",
"links": "https://www.ifixit.com/Guide/Car+Tire+Valve+Stem+Replacement/121415"
},
other results ...
]
Also you can use Google Search Engine Results API from SerpApi. It's a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": "...", # serpapi key, https://serpapi.com/manage-api-key
"engine": "google", # serpapi parser engine
"q": "How to change a car tire", # search query
"gl": "uk", # country of the search, UK -> United Kingdom
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet"),
"link": result.get("link")
})
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "Today: can you safely change a tire with passengers on board?",
"snippet": "RAY: In any case, the primary danger during a tire change is that the vehicle will slip off the jack and injure the tire changer.",
"link": "https://www.cartalk.com/content/today-can-you-safely-change-tire-passengers-board"
},
{
"title": "How to Change a Flat Tire - Mercedes-Benz Burlington",
"snippet": "How to Switch a Tire in 5 Steps · Secure the wheel wedges against the tires on the opposite side of the flat tire. · Remove the hubcap or wheel ...",
"link": "https://www.mercedes-benz-burlington.ca/how-to-change-a-flat-tire/"
},
other results...
]

.findAll() finding things not constantly

I tried to make a little Beautiful Soup Script, to analyze prices on eBay. So the problem is, that my soup.findAll() that should find the prices, is sometimes working, sometimes not and I am wondering why. So here is my code:
import requests
from bs4 import BeautifulSoup
from requests.models import encode_multipart_formdata
article = input("Product:")
keywords = article.strip().replace(" ", "+")
URL_s = "https://www.ebay.de/sch/i.html?_dmd=1&_fosrp=1&LH_SALE_CURRENCY=0&_sop=12&_ipg=50&LH_Complete=1&LH_Sold=1&_sadis=10&_from=R40&_sacat=0&_nkw=" + keywords + "&_dcat=139971&rt=nc&LH_ItemCondition=3"
source = requests.get(URL_s).text
soup = BeautifulSoup(source)
prices = soup.findAll('span', class_='bold bidsold')
# ^ this line sometimes finds the prices, sometimes it just produces an empty list ^
help would be very welcome, hope you are doing well, bye bye :)
If you look at the variable soup, and open the results as an html page you would see something like this:
This means the ebay has some sort of a filtering mechanism to prevent scraping, and requires you to somehow confirm your identity. This is why your query for prices returns empty.
Maybe the prices are rendered by JavaScript. Requests does not wait for the JavaScript to be loaded.
So thats why, you should use other modules, such as Selenium or DryScrape
When using requests, the request may be blocked because the default user-agent in the requests library is python-requests, in order for the website to understand that this is not a bot or script, you need to pass your real User-Agent to the headers.
You can also read Reducing the chance of being blocked while web scraping blog post to learn about other options for solving this problem.
If you want to collect all the information from all pages, you can use a while loop that dynamically paginates all pages.
The while loop will be executed until the stop command appears, in our case, the loop termination command will be to check for the presence of the next page, for which the CSS selector “.pagination__next” is responsible.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
}
query = input('Your query is: ') # "shirt" for example
params = {
'_nkw': query, # search query
'_pgn': 1, # page number
'LH_Sold': '1' # shows sold items
}
data = []
page_limit = 10 # page limit (if you need)
while True:
page = requests.get('https://www.ebay.de/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
print(f"Extracting page: {params['_pgn']}")
print("-" * 10)
for products in soup.select(".s-item__info"):
title = products.select_one(".s-item__title span").text
price = products.select_one(".s-item__price").text
data.append({
"title" : title,
"price" : price
})
if params['_pgn'] == page_limit:
break
if soup.select_one(".pagination__next"):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "CECIL Pullover Damen Hoodie Sweatshirt Gr. L (DE 44) Baumwolle flieder #7902fa2",
"price": "EUR 17,64"
},
{
"title": "Shirt mit Schlangendruck & Strass \"cyclam\" Gr. 40 UVP: 49,99€ 5.65",
"price": "EUR 6,50"
},
{
"title": "Fender Guitars Herren T-Shirt von Difuzed - Größe Medium blau auf blau - Sehr guter Zustand",
"price": "EUR 10,06"
},
other results ...
]
As an alternative, you can use Ebay Organic Results API from SerpApi. It's a paid API with a free plan that handles blocks and parsing on their backend.
Example code with pagination:
from serpapi import EbaySearch
import os, json
query = input('Your query is: ') # "shirt" for example
params = {
"api_key": "...", # serpapi key, https://serpapi.com/manage-api-key
"engine": "ebay", # search engine
"ebay_domain": "ebay.com", # ebay domain
"_nkw": query, # search query
# "LH_Sold": "1" # shows sold items
}
search = EbaySearch(params) # where data extraction happens
page_num = 0
data = []
while True:
results = search.get_dict() # JSON -> Python dict
if "error" in results:
print(results["error"])
break
for organic_result in results.get("organic_results", []):
title = organic_result.get("title")
price = organic_result.get("price")
data.append({
"price" : price,
"title" : title
})
page_num += 1
print(page_num)
if "next" in results.get("pagination", {}):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2))
Output:
[
{
"price": {
"raw": "EUR 17,50",
"extracted": 17.5
},
"title": "Mensch zweiter Klasse Gesund und ungeimpft T-Shirt"
},
{
"price": {
"raw": "EUR 14,90",
"extracted": 14.9
},
"title": "Sprüche Shirt Lustige T-Shirts für Herren oder Unisex Kult Fun Gag Handwerker"
},
# ...
]
There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.

How to scrape all results from Google search results pages (Python/Selenium ChromeDriver)

I am working on a Python script using selenium chromedriver to scrape all google search results (link, header, text) off a specified number of results pages.
The code I have seems to only be scraping the first result from all pages after the first page.
I think this has something to do with how my for-loop is set up in the scrape function, but I have not been able to tweak it into working the way I'd like it to. Any suggestions for how to fix/ better approach this appreciated.
# create instance of webdriver
driver = webdriver.Chrome()
url = 'https://www.google.com'
driver.get(url)
# set keyword
keyword = 'cars'
# we find the search bar using it's name attribute value
searchBar = driver.find_element_by_name('q')
# first we send our keyword to the search bar followed by the ent
searchBar.send_keys(keyword)
searchBar.send_keys('\n')
def scrape():
pageInfo = []
try:
# wait for search results to be fetched
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "g"))
)
except Exception as e:
print(e)
driver.quit()
# contains the search results
searchResults = driver.find_elements_by_class_name('g')
for result in searchResults:
element = result.find_element_by_css_selector('a')
link = element.get_attribute('href')
header = result.find_element_by_css_selector('h3').text
text = result.find_element_by_class_name('IsZvec').text
pageInfo.append({
'header' : header, 'link' : link, 'text': text
})
return pageInfo
# Number of pages to scrape
numPages = 5
# All the scraped data
infoAll = []
# Scraped data from page 1
infoAll.extend(scrape())
for i in range(0 , numPages - 1):
nextButton = driver.find_element_by_link_text('Next')
nextButton.click()
infoAll.extend(scrape())
print(infoAll)
You have an indentation problem:
You should to have return pageInfo outside for loop, otherwise you're returning results after first loop execution
for result in searchResults:
element = result.find_element_by_css_selector('a')
link = element.get_attribute('href')
header = result.find_element_by_css_selector('h3').text
text = result.find_element_by_class_name('IsZvec').text
pageInfo.append({
'header' : header, 'link' : link, 'text': text
})
return pageInfo
Like this:
for result in searchResults:
element = result.find_element_by_css_selector('a')
link = element.get_attribute('href')
header = result.find_element_by_css_selector('h3').text
text = result.find_element_by_class_name('IsZvec').text
pageInfo.append({
'header' : header, 'link' : link, 'text': text
})
return pageInfo
I've ran your code and got results:
[{'header': 'Cars (film) — Wikipédia', 'link': 'https://fr.wikipedia.org/wiki/Cars_(film)', 'text': "Cars : Quatre Roues, ou Les Bagnoles au Québec (Cars), est le septième long-métrage d'animation entièrement en images de synthèse des studios Pixar.\nPays d’origine : États-Unis\nDurée : 116 minutes\nSociétés de production : Pixar Animation Studios\nGenre : Animation\nCars 2 · Michel Fortin · Flash McQueen"}, {'header': 'Cars - Wikipedia, la enciclopedia libre', 'link': 'https://es.wikipedia.org/wiki/Cars', 'text': 'Cars es una película de animación por computadora de 2006, producida por Pixar Animation Studios y lanzada por Walt Disney Studios Motion Pictures.\nAño : 2006\nGénero : Animación; Aventuras; Comedia; Infa...\nHistoria : John Lasseter Joe Ranft Jorgen Klubi...\nProductora : Walt Disney Pictures; Pixar Animat...'}, {'header': '', 'link': 'https://fr.wikipedia.org/wiki/Flash_McQueen', 'text': ''}, {'header': '', 'link': 'https://www.allocine.fr/film/fichefilm-55774/secrets-tournage/', 'text': ''}, {'header': '', 'link': 'https://fr.wikipedia.org/wiki/Martin_(Cars)', 'text': ''},
Suggestions:
Use a timer to control your for loop, otherwise you could be banned by Google due to suspicious activity
Steps:
1.- Import sleep from time: from time import sleep
2.- On your last loop add a timer:
for i in range(0 , numPages - 1):
sleep(5) #It'll wait 5 seconds for each iteration
nextButton = driver.find_element_by_link_text('Next')
nextButton.click()
infoAll.extend(scrape())
Google Search can be parsed with BeautifulSoup web scraping library without selenium, since the data is not being loaded dynamically via JavaScript, and will execute much faster in comparison to selenium as there's no need to render the page and use browser.
In order to get information from all pages, you can use pagination using an infinite while loop. Try to avoid using for i in range() pagination as it is a hardcoded way of doing pagination thus not reliable. If the page number would change (from 5 to 20), pagination will be broken.
Since the while loop is infinite, you need to set the conditions for exiting it, you can make two conditions:
the exit condition will be the presence of a button to switch to the next page (it is not on the last page), the presence can be checked by its CSS selector (in our case - ".d6cvqb a[id=pnnext]")
# condition for exiting the loop in the absence of the next page button
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
another solution would be to add a limit of pages available for scraping if there is no need to extract all the pages.
# condition for exiting the loop when the page limit is reached
if page_num == page_limit:
break
When trying to request a site, it may think that this is a bot, so that this does not happen, you need to send headers that contain user-agent in the request, then the site will assume that you are a user and display the information.
Next step could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on. The most reliable way is to use rotating proxies, user-agents, and a captcha solver.
Check full code in the online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "cars", # query example
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_limit = 10 # page limit for example
page_num = 0
data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf span").text
except:
snippet = None
links = result.select_one(".yuRUbf a")["href"]
data.append({
"title": title,
"snippet": snippet,
"links": links
})
# condition for exiting the loop when the page limit is reached
if page_num == page_limit:
break
# condition for exiting the loop in the absence of the next page button
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Cars (2006) - IMDb",
"snippet": "On the way to the biggest race of his life, a hotshot rookie race car gets stranded in a rundown town, and learns that winning isn't everything in life.",
"links": "https://www.imdb.com/title/tt0317219/"
},
{
"title": "Cars (film) - Wikipedia",
"snippet": "Cars is a 2006 American computer-animated sports comedy film produced by Pixar Animation Studios and released by Walt Disney Pictures. The film was directed ...",
"links": "https://en.wikipedia.org/wiki/Cars_(film)"
},
{
"title": "Cars - Rotten Tomatoes",
"snippet": "Cars offers visual treats that more than compensate for its somewhat thinly written story, adding up to a satisfying diversion for younger viewers.",
"links": "https://www.rottentomatoes.com/m/cars"
},
other results ...
]
Also you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": "...", # serpapi key from https://serpapi.com/manage-api-key
"engine": "google", # serpapi parser engine
"q": "cars", # search query
"gl": "uk", # country of the search, UK -> United Kingdom
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
page_limit = 10
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet"),
"link": result.get("link")
})
if page_num == page_limit:
break
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "Rally Cars - Page 30 - Google Books result",
"snippet": "Some people say rally car drivers are the most skilled racers in the world. Roger Clark, a British rally legend of the 1970s, describes sliding his car down ...",
"link": "https://books.google.co.uk/books?id=uIOlAgAAQBAJ&pg=PA30&lpg=PA30&dq=cars&source=bl&ots=9vDWFi0bHD&sig=ACfU3U1d4R-ShepjsTtWN-b9SDYkW1sTDQ&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgcEAM"
},
{
"title": "Independent Sports Cars - Page 5 - Google Books result",
"snippet": "The big three American auto makers produced sports and sports-like cars beginning with GMs Corvette and Fords Thunderbird in 1954. Folowed by the Mustang, ...",
"link": "https://books.google.co.uk/books?id=HolUDwAAQBAJ&pg=PA5&lpg=PA5&dq=cars&source=bl&ots=yDaDtQSyW1&sig=ACfU3U11nHeRTwLFORGMHHzWjaVHnbLK3Q&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgaEAM"
}
other results...
]

List returns empty even when path is correct for screen scraper

So im trying to get all the url from the free games site on Ubisoft website however it keeps returning empty. I dont know what I'm doing wrong here, image below shows the path
headers = {
"User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
}
result = requests.get("https://free.ubisoft.com/", headers=headers)
soup = BeautifulSoup(result.content, 'lxml')
print(result.content)
urls = []
urls = soup.find('div', {'class': 'free-events'}).find_all("a")
for url in urls:
link = url.attrs['data-url']
if "https" in link:
links.append(link)
return links
The data is loaded dynamically so if you print the result.content you see that there is only some simple HTML and Javascript.
Using Selenium you can load the page and retrieve the links like this:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
browser = webdriver.Chrome(chrome_options=options)
browser.get("https://free.ubisoft.com/")
for link in browser.find_elements_by_css_selector("div.free-event-button a[data-type='freegame']"):
print(link.get_attribute("data-url"))
# https://register.ubisoft.com/aco-discovery-tour
# https://register.ubisoft.com/acod-discovery-tour
# https://register.ubisoft.com/might_and_magic_chess_royale
# https://register.ubisoft.com/rabbids-coding
The content is loaded dynamically via JavaScript, but you can simulate Javascript requests with requests module.
For example:
import re
import requests
configuration_url = 'https://free.ubisoft.com/configuration.js'
configuration_js = requests.get(configuration_url).text
app_id = re.search(r"appId:\s*'(.*?)'",configuration_js).group(1)
url = re.search(r"prod:\s*'(.*?)'",configuration_js).group(1)
data = requests.get(url, headers={'ubi-appid': app_id,'ubi-localecode': 'en-US'}).json()
# pretty print all data:
import json
print(json.dumps(data, indent=4))
Prints:
{
"news": [
{
"spaceId": "6d0af36b-8226-44b6-a03b-4660073a6349",
"newsId": "ignt.21387",
"type": "freegame",
"placement": "freeevents",
"priority": 1,
"displayTime": 0,
"publicationDate": "2020-05-14T17:01:00",
"expirationDate": "2020-05-21T18:01:00",
"title": "Assassin's Creed Origins Discovery Tour",
"body": "Assassin's Creed Origins Discovery Tour",
"mediaURL": "https://ubistatic2-a.akamaihd.net/sitegen/assets/img/ac-odyssey/ACO_DiscoveryTour_logo.png",
"mediaType": null,
"profileId": null,
"obj": {},
"links": [
{
"type": "External",
"param": "https://register.ubisoft.com/aco-discovery-tour",
"actionName": "goto"
}
],
"locale": "en-US",
"tags": null
},
... and so on.
EDIT: To iterate over this data, you can use this example:
import re
import requests
configuration_url = 'https://free.ubisoft.com/configuration.js'
configuration_js = requests.get(configuration_url).text
app_id = re.search(r"appId:\s*'(.*?)'",configuration_js).group(1)
url = re.search(r"prod:\s*'(.*?)'",configuration_js).group(1)
data = requests.get(url, headers={'ubi-appid': app_id,'ubi-localecode': 'en-US'}).json()
for no, news in enumerate(data['news'], 1):
print('{:<5}{:<45}{}'.format(no, news['title'], news['links'][0]['param']))
Prints:
1 Assassin's Creed Origins Discovery Tour https://register.ubisoft.com/aco-discovery-tour
2 Assassin's Creed Odyssey Discovery Tour https://register.ubisoft.com/acod-discovery-tour
3 Uno Demo https://register.ubisoft.com/uno-trial
4 The Division 2 Trial https://register.ubisoft.com/the-division-2-trial
5 Ghost Recon Breakpoint Trial https://register.ubisoft.com/ghost-recon-breakpoint-trial
6 Might and Magic Chess Royale https://register.ubisoft.com/might_and_magic_chess_royale
7 Rabbids Coding https://register.ubisoft.com/rabbids-coding
8 Trials Rising Demo https://register.ubisoft.com/trials-rising-demo
9 The Crew 2 Trial https://register.ubisoft.com/tc2-trial
10 Ghost Recon Wildlands Trial https://register.ubisoft.com/ghost-recon-wildlands-trial
11 The Division Trial https://register.ubisoft.com/the-division-trial
EDIT 2: To filter only free games, you can do:
no = 1
for news in data['news']:
if news['type'] != 'freegame':
continue
print('{:<5}{:<45}{}'.format(no, news['title'], news['links'][0]['param']))
no += 1
Prints:
1 Assassin's Creed Origins Discovery Tour https://register.ubisoft.com/aco-discovery-tour
2 Assassin's Creed Odyssey Discovery Tour https://register.ubisoft.com/acod-discovery-tour
3 Might and Magic Chess Royale https://register.ubisoft.com/might_and_magic_chess_royale
4 Rabbids Coding https://register.ubisoft.com/rabbids-coding

Categories

Resources