I'm very new to python and trying to webscrape soccer matches for 'today' from the fox sports website: https://www.foxsports.com/scores/soccer. Unfortunately, I keep running into issues with
'AttributeError: 'NoneType' object has no attribute 'find_all''
and can't seem to get the teams for that day. This is what I have so far:
import bs4
import requests
res = requests.get('foxsports.com/scores/soccer')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
results = soup.find("div", class_="scores-date")
games = results.find("div", class_="scores")
print(games)
What happens?
Content is not static it is served dynamically by website, so request won't get the information you can see in your dev tools.
How to fix?
Use an api provided or selenium that handels content like a browser and can provide the page_source you are looking for.
Cause not all of the content is provided directly, you have to use selenium waits to locate the presence of the <span> with class "title-text".
Example
Note Example uses selenium 4, so check your version, update or adapt requiered dependencies to a lower version by yourself
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
service = ChromeService(executable_path='ENTER YOUR PATH TO CHROMEDRIVER')
driver = webdriver.Chrome(service=service)
driver.get('https://www.foxsports.com/scores/soccer')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//span[contains(#class, "title-text") and text() = "Today"]')))
soup = BeautifulSoup(driver.page_source, 'lxml')
for g in soup.select('.scores-date:not(:has(div)) + div .score-chip-content'):
print(list(g.stripped_strings))
Output
['SERIE A', 'JUVENTUS', '9-4-5', 'JUV', '9-4-5', 'CAGLIARI', '1-7-10', 'CAG', '1-7-10', '8:45PM', 'Paramount+', 'JUV -455', 'CAG +1100']
['LG CUP', 'ARSENAL', '0-0-0', 'ARS', '0-0-0', 'SUNDERLAND', '0-0-0', 'SUN', '0-0-0', '8:45PM', 'ARS -454', 'SUN +1243']
['LA LIGA', 'SEVILLA', '11-4-2', 'SEV', '11-4-2', 'BARCELONA', '7-6-4', 'BAR', '7-6-4', '9:30PM', 'ESPN+', 'SEV +155', 'BAR +180']
You have to provide a link with the http protocol. This code works:
import bs4
import requests
res = requests.get('https://foxsports.com/scores/soccer')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
results = soup.find("div", class_="scores-date")
games = results.find("div", class_="scores")
print(results)
print(games)
However, games is None because bs4 cannot find any div with class scores in results
Far more efficient if you go through the api. All the data is there, including much more (but I only pulled out the scores to print out). You'll have to first access the site to grab the apikey to be used as a parameter.
I've also added the option to choose your group/league. So you'll nee to pip install choice
import requests
import datetime
from bs4 import BeautifulSoup
import re
#pip install choice
import choice
# Get the apikey
url = 'https://www.foxsports.com/scores/soccer'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
apikey = soup.find_all('div', {'data-scoreboard':re.compile("^https://")})[0]['data-scoreboard'].split('apikey=')[-1]
# Get the group Ids and correpsonding titles
url = 'https://api.foxsports.com/bifrost/v1/soccer/scoreboard/main'
payload = {'apikey':apikey}
jsonData = requests.get(url, params=payload).json()
groupsTitle_list = [ x['title'] for x in jsonData['groupList']]
groupsId_list = [ x['id'] for x in jsonData['groupList']]
groups_dict = dict(zip(groupsTitle_list, groupsId_list))
user_input = choice.Menu(groups_dict.keys()).ask()
groupId = groups_dict[user_input]
# Get the date of the score you are after
date_param = input('Enter date in YYYYMMDD format\nEx: 20220109\n-> ')
# If you prefer to always just grab todays score, use line below
#date_param = datetime.datetime.now().strftime("%Y%m%d")
# Pull the score for the date and group
url = f'https://api.foxsports.com/bifrost/v1/soccer/scoreboard/segment/c{groupId}d{date_param}'
payload = {
'apikey':apikey,
'groupId':groupId}
jsonData = requests.get(url, params=payload).json()
if len(jsonData['sectionList']) == 0:
print(f'No score available on {date_param} for {user_input}')
else:
returnDate = jsonData['sectionList'][0]['menuTitle']
print(f'\n {returnDate} - {user_input}')
events = jsonData['sectionList'][0]['events']
for event in events:
lowerTeamName = event['lowerTeam']['longName']
lowerTeamScore = event['lowerTeam']['score']
upperTeamName = event['upperTeam']['longName']
upperTeamScore = event['upperTeam']['score']
print(f'\t{upperTeamName} {upperTeamScore}')
print(f'\t{lowerTeamName} {lowerTeamScore}\n')
Output:
Make a choice:
0: FEATURED MATCHES
1: ENGLISH PREMIER LEAGUE
2: MLS
3: LA LIGA
4: LIGUE 1
5: BUNDESLIGA
6: UEFA CHAMPIONS LEAGUE
7: LIGA MX
8: SERIE A
9: WCQ - CONCACAF
Enter number or name; return for next page
? 0
Enter date in YYYYMMDD format
Ex: 20220109
-> 20220109
SUN, JAN 9 - FEATURED MATCHES
LIVERPOOL 4
SHREWSBURY 1
TOTTENHAM 3
MORECAMBE 1
WOLVES 3
SHEFFIELD UTD 0
WEST HAM 2
LEEDS UNITED 0
NOTTINGHAM 1
ARSENAL 0
ROMA 3
JUVENTUS 4
LYON 1
PARIS SG 1
VILLARREAL 2
ATLÉTICO MADRID 2
GUADALAJARA 3
MAZATLÁN FC 0
Related
I'm doing a scraping exercise on a job searching webpage. I want to get the link, name of the company, job title, salary, location and posting date. I've run the same code multiple times, and sometimes it gives the expected results in the salary item (salary if the info is displayed, "N/A" otherwise) and sometimes it gives me something different: salary if the info is displayed, "N/A", and some random character values in columns whose values should be "N/A". I have no problems with the other elements. Here is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
import pandas as pd
import requests
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://ca.indeed.com/')
#Inputs a job title and location into the input boxes
input_box = driver.find_element(By.XPATH,'//*[#id="text-input-what"]')
input_box.send_keys('data analyst')
location = driver.find_element(By.XPATH,'//*[#id="text-input-where"]')
location.send_keys('toronto')
#Clicks on the search button
button = driver.find_element(By.XPATH,'//*[#id="jobsearch"]/button').click()
#Creates a dataframe
df = pd.DataFrame({'Link':[''], 'Job Title':[''], 'Company':[''], 'Location':[''],'Salary':[''], 'Date':['']})
#This loop goes through every page and grabs all the details of each posting
#Loop will only end when there are no more pages to go through
while True:
#Imports the HTML of the current page into python
soup = BeautifulSoup(driver.page_source, 'lxml')
#Grabs the HTML of each posting
postings = soup.find_all('div', class_ = 'slider_container css-g7s71f eu4oa1w0')
len(postings)
#grabs all the details for each posting and adds it as a row to the dataframe
for post in postings:
link = post.find('a').get('href')
link_full = 'https://ca.indeed.com'+link
name = post.find('h2', tabindex = '-1').text.strip()
company = post.find('span', class_ = 'companyName').text.strip()
try:
location = post.find('div', class_ = 'companyLocation').text.strip()
except:
location = 'N/A'
try:
salary = post.find('div', attrs = {'class':'heading6 tapItem-gutter metadataContainer noJEMChips salaryOnly'}).text.strip()
except:
salary = 'N/A'
date = post.find('span', class_ = 'date').text.strip()
df = df.append({'Link':link_full, 'Job Title':name, 'Company':company, 'Location':location,'Salary':salary, 'Date':date},
ignore_index = True)
#checks if there is a button to go to the next page, and if not will stop the loop
try:
button = soup.find('a', attrs = {'aria-label': 'Next'}).get('href')
driver.get('https://ca.indeed.com'+button)
except:
break
Can I fix my code to get the expected results everytime I run it? Also, an additional issue: I'm scraping around 60 pages. But usually the program stops between 20 and 30 pages before the last page. Is there a way to fix the code so that it scrapes until the last page everytime?
Here is a simplified example with requests library:
import requests
from bs4 import BeautifulSoup
cookies = {}
headers = {}
params = {
'q': 'data analyst',
'l': 'toronto',
'from': 'searchOnHP',
}
response = requests.get('https://ca.indeed.com/jobs', params=params, cookies=cookies, headers=headers)
soup = BeautifulSoup(response.text)
postings = soup.find_all('div', class_ = 'slider_container css-g7s71f eu4oa1w0')
len(postings)
prints
15
I have problems trying to scrape a web with multiple pages with Spyder: the web has 1 to 6 pages and also a next button. Also, each of one the six pages has 30 results. I've tried two solutions without success.
This is the first one:
#SOLUTION 1#
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')
#Imports the HTML of the webpage into python
soup = BeautifulSoup(driver.page_source, 'lxml')
postings = soup.find_all('div', class_ = 'isp_grid_product')
#Creates data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})
#Scrape the data
for i in range (1,7): #I've also tried with range (1,6), but it gives 5 pages instead of 6.
url = "https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num="+str(i)+""
postings = soup.find_all('li', class_ = 'isp_grid_product')
for post in postings:
link = post.find('a', class_ = 'isp_product_image_href').get('href')
link_full = 'https://store.unionlosangeles.com'+link
vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
title = post.find('div', class_ = 'isp_product_title').text.strip()
price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)
The output of this code is a data frame with 180 rows (30 x 6), but it repeats the results
of the first page. Thus, my first 30 rows are the first 30 results of the first page, and the rows 31-60 are again the same results of the first page and so on.
Here is the second solution I tried:
### SOLUTION 2 ###
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')
#Imports the HTML of the webpage into python
soup = BeautifulSoup(driver.page_source, 'lxml')
soup
#Create data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})
#Scrape data
i = 0
while i < 6:
postings = soup.find_all('li', class_ = 'isp_grid_product')
len(postings)
for post in postings:
link = post.find('a', class_ = 'isp_product_image_href').get('href')
link_full = 'https://store.unionlosangeles.com'+link
vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
title = post.find('div', class_ = 'isp_product_title').text.strip()
price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)
#Imports the next pages HTML into python
next_page = 'https://store.unionlosangeles.com'+soup.find('div', class_ = 'page-item next').get('href')
page = requests.get(next_page)
soup = BeautifulSoup(page.text, 'lxml')
i += 1
The problem with this second solution is that the program cannot recognize the attribute "get" in next_page, for reasons I cannot grasp (I haven't had this problem in other webs with paginations). Thus, I get only the first page and not the others.
How can I fix the code to properly scrape all 180 elements?
The data you see is loaded from external URL via javascript. You can simulate these calls with requests module. For example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs
url = "https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1"
api_url = "https://cdn-gae-ssl-premium.akamaized.net/categories_navigation"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
params = {
"page_num": 1,
"store_id": "",
"UUID": "",
"sort_by": "creation_date",
"facets_required": "0",
"callback": "",
"related_search": "1",
"category_url": "/collections/outerwear",
}
q = parse_qs(
urlparse(soup.select_one("#isp_search_result_page ~ script")["src"]).query
)
params["store_id"] = q["store_id"][0]
params["UUID"] = q["UUID"][0]
all_data = []
for params["page_num"] in range(1, 7):
data = requests.get(api_url, params=params).json()
for i in data["items"]:
link = i["u"]
vendor = i["v"]
title = i["l"]
price = i["p"]
all_data.append([link, vendor, title, price])
df = pd.DataFrame(all_data, columns=["link", "vendor", "title", "price"])
print(df.head(10).to_markdown(index=False))
print("Total items =", len(df))
Prints:
link
vendor
title
price
/products/barn-jacket
Essentials
BARN JACKET
250
/products/work-vest-2
Essentials
WORK VEST
120
/products/tailored-track-jacket
Martine Rose
TAILORED TRACK JACKET
1206
/products/work-vest-1
Essentials
WORK VEST
120
/products/60-40-cloth-bug-anorak-1tone
Kapital
60/40 Cloth BUG Anorak (1Tone)
747
/products/smooth-jersey-stand-man-woman-track-jkt
Kapital
Smooth Jersey STAND MAN & WOMAN Track JKT
423
/products/supersized-sports-jacket
Martine Rose
SUPERSIZED SPORTS JACKET
1695
/products/pullover-vest
Nicholas Daley
PULLOVER VEST
267
/products/flannel-polkadot-x-bandana-reversible-1st-jkt-1
Kapital
FLANNEL POLKADOT X BANDANA REVERSIBLE 1ST JKT
645
/products/60-40-cloth-bug-anorak-1tone-1
Kapital
60/40 Cloth BUG Anorak (1Tone)
747
Total items = 175
I am currently trying to extract all the reviews on Spiderman Homecoming movie but I am only able to get the first 25 reviews. I was able to load more in IMDB to get all the reviews as originally it only shows the first 25 but for some reason I am unable to mine all the reviews after every review has been loaded. Does anyone know what I am doing wrong?
Below is the code I am running:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
#Set the web browser
driver = webdriver.Chrome(executable_path=r"C:\Users\Kent_\Desktop\WorkStudy\chromedriver.exe")
#Go to Google
driver.get("https://www.imdb.com/title/tt6320628/reviews?ref_=tt_urv")
#Loop load more button
wait = WebDriverWait(driver,10)
while True:
try:
driver.find_element_by_css_selector("button#load-more-trigger").click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
soup = BeautifulSoup(driver.page_source, 'lxml')
except Exception:break
#Scrape IMBD review
ans = driver.current_url
page = requests.get(ans)
soup = BeautifulSoup(page.content, "html.parser")
all = soup.find(id="main")
#Get the title of the movie
all = soup.find(id="main")
parent = all.find(class_ ="parent")
name = parent.find(itemprop = "name")
url = name.find(itemprop = 'url')
film_title = url.get_text()
print('Pass finding phase.....')
#Get the title of the review
title_rev = all.select(".title")
title = [t.get_text().replace("\n", "") for t in title_rev]
print('getting title of reviews and saving into a list')
#Get the review
review_rev = all.select(".content .text")
review = [r.get_text() for r in review_rev]
print('getting content of reviews and saving into a list')
#Make it into dataframe
table_review = pd.DataFrame({
"Title" : title,
"Review" : review
})
table_review.to_csv('Spiderman_Reviews.csv')
print(title)
print(review)
Well, actually, there's no need to use Selenium. The data is available via sending a GET request to the websites API in the following format:
https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey=MY-KEY
where you have to provide a key for the paginationKey in the URL (...&paginationKey=MY-KEY)
The key is found in the class load-more-data:
<div class="load-more-data" data-key="g4wp7crmqizdeyyf72ux5nrurdsmqhjjtzpwzouokkd2gbzgpnt6uc23o4zvtmzlb4d46f2swblzkwbgicjmquogo5tx2">
</div>
So, to scrape all the reviews into a DataFrame, try:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = (
"https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"title": [], "review": []}
while True:
response = requests.get(url.format(key))
soup = BeautifulSoup(response.content, "html.parser")
# Find the pagination key
pagination_key = soup.find("div", class_="load-more-data")
if not pagination_key:
break
# Update the `key` variable in-order to scrape more reviews
key = pagination_key["data-key"]
for title, review in zip(
soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
):
data["title"].append(title.get_text(strip=True))
data["review"].append(review.get_text())
df = pd.DataFrame(data)
print(df)
Output (truncated):
title review
0 Terrific entertainment Spiderman: Far from Home is not intended to be...
1 THe illusion of the identity of Spider man. Great story in continuation of spider man home...
2 What Happened to the Bad Guys I believe that Quinten Beck/Mysterio got what ...
3 Spectacular One of the best if not the best Spider-Man mov...
...
...
I'm a complete beginner and I'm running into some issues in web scraping I'm able to scrape the picture, title, and price. And was successful at grabbing the index at [0] However, whenever I try to run a loop or hardcode the index at a higher value than 0 it states that it's out of range. And it won't scrape any of the other <li> tags. Is there any other way to go about this problem? Also, I incorporated selenium in order to load the entire page. Any help would be highly appreciated.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://ca.octobersveryown.com/collections/all")
scrolls = 22
while True:
scrolls -= 1
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(0.2)
if scrolls < 0:
break
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
bodies = (soup.find(id='content'))
clothing = bodies.find_all('ul', class_='grid--full product-grid-items')
for span_tag in soup.findAll(class_='visually-hidden'):
span_tag.replace_with('')
print(clothing[0].find('img')['src'])
print(clothing[0].find(class_='product-title').get_text())
print(clothing[0].find(class_='grid-price-money').get_text())
time.sleep(8)
driver.quit()
If you want to use only BeautifulSoup without selenium, you can simulate Ajax requests the page is making. For example:
import requests
from bs4 import BeautifulSoup
url = 'https://uk.octobersveryown.com/collections/all?page={page}&view=pagination-ajax'
page = 1
while True:
soup = BeautifulSoup( requests.get(url.format(page=page)).content, 'html.parser' )
li = soup.find_all('li', recursive=False)
if not li:
break
for l in li:
print(l.select_one('p a').get_text(strip=True))
print('https:' + l.img['src'])
print(l.select_one('.grid-price').get_text(strip=True, separator=' '))
print('-' * 80)
page += 1
Prints:
LIGHTWEIGHT RAIN SHELL
https://cdn.shopify.com/s/files/1/1605/0171/products/lightweight-rain-shell-dark-red-1_large.jpg?v=1598583974
£178.00
--------------------------------------------------------------------------------
LIGHTWEIGHT RAIN SHELL
https://cdn.shopify.com/s/files/1/1605/0171/products/lightweight-rain-shell-black-1_large.jpg?v=1598583976
£178.00
--------------------------------------------------------------------------------
ALL COUNTRY HOODIE
https://cdn.shopify.com/s/files/1/1605/0171/products/all-country-hoodie-white-1_large.jpg?v=1598583978
£148.00
--------------------------------------------------------------------------------
...and so on.
EDIT (To save as CSV):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://uk.octobersveryown.com/collections/all?page={page}&view=pagination-ajax'
page = 1
all_data = []
while True:
soup = BeautifulSoup( requests.get(url.format(page=page)).content, 'html.parser' )
li = soup.find_all('li', recursive=False)
if not li:
break
for l in li:
d = {'name': l.select_one('p a').get_text(strip=True),
'link': 'https:' + l.img['src'],
'price': l.select_one('.grid-price').get_text(strip=True, separator=' ')}
all_data.append(d)
print(d)
print('-' * 80)
page += 1
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
Prints:
name ... price
0 LIGHTWEIGHT RAIN SHELL ... £178.00
1 LIGHTWEIGHT RAIN SHELL ... £178.00
2 ALL COUNTRY HOODIE ... £148.00
3 ALL COUNTRY HOODIE ... £148.00
4 ALL COUNTRY HOODIE ... £148.00
.. ... ... ...
271 OVO ESSENTIALS LONGSLEEVE T-SHIRT ... £58.00
272 OVO ESSENTIALS POLO ... £68.00
273 OVO ESSENTIALS T-SHIRT ... £48.00
274 OVO ESSENTIALS CAP ... £38.00
275 POM POM COTTON TWILL CAP ... £32.00 SOLD OUT
[276 rows x 3 columns]
and saves data.csv (screenshot from LibreOffice):
I am trying to write a Python program to gather data from Google Trends (GT)- specifically, I want to automatically open URLs and access the specific values that are displayed in the title.
I have written the code and i am able to scrape data successfully. But i compare the data returned by code and one present in the url, the results are only partially returned.
For e.g. in the below image, the code returns the first title "Manchester United F.C. • Tottenham Hotspur F.C." But the actual website has 4 results "Manchester United F.C. • Tottenham Hotspur F.C. , International Champions Cup, Manchester
".
google trends image
screenshot output of code
We have currently tried all all possible locate elements in a page but we are still unable to fund a fix for this. We didn't want to use scrapy or beautiful soup for this
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
import time
from selenium import webdriver
links=["https://trends.google.com/trends/trendingsearches/realtime?geo=DE&category=s"]
for link in links:
Title_temp=[]
Titile=''
seleniumDriver = r"C:/Users/Downloads/chromedriver_win32/chromedriver.exe"
chrome_options = Options()
brow = webdriver.Chrome(executable_path=seleniumDriver, chrome_options=chrome_options)
try:
brow.get(link) ## getting the url
try:
content = brow.find_elements_by_class_name("details-top")
for element in content:
Title_temp.append(element.text)
Title=' '.join(Title_temp)
except:
Title=''
brow.quit()
except Exception as error:
print error
break
Final_df = pd.DataFrame(
{'Title': Title_temp
})
From what I see, data is retrieved from an API endpoint you can call direct. I show how to call and then extract only the title (note more info is returned other than just title from API call). You can explore the breadth of what is returned (which includes article snippets, urls, image links etc) here.
import requests
import json
r = requests.get('https://trends.google.com/trends/api/realtimetrends?hl=en-GB&tz=-60&cat=s&fi=0&fs=0&geo=DE&ri=300&rs=20&sort=0')
data = json.loads(r.text[5:])
titles = [story['title'] for story in data['storySummaries']['trendingStories']]
print(titles)
Here is the code which printed all the information.
url = "https://trends.google.com/trends/trendingsearches/realtime?geo=DE&category=s"
driver.get(url)
WebDriverWait(driver,30).until(EC.presence_of_element_located((By.CLASS_NAME,'details-top')))
Title_temp = []
try:
content = driver.find_elements_by_class_name("details-top")
for element in content:
Title_temp.append(element.text)
Title=' '.join(Title_temp)
except:
Title=''
print(Title_temp)
driver.close()
Here is the output.
['Hertha BSC • Fenerbahçe S.K. • Bundesliga • Ante Čović • Berlin', 'Eintracht Frankfurt • UEFA Europa League • Tallinn • Estonia • Frankfurt', 'FC Augsburg • Galatasaray S.K. • Martin Schmidt • Bundesliga • Stefan Reuter', 'Austria national football team • FIFA • Austria • FIFA World Rankings', 'Lechia Gdańsk • Brøndby IF • 2019–20 UEFA Europa League • Gdańsk', 'Alexander Zverev • Hamburg', 'Julian Lenz • Association of Tennis Professionals • Alexander Zverev', 'UEFA Europa League • Diego • Nairo Quintana • Tour de France']
Screenshot:
We were able to find a fix for this. We had to scrape data from inner html and then do some bit of data cleaning to get required records
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
#html parser
def parse_html(content):
from bs4 import BeautifulSoup
from bs4.element import Comment
soup = BeautifulSoup(content, 'html.parser')
text_elements = soup.findAll(text=True)
tag_blacklist = ['style', 'script', 'head', 'title', 'meta', '[document]','img']
clean_text = []
for element in text_elements:
if element.parent.name in tag_blacklist or isinstance(element, Comment):
continue
else:
text_ = element.strip()
clean_text.append(text_)
result_text = " ".join(clean_text)
result_text = result_text.replace(r'[\r\n]','')
tag_remove_pattern = re.compile(r'<[^>]+>')
result_text = tag_remove_pattern.sub('', result_text)
result_text = re.sub(r'\\','',result_text)
return result_text
seleniumDriver = r"./chromedriver.exe"
chrome_options = Options()
brow = webdriver.Chrome(executable_path=seleniumDriver, chrome_options=chrome_options)
links=["https://trends.google.com/trends/trendingsearches/realtime?geo=DE&category=s"]
title_temp = []
for link in links:
try:
brow.get(link)
try:
elements = brow.find_elements_by_class_name('details-top')
for element in elements:
html_text = parse_html(element.get_attribute("innerHTML"))
title_temp.append(html_text.replace('share','').strip())
except Exception as error:
print(error)
time.sleep(1)
brow.quit()
except Exception as error:
print(error)
break
Final_df = pd.DataFrame(
{'Title': title_temp
})
print(Final_df)