Iterating over click while scraping data using selenium and python - python

I am trying to scrape data from this webpage
http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;team=5;template=results;type=batting
I need to copy the contents from the table and put them in a csv file, then go the next page and append the contents of those page into the same file. I am able to scrape the table but however when I try to loop over clicking next button using selenium webdriver's click, it goes to the next page and stops. This is my code.
driver = webdriver.Chrome(executable_path = 'path')
url = 'http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;team=5;template=results;type=batting'
def data_from_cricinfo(url):
driver.get(url)
pgsource = str(driver.page_source)
soup = BeautifulSoup(pgsource, 'html5lib')
data = soup.find_all('div', class_ = 'engineTable')
for tr in data:
info = tr.find_all('tr')
# grab data
next_link = driver.find_element_by_class_name('PaginationLink')
next_link.click()
data_from_cricinfo(url)
Is there anyway to click next for all pages using a loop and copy the contents of all pages into the same file? Thanks in advance.

You can do something like below to traverse all the pages (through Next button) and parse the data from the table:
from selenium import webdriver
from bs4 import BeautifulSoup
URL = 'http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;team=5;template=results;type=batting'
driver = webdriver.Chrome()
driver.get(URL)
while True:
soup = BeautifulSoup(driver.page_source, 'html5lib')
table = soup.find_all(class_='engineTable')[2]
for info in table.find_all('tr'):
data = [item.text for item in info.find_all("td")]
print(data)
try:
driver.find_element_by_partial_link_text('Next').click()
except:
break
driver.quit()

Related

Web Scrapping dynamic content with Python

I would simply like to get the Open Price of a stock with BeautifulSoup or Selenium is okay but i keep getting just the html tag for it and not the actually price i want
# <div class="tv-fundamental-block__value js-symbol-open">33931.0</div>
import requests
from bs4 import BeautifulSoup
url = requests.get('https://www.tradingview.com/symbols/PEPPERSTONE-US30/')
response = url.content
soup = BeautifulSoup(response, 'html.parser')
# print(soup.prettify())
open = soup.find('div', {'class': 'js-symbol-open'})
print(open)
The 33931.0 is the price id like to see in my terminal but i still dont get it
Using selenium ive only gotten the page i already know where i am getting the data from.
To extract the text content of the element using BeautifulSoup, use the .text property of the element:
open = soup.find('div', {'class': 'js-symbol-open'}).text
print(open)
In selenium:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.tradingview.com/symbols/PEPPERSTONE-US30/')
open_price = driver.find_element_by_css_selector('.js-symbol-open').text
print(open_price)
driver.quit()

Is there a way to web scarpe a website with unchanging URLs?

I am trying to webscrape a dynamic page using selenium and beautifulsoup and python and am able to scrape the first page. But when i try to get to the next page, the url doesnt change and when i Inspect, i am unable to see Form Data as well. Can someone can help me?
import time
from selenium import webdriver
from parsel import Selector
from bs4 import BeautifulSoup
import random
import re
import csv
import requests
import pandas as pd
companies = []
overview = []
people = []
driver = webdriver.Chrome(executable_path=r'C:\\Users\\rahul\Downloads\\chromedriver_win32 (1)\\chromedriver.exe')
driver.get('https://coverager.com/data/companies/')
driver.maximize_window()
src = driver.page_source
soup = BeautifulSoup(src, 'lxml')
table = soup.find('tbody')
descrip = []
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
#print(td)
row = [i.text.strip() for i in td]
descrip.append(row)
#print(row)
#file = open('gag.csv','w')
#with file:
# write = csv.writer(file)
# write.writerows(descrip)
url = ('https://coverager.com')
a_tags = table.find_all('a', href = True)
for link in a_tags:
ol = link.get('href')
pl = link.string.strip()
#companies.append(row)
#print(pl)
#print(ol)
driver.get(url + ol)
driver.implicitly_wait(1000)
data1 = driver.find_element_by_class_name('tab-details').text
overview.append(data1.strip())
data2 = driver.find_element_by_link_text('People').click()
p_tags = driver.find_element_by_class_name('tab-details').text
people.append(p_tags)
In your case of https://coverager.com/data/companies/ it would be much easier to scrape the api call instead of the html on the page.
Open dev tools (on chrome right click and hit inspect) and go to the network tab. When you hit the "next" button a row should show up in the network tab. Click on this row and then go to preview. You should see the company in this tab.
The api is accessing links which look like the following:
https://coverager.com/wp-json/ath/v1/coverager-data/companies?per_page=20&page=2&draw=4&column=3&dir=desc&filters=%7B%22companies%22:[],%22company_lob%22:[],%22industry%22:[],%22company_type%22:[],%22company_category%22:[],%22region%22:[],%22founded%22:[],%22company_stage%22:[],%22company_business_model%22:[]%7D
It seems like all the pages call the same api url but change the page= and raw= which are 2 apart.
So, simply use requests to call this class of links and loop through as many pages as you need! You could also change the per_page return as many companies as you need. You will have to test that though.

BeautifulSoup- Accessing More Reviews

I am trying to web-scrape reviews from an IMDB movie link and extracting usernames for reviews, I am only getting 25 usernames since thats what the page shows until you press "Show More". I need a way to accesss all reviews, is there a way to do this besides using Selenium because for some reason I get a SSL cert error when trying to import that.
import requests
from time import sleep
url='https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv'
response= requests.get(url,verify=False)
response
import bs4
soup=bs4.BeautifulSoup(response.content, 'html5lib')
name=soup.find_all('span', class_='display-name-link')
len(name)
To scrape all usernames (a total of 4041), send a GET request to simulate clicking the button:
import requests
from bs4 import BeautifulSoup
main_url = "https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv"
ajax_url = "https://www.imdb.com/title/tt0068646/reviews/_ajax?ref_=undefined&paginationKey={}"
soup = BeautifulSoup(requests.get(main_url).content, "html5lib")
while True:
for tag in soup.select(".display-name-link"):
print(tag.text)
print("-" * 30)
button = soup.select_one(".load-more-data")
if not button:
break
key = button["data-key"]
soup = BeautifulSoup(requests.get(ajax_url.format(key)).content, "html5lib")
Output:
CalRhys
gogoschka-1
SJ_1
andrewburgereviews
alexkolokotronis
MR_Heraclius
b-a-h TNT-6
danielfeerst
mattrochman
Godz365
winnantonio
Trevizolga
DaveDiggler
ks4
...
... All the way until
Steven Bray
Castor-5
BLDJ
pinky67
dean keaton
rejoefrankel
Timothy
I cannot think of clicking elements without Selenium. You can add options in browsers for ignoring SSL certificate errors.
Firefox
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv')
driver.close()
Chrome
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('ignore-certificate-errors')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv')
driver.close()
IE
from selenium import webdriver
capabilities = webdriver.DesiredCapabilities().INTERNETEXPLORER
capabilities['acceptSslCerts'] = True
driver = webdriver.Ie(capabilities=capabilities)
driver.get('https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv')
driver.close()
Here's what you're looking for:
import requests as r
from time import sleep
from bs4 import BeautifulSoup
# We'll use this link that gets only the reviews
url_reviews = 'https://www.imdb.com/title/tt0068646/reviews/_ajax'
# We need to get a key everytime to scrape the next page
url_reviews_next = 'https://www.imdb.com/title/tt0068646/reviews/_ajax?paginationKey='
response= r.get(url_reviews)
soup = BeautifulSoup(response.text, 'html.parser')
name = soup.find_all('span', class_='display-name-link')
# Get your data here
paginationKey = ''
# If there's only one page and more, there's won't be the class load_more_data
try:
paginationKey = soup.find_all('div', class_='load-more-data')[0]['data-key']
except:
paginationKey = ''
print(paginationKey)
while paginationKey != '':
response= r.get(url_reviews_next+paginationKey)
soup = BeautifulSoup(response.text, 'html.parser')
# Get your data here
try:
paginationKey = soup.find_all('div', class_='load-more-data')[0]['data-key']
except:
paginationKey = ''
print(paginationKey)
# If the page has no more pagination, it will catch the exception
Everytime we need to extract the paginationKey to scrape the next page, no need for selenium.

Scraping a table with BeautifulSoup

I am trying to scrape a table with info on footballplayers on https://www.transfermarkt.co.uk/manchester-city/kader/verein/281/saison_id/2019/plus/1
It works fine when I try to get information manually like this:
url = 'https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2019'
response = requests.get(url, headers={'User-Agent': 'Custom5'})
data = response.text
soup = BeautifulSoup(data, 'html.parser')
players_table = soup.find("table", attrs={"class": "items"})
Players = soup.find_all("a", {"class": "spielprofil_tooltip"})
Players[5].text
Values = soup.find_all("td", {"class": "rechts hauptlink"})
Values[9].text
Birthdays = soup.find_all("td", {"class": "zentriert"})
Birthdays[1].text
But to actually get the data into a table I think I need to use a for loop with td and tr tags. I have looked for solutions but cannot find anything that works with this particular website.
When I try this for example, the list remains empty
data = []
for tr in players_table.find_all("tr"):
# remove any newlines and extra spaces from left and right
data.append
print(data)
You don't actually append anything to the list.
Change data.append to data.append(tr).
That way you tell the your program what to append to the list, assuming players_table.find_all("tr") does return at least 1 item.
The website uses JavaScript, but requests doesn't support it. so we can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
URL = "https://www.transfermarkt.co.uk/manchester-city/kader/verein/281/saison_id/2019/plus/1"
driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe")
driver.get(URL)
# Wait 5 seconds for the page to load
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
players_table = soup.find("table", attrs={"class": "items"})
for tr in players_table.find_all('tr'):
tds = ' '.join(td.get_text(strip=True) for td in tr.select('td'))
print(tds)
driver.quit()

Webscraping - Python - Can't find links in html

I'm trying to scrape all links from https://www.udemy.com/courses/search/?q=sql&src=ukw&lang=en however without even selecting an element, my code retrieves no links. Please see my code below.
import bs4,requests as rq
Link = 'https://www.udemy.com/courses/search/?q=sql&src=ukw&lang=en'
RQOBJ = rq.get(Link)
BS4OBJ = bs4.BeautifulSoup(RQOBJ.text)
print(BS4OBJ)
hope you want link of courses on the page, this code will help
from selenium import webdriver
from bs4 import BeautifulSoup
import time
baseurl='https://www.udemy.com'
url="https://www.udemy.com/courses/search/?q=sql&src=ukw&lang=en"
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
courseLink = soup.findAll("a", {"class": "card__title",'href': True})
for link in courseLink:
print baseurl+link['href']
driver.quit()
It will print:
https://www.udemy.com/the-complete-sql-bootcamp/
https://www.udemy.com/the-complete-oracle-sql-certification-course/
https://www.udemy.com/introduction-to-sql23/
https://www.udemy.com/oracle-sql-12c-become-an-sql-developer-with-subtitle/
https://www.udemy.com/sql-advanced/
https://www.udemy.com/sql-for-newbs/
https://www.udemy.com/sql-for-marketers-data-analytics-data-science-big-data/
https://www.udemy.com/sql-for-punk-analytics/
https://www.udemy.com/sql-basics-for-beginners/
https://www.udemy.com/oracle-sql-step-by-step-approach/
https://www.udemy.com/microsoft-sql-for-beginners/
https://www.udemy.com/sql-tutorial-learn-sql-with-mysql-database-beginner2expert/
the website use javascript to fetch data, you should use selenium

Categories

Resources