How to retreive till the last page of the table? - python

In the code below I don't know where is the last page, the code below works till PAGE 25 which I mentioned manually! sometimes we have 60 or 70 pages! How can I change the code and get the table till the last page??
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Chrome('C:\Webdriver\chromedriver.exe')
driver.get('https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/imp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-01-01&r9=2022-01-01')
time.sleep(2)
Canada_Result=[]
for J in range (25):
commodities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[2]/a')
Countries = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[4]')
quantities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[7]')
weights = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[8]/abbr')
for i in range(25):
temporary_data= {'Commodity': commodities[i].text,'Country': Countries[i].text,'quantity': quantities[i].text, 'weight': weights[i].text }
Canada_Result.append(temporary_data)
df_data = pd.DataFrame(Canada_Result)
df_data
df_data.to_excel('Canada_scrapping_result.xlsx', index=False)
# click on the Next button
driver.find_element_by_xpath('//*[#id="report_results_next"]').click()
time.sleep(1)

I used the variable pages to find how many buttons there were and then work out how many pages there are by getting the text from the last non "Next" button on the page. This is great for dealing with multi-page selections but I have also implemented a solution in case the selection only has one page e.g. less than 25 rows retrieved.
I added an if statement near the start for the categories that have less rows than 25 for example this one https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/imp-eng.htm?r1=(1)&r2=9&r3=1&r4=02&r5=0&r7=0&r8=2022-01-01&r9=2022-04-01 which only has 19 rows retrieved.
The variable period_entries is just used to see how many row entries there are on each page. I implemented this mainly due to the last page having only 22 entries instead of 25 which broke the program initially when it got to the end.
The last if statement is there to ensure the program still scrapes the very last page but does not try to click the next button since it is not available.
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Chrome()
driver.get('https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/imp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-01-01&r9=2022-01-01')
time.sleep(2)
Canada_Result=[]
pages = len(driver.find_elements_by_xpath('//a[#class="paginate_button" or #class="paginate_button current" and #title]'))
pages = driver.find_element_by_xpath('//a[#onclick and #class="paginate_button" or #class="paginate_button current" and #title][%d]' % (pages)).text.strip("Page\n")
if pages == '':
pages = 1
for J in range (int(pages)):
commodities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[2]/a')
Countries = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[4]')
quantities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[7]')
weights = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[8]/abbr')
period_entries = len(commodities)
for i in range(period_entries):
temporary_data= {'Commodity': commodities[i].text,'Country': Countries[i].text,'quantity': quantities[i].text, 'weight': weights[i].text }
Canada_Result.append(temporary_data)
df_data = pd.DataFrame(Canada_Result)
df_data
df_data.to_excel('Canada_scrapping_result.xlsx', index=False)
if J == int(pages) - 1:
print("Done")
break
# click on the Next button
driver.find_element_by_xpath('//*[#id="report_results_next"]').click()
time.sleep(1)

Related

Condition based scraping using selenium python

I want to scrape the dates and the respectives news headlines/articles for a period of 6days- like when the python script runs today,it should scrape headlines/articles from today(10th August) to 4th August.
I am able to scrape the dates and headlines/urls for all dates as of now from here.
here is the code for the same
websites = ['https://www.thespiritsbusiness.com/tag/rum/']
for spirits in websites:
browser.get(spirits)
time.sleep(1)
news_links = browser.find_elements_by_xpath('//*[#id="archivewrapper"]/div/div[2]/h3')
n_links = [ele.find_element_by_tag_name('a').get_attribute('href') for ele in news_links]
dates = browser.find_elements_by_xpath('//*[#id="archivewrapper"]/div/div[2]/small')
n_dates = [ele.text for ele in dates]
print(n_links)
print(n_dates)
But how do I scrape for a period for last 6days from today? Is there an idea?
See the page 2 url is
https://www.thespiritsbusiness.com/tag/rum/page/2/
which basically means, that for next iteration you would need to add /page/2/ in URL.
you can have a websites list as :
websites = ['https://www.thespiritsbusiness.com/tag/rum/', 'https://www.thespiritsbusiness.com/tag/rum/page/2/', 'https://www.thespiritsbusiness.com/tag/rum/page/3/']
and so on, to achieve this.
or you can do this programmatically as well :-
page_number = 1
websites = ['https://www.thespiritsbusiness.com/tag/rum/']
for spirits in websites:
browser.get(spirits + f"page/{page_number}/")
page_number = page_number + 1

Beautifulsoup: how to iterate a table

I am trying to extract data from a dynamic table with the following structure:
Team 1 - Score - Team 2 - Minute first goal.
It is a table of soccer match results and there are about 10 matches per table and one table for each matchday. This is an example of the website in working with: https://www.resultados-futbol.com/premier/grupo1/jornada1
For this I am trying web scraping with BeautifulSoup in Python. Although I've made good progress, I'm running into a problem. I would like to generate a code that would iterate data by data each row of the table and I would get each data to a list so that I would have, for example:
List Team 1: Real Madrid, Barcelona
Score list: 1-0, 1-0
List Team 2: Atletico Madrid, Sevilla
First goal minutes list: 17', 64'
Once I have the lists, my intention is to make a complete dataframe with all the extracted data. However, I have the following problem: the matches that end 0-0. This implies that in the column Minute first goal there is none and it doesn't extract anything, so I can't 'fill' that value in any way in my dataframe and I get an error. To continue with the previous example, imagine that the second game has ended 0-0 and that in the 'Minutes first goal list' there is only one data (17').
In my mind the solution would be to create a loop that takes the data cell by cell and put a condition in 'Score' that if it is 0-0 to the list of Minutes first goal a value for example 'No goals' would be added.
This is the code I am using. I paste only the part in which I would like to create the loop:
page = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
table = page.find('div', class_= 'contentitem').find_all('tr', class_= 'vevent')
teams1 = []
teams2 = []
scores = []
for cell in table:
team1 = cell.find('td', class_='team1')
for name in local:
nteam1 = name.text
teams1.append(nteam1)
team2 = cell.find('td', class_='team2')
for name in team2:
nteam2 = name.text
teams2.append(nteam2)
score = cell.find('span', class_='clase')
for name in score:
nscore = name.text
scores.append(nscore)
It is not clear to me how to iterate over the table to be able to store in the list the content of each cell and it is essential to include a condition "when the score cell is 0-0 create a non-goals entry in the list".
If someone could help me, I would be very grateful. Best regards
You are close to your goal, but can optimize your script a bit.
Do not use these different lists, just use one:
data = []
Try to get all information in one loop, there is an td that contains all the information and push a dict to your list:
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
Push your data into DataFrame
pd.DataFrame(data)
Example
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://www.resultados-futbol.com/premier/grupo1/jornada1'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
data = []
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
pd.DataFrame(data)

how do i stop this print(val.text) from printing the same table 8 times

Am scraping tables from a Matchday dropdown. when the print(val.text) is executed it prints each table 8 times just like the number of rows instead of just once it then moves to the next Matchday and prints the table 8 times.i just want it be printed once. can someone help me find where the problem is.
i would also like to append the tables scraped to an excel file any assistance would be a appreciated
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome('C:\Chrome\chromedriver.exe')
driver.get('https://www.betika.com/mobile/#/virtuals/results')
#Selecting the Season dropdown box
Seasons = Select(driver.find_element_by_css_selector('td:nth-of-type(1) > .virtuals__results__header__select'))
#Selecting the Matchday dropdown box
Matchdays = Select(driver.find_element_by_css_selector('td:nth-of-type(2) > .virtuals__results__header__select'))
import time
#selecting all the options the Seasons dropdowns.
Season = len(Seasons.options)
for items in reversed(range(Season)):
Seasons.select_by_index(items)
time.sleep(2)
#selecting all the options the Matchdays dropdowns.
Matchday = len(Matchdays.options)
for items in range(Matchday):
Matchdays.select_by_index(items)
time.sleep(2)
rows = len(driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table[2]/tbody/tr')) #count number of rows in the table
columns = len(driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table[2]/tbody/tr[2]/td')) #count number of columns in the table
# print(rows)
# print(columns)
for r in range(3,rows+1):
for c in range(1,columns+1):
value = driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table["+str(r)+"]/tbody/tr["+str(c)+"]')
for val in value:
print(val.text)```
It looks like this line is your problem.
value = driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table["+str(r)+"]/tbody/tr["+str(c)+"]')
.format() is what I would sugest to use. See below. I commented out the problematic line and added my updated line.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome('C:\Chrome\chromedriver.exe')
driver.get('https://www.betika.com/mobile/#/virtuals/results')
#Selecting the Season dropdown box
Seasons = Select(driver.find_element_by_css_selector('td:nth-of-type(1) > .virtuals__results__header__select'))
#Selecting the Matchday dropdown box
Matchdays = Select(driver.find_element_by_css_selector('td:nth-of-type(2) > .virtuals__results__header__select'))
import time
#selecting all the options the Seasons dropdowns.
Season = len(Seasons.options)
for items in reversed(range(Season)):
Seasons.select_by_index(items)
time.sleep(2)
#selecting all the options the Matchdays dropdowns.
Matchday = len(Matchdays.options)
for items in range(Matchday):
Matchdays.select_by_index(items)
time.sleep(2)
rows = len(driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table[2]/tbody/tr')) #count number of rows in the table
columns = len(driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table[2]/tbody/tr[2]/td')) #count number of columns in the table
# print(rows)
# print(columns)
for r in range(3,rows+1):
for c in range(1,columns+1):
#value = driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table["+str(r)+"]/tbody/tr["+str(c)+"]')
value = driver.find_elements_by_xpath("//*[#id='app']/main/div/div[2]/table['{}']/tbody/tr['{}']".format(r, c))
for val in value:
print(val.text)

Pass two parameters into a url element while running the loop to webscrape data

import requests
for i in range(len(lat_lon_df)):
lat,lon = lat_lon_df.iloc[i]
try:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat={}&lon={}&unit=0&lg=english&FcstType=graphical".format(lat).format(lon))
print(page)
except:
continue
Lat Long
0 55.999722 -161.207778
1 60.891854 -161.392330
2 60.890632 -161.199325
3 54.143012 -165.785368
4 62.746967 -164.602280
I am trying to run a loop to scrape data while taking new latitude and longitude parameters for every iteration.
Use one time format, information can be found here:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat={}&lon={}&unit=0&lg=english&FcstType=graphical".format(lat, lon))
Alternative:
page = requests.get(f"https://forecast.weather.gov/MapClick.php?lat={lat}&lon={lon}&unit=0&lg=english&FcstType=graphical")

How can I do Python Selenium multiples loops from one element?

I want to extract price data from one same element, via 2 consecutive while loops functions, by executing first one lazy-loop (like time.sleep(1)) that will extract price data every 1 second, and at the same time from second fast-loop (with time.sleep(0.2)) that will extract price data every 0.2 seconds.
Code for extracting price data with only one loop that works:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from datetime import datetime
import time
#~~~~~~~~~~~~~~~# Cookies Saver:
chrome_options = Options()
chrome_options.add_argument("user-data-dir=selenium")
driver = webdriver.Chrome(chrome_options=chrome_options)
#~~~~~~~~~~~~~~~# Get to the link:
driver.get('https://coininfoline.com/currencies/ETH/ethereum/')
input('ENTER AFTER PAGE LOADED:')
#~~~~~~~~~~~~~~~# Get price data:
while True:
price_extracor = driver.find_elements_by_xpath('//span[#class="cmc-formatted-price"]')
for price_raw in price_extracor:
price = price_raw.text
#~~~~~~~~~~~~~~~# Time Stamp:
timestamper = datetime.now()
timestamper.microsecond
#~~~~~~~~~~~~~~~# Date and Price Printer:
print(timestamper,str(price))
time.sleep(1)
I expect:
A second while loop that will extract price data AT THE SAME TIME faster with time.sleep(0.2). How can this be done, or is it even possible?
Maybe trying with multiprocessing would be possible?

Categories

Resources