Python Web scraping, Now I am keep getting StaleElementReferenceException - python

I am trying to scrape a site.
I want to loop through all the dates in dateelements and if they match day than click on date, to select. collect the data and add it to the dataframe. I want to do it in one go as each day has unique data.
#Select the Display by splitting the data by AM/PM Split at 12 pm
browser.find_element_by_xpath("//*[contains(#id,'react-select-4--value-
item')]").click()
Display = browser.find_element_by_xpath("//*[contains(#id,'react-select-
4--option-2')]").click()
######Select the date for scraping from the planning search calender
browser.find_element_by_id("room-planning-from-date").click()
planning_From_Dates = browser.find_elements_by_xpath("//*
[contains(#class,'DayPicker-Day')]")
for dateelements in itertools.islice(planning_From_Dates,None,None,n):
date = dateelements.text
for day in days:
time.sleep(0.1)
print('This is the date:',date,'| This is the day',day)
if date == day:
dateelements.click()
time.sleep(3)
child_groups =browser.find_elements_by_xpath("//*
[contains(#class,'roomPlanning__attendanceInfoHeader')]")
for group in child_groups:
#time.sleep(0.2)
Group = group.text
group.click()
#Lets bring the page source into the tool make sure we have
the correct page
page = browser.page_source
dfs = pd.read_html(page, header = 0)
df = dfs[0]
##Add dfs together

Related

Condition based scraping using selenium python

I want to scrape the dates and the respectives news headlines/articles for a period of 6days- like when the python script runs today,it should scrape headlines/articles from today(10th August) to 4th August.
I am able to scrape the dates and headlines/urls for all dates as of now from here.
here is the code for the same
websites = ['https://www.thespiritsbusiness.com/tag/rum/']
for spirits in websites:
browser.get(spirits)
time.sleep(1)
news_links = browser.find_elements_by_xpath('//*[#id="archivewrapper"]/div/div[2]/h3')
n_links = [ele.find_element_by_tag_name('a').get_attribute('href') for ele in news_links]
dates = browser.find_elements_by_xpath('//*[#id="archivewrapper"]/div/div[2]/small')
n_dates = [ele.text for ele in dates]
print(n_links)
print(n_dates)
But how do I scrape for a period for last 6days from today? Is there an idea?
See the page 2 url is
https://www.thespiritsbusiness.com/tag/rum/page/2/
which basically means, that for next iteration you would need to add /page/2/ in URL.
you can have a websites list as :
websites = ['https://www.thespiritsbusiness.com/tag/rum/', 'https://www.thespiritsbusiness.com/tag/rum/page/2/', 'https://www.thespiritsbusiness.com/tag/rum/page/3/']
and so on, to achieve this.
or you can do this programmatically as well :-
page_number = 1
websites = ['https://www.thespiritsbusiness.com/tag/rum/']
for spirits in websites:
browser.get(spirits + f"page/{page_number}/")
page_number = page_number + 1

Python web scraping, only collects 80 to 90% of intended data rows. Is there something wrong with my loop?

I'm trying to collect the 150 rows of data from the text that appears at the bottom of a given Showbuzzdaily.com web page (example), but my script only collects 132 rows.
I'm new to Python. Is there something I need to add to my loop to ensure all records are collected as intended?
To troubleshoot, I created a list (program_count) to verify this is happening in the code before the CSV is generated, which shows there are only 132 items in the list, rather than 150. Interestingly, the final row (#132) ends up being duplicated at the end of the CSV for some reason.
I experience similar issues scraping Google Trends (using pytrends), where only about 80% of the data I try to scrape ended up in the CSV. So I'm suspecting there's something wrong with my code or that I'm overwhelming my target with requests.
Adding time.sleep(0.1) to for while loop in this code didn't produce different results.
import time
import requests
import datetime
from bs4 import BeautifulSoup
import pandas as pd # import pandas module
from datetime import date, timedelta
# creates empty 'records' list
records = []
start_date = date(2021, 4, 12)
orig_start_date = start_date # Used for naming the CSV
end_date = date(2021, 4, 12)
delta = timedelta(days=1) # Defines delta as +1 day
print(str(start_date) + ' to ' + str(end_date)) # Visual reassurance
# begins while loop that will continue for each daily viewership report until end_date is reached
while start_date <= end_date:
start_weekday = start_date.strftime("%A") # define weekday name
start_month_num = int(start_date.strftime("%m")) # define month number
start_month_num = str(start_month_num) # convert to string so it is ready to be put into address
start_month_day_num = int(start_date.strftime("%d")) # define day of the month
start_month_day_num = str(start_month_day_num) # convert to string so it is ready to be put into address
start_year = int(start_date.strftime("%Y")) # define year
start_year = str(start_year) # convert to string so it is ready to be put into address
#define address (URL)
address = 'http://www.showbuzzdaily.com/articles/showbuzzdailys-top-150-'+start_weekday.lower()+'-cable-originals-network-finals-'+start_month_num+'-'+start_month_day_num+'-'+start_year+'.html'
print(address) # print for visual reassurance
# read the web page at the defined address (URL)
r = requests.get(address)
soup = BeautifulSoup(r.text, 'html.parser')
# we're going to deal with results that appear within <td> tags
results = soup.find_all('td')
# reads the date text at the top of the web page so it can be inserted later to the CSV in the 'Date' column
date_line = results[0].text.split(": ",1)[1] # reads the text after the colon and space (': '), which is where the date information is located
weekday_name = date_line.split(' ')[0] # stores the weekday name
month_name = date_line.split(' ',2)[1] # stores the month name
day_month_num = date_line.split(' ',1)[1].split(' ')[1].split(',')[0] # stores the day of the month
year = date_line.split(', ',1)[1] # stores the year
# concatenates and stores the full date value
mmmmm_d_yyyy = month_name+' '+day_month_num+', '+year
del results[:10] # deletes the first 10 results, which contained the date information and column headers
program_count = [] # empty list for program counting
# (within the while loop) begins a for loop that appends data for each program in a daily viewership report
for result in results:
rank = results[0].text # stores P18-49 rank
program = results[1].text # stores program name
network = results[2].text # stores network name
start_time = results[3].text # stores program's start time
mins = results[4].text # stores program's duration in minutes
p18_49 = results[5].text # stores program's P18-49 rating
p2 = results[6].text # stores program's P2+ viewer count (in thousands)
records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2)) # appends the data to the 'records' list
program_count.append(program) # adds each program name to the list.
del results[:7] # deletes the first 7 results remaining, which contained the data for 1 row (1 program) which was just stored in 'records'
print(len(program_count)) # Toubleshooting: prints to screen the number of programs counted. Should be 150.
records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2)) # appends the data to the 'records' list
print(str(start_date)+' collected...') # Visual reassurance one page/day is finished being collected
start_date += delta # at the end of while loop, advance one day
df = pd.DataFrame(records, columns=['Date','Weekday','P18-49 Rank','Program','Network','Start time','Mins','P18-49','P2+']) # Creates DataFrame using the columns listed
df.to_csv('showbuzz '+ str(orig_start_date) + ' to '+ str(end_date) + '.csv', index=False, encoding='utf-8') # generates the CSV file, using start and end dates in filename
It seems like you're making debugging a lot tougher on yourself by pulling all the table data (<td>) individually like that. After stepping through the code and making a couple of changes, my best guess is the bug is coming from the fact that you're deleting entries from results while iterating over it, which gets messy. As a side note, you're also never using result from the loop which would make the declaration pointless. Something like this ends up a little cleaner, and gets you your 150 results:
results = soup.find_all('tr')
# reads the date text at the top of the web page so it can be inserted later to the CSV in the 'Date' column
date_line = results[0].select_one('td').text.split(": ", 1)[1] # Selects first td it finds under the first tr
weekday_name = date_line.split(' ')[0]
month_name = date_line.split(' ', 2)[1]
day_month_num = date_line.split(' ', 1)[1].split(' ')[1].split(',')[0]
year = date_line.split(', ', 1)[1]
mmmmm_d_yyyy = month_name + ' ' + day_month_num + ', ' + year
program_count = [] # empty list for program counting
for result in results[2:]:
children = result.find_all('td')
rank = children[0].text # stores P18-49 rank
program = children[1].text # stores program name
network = children[2].text # stores network name
start_time = children[3].text # stores program's start time
mins = children[4].text # stores program's duration in minutes
p18_49 = children[5].text # stores program's P18-49 rating
p2 = children[6].text # stores program's P2+ viewer count (in thousands)
records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2))
program_count.append(program) # adds each program name to the list.
You also shouldn't need to use a second list to get the number of programs you've retrieved (appending programs to program_count). It ends up the same amount in both lists no matter what since you're appending a program name from every result. So instead of print(len(program_count)) you could've instead used print(len(records)). I'm assuming it was just for debugging purposes though.

Beautifulsoup: how to iterate a table

I am trying to extract data from a dynamic table with the following structure:
Team 1 - Score - Team 2 - Minute first goal.
It is a table of soccer match results and there are about 10 matches per table and one table for each matchday. This is an example of the website in working with: https://www.resultados-futbol.com/premier/grupo1/jornada1
For this I am trying web scraping with BeautifulSoup in Python. Although I've made good progress, I'm running into a problem. I would like to generate a code that would iterate data by data each row of the table and I would get each data to a list so that I would have, for example:
List Team 1: Real Madrid, Barcelona
Score list: 1-0, 1-0
List Team 2: Atletico Madrid, Sevilla
First goal minutes list: 17', 64'
Once I have the lists, my intention is to make a complete dataframe with all the extracted data. However, I have the following problem: the matches that end 0-0. This implies that in the column Minute first goal there is none and it doesn't extract anything, so I can't 'fill' that value in any way in my dataframe and I get an error. To continue with the previous example, imagine that the second game has ended 0-0 and that in the 'Minutes first goal list' there is only one data (17').
In my mind the solution would be to create a loop that takes the data cell by cell and put a condition in 'Score' that if it is 0-0 to the list of Minutes first goal a value for example 'No goals' would be added.
This is the code I am using. I paste only the part in which I would like to create the loop:
page = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
table = page.find('div', class_= 'contentitem').find_all('tr', class_= 'vevent')
teams1 = []
teams2 = []
scores = []
for cell in table:
team1 = cell.find('td', class_='team1')
for name in local:
nteam1 = name.text
teams1.append(nteam1)
team2 = cell.find('td', class_='team2')
for name in team2:
nteam2 = name.text
teams2.append(nteam2)
score = cell.find('span', class_='clase')
for name in score:
nscore = name.text
scores.append(nscore)
It is not clear to me how to iterate over the table to be able to store in the list the content of each cell and it is essential to include a condition "when the score cell is 0-0 create a non-goals entry in the list".
If someone could help me, I would be very grateful. Best regards
You are close to your goal, but can optimize your script a bit.
Do not use these different lists, just use one:
data = []
Try to get all information in one loop, there is an td that contains all the information and push a dict to your list:
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
Push your data into DataFrame
pd.DataFrame(data)
Example
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://www.resultados-futbol.com/premier/grupo1/jornada1'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
data = []
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
pd.DataFrame(data)

How can I scrape the daily price changes from Coinmarketcap with requests_html?

I want to get the daily price changes from coinmarketcap for all coins available on the website.
I have tried to scrape and put the daily changes into a list but, somehow I'm getting the hourly, daily and weekly changes into the list. The code I used:
import requests
from requests_html import HTML, HTMLSession
r = HTMLSession().get('https://coinmarketcap.com/all/views/all/')
table = r.html.find('tbody')
delta_list = []
for row in table:
change = row.find('.percent-change')
for d in change:
delta = d.text
delta_list.append(delta)
print(delta_list)
How can I scrape only the daily changes?
Since requests_html supports xpath...
from requests_html import HTMLSession
r = HTMLSession().get('https://coinmarketcap.com/all/views/all/')
# get the table by id
table = r.html.xpath('//*[#id="currencies-all"]')
# filter table rows to tr elements with id
rows = table[0].xpath('//tr[#id]')
# your list of results
delta_list = []
# iterate in the rows result
for row in rows:
# get the cryptocurrency name
name = row.xpath('//*[#class="no-wrap currency-name"]')[0].text.replace('\n', ' ')
# get the element which contains the 24h cahnge data
val_elem = row.xpath('//*[#data-timespan="24h"]')
# some currencies are too fresh to have a result in 24h, they contain '?'
# Such elements don't have the #data-timespan="24h" attribute
# So if the result is empty something should be done, I decided to add 0
val = val_elem[0].text if val_elem else 0
# just debug print
print(f"Change of {name} in the past 24h is {val}")
# add the result to your list
delta_list.append(val)
On sidenote, using a list to store the results is not the best choice. The currencies are sorted by "market cap" and order of some currencies may change on any day. Using a dict/OrderedDict would be a better choice because that way you can pair currencies with values...

How can python auto scrape on a daily basis given certain url change rules?

Hi new comer in the python learning. Sales man travel a lot, would like to save some bucks in the hotel booking, so I am using python to scrape certain hotels on certain days for personal use.
I can use python to scrape a specific webpage, but im having trouble in making a serial search.
The single webpage scrape goes like this:
import requests
from bs4 import BeautifulSoup
url ="http://hotelname.com/arrivalDate=05%2F23%2F2016**&departureDate=05%2F24%2F2016" #means arrive on May23 and leaves on May
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
names = soup.select('.PropertyName')
prices = soup.select('.RateSection ')
for name,price in zip(names,prices):
data = {
"name":name.get_text(),
"price":price.get_text()
}
print (data)
By doing this I can get the price of the hotels on that day. But I would like to know the price in a longer period(say 15 days), so I can arrange my travel and save some bucks. The question is how can I make the search auto loop itself?
eg. hotelname('') price(200USD) May 1 Check in(CI) and May 2 check out(CO)
hotelname('') price(150USD) May 2 CI May 3 CO
..........
hotelname('') price(170USD) May30 CI May 31 CO
Hope I make my intentions clear. Can someone help guide in what way should I do to achieve this auto search? It is too much work to manually change the date in the urls. Thanks
You can use the datetime lib to get the dates and increment a day at a time in the loop for n days:
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
def n_booking(n):
# start tomorrow
bk = (datetime.now() + timedelta(days=1))
# check next n days
for i in range(n):
mon, day, year = bk.month, bk.day, bk.year
# go to next day
bk = (datetime.now() + timedelta(days=1))
d_mon, d_day, d_year = bk.month, bk.day, bk.year
url ="http://hotelname.com/arrivalDate=d{mon}%2F{day}%2F{year}**&departureDate={d_mon}%2F{d_day}%2F{d_year}"\
.format(mon=mon, day=day, year=year, d_day=d_day, d_mon=d_mon,d_year=d_year)
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
names = soup.select('.PropertyName')
prices = soup.select('.RateSection ')
for name,price in zip(names,prices):
yield {
"name":name.get_text(),
"price":price.get_text()
}

Categories

Resources