I am currently writing a python selenium script to take information of a website.
I have successfully got the data of page 1/100+ in the format I want. I unfortunately can’t get the program to run and collect all the information off the proceeding pages.
When I look at the web site target script, it shows me that the “Next” button is compiled like the below;
/body/div[#id='main-content']/div[#class='t6a-grid']/div[#class='mmargin-bottom-30']/div[#id='grid']/div[#class='row-margin-bottom-10']/div[#class='col-md-12 padding-left-0 padding-right-20']/ul[#class='pagination']/li[11]/a
Part of the script I have written is below. The "# this is navigate to next page element" in the script is the area that isn’t currently working.
def get_links(driver, target):
# this is to collect links that associate with all the profiles present in Freshfields website
driver.get(target)
# get links associated to profiles on result page
list_links = []
while True:
list_ppl_link = driver.find_elements_by_xpath('//div[#class=" mix item col-xs-6 col-sm-4"]')
for item in list_ppl_link:
emp_name_obj = item.find_element_by_tag_name('a')
emp_name = emp_name_obj.text
emp_link = emp_name_obj.get_attribute('href')
list_links.append({'emp_name':emp_name, 'emp_link':emp_link})
try:
# this is navigate to next page
driver.find_element_by_xpath('//ul[#class="pagination"]/li').click()
time.sleep(1)
except NoSuchElementException:
break
return list_links
Please could somebody help me to understand how I can loop through the pages and collect the 1,960 records?
try using something like below:
list_ppl_link = driver.find_elements_by_xpath('//div[#class=" mix item col-xs-6 col-sm-4"]')
i=1
for item in list_ppl_link:
i=i+1
emp_name_obj = item.find_element_by_tag_name('a')
emp_name = emp_name_obj.text
emp_link = emp_name_obj.get_attribute('href')
list_links.append({'emp_name':emp_name, 'emp_link':emp_link})
try:
# this is navigate to next page
driver.find_element_by_xpath('//ul[#class="pagination"]//li/a[contains(text(),"' + str(i) +'")').click()
time.sleep(1)
except NoSuchElementException:
break
Related
I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else
it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.
You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')
Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")
I have a program where I am going to the reddit.com website and grabbing an html element from it. however, about 1/10th of the time, the old reddit website shows up, and I have to restart the program. Is there any shorter way to handle this error (basically restart from the top again)? I couldn't seem to figure it out with a try/except.
browser = webdriver.Chrome(executable_path=r'C:\Users\jacka\Downloads\chromedriver_win32\chromedriver.exe')
browser.get("https://www.reddit.com/")
# grabs the html tag for the subreddit name
elem = browser.find_elements_by_css_selector("a[data-click-id='subreddit']")
#in the case that old reddit loads, it restarts the browser.
if len(elem) == 0:
browser.close()
browser = webdriver.Chrome(executable_path=r'C:\Users\jacka\Downloads\chromedriver_win32\chromedriver.exe')
browser.get("https://www.reddit.com/")
# grabs the html tag for the subreddit name
elem = browser.find_elements_by_css_selector("a[data-click-id='subreddit']")
Like #HSK has mentioned in the comment, you can use an infinite while loop to keep trying until you get what you want without an exception. Do add a finally clause to close the browser handle before leaving.
while True:
browser = webdriver.Chrome(executable_path=r'C:\Users\jacka\Downloads\chromedriver_win32\chromedriver.exe')
try:
browser.get("https://www.reddit.com/")
elem = browser.find_elements_by_css_selector("a[data-click-id='subreddit']")
break
except Exception:
pass
finally:
browser.close()
Solved thanks to #HSK. I put the code in a while loop that ran until it got the right version of reddit.
#had to initalize elem so the loop would run
elem = ""
while len(elem) == 0:
browser = webdriver.Chrome(executable_path=r'C:\Users\jacka\Downloads\chromedriver_win32\chromedriver.exe')
browser.get("https://www.reddit.com/")
# grabs the html tag for the subreddit name
elem = browser.find_elements_by_css_selector("a[data-click-id='subreddit']")
total_link = []
temp = ['a']
total_num = 0
while driver.find_element_by_tag_name('div'):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Divs=driver.find_element_by_tag_name('div').text
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
my_titles = soup.select(
'div._6d3hm > div._mck9w'
)
for title in my_titles:
try:
if title in temp:
#print('중복')
pass
else:
#print('중복이 아니다')
link = str(title.a.get("href")) #주소를 가져와!
total_link.append(link)
#print(link)
except:
pass
print("현재 모은 개수: " + str(len(total_link)))
temp = my_titles
time.sleep(2)
if 'End of Results' in Divs:
print('end')
break
else:
continue
Blockquote
Hello I was scraping instagram data with the tags in korean.
My code is consisted in the followings.
scroll down the page
by using bs4 and requests, get their HTML
locate to the point where the time log, picture src, text, tags, ID
select them all, and crawl it.
after it is done with the HTML that is on the page, scroll down
do the same thing until the end
By doing this, and using the codes of the people in this site, it seemed to work...
but after few scrolls going down, at certain points, scroll stops with the error message showing
'읽어드리지 못합니다' or in English 'Unable to read'
Can I know the reason why the error pops up and how to solve the problem?
I am using python and selenium
thank you for your answer
Instagram is trying to protect against malicious attacks, such as scraping or any other automated ways. It often occurs when you are trying to access to Instagram pages abnormally fast. So you have to set time.sleep() options more frequently or longer.
I'm trying to make a python app that extracts all of the youtube titles of a youtube channel's videos.
I'm currently attempting to do it using selenium.
def getVideoTitles():
driver = webdriver.Chrome("/Users/{username}/PycharmProjects/YoutubeChannelVideos/chromedriver")
driver.get(googleYoutubePage())
titleElement = driver.find_element_by_class_name("yt-lockup-content")
print(titleElement.text) #it prints out title, + views, hours ago, and "CC"
#I suck at selenium so lets just store the title and cut everything after it
The class_name yt-lockup-content is the class name for each video on a youtube channel's /videos page.
In the code above I am able to get the title for the first youtube video on that page. But I want to iterate through all of the youtube titles (in other words, I want to iterate through every single yt-lockup-content element) in order to store the .text.
But I was wondering how do I access the yt-lockup-content[2] persay. Which in other words would be the second video on that page, that has the same class name
Here is my full code.
Feel free to play
'''
'''
import selenium
from selenium import webdriver
def getChannelName():
print("Please enter the channel that you would like to scrape video titles...")
channelName = input()
googleSearch = "https://www.google.ca/search?q=%s+youtube&oq=%s+youtube&aqs=chrome..69i57j0l5.2898j0j4&sourceid=chrome&ie=UTF-8#q=%s+youtube&*" %(channelName, channelName, channelName)
print(googleSearch)
return googleSearch
def googleYoutubePage():
driver = webdriver.Chrome("/Users/{username}/PycharmProjects/YoutubeChannelVideos/chromedriver")
driver.get(getChannelName())
element = driver.find_element_by_class_name("s") #this is where the link to the proper youtube page lives
keys = element.text #this grabs the link to the youtube page + other crap that will be cut
splitKeys = keys.split(" ") #this needs to be split, because aside from the link it grabs the page description, which we need to truncate
linkToPage = splitKeys[0] #this is where the link lives
for index, char in enumerate(linkToPage): #this loops over the link to find where the stuff beside the link begins (which is unecessary)
if char == "\n":
extraCrapStartsHere = index #it starts here, we know everything beyond here can be cut
link = ""
for i in range(extraCrapStartsHere): #the offical link will be everything in the linkToPage up to where we found suitable to cut
link = link + linkToPage[i]
videosPage = link + "/videos"
print(videosPage)
return videosPage
def getVideoTitles():
driver = webdriver.Chrome("/Users/{username}/PycharmProjects/YoutubeChannelVideos/chromedriver")
driver.get(googleYoutubePage())
titleElement = driver.find_element_by_class_name("yt-lockup-content")
print(titleElement.text) #it prints out title, + views, hours ago, and "CC"
#I suck at selenium so lets just store the title and cut everything after it
def main():
getVideoTitles()
main()
This seems like an overly complicated way to do this. You can just navigate directly to the videos page using the URL, https://www.youtube.com/user/{ChannelName}/videos, loop through the titles, and print them.
print("Please enter the channel that you would like to scrape video titles...")
channelName = input()
videosUrl = "https://www.youtube.com/user/%s/videos" % channelName
driver = webdriver.Chrome("/Users/{username}/PycharmProjects/YoutubeChannelVideos/chromedriver")
driver.get(videosUrl)
for title in driver.find_elements_by_class_name("yt-uix-tile-link")
print(title.text)
Rather than using driver.find_element_by_class_name you can use driver.find_elements_by_class_name which will return a list of all the elements with the specified class name.
From there you can iterate through the list and get the titles of each youtube video.
Have you tried driver.find_elements_by_css_selector(".yt-lockup-content")?
I'm trying to take some information from an HTML element using Selenium - Python, and I'm unsure on how to save it. I'm kind of new to programming, but literate enough to where I know how to write code, but it's hard to research answers and adapt those to my code. I've looked on Google and can't seem to find anything that would help me specifically with what I need.
Here is the HTML element I need to get information from:
<span id="ctl00_plnMain_rptAssigmnetsByCourse_ctl00_lblOverallAverage">99.05</span>
I need to retrieve the 99.05 and store it in a variable named "avg."
Here is my code I have for the Selenium test.
username = raw_input("Username: ")
password = raw_input("Password: ")
browser = webdriver.Firefox() # Get local session of firefox
browser.get("https://hac.mckinneyisd.net/homeaccess/default.aspx") # Load page
elem = browser.find_element_by_name("ctl00$plnMain$txtLogin") # Find the query box
elem.send_keys(username)
elem = browser.find_element_by_name("ctl00$plnMain$txtPassword") # Find the password box
elem.send_keys(password + Keys.RETURN)
time.sleep(0.2) # Let the page load
elem = browser.find_element_by_link_text("Classwork").click()
time.sleep(0.2)
???????????????
browser.close()
What should I put in the ???... to take the 99.05 from the object and save it as "avg?" I have tried:
content = elem.text("td[#id='ctl00....lblOverallAverage']"
...but I get an error saying that I can't do that because it has no type.
Try:
elem = browser.find_element_by_id("ctl00_plnMain_rptAssigmnetsByCourse_ctl00_lblOverallAverage")
avg = elem.getText()