Web scraping with python and selenium - python

New to stack and been learning Python for a couple of months now. I am in the process of writing a script which logs on to a website (which I am a subscriber of) and scrape article titles and text.
So far I have been able to log on to the website and get to the page with the article titles, and pull the titles for the first page. However, I am having trouble cycling through the pages.
from selenium import webdriver
chrome_path = r"C:\Users\user.name\Desktop\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://www.WEBSITE.co.uk/")
driver.find_element_by_name("ctl00$LoginView1$Login1$UserName").send_keys('USERNAME') # Enters username
driver.find_element_by_name("ctl00$LoginView1$Login1$Password").send_keys('PASSWORD') # Enters password
driver.find_element_by_name("ctl00$LoginView1$Login1$Submit").click() # Submits username/password
driver.find_element_by_xpath('//*[#id="middle_col"]/div[2]/div[1]/a[1]').click() # Clicks on more articles
def title_scraper(max_pages): # A loop to cycle through xpaths of various pages (?)
page = 2 # Set at 2 for test circa 40 in total
while page < max_pages:
newPage = '//*[#id="ctl00_mainContentArea_ArticleListing1_gvwArticles"]/tbody/tr[11]/td/table/tbody/tr/td[' + str(page) + ']/a' # xpath = //*[#id="ctl00_mainContentArea_ArticleListing1_gvwArticles"]/tbody/tr[11]/td/table/tbody/tr/td[1]/a - it is td[1] which increases depending on page number
driver.find_element_by_xpath(newPage).click() # Scrapes article titles, currently only does the first page
titles = driver.find_elements_by_class_name("articletitle")
for title in titles:
print(title.text)
Sorry if this has already been answered, I have had no luck with online resources so far!
Update:
def title_scraper(max_pages):
page = 2
while page < max_pages:
path = '//*[#id="ctl00_mainContentArea_ArticleListing1_gvwArticles"]/tbody/tr[11]/td/table/tbody/tr/td[' + str(
max_pages) + ']/a'
driver.find_element_by_xpath(path)
titles = driver.find_elements_by_class_name("articletitle")
for title in titles:
print(title.text)

Related

Selenium - HTML doesn't always update after a click. Content in the browser changes, but I often get the same HTML from prior to the click

I'm trying to set up a simple webscraping script to pull every hyperlink from the discover cards on Bandcamp.
Here is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
browser = webdriver.Chrome()
all_links = []
page = 1
url = "https://bandcamp.com/?g=all&s=new&p=0&gn=0&f=digital&w=-1"
browser.get(url)
while page < 6:
page += 1
# wait until discover cards are loaded
test = WebDriverWait(browser, 20).until(EC.element_to_be_clickable(
(By.XPATH, '//*[#id="discover"]/div[9]/div[2]/div/div[1]/div/table/tbody/tr[1]/td[1]/a/div')))
# scrape hyperlinks for each of the 8 albums shown
titles = browser.find_elements(By.CLASS_NAME, "item-title")
links = [title.get_attribute('href') for title in titles[-8:]]
all_links = all_links + links
print(links)
# pagination - click through the page buttons as the links are scraped
page_nums = browser.find_elements(By.CLASS_NAME, 'item-page')
for page_num in page_nums:
if page_num.text.isnumeric():
if int(page_num.text) == page:
page_num.click()
time.sleep(20) # I've tried multiple long wait times as well as WebDriverWaits on different elements to see if the HTML will update, but I haven't seen a positive effect
break
I'm using print(links) to see where this is going wrong. In the selenium browser, it clicks through the pages well. Note that pagination via the url parameters doesn't seem possible as the discover cards often won't load unless you click the page buttons towards the bottom of my picture. BetterSoup and Requests don't work either for the same reason. The print function is returning the following:
['https://neubauten.bandcamp.com/album/stimmen-reste-musterhaus-7?from=discover-new', 'https://cirka1.bandcamp.com/album/time?from=discover-new', 'https://futuramusicsound.bandcamp.com/album/yoga-meditation?from=discover-new', 'https://deathsoundbatrecordings.bandcamp.com/album/real-mushrooms-dsbep092?from=discover-new', 'https://riacurley.bandcamp.com/album/take-me-album?from=discover-new', 'https://terracuna.bandcamp.com/album/el-origen-del-viento?from=discover-new', 'https://hyper-music.bandcamp.com/album/hypermusic-vol-4?from=discover-new', 'https://defisis1.bandcamp.com/album/priceless?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://finitysounds.bandcamp.com/album/kreme?from=discover-new', 'https://mylittlerobotfriend.bandcamp.com/album/amen-break?from=discover-new', 'https://electrinityband.bandcamp.com/album/rise?from=discover-new', 'https://abyssal-void.bandcamp.com/album/ritualist?from=discover-new', 'https://plataformarecs.bandcamp.com/album/v-a-david-lynch-experience?from=discover-new', 'https://hurricaneturtles.bandcamp.com/album/industrial-synth?from=discover-new', 'https://blackwashband.bandcamp.com/album/2?from=discover-new', 'https://worldwide-bitchin-records.bandcamp.com/album/wack?from=discover-new']
Each time it correctly pulls the first 8 albums on page 1, then for pages 2-4 it repeats the 8 albums on page 2, for pages 5-7 it repeats the 8 albums on page 5, and so on. Even though the page is updating (and the url changes) in the selenium browser, for some reason selenium is not recognizing any changes to the html so it repeats the same titles. Any idea where I've gone wrong?
Your definition of titles, i.e.
titles = browser.find_elements(By.CLASS_NAME, "item-title")
is a bad idea because item-title is the class of many elements in the page. Then another bad idea is to pick titles[-8:]. It may sounds good because you think ok each time I click a page the new elements are added at the end, but this is not always the case. Your case is one of those were elements are not added sequentially.
So let's start by considering a class exclusive of cards. For example discover-item. Then open the DevTools, press CTRL+F and enter .discover-item. When the url is first loaded, it will find 8 results. Now click next page, now it finds 16 results, click again and will find 24 results. To better see what's going I suggest you to run the following each time you click on the "next" button.
el = driver.find_elements(By.CSS_SELECTOR, '.discover-item')
for i,e in enumerate(el):
print(i,e.get_attribute('innerText').replace('\n',' - '))
In particular, when arriving to page 3, you will see that the first item shown in page 1 (which in my case is 0 and friends - Jacob Stanley - alternative), is now printed at a different position (in my case 8 and friends - Jacob Stanley - alternative). What happened is that the items of page 3 were added at the beginning of the list, and so you can see why titles[-8:] was a bad choice.
So a better choice is to consider all cards each time you go to the next page, instead of the last 8 only (notice that the HTML of this site can contain no more than 24 cards), and then add all current cards to a set (since a set cannot contain duplicates, only new elements will be added).
# scroll to cards
cards = WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".discover-item")))
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', cards[0])
time.sleep(1)
items = set()
while 1:
links = driver.find_elements(By.CSS_SELECTOR, '.discover-item .item-title')
# extract info and store it
for idx,card in enumerate(cards):
tit_art_gen = card.get_attribute('innerText').replace('\n',' - ')
href = links[idx].get_attribute('href')
# print(idx, tit_art_gen)
items.add(tit_art_gen + ' - ' + href)
# click 'next' button if it is not disabled
next_button = driver.find_element(By.XPATH, "//a[.='next']")
if 'disabled' in next_button.get_attribute('class'):
print('last page reached')
break
else:
next_button.click()
# wait until new elements are loaded
cards_new = cards.copy()
while cards_new == cards:
cards_new = driver.find_elements(By.CSS_SELECTOR, '.discover-item')
time.sleep(.5)
cards = cards_new.copy()

Creating POST request to scrape website with python where no network form data changes

I am scraping a website that dynamically renders with javascript. The urls don't change when hitting the > button So I have been trying to look at the inspector in the network section and more specifically the "General" section for the "Request Url" and the "Request Method" as well as in the "Form Data" section looking for any sort of ID that could be unique to distinguish each successive page. However when recording a log of clicking the > button from page to page the "Form Data" data seems to be the same each time (See images):
Currently my code doesn't incorporate this method because I can't see it helping until I can find a unique identifier in the "Form Data" section. However, I can show my code if helpful. In essence it just pulls the first page of data over and over again in my while loop even though I'm using a driver with selenium and using driver.find_elements_by_xpath("xpath of > button").click() before trying to get the data with BeautifulSoup.
(Updated code see comments)
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
from pandas import *
masters_list = []
def extract_info(html_source):
# html_source will be inner HTMl of table
global lst
soup = BeautifulSoup(html_source, 'html.parser')
lst = soup.find('tbody').find_all('tr')[0]
masters_list.append(lst)
# i am printing just id because it's id set as crypto name you have to do more scraping to get more info
chrome_driver_path = '/Users/Justin/Desktop/Python/chromedriver'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True
while loop: # loop for extrcting all 120 pages
crypto_table = driver.find_element(By.ID, 'DataTables_Table_0').get_attribute(
'innerHTML') # this is for crypto data table
extract_info(crypto_table)
paginate = driver.find_element(
By.ID, "DataTables_Table_0_paginate") # all table pagination
pages_list = paginate.find_elements(By.TAG_NAME, 'li')
# we clicking on next arrow sign at last not on 2,3,.. etc anchor link
next_page_link = pages_list[-1].find_element(By.TAG_NAME, 'a')
# checking is there next page available
if "disabled" in next_page_link.get_attribute('class'):
loop = False
pages_list[-1].click() # if there next page available then click on it
df = pd.DataFrame(masters_list)
print(df)
df.to_csv("crypto_list.csv")
driver.quit()
I am using my own code to show how i am getting the table i add explanation as comment for important line
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
def extract_info(html_source):
soup = BeautifulSoup(html_source,'html.parser') #html_source will be inner HTMl of table
lst = soup.find('tbody').find_all('tr')
for i in lst:
print(i.get('id')) # i am printing just id because it's id set as crypto name you have to do more scraping to get more info
driver = webdriver.Chrome()
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True
while loop: #loop for extrcting all 120 pages
crypto_table = driver.find_element(By.ID,'DataTables_Table_0').get_attribute('innerHTML') # this is for crypto data table
print(extract_info(crypto_table))
paginate = driver.find_element(By.ID, "DataTables_Table_0_paginate") # all table pagination
pages_list = paginate.find_elements(By.TAG_NAME,'li')
next_page_link = pages_list[-1].find_element(By.TAG_NAME,'a') # we clicking on next arrow sign at last not on 2,3,.. etc anchor link
if "disabled" in next_page_link.get_attribute('class'): # checking is there next page available
loop = False
pages_list[-1].click() # if there next page available then click on it
so main answer of your question is when you click on button, selenium update the page then you can use driver.page_source to get updated html. some times (*not this url) page can have ajax request which can take some time so you have to wait till the selenium load the full page.

Is there a way to loop through pages using selenium webdriver in python?

I am gathering listing information from Craigslist and I am able to get all of the listings on the first page, save as .csv file, and export to my MongoDB collection. I would like to know how to go to the next page of the website after all of the listings are collected on the first page, then get all of that's pages listings and so on until the script gets all of the listings on the last page and there are no more pages left.
I noticed that by default, Craigslist shows 119 listings on the first page, then page two shows listings 121->240 and so on. the format on the website is "1-120/#num of total listings". also the URL has an element "s=" that updates every time you click "next page" and go the new page. Example, on the first page "s=" is not in URL so I put "s=0" in its place and the page loaded up normally. Go to next page and "s=120", next page "s=240", and so on.
I was thinking about getting the total number of listings after the search(n) and setting MAX_PAGES = 119/n (round up). Then in the init main, put a for loop "for I in range(MAX_PAGES)" around the function that gets the URL and makes sure all of the listings were collected and written to the .csv file.I just do not know how to get #num of total listings from the craigslist page.
Update
realized my proposal will just get the contents from the first page again and again. I need a selenium tool to physically go to the next page until idk while next_page != nul.
Craigslist next page button contents while inspecting in chrome
next >
constructor
def __init__(self, location, postal_code, max_price, query, radius,s):
self.location = location
self.postal_code = postal_code
self.max_price = max_price
self.query = query
self.radius = radius
self.s = s
# MAX_PAGE_NUM =
self.url = f"https://{location}.craigslist.org/search/sss?s={s}&max_price={max_price}&postal={postal_code}&query={query}&20card&search_distance={radius}"
self.driver = webdriver.Chrome('/usr/bin/chromedriver')
self.delay = 5
gets url
def load_craigslist_url(self):
self.driver.get(self.url)
try:
wait = WebDriverWait(self.driver, self.delay)
wait.until(EC.presence_of_element_located((By.ID, "searchform")))
print("Page is ready")
except TimeoutException:
print("Loading took too long")
extracts listings in url
def extract_post_urls(self):
url_list = []
html_page = urllib.request.urlopen(self.url)
soup = BeautifulSoup(html_page)
for link in soup.findAll("a", {"class": "result-title hdrlnk"}):
print(link["href"])
url_list.append(link["href"])
return url_list
Main
if __name__ == "__main__":
filepath = '/home/diego/git_workspace/PyScrape/data.csv' # Filepath of written csv file
location = "philadelphia" # Location Craigslist searches
postal_code = "19132" # Postal code Craigslist uses as a base for 'MILES FROM ZIP'
max_price = "700" # Max price Craigslist limits the items too
query = "graphics+card" # Type of item you are looking for
radius = "400" # Radius from postal code Craigslist limits the search to
s = 0
scraper = CraigslistScraper(location, postal_code, max_price, query, radius, s)
scraper.load_craigslist_url()
titles, prices, dates = scraper.extract_post_information()
What I expect is to get every listing from one page, then go to the next page and gets its listing, and so on until i get all of the listings on the last page and there are no more pages left

WebScraping Next pages with Selenium

When I navigate to the below link and locate the pagination at the bottom of the page:
https://shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&sort=Boosted
I am only able to scrape the first 4 or so pages then my script stops
I have tried with xpath, css_selector, and with the WebDriverWait options
pages_remaining = True
page = 2 //starts # page 2 since page one is scraped already with first loop
while pages_remaining:
//scrape code
try:
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.LINK_TEXT, str(page)))).click()
print browser.current_url
page += 1
except TimeoutException:
pages_remaining = False
Current Results from console:
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories- shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=2&sort=Boosted
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories-shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=3&sort=Boosted
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories-shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=4&sort=Boosted
This solution is a BeautifulSoup one, because I am not too familiar with Selenium.
Try to create a new variable with your number of pages. As you can see, when you enter the next page the URL changes, thus just manipulate the given URL. See my code example below.
# Define variable pages first
pages = [str(i) for i in range(1,53)] # 53 'cuz you have 52 pages
for page in pages:
response = get("https://shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&page=" + page + "&sort=Boosted"
# Rest of you code
This snippet should do the job for the rest of the pages. Hope that helps, although this might not exactly what you have been looking for.
When you have any questions just post below. ;).
Cheers.
You could loop throught page numbers until no more results are shown by just changing the url:
from bs4 import BeautifulSoup
from selenium import webdriver
base_url = "https://m.shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&page={}&sort=Boosted"
driver = webdriver.Chrome()
page = 1
soup = BeautifulSoup("")
#Will loop untill there's no more results
while "Looks like we don’t have exactly what you’re looking for." not in soup.text:
print(base_url.format(page))
#Go to page
driver.get(base_url.format(page))
soup = BeautifulSoup(driver.page_source)
### your extracting code
page +=1

Instagram crawling with scrolling down...with python selenium

total_link = []
temp = ['a']
total_num = 0
while driver.find_element_by_tag_name('div'):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Divs=driver.find_element_by_tag_name('div').text
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
my_titles = soup.select(
'div._6d3hm > div._mck9w'
)
for title in my_titles:
try:
if title in temp:
#print('중복')
pass
else:
#print('중복이 아니다')
link = str(title.a.get("href")) #주소를 가져와!
total_link.append(link)
#print(link)
except:
pass
print("현재 모은 개수: " + str(len(total_link)))
temp = my_titles
time.sleep(2)
if 'End of Results' in Divs:
print('end')
break
else:
continue
Blockquote
Hello I was scraping instagram data with the tags in korean.
My code is consisted in the followings.
scroll down the page
by using bs4 and requests, get their HTML
locate to the point where the time log, picture src, text, tags, ID
select them all, and crawl it.
after it is done with the HTML that is on the page, scroll down
do the same thing until the end
By doing this, and using the codes of the people in this site, it seemed to work...
but after few scrolls going down, at certain points, scroll stops with the error message showing
'읽어드리지 못합니다' or in English 'Unable to read'
Can I know the reason why the error pops up and how to solve the problem?
I am using python and selenium
thank you for your answer
Instagram is trying to protect against malicious attacks, such as scraping or any other automated ways. It often occurs when you are trying to access to Instagram pages abnormally fast. So you have to set time.sleep() options more frequently or longer.

Categories

Resources