WebScraping issues in python using Selenium

WebScraping issues in python using Selenium - python

I'm trying to scrape data from this website called Anhembi
But when I try all the options from selenium to find elements, I get nothing. Anyone know why this happens?
I've already tried:
driver.find_element_by_xpath('//*[#class="agenda_result_laco_box"]')
And made a for-loop through that to click in every single one and get the info I need which consists of the day, website and name of the events. How can I do that?

Clearly, there is an iframe involved , you need to switch the focus of your web driver in order to interact with elements which are in iframe/frameset/frame.
You can try with this code :
driver.get("http://www.anhembi.com.br/agenda/")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[src='http://intranet.spturis.com.br/intranet/modulos/booking/anhembisite_busca.php']"))
all_data = driver.find_elements_by_css_selector("div.agenda_result_laco_box")
print(len(all_data))
for data in all_data:
print(data.text)

Related

Get the current url when it's not valid with Selenium Python

I'm an beginner learning web scraping with Selenium. Recently I faced the problem that sometimes there are button elements that do not have a "href" attribute with link to the website it leads to. In order to obtain the link or useful information from that link, I need to click on the button and get the current url in the new window using the "current_url" method. However, it doesn't always work, when the new url is not valid. I'm asking for help on the solution.
To give you an example, say one wants to obtain the Spotify link to the song listed on https://www.what-song.com/Tvshow/100242/BoJack-Horseman/e/116712. After clicking on the Spotify button, instead of being directed to spotify web player, I see a new window popping up with this url "spotify:track:6ta5yavnnEfCE4faU0jebM". It's not valid probably due to some errors made by the website, but the identifier "6ta5yavnnEfCE4faU0jebM" is still useful so I want to obtain it.
However, when I try using the "current_url" method, it gives me the original link "https://www.what-song.com/Tvshow/100242/BoJack-Horseman/e/116712", instead of the invalid url. My codes are attached below. Note that I already have a time.sleep.
Specs: MacOS 12.6, chrome and webdriver version 106.something, Python 3.
s = Service('/web_scraping/chromedriver')
driver = webdriver.Chrome(service=s)
wait = WebDriverWait(driver, 3)
driver.get('https://www.what-song.com/Tvshow/100242/BoJack-Horseman/e/116712')
spotify_button_element = driver.find_element("xpath",'/html/body/div/div[2]/main/div[2]/div/div[1]/div[5]/div[1]/div[2]/div/div/div[2]/div/div[1]/button[3]')
driver.execute_script("arguments[0].click();", spotify_button_element)
time.sleep(3)
print(driver.current_url)
Any idea on why this happened and how to fix it? Hugh thanks in advance!

What you could do instead of finding the button to click and opening a new tab is to do the following:
import json
spotify_data_request = driver.find_element("id",'__NEXT_DATA__') # get the data stored in a script tag with id = '__NEXT_DATA__'
temp = json.loads(spotify_data_request.get_attribute('innerHTML')) # convert the string into a dict like object
print(temp['props']['pageProps']['episode']['songs'][0]['song']['spotifyId']) # get the Id attribute that you want instead of having to click the spotify button and retrieve it from the URL

SELENIUM Can't access payment form/fields

https://squidindustries.co/checkout
checkout_cc_number = driver.find_element_by_id("number")
checkout_cc_number.send_keys(card_number)
When I try to input information into the card number field I get an error saying the element could not be located. I tried using time.sleep and driver.implicitly_wait when i first got to the page but both failed. Any ideas?

The element is in a frame (i.e. a webpage within a webpage). Selenium will look for elements in the page it has loaded and not within frames. That's the problem.
To solve this we just need a bit more code, which will tell Selenium to look in the frame.
The example you've given is several pages deep into a shopping cart, so I'm going to use a much more accessible example instead: the mozilla guide to iframes.
Here is some code to open that page and then click the CSS button within the frame:
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get(r"https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe")
time.sleep(5)
browser.switch_to.frame(browser.find_element_by_class_name("interactive"))
css_button = browser.find_element_by_id("css")
css_button.click()
browser.switch_to.default_content()
There are two lines that are important. The first one is:
browser.switch_to.frame(browser.find_element_by_class_name("interactive"))
That finds the frame and then switches to it. Once we have done that, any code that looks for elements will be looking in the frame and not in the page that we navigated to. That is what you need to do to access the number element. In your example the class of the frame is card-fields-iframe, so use that instead of interactive.
The second important line is:
browser.switch_to.default_content()
That reverts the previous line. So now Selenium will be looking for elements within the page that we navigated to. You'll want to do that after interacting with the frame, so that you can continue through the shopping cart.

have you tried getting the input element using the DOM? what happens if you do document.getElementById('number') ?

I ran into the same issue, and with checkouts, as you mentioned, all the iframe class names are the same. What I did was get all the iframes with the same class name as a list:
iframes = driver.find_elements(By.CLASS_NAME, "card-fields-iframe")
I then switched through the iframes referencing each one by its place in the list. Since there are only four fields in the checkout, the list is only 4 elements long, starting with [0].
driver.switch_to.frame(iframes[0])
number = driver.find_element(By.ID, "number")
if number.is_displayed:
number.send_keys("4000300040005000")
driver.switch_to.default_content()
It's important to note that switching back to the default content, using driver.switch_to.default_content(), before switching to the next frame is the only way I was able to make this work. The is_displayed function just checks to see whether the element is on the page or not.

Click an element to dynamically change content with python web scraping

So I'm scraping all the data off my works website to get all my shifts and data about those shifts with python and beautiful soup. Scraping the shifts is fine since they are just elements. But to get info like who is working on a shift you have to click an element which displays a hidden element but also changes the info depending on which day you have clicked on. This is changed with a javascript function showFloorPlan('N','N','N','20200624')
I need to be able to scrape the data from a weeks worth of shifts showing who is working on which shift. I've tried added javascript:showFloorPlan('N','N','N','20200624') to the URL I am scraping from but with no luck.
Any help is greatly appreciated

You can try selenium library for browser-like actions like clicking, scrolling, navigating, and much more. Functionalities of BeautifulSoup are limited when it comes to dynamic scraping.
Selenium Offical Documentation
Selenium for python
If you want to scrape without opening browser, you should also look for
Headless Browser

How to scrape information from websites with subframes

I am trying to build a simple web scraper to extract flight information from the student Universe.
I used the selenium to navigate the webpages to get the flight information for my desired location and date. There is no problem for me to get to the right page with all the information.
However, I have difficulties in extracting the information from the webpage. I used xpath to locate those elements that contain desired info, but extracting the information is unsuccessful unless I manually scroll up and down the webpage. It seems that this has something to do with subframes embedded in the website. I tried to iterate all the iframe to see if I get information with the command driver.switch_to.frame(), but the problem remains.
It would be great if anyone could offer some help regarding how to scrape information from websites like this. The problem may not be caused by the existence of subframe. any input is appreciated.
The code I used to extract flights info is shown below, an article tag contains all the info(carrier name, departure time,arrival time and so on). What I did first is to locate this element.
def parseprice(driver):
driver.maximize_window()
parser = lxml.html.fromstring(driver.page_source,driver.current_url)
flights=parser.xpath('//article[#class="itin activeCheck"]')
driver.quit()
carriername=flights[0].xpath('//p[#id="airlineName0"]/text()')
duration=flights[0].xpath('//strong[#id="duration0"]/text()')
depttime=flights[0].xpath('//span[#id="departureTime0"]/text()')
arrtime=flights[0].xpath('//span[#id="arrivalTime0"]/text()')
price=flights[0].xpath('//p[#ng-click="pricePoint()"]//text()')
stops=flights[0].xpath('//p[#id="stops0"]//text()')
stoplis=list()
for st in stops:
res1=re.search('^(\d)+\D*',st)
if res1 is not None:
stoplis.append(int(res1.group(1)))
now=datetime.datetime.now()
now=now.timetuple()
for i in range(20):
yield{'current time':str(now[1])+'/'+str(now[2])+'/'+str(now[0]),'carrier':carriername[i],'duration':duration[i],'price':price[i],'numstops':stoplis[i],'departure_time':depttime[i],'arrival_time':arrtime[i]}

How to scrape javascript dynamic website

i've been trying to scrape the website below but having some problems.I cannot find how they build the list of empresas(in english : companies) that they show. When i select some categorie and submit the form, the url doesnt change, i've tryed to look in the request but no sucess.(not a webdeveloper here ).
http://www.vitrinedoexportador.gov.br
I first tried to go though all links in the webpage. The first approach that i've tried was bruteforcing all the urls. They has this syntax.
"http://www.vitrinedoexportador.gov.br/bens/ve/br/detalhes/index/cdEmpresa/" + 6 digit code + "#inicio".
But i think that trying out 999999 possibilities would be wrong way to aproach the problem.
The next approach that i'm trying is navigatin through the pages using selenium webdriver.
with the code below:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import time
browser = webdriver.Firefox()
browser.get('http://www.vitrinedoexportador.gov.br/bens/ve/br#a')
# navigate to the page
select = Select(browser.find_element_by_id('cdSetor'))
print (select.options)
for opt in select.options:
print (opt.text)
opt.click()
if(opt.text != 'Escolha'):
opt.submit()
time.sleep(5) # tem q colocar esse para a página poder carregar.
listaEmpresas = browser.find_elements_by_tag_name("h6")
for link in listaEmpresas:
print(link)
print (listaEmpresas)
listaEmpresas[0].click()
But seens incredibly slow, and i only could get still one companie, is there a more smart way to do this?
Other approach that i've tried is using scrap, i can already parse a entire companie page with all the fields that i want. so if u guys help me in the way to get all the IDS , i can parse in my already built-in scrapy project.
Thank you.

I've done something very similar to this already and there is no super easy way. There is usually no list with all companies, because that belongs to the backend. You have to use the frontend to navigate to a page where you can build a loop to scrap what you want.
For example: I clicked the main url, then I changed the filter 'Valor da empresa' which has only five options. I chose the first, which gave me 3436 companies. Now it dependes if you want to scrap details of company or only main info, like tel cep address that are already in this page. If you want details you have to build a loop that clicks in each link, scrap from main page, go back to search and click on the next link. If you need only main information, you can already get that on search page by grabbing class=resultitem with beautifull soup, and looping through data to get first page.
In any case, next step (after all links of first page are scraped) is pressing the second page and doing it again.
After you scrap all 3436 of first filter, do it again for other 4 filters, and you would get all companies
You can use other filters, but they have many options and to go through all companies you would have to go through all of them, which is more work.
Hope that helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

WebScraping issues in python using Selenium - python

Related

Get the current url when it's not valid with Selenium Python

SELENIUM Can't access payment form/fields

Click an element to dynamically change content with python web scraping

How to scrape information from websites with subframes

How to scrape javascript dynamic website

Categories

Resources