How to scrape information from websites with subframes - python

I am trying to build a simple web scraper to extract flight information from the student Universe.
I used the selenium to navigate the webpages to get the flight information for my desired location and date. There is no problem for me to get to the right page with all the information.
However, I have difficulties in extracting the information from the webpage. I used xpath to locate those elements that contain desired info, but extracting the information is unsuccessful unless I manually scroll up and down the webpage. It seems that this has something to do with subframes embedded in the website. I tried to iterate all the iframe to see if I get information with the command driver.switch_to.frame(), but the problem remains.
It would be great if anyone could offer some help regarding how to scrape information from websites like this. The problem may not be caused by the existence of subframe. any input is appreciated.
The code I used to extract flights info is shown below, an article tag contains all the info(carrier name, departure time,arrival time and so on). What I did first is to locate this element.
def parseprice(driver):
driver.maximize_window()
parser = lxml.html.fromstring(driver.page_source,driver.current_url)
flights=parser.xpath('//article[#class="itin activeCheck"]')
driver.quit()
carriername=flights[0].xpath('//p[#id="airlineName0"]/text()')
duration=flights[0].xpath('//strong[#id="duration0"]/text()')
depttime=flights[0].xpath('//span[#id="departureTime0"]/text()')
arrtime=flights[0].xpath('//span[#id="arrivalTime0"]/text()')
price=flights[0].xpath('//p[#ng-click="pricePoint()"]//text()')
stops=flights[0].xpath('//p[#id="stops0"]//text()')
stoplis=list()
for st in stops:
res1=re.search('^(\d)+\D*',st)
if res1 is not None:
stoplis.append(int(res1.group(1)))
now=datetime.datetime.now()
now=now.timetuple()
for i in range(20):
yield{'current time':str(now[1])+'/'+str(now[2])+'/'+str(now[0]),'carrier':carriername[i],'duration':duration[i],'price':price[i],'numstops':stoplis[i],'departure_time':depttime[i],'arrival_time':arrtime[i]}

Related

Selenium scraping with HTML changing after refresh

I am using Selenium along with python to scrape some pages. I have many web pages that represent the same type of objects(football player information) but each of them has a slightly different HTML layout. In particular my main issue here is that the div class identifiers change when refreshing or changing web page, in a way which is unpredictable.
In the specific case I would like to get the data in the div which class identifier "jss176", but when I get to another player this will change to "jss450" for example, with no meaningful pattern to be found.
Is there a way I can go around this? I was thinking of navigating through the Childs starting from div with id = "root", but I don't seem to find a good piece of code to achieve this.
Thank you very much!
If only the id's change, but not the web structure, you could scrape the info by XPATH.
https://www.tutorialspoint.com/what-is-xpath-in-selenium-with-python
You can directly access the div you want and select in chrome "copy XPATH" option in the browser.

Script cannot fetch data from a web page

I am trying to write a program in Python that can take the name of a stock and its price and print it. However, when I run it, nothing is printed. it seems like the data is having a problem being fetched from the website. I double checked that the path from the web page is correct, but for some reason the text does not want to show up.
from lxml import html
import requests
page = requests.get('https://www.bloomberg.com/quote/UKX:IND?in_source=topQuotes')
tree = html.fromstring(page.content)
Prices = tree.xpath('//span[#class="priceText__1853e8a5"]/text()')
print ('Prices:' , Prices)
here is the website I am trying to get the data from
I have tried BeautifulSoup, but it has the same problem.
If you print the string page.content, you'll see that the website code it captures is actually for a captcha test, not the "real" destination page itself you see when you manually visit the website. It seems that the website was smart enough to see that your request to this URL was from a script and not manually from a human, and it effectively prevented your script from scraping any real content. So Prices is empty because there simply isn't a span tag of class "priceText__1853e8a5" on this special Captcha page. I get the same when I try scraping with urllib2.
As others have suggested, Selenium (actual web automation) might be able to launch the page and get you what you need. The ID looks dynamically generated, though I do get the same one when I manually look at the page. Another alternative is to simply find a different site that can give you the quote you need without blocking your script. I tried it with https://tradingeconomics.com/ukx:ind and that works. Though of course you'll need a different xpath to find the cell you need.

getting links from table in web page

I am trying to go to a website, use their search tool to query a database, and grab all of the links from the table of search results displayed below the search tool. The problem is, the source for the website only shows the html for the search tool. Can anyone help me figure out how to get the links from the table? The address of the search tool is:
https://wagyu.digitalbeef.com/
I was hoping to use BeautifulSoup and python 3.6 on a windows 10 machine to read the pages associated with those links and grab the name of the cows and it's parents to create a more advanced pedigree chart than what is available on the site. Thanks for the help.
Just to clarify, I can manually grab a single link, use bs to grab the html for that page, and pull out the pedigree info. I just don't know how to grab the links from the search results page.

How to scrape javascript dynamic website

i've been trying to scrape the website below but having some problems.I cannot find how they build the list of empresas(in english : companies) that they show. When i select some categorie and submit the form, the url doesnt change, i've tryed to look in the request but no sucess.(not a webdeveloper here ).
http://www.vitrinedoexportador.gov.br
I first tried to go though all links in the webpage. The first approach that i've tried was bruteforcing all the urls. They has this syntax.
"http://www.vitrinedoexportador.gov.br/bens/ve/br/detalhes/index/cdEmpresa/" + 6 digit code + "#inicio".
But i think that trying out 999999 possibilities would be wrong way to aproach the problem.
The next approach that i'm trying is navigatin through the pages using selenium webdriver.
with the code below:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import time
browser = webdriver.Firefox()
browser.get('http://www.vitrinedoexportador.gov.br/bens/ve/br#a')
# navigate to the page
select = Select(browser.find_element_by_id('cdSetor'))
print (select.options)
for opt in select.options:
print (opt.text)
opt.click()
if(opt.text != 'Escolha'):
opt.submit()
time.sleep(5) # tem q colocar esse para a página poder carregar.
listaEmpresas = browser.find_elements_by_tag_name("h6")
for link in listaEmpresas:
print(link)
print (listaEmpresas)
listaEmpresas[0].click()
But seens incredibly slow, and i only could get still one companie, is there a more smart way to do this?
Other approach that i've tried is using scrap, i can already parse a entire companie page with all the fields that i want. so if u guys help me in the way to get all the IDS , i can parse in my already built-in scrapy project.
Thank you.
I've done something very similar to this already and there is no super easy way. There is usually no list with all companies, because that belongs to the backend. You have to use the frontend to navigate to a page where you can build a loop to scrap what you want.
For example: I clicked the main url, then I changed the filter 'Valor da empresa' which has only five options. I chose the first, which gave me 3436 companies. Now it dependes if you want to scrap details of company or only main info, like tel cep address that are already in this page. If you want details you have to build a loop that clicks in each link, scrap from main page, go back to search and click on the next link. If you need only main information, you can already get that on search page by grabbing class=resultitem with beautifull soup, and looping through data to get first page.
In any case, next step (after all links of first page are scraped) is pressing the second page and doing it again.
After you scrap all 3436 of first filter, do it again for other 4 filters, and you would get all companies
You can use other filters, but they have many options and to go through all companies you would have to go through all of them, which is more work.
Hope that helps!

Selenium Python: clicking links produced by JSON application

[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.

Categories

Resources