I'm fairly new to Python so excuse me if the problem isn't clear or if the answer is obvious.
I want to scrape the web page http://jassa.fr/. I generated some random input (sequences) and see how it holds against my own data. I tried scraping the page using selenium but the HTML of the webpage doesn't use any id's, and I don't know how to navigate through the DOM without using id's (impossible with selenium?).
Does anyone have any ideas for me how to tackle this problem, especially regarding that I want to scrape the results which are generated server side?
Thanks in advance!
[edit]
Thanks for the quick response!
How do I access this text area using selenium:
< textarea style="border:1px solid #999999;" tabindex="1" name="sequence" cols="70" rows="4" onfocus="if(this.value=='Type or paste your sequence')this.value='';">Type or paste your sequence
Edit: After clarification that you need to access <textarea> with the name sequence I suggest using find_element_by_name, see here for more details on selecting elements in Selenium.
from selenium import webdriver
url = "http://jassa.fr/"
browser = webdriver.Firefox()
browser.get(url)
form = browser.find_element_by_tag_name("form")
sequence = form.find_element_by_name("sequence")
sequence.clear()
sequence.send_keys("ATTTAATTTA")
form.submit()
Selenium has ability to navigate the tree and select elements not only by ID but also by class, tag name, link text and so on (see the docs), but I found myself more comfortable with the following scenario: I use Selenium to grab the webpage content (so the browser renders page with javascript things) and then feed BeautifulSoup with it and navigate it with BeautifulSoup methods. It looks like this:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "http://example.com/"
browser = webdriver.Firefox()
browser.get(url)
page = BeautifulSoup(browser.page_source, "lxml")
# Let's find some tables and then print all their rows
for table in page("table"):
for row in table("tr"):
print(row)
However, I'm not sure that you really need Selenium. The site you are going to parse doesn't seem to rely on JavaScript heavily, so it may me easier just to use simpler solutions like RoboBrowser or MechanicalSoup (or mechanize for python2).
Related
I've tried to get the world population from this website: https://www.worldometers.info/world-population/
but I can only get the html code, not the data of the actual numbers.
I already tried to find children of the object I tried to get data from. I also tried to list the whole object, but nothing seemed to work.
'''just importing stuff '''
import urllib.request
import requests
from bs4 import BeautifulSoup
'''getting html from website to text '''
r = requests.get('https://www.worldometers.info/world-population/')
soup = BeautifulSoup(r.text,'html.parser')
'''here it only finds the one object that's is listed below '''
current_population = soup.find('div',{'class':'maincounter-number'}).find_all('span', recursive=False)
print(current_population)
This is the object the information is stored in:
(span class="rts-counter" rel="current_population">retrieving data... </span>
and in 'inspect-mode' you can see this:
(span class="rts-counter" rel="current_population">(span class="rts-nr-sign"></span>(span class="rts-nr-int rts-nr-10e9">7</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e6">703</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e3">227</span><span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e0">630</span></span>
I always only get the first one, but want to get the second one from 'inspect-mode'.
Here is a picture of the inspect-mode.
You are going to need a method that lets javascript run such as selenium as this number is set up via a counter that is generated in this script: https://www.realtimestatistics.net/rts/RTSp.js
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.worldometers.info/world-population/')
print(d.find_element_by_css_selector('[rel="current_population"]').text)
You could try writing your own version of that javascript script but I wouldn't recommend it.
I didn't need an explicit wait condition for selenium script but that could be added.
The website you are scraping is a JavaScript web app. The element content you see in inspect mode is the result of running some JavaScript code after the page downloads that populates that element. Prior to the JavaScript running, the element only contains the text "retrieving data...", which is what you see in your Python code. Neither the Python requests library nor BeautifulSoup run JavaScript in downloaded HTML -- they only download and parse the HTML, and that is why your code only sees the initial text.
You have two options:
Inspect the JavaScript code or website calls and figure out what HTTP URL the page is calling to retrieve the value it puts into that element. Have your Python code fetch that URL instead and parse the value from the response for that URL.
Use a full browser engine. This StackOverflow answer provides a solution: Web-scraping JavaScript page with Python
Javascript is rendered on the DOM so Beautiful Soup will not work as you want it to.
You will have to make something that lets javascript run(eg: browser) so you can make your own browser using QT4 or the like. Sentdex had a good tutorial on it here:
https://www.youtube.com/watch?v=FSH77vnOGqU
Otherwise, you could use Selenium:
from selenium import webdriver
import time
drive = webdriver.Firefox()
drive.get('https://www.worldometers.info/world-population/')
time.sleep(5)
html = driver.page_source
I am interested in downloading financial statements from the website Morningstar. Here there is an example of a page:
http://financials.morningstar.com/cash-flow/cf.html?t=PIRC®ion=ita&culture=en-US
On the top right there is the export to csv button, and I would like to click it with Python. Pressing inspection, I have this HTML tag:
<div class="exportButton">
<span class="icon_1_span">
<a href="javascript:SRT_stocFund.Export()" class="rf_export">
</a> ==$0
My idea was to use bs4 - BeautifulSoup to parse (not sure at all whether I need to parse it) the page and find the button to click it. Something like:
quote_page = pageURL
page = urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
bs = soup.find(href="javascript:SRT_stocFund.Export()", attrs={"class":"rf_export"})
Obviously, this returns nothing. Do you have any suggestion on how could I tell to Python to export the data in the table? I.e. to automate the process of downloading the csv file instead of going on the webpage and doing it on my own.
Thank you very much!!
With the extension of google chrome "http trace", you can know, than it is a link:
Export
It can do, with requests library.
Example
I think, that it is the easy way (I think that if you modify the url parameter you can do the excel file as you want).
Regards!!!
I would do it with Selenium WebDriver in "headless" mode. Try Selenium, it's quite easy to understand and use. :)
Let's use the url https://www.google.cl/#q=stackoverflow as an example. Using Chrome Developer Tools on the first link given by the search we see this html code:
Now, if I run this code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = urlopen("https://www.google.cl/#q=stackoverflow")
soup = BeautifulSoup(url)
print(soup.prettify())
I wont find the same elements. In fact, I wont find any link from the results given by the google search. Same goes if I use the requests module. Why does this happen? Can I do something to get the same results as if I was requesting from a web browser?
Since the html is generated dynamically, likely from a modern single page javascript framework like Angular or React (or even just plain JavaScript), you will need to actually drive a browser to the site using selenium or phantomjs before parsing the dom.
Here is some skeleton code.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("http://google.com")
html = driver.execute_script("return document.documentElement.innerHTML")
soup = BeautifulSoup(html)
Here is the selenium documentation for more info on running selenium, configurations, etc.:
http://selenium-python.readthedocs.io/
edit:
you will likely need to add a wait before grabbing the html, since it may take a second or so to load certain elements of the page. See below for reference to the explicity wait documentation of python selenium:
http://selenium-python.readthedocs.io/waits.html
Another source of complication is that certain parts of the page might be hidden until AFTER user interaction. In this case you will need to code your selenium script to interact with the page in certain ways before grabbing the html.
I am trying to use Selenium to (1) submit a query in a website, and then (2) copy out the contents of the result using beautiful soup. This is my script for the 1st part...
from selenium import webdriver
browser = webdriver.Chrome('C:\Users\XXX\Scripts\MyPythonScripts\chromedriver.exe')
browser.get(r'http://www.ars-grin.gov/cgi-bin/npgs/html/tax_search.pl?language=en')
elem = browser.find_element_by_name('search')
elem.send_keys('Syzygium polyanthum')
elem.submit()
For the 2nd part, I realised that I have to somehow copy the new url of the result into a variable, before I can use beautiful soup to grab the contents, but I have no idea how to do that after googling extensively.
Does anyone know this, or any alternative methods to achieve the same result?
From what I understand, you want to feed the page source to BeautifulSoup after submitting a form. If this is the case, use browser.page_source:
soup = BeautifulSoup(browser.page_source)
If your question is to get the current browser url then,
browser.current_url
I am attempting to write a program that, as an example, will scrape the top price off of this web page:
http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults
First, I am easily able to retrieve the HTML by doing the following:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import mechanize
webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults'
br = mechanize.Browser()
data = br.open(webpage).get_data()
soup = BeautifulSoup(data)
print soup
However, the raw HTML does not contain the price. The browser does...it's thing (clarification here might help me also)...and retrieves the price from elsewhere while it constructs the DOM tree.
I was led to believe that mechanize would act just like my browser and return the DOM tree, which I am also led to believe is what I see when I look at, for example, Chrome's Developer Tools view of the page (if I'm incorrect about this, how do I go about getting whatever that price information is stored in?) Is there something that I need to tell mechanize to do in order to see the DOM tree?
Once I can get the DOM tree into python, everything else I need to do should be a snap. Thanks!
Mechanize and Beautiful soup are un-beatable tools web-scraping in python.
But you need to understand what is meant for what:
Mechanize : It mimics the browser functionality on a webpage.
BeautifulSoup : HTML parser, works well even when HTML is not well-formed.
Your problem seems to be javascript. The price is getting populated via an ajax call using javascript. Mechanize, however, does not do javascript, so any content that results from javascript will remain invisible to mechanize.
Take a look at this : http://github.com/davisp/python-spidermonkey/tree/master
This does a wrapper on mechanize and Beautiful soup with js execution.
Answering my own question because in the years since asking this I have learned a lot. Today I would use Selenium Webdriver to do this job. Selenium is exactly the tool I was looking for back in 2012 for this type of web scraping project.
https://www.seleniumhq.org/download/
http://chromedriver.chromium.org/