I am interested in downloading financial statements from the website Morningstar. Here there is an example of a page:
http://financials.morningstar.com/cash-flow/cf.html?t=PIRC®ion=ita&culture=en-US
On the top right there is the export to csv button, and I would like to click it with Python. Pressing inspection, I have this HTML tag:
<div class="exportButton">
<span class="icon_1_span">
<a href="javascript:SRT_stocFund.Export()" class="rf_export">
</a> ==$0
My idea was to use bs4 - BeautifulSoup to parse (not sure at all whether I need to parse it) the page and find the button to click it. Something like:
quote_page = pageURL
page = urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
bs = soup.find(href="javascript:SRT_stocFund.Export()", attrs={"class":"rf_export"})
Obviously, this returns nothing. Do you have any suggestion on how could I tell to Python to export the data in the table? I.e. to automate the process of downloading the csv file instead of going on the webpage and doing it on my own.
Thank you very much!!
With the extension of google chrome "http trace", you can know, than it is a link:
Export
It can do, with requests library.
Example
I think, that it is the easy way (I think that if you modify the url parameter you can do the excel file as you want).
Regards!!!
I would do it with Selenium WebDriver in "headless" mode. Try Selenium, it's quite easy to understand and use. :)
Related
I've tried to get the world population from this website: https://www.worldometers.info/world-population/
but I can only get the html code, not the data of the actual numbers.
I already tried to find children of the object I tried to get data from. I also tried to list the whole object, but nothing seemed to work.
'''just importing stuff '''
import urllib.request
import requests
from bs4 import BeautifulSoup
'''getting html from website to text '''
r = requests.get('https://www.worldometers.info/world-population/')
soup = BeautifulSoup(r.text,'html.parser')
'''here it only finds the one object that's is listed below '''
current_population = soup.find('div',{'class':'maincounter-number'}).find_all('span', recursive=False)
print(current_population)
This is the object the information is stored in:
(span class="rts-counter" rel="current_population">retrieving data... </span>
and in 'inspect-mode' you can see this:
(span class="rts-counter" rel="current_population">(span class="rts-nr-sign"></span>(span class="rts-nr-int rts-nr-10e9">7</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e6">703</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e3">227</span><span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e0">630</span></span>
I always only get the first one, but want to get the second one from 'inspect-mode'.
Here is a picture of the inspect-mode.
You are going to need a method that lets javascript run such as selenium as this number is set up via a counter that is generated in this script: https://www.realtimestatistics.net/rts/RTSp.js
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.worldometers.info/world-population/')
print(d.find_element_by_css_selector('[rel="current_population"]').text)
You could try writing your own version of that javascript script but I wouldn't recommend it.
I didn't need an explicit wait condition for selenium script but that could be added.
The website you are scraping is a JavaScript web app. The element content you see in inspect mode is the result of running some JavaScript code after the page downloads that populates that element. Prior to the JavaScript running, the element only contains the text "retrieving data...", which is what you see in your Python code. Neither the Python requests library nor BeautifulSoup run JavaScript in downloaded HTML -- they only download and parse the HTML, and that is why your code only sees the initial text.
You have two options:
Inspect the JavaScript code or website calls and figure out what HTTP URL the page is calling to retrieve the value it puts into that element. Have your Python code fetch that URL instead and parse the value from the response for that URL.
Use a full browser engine. This StackOverflow answer provides a solution: Web-scraping JavaScript page with Python
Javascript is rendered on the DOM so Beautiful Soup will not work as you want it to.
You will have to make something that lets javascript run(eg: browser) so you can make your own browser using QT4 or the like. Sentdex had a good tutorial on it here:
https://www.youtube.com/watch?v=FSH77vnOGqU
Otherwise, you could use Selenium:
from selenium import webdriver
import time
drive = webdriver.Firefox()
drive.get('https://www.worldometers.info/world-population/')
time.sleep(5)
html = driver.page_source
I'm new to web scraping and have been using BeautifulSoup to scrape daily mortgage rates. However, a lot of the servicer sites that I am trying to scrape return 'none' or an empty list. The tags that I'm trying to scrape definitely exist and I just don't understand why this is happening. One example is below.
html=urlopen('https://www.popular.com/en/mortgages/')
bs0bj=BeautifulSoup(html, 'html.parser');
rate=bs0bj.find('span', {'class':'text-md text-popular-medium-blue'}).div
print(rate)
If you check the page source (for example via view-source: in Chrome or Firefox, or by writing your html string to a file) you'll see that the element you are looking for is not there. In fact, the rates are loaded dynamically:
<td>
<span class="text-md text-popular-medium-blue">
<script type="text/javascript" src = "https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&textcolor=3784D2&backgroundcolor=ffffff&t=1"></script>
</span>
</td>
You can follow the script URL and you'll see that the response is something like the following:
document.write('<div>2.75%</div>')
This response is probably regular enough to be able to use regexes on it.
To get the data you are after you can use selenium in combination with python something like below:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.popular.com/en/mortgages/')
soup = BeautifulSoup(driver.page_source,"lxml")
item = soup.select('.table-responsive')[0].select("span div")[0].text
print(item)
driver.quit()
Result:
2.75%
As rubik said, the rates are loaded dynamically using JS. Luckily, the structure of the content is relatively simple, here is how I analyze it:
Open a new tab in Chrome(or other Browsers) and right-click, then choose view-source. Switch to Network tab and check the preserve log option.
Now, open the website https://www.popular.com/en/mortgages/. The loaded contents can be seen at the left panel.
Check each item and analyze its Preview content till you find the one you want to scrap. Here is what I found, the 2.75% match the rate value of mortgage shown on the website.
Now, switch to Headers tab and check the Request URL, this is the final request sent to the server.
The next step is to analyze the Request URL https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&textcolor=3784D2&backgroundcolor=ffffff&t=1
I guessed that textcolor and backgroundcolor indicate css information, so I removed them and found the url is still valid.
Now we have a simpler url:https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&t=1
It is obvious that id_rates indicate the order of the mortgages rates without any analyzation. The question is: what does t mean?
This can be answered by analyzing other Preview contents to find the rule. Here I want to skip the process and just give the conclusion.
t=1 indicate Annual interest, t=2 indicate APR, t=6 indicate P&I Payment etc:
After doing these, now you can scrap content from the Request URL directly:
from urllib2 import urlopen
import re
file=urlopen('https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&t=1')
annual_interest = re.findall(r"\d+\.\d+", str(file.read()))[0]
#the annual interest is 2.75
Use pip install html5lib but I think with "pip install bs4(beautifulSoup)everything should be installed automatically If you are using PyCharm like me, after "pip install bs4" in the command line, open Pycharm and go to interpreters setting, add beautifulsooup and html5lib html5lib is like parser it's the same thing as HTML.parser. Both are the parser for more Info here is the beautifulsoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
The sample site I am using is: http://stats.jenkins.io/jenkins-stats/svg/svgs.html
There are a ton of CSVs linked on this site. Now obviously I can go through each link click and download, but I know there is a better way.
I was able to put together the following Python script using BeautifulSoup but all it does is print the soup:
from bs4 import BeautifulSoup
import urllib2
jenkins = "http://stats.jenkins.io/jenkins-stats/svg/svgs.html"
page = urllib2.urlopen(jenkins)
soup = BeautifulSoup(page)
print soup
Below is a sample I get when I print the soup, but I am still missing how to actually download the multiple CSV files from this detail.
<td>
<a alt="201412-jobs.svg" class="info" data-content="<object data='201412-jobs.svg' width='200' type='image/svg+xml'/>" data-original-title="201412-jobs.svg" href="201412-jobs.svg" rel="popover">SVG</a>
<span>/</span>
<a alt="201412-jobs.csv" class="info" href="201412-jobs.csv">CSV</a>
</td>
Just use a BeatifulSoup to parse this webpage and get all the URLs of the CSV files and then download each one using urllib.request.urlretrieve().
This is a one time task, so I don`t think, that you need anything like Scrapy for it.
I totally get where youre coming from, have wanted to do the same myself, lucky if you are a linux use there is a super easy way to do what you want. On the other side, using a webscraper, im familiar with bs4 but scrapy is my life (sadly) but as far as I recall bs/4 has no real option-able way to download without to use of urlib/request but all the same !!
As to your current bs4 spider,,, First you should probably ascertain only the links that are .csv, extract clean.. I IMAGINE it would look like
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv'. '.fileformatetcetc'])
continue
This is like doing find all but limiting the response to ... well only the once with .csv or desired extension...
Then you would join the responses from that to the base url(if its incomplete). If not needed the Using csv module you would read out the csv files... (from the responses right!!?) the write it out to a new file...
For the lols Im going to create a scrapy version.
AS for that easy method... why not just use wget?
Found this... sums up on the whole csv read/write process... https://stackoverflow.com/a/21501574/3794089
I'm fairly new to Python so excuse me if the problem isn't clear or if the answer is obvious.
I want to scrape the web page http://jassa.fr/. I generated some random input (sequences) and see how it holds against my own data. I tried scraping the page using selenium but the HTML of the webpage doesn't use any id's, and I don't know how to navigate through the DOM without using id's (impossible with selenium?).
Does anyone have any ideas for me how to tackle this problem, especially regarding that I want to scrape the results which are generated server side?
Thanks in advance!
[edit]
Thanks for the quick response!
How do I access this text area using selenium:
< textarea style="border:1px solid #999999;" tabindex="1" name="sequence" cols="70" rows="4" onfocus="if(this.value=='Type or paste your sequence')this.value='';">Type or paste your sequence
Edit: After clarification that you need to access <textarea> with the name sequence I suggest using find_element_by_name, see here for more details on selecting elements in Selenium.
from selenium import webdriver
url = "http://jassa.fr/"
browser = webdriver.Firefox()
browser.get(url)
form = browser.find_element_by_tag_name("form")
sequence = form.find_element_by_name("sequence")
sequence.clear()
sequence.send_keys("ATTTAATTTA")
form.submit()
Selenium has ability to navigate the tree and select elements not only by ID but also by class, tag name, link text and so on (see the docs), but I found myself more comfortable with the following scenario: I use Selenium to grab the webpage content (so the browser renders page with javascript things) and then feed BeautifulSoup with it and navigate it with BeautifulSoup methods. It looks like this:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "http://example.com/"
browser = webdriver.Firefox()
browser.get(url)
page = BeautifulSoup(browser.page_source, "lxml")
# Let's find some tables and then print all their rows
for table in page("table"):
for row in table("tr"):
print(row)
However, I'm not sure that you really need Selenium. The site you are going to parse doesn't seem to rely on JavaScript heavily, so it may me easier just to use simpler solutions like RoboBrowser or MechanicalSoup (or mechanize for python2).
I run this program but it is giving me only "[]" instead of giving the web page data.please help
import urllib
import re
import lxml.html
start_link= "http://aepcindia.com/ApparelMarketplaces/detail"
html_string = urllib.urlopen(start_link)
dom = lxml.html.fromstring(html_string.read())
side_bar_link = dom.xpath("//*[#id='show_cont']/div/table/tr[2]/td[2]/text()")
print side_bar_link
file = open("next_page.txt","w")
for link in side_bar_link:
file.write(link)
print link
file.close()
The HTML source you are downloading contains an empty content area: <div id="show_cont"></div>. This div is populated later by a javascript function showData(). When you look at the page in a browser, the javascript is executed before, which is not the case when you just download the HTML source using urllib.
To get the data you want, you can try to mimic the POST request in the showData() function or, preferably, scrape the website using a scriptable headless browser.
Update: While a headless browser would be a much more generally applicable approach, in this case it might be overhead here. You actually will be better off reverse engineering the showData() function. The alax-call in that is all too obvious, delivers a plain HTML table and you can also limit searches :)
http://aepcindia.com/ApparelMarketplaces/ajax_detail/search_type:/search_value: