Beautiful soup - data not in HTML file - python

I am new to Python. I am trying to scrape data from a website and the data I want can not be seen on view > source in the browser. It comes from another file. It is possible to scrape the actual data on the screen with Beautifulsoup and Python?
example site www[dot]catleylakeman[dot]co(dot)uk/cds_banks.php
If not, is this possible using another route?
Thanks

The "other file" is http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369145707664 - you can find this out (and I suspect you already have) by using chrome's developer tools, network tab (or the equivalent in your browser).
This format is easier to parse than the final html would be; generally HTML scrapers should be used as a last resort if the website does not publish raw data like the above.

My guess is, the url you are actually looking for is:
http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122
I found it using the developer toolbar and looking at the network traffic (builtin to chrome and firefox, also using firebug). It gets called in with Ajax. You do not even need beatiful soup to parse that one as it seems to be a long string separated with *| and sometimes **|. The following should get you initial access to that data:
import urllib2
f = urllib2.urlopen('http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122')
try:
data = f.read().split('*|')
finally:
f.close()
print data

Related

Beautiful Soup not picking up some data form the website

I have been trying to scrape some data using beautiful soup from https://www.eia.gov/coal/markets/. However when I parse the contents some of the data does not show up at all. Those data fields are visible in chrome inspector but not in the soup. The thing is they do not seem to be text elements. I think they are fed using an external database. I have attached the screenshots below. Is there any other way to scrape that data?
Thanks in advance.
Google inspector:
Beautiful soup parsed content:
#DMart is correct. The data you are looking for is being populated by Javascript, have a look at line 1629 in the page source. Beautiful soup doesn't act as a client browser so there is nowhere for the JS to execute. So it looks like selenium is your best bet.
See This thread for more information.
Not enough detail in your question but this information is probably dynamically loaded and you're not fetching the entire page source.
Without your code it's tough to see if you're using selenium to do it (you tagged this questions as such) which may indicate you're using page_source which does not guarantee you the entire completed source of the page you're looking at.
If you're using requests its even more unlikely you're capturing the entire page's completed source code.
The data is loaded via ajax, so it is not available in the initial document. If you go to the networking tab in chrome dev tools you will see that the site reaches out to https://www.eia.gov/coal/markets/coal_markets_json.php. I searched for some of the numbers in the response and it looks like the data you are looking for is there.
This is a direct json response from the backend. Its better than selenium if you can get it to work.
Thanks you all!
Opening the page using selenium using a webdriver and then parsing the page source using beautiful soup worked.
webdriver.get('https://www.eia.gov/coal/markets/')
html=webdriver.page_source
soup=BS(html)
table=soup.find("table",{'id':'snl_dpst'})
rows=table.find_all("tr")

Url request does not parse every information in HTML using Python

I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.

How to access the subtags within a tag using beautifulsoup in python?

I am attempting to retrieve player statistics from MLB.com for the 2016 season. I am using Beautiful Soup in Python, and I need to extract the information in the table seen here:
http://mlb.mlb.com/stats/sortable.jsp#elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&game_type='R'&season=2016&season_type=ANY&league_code='MLB'&sectionType=sp&statType=hitting&page=1&ts=1493672037085&playerType=ALL&sportCode='mlb'&split=&team_id=&active_sw=&position=&page_type=SortablePlayer&sortOrder='desc'&sortColumn=ab&results=&perPage=442&timeframe=&last_x_days=&extended=0
Here is what I have attempted:
r=requests.get(url)
soup=BeautifulSoup(r.content,'html.parser')
gdata=soup.find_all('div',{'id':'datagrid'})
print(gdata)
This should return all of the subtags within the tag, but it does not. This results in the following:
[<div id="datagrid"></div>]
Can anyone explain why this is not producing the contents of the table? Furthermore, what can I do to access the contents of the table?
Thanks
If you look at the source for the webpage, it looks like the datagrid div is actually empty & the stats are inserted dynamically as json from this URL. Maybe you can use that instead. To figure this out I looked at the page source to see that the div had no children and then used Chrome developer tools Network tab to find the request where it pulled the data:
Open the web page
Open the chrome developer tools, Command+Option+I (Mac) or Control+Shift+I (Windows, Linux).
Refresh the web page with the tools open so it processes the network requests then wait for the page to load
(optional) Type xml in the search bar on the web to narrow your search results to requests that are likely to have data
Click on each request and look at the preview of the response. At this point I just manually examined the responses to see which had your data. I got lucky and got yours on the first try since it has stats in the name.

How to print a div data-reactid?

I'm doing a project in my spare time where I have hit a problem with getting data from a webpage into the program.
This is my current code:
import urllib
import re
htmlfile = urllib.urlopen("http://www.superliga.dk/klub/aab?sub=squad")
htmltext = htmlfile.read()
regex = r'<div data-reactid=".3.$squad content.0.$=11:0.0.0.0.1:0.2.0.0">([^<]*)</div>'
pattern = re.compile(regex)
goal = re.findall(pattern,htmltext)
print goal
And it's working okay except this part:
regex = r'<div data-reactid=".3.$squad content.0.$=11:0.0.0.0.1:0.2.0.0">([^<]*)</div>'
I can't make it display all values on the webpage with this reactid, and I can't find any solution to this problem.
Any suggestions to how I can get Python to print it?
You are trying to match a tag you saw on the on the developer console of you browser, right?
Unfortunately the html you saw is only the "final form" of a dynamic page: what you did download with urlopen is only the skeleton of the webpage, which in the browser is then dynamically filled with other elements by the javascript using data fetched from some backend server.
If you try to print the actual value stored in htmltest you will find nothing like what you are trying to match with the regex, and that's because it missed all the further processing normally performed by by the javascript.
What you can try to do is to monitor (through the dev console) the fetched resources and reverse-engineer the API call in order to recover the desired info. Chances are the response of these API call is in JSON format or has a structure way more easily parsable than the html body.
UPDATE: for example, in Chrome's dev tools I can see async calls like:
http://ss2.tjekscores.dk/pro-stats/tournaments/46/top-players?sortBy=eventsStats.goals&limit=5&skip=0&positionId=&q=&seasonId=10392&teamId[]=8470
Maybe this returns the info you are looking for.

Efficient web page scraping with Python/Requests/BeautifulSoup

I am trying to grab information from the Chicago Transit Authority bustracker website. In particular, I would like to quickly output the arrival ETAs for the top two buses. I can do this rather easily with Splinter; however I am running this script on a headless Raspberry Pi model B and Splinter plus pyvirtualdisplay results in a significant amount of overhead.
Something along the lines of
from bs4 import BeautifulSoup
import requests
url = 'http://www.ctabustracker.com/bustime/eta/eta.jsp?id=15475'
r = requests.get(url)
s = BeautifulSoup(r.text,'html.parser')
does not do the trick. All of the data fields are empty (well, have &nbsp). For example, when the page looks like this:
This code snippet s.find(id='time1').text gives me u'\xa0' instead of "12 MINUTES" when I perform the analogous search with Splinter.
I'm not wedded to BeautifulSoup/requests; I just want something that doesn't require the overhead of Splinter/pyvirtualdisplay since the project requires that I obtain a short list of strings (e.g. for the image above, [['9','104th/Vincennes','1158','12 MINUTES'],['9','95th','1300','13 MINUTES']]) and then exits.
The bad news
So the bad news is the page you are trying to scrape is rendered via Javascript. Whilst tools like Splinter, Selenium, PhantomJS can render this for you and give you the output to easily scrape, Python + Requests + BeautifulSoup don't give you this out of the box.
The good news
The data pulled in from the Javascript has to come from somewhere, and usually this will come in an easier to parse format (as it's designed to be read by machines).
In this case your example loads this XML.
Now with an XML response it's not as nice as JSON so I'd recommend reading this answer about integrating with the requests library. But it will be a lot more lightweight than Splinter.

Categories

Resources