I'm doing a project in my spare time where I have hit a problem with getting data from a webpage into the program.
This is my current code:
import urllib
import re
htmlfile = urllib.urlopen("http://www.superliga.dk/klub/aab?sub=squad")
htmltext = htmlfile.read()
regex = r'<div data-reactid=".3.$squad content.0.$=11:0.0.0.0.1:0.2.0.0">([^<]*)</div>'
pattern = re.compile(regex)
goal = re.findall(pattern,htmltext)
print goal
And it's working okay except this part:
regex = r'<div data-reactid=".3.$squad content.0.$=11:0.0.0.0.1:0.2.0.0">([^<]*)</div>'
I can't make it display all values on the webpage with this reactid, and I can't find any solution to this problem.
Any suggestions to how I can get Python to print it?
You are trying to match a tag you saw on the on the developer console of you browser, right?
Unfortunately the html you saw is only the "final form" of a dynamic page: what you did download with urlopen is only the skeleton of the webpage, which in the browser is then dynamically filled with other elements by the javascript using data fetched from some backend server.
If you try to print the actual value stored in htmltest you will find nothing like what you are trying to match with the regex, and that's because it missed all the further processing normally performed by by the javascript.
What you can try to do is to monitor (through the dev console) the fetched resources and reverse-engineer the API call in order to recover the desired info. Chances are the response of these API call is in JSON format or has a structure way more easily parsable than the html body.
UPDATE: for example, in Chrome's dev tools I can see async calls like:
http://ss2.tjekscores.dk/pro-stats/tournaments/46/top-players?sortBy=eventsStats.goals&limit=5&skip=0&positionId=&q=&seasonId=10392&teamId[]=8470
Maybe this returns the info you are looking for.
Related
I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.
After following this tutorial on finding a css class and copying the text on a website, I tried to implement this into a small text code but sadly it didnt work.
I followed the tutorial exactly on the same website and did get the headline of the webpage, but cant get this process to work for any other class on that, or any other , webpage. Am I missing something? I am a beginner programmer and have never used Request HTML or anything similar before.
Here is an example of the code I'm using, the purpose being to grab the random fact that appears in the "af-description" class when you load the webpage.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://mentalfloss.com/amazingfactgenerator')
r.html.find('.af-description', first=True)
description = r.html.find('.af-description', first=True)
print("Fun Fact:" + description.text)
No matter how hard I try and no matter how I rearrange things or try different code, I cant get it to work. It seems to not be able to find the class or the text the class contains. Please help.
What you are trying to do requires that the HTML source contains an element with such a class. A browser does much more than just download HTML; it also downloads CSS and Javascript code when referenced by the page, and executes any scripts attached to the page, which can trigger further network activity. If the content you are looking for was generated by Javascript, you can see the elements in the browser development tools inspector, but that doesn't make the element accessible to the r.html object!
In the case of the URL you tried to scrape, if you look at the network console you'll see that an AJAX request GET request http://mentalfloss.com/api/facts is made to fill the <div af-details> structures, so if you wanted to scrape that data you could just get it as JSON directly from the API:
r = session.get('http://mentalfloss.com/api/facts')
description = r.json()[0]['fact']
print("Fun Fact:" + fact)
You can make the requests_html session render the page with Javascript too by calling r.html.render().
This then uses a headless browser to render the HTML, execute the JavaScript code embedded in it, fetch the AJAX request and render the additional DOM elements, then reflect the whole page back to HTML for your code to mine. The first time you do this the required libraries for the headless browser infrastructure are downloaded for you:
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('http://mentalfloss.com/amazingfactgenerator')
>>> r.html.render()
[W:pyppeteer.chromium_downloader] start chromium download.
Download may take a few minutes.
# .... a lot more information elided
[W:pyppeteer.chromium_downloader] chromium extracted to: /Users/mj/.pyppeteer/local-chromium/533271
>>> r.html.render()
>>> r.html.find('.af-description', first=True)
<Element 'div' class=('af-description',)>
>>> _.text
'The cubicle did not get its name from its shape, but from the Latin “cubiculum” meaning bed chamber.'
However, this requires your computer to do a lot more work; for this specific example, it's easier to just call the API directly.
The div which include the class 'af-description' is not included on the DOM but on a js script. Its normal to not be able to find it.
If you test your script to find a class from the DOM, like this one 'afg-page row' you should be fine.
I have a router that I want to login to and retrieve information using Python script. Im a newbie to Python but want to learn and explore more with it. Here is what I have written so far:
from requests.auth import HTTPBasicAuth
import requests
from bs4 import BeautifulSoup
response = requests.get('http://192.168.1.1/Settings.html/', auth=HTTPBasicAuth('Username', 'Password'))
html = response.content
soup = BeautifulSoup(html, "html.parser")
print (soup.prettify())
I have two questions which are:
When I run the script the first time, I receive an authentication error. On running the script a second time it seems to authenticate fine and retrieve the HTML. Is there a better method?
With BS I want to only retrieve the code I require from the script. I cant see a tag to set BS to scrape. At the start of the HTML there are a list of variables of which I want to scrape the data for example:
var Device Pin = '12345678';
Its much easier to retrieve the information using a single script instead of logging onto the web interface each time. It sits within the script type="text/javascript".
Is BS the correct tool for the job. Can I just scrape the one line in the list of variables?
Any help as always very much appreciatted.
As far as I know, BeautifulSoup does not handle javascript. In this case, it's simple enough to just use regular expressions
import re
m = re.search(r"var Device Pin\s+= '(\d+)'", html)
pin = m.group(1)
Regarding the authentication problem, you can wrap your call in try except to redo the call if it doesn't work the first time.
I'd run a packet sniffer, tcpdump or wireshark, to see the interaction between your script and your router. Viewing the interactions may help determine why you're unable to authenticate on the first pass. As a workaround, run the auth section in a for loop which will try N number of times to authenticate before failing.
Regarding scraping, you may want to consider lxml with the beautiful soup parser so you can use XPath. See can we use xpath with BeautifulSoup?
XPath would allow you easily pull a single value, text, attribute, etc. from the html if lxml can parse it.
I run this program but it is giving me only "[]" instead of giving the web page data.please help
import urllib
import re
import lxml.html
start_link= "http://aepcindia.com/ApparelMarketplaces/detail"
html_string = urllib.urlopen(start_link)
dom = lxml.html.fromstring(html_string.read())
side_bar_link = dom.xpath("//*[#id='show_cont']/div/table/tr[2]/td[2]/text()")
print side_bar_link
file = open("next_page.txt","w")
for link in side_bar_link:
file.write(link)
print link
file.close()
The HTML source you are downloading contains an empty content area: <div id="show_cont"></div>. This div is populated later by a javascript function showData(). When you look at the page in a browser, the javascript is executed before, which is not the case when you just download the HTML source using urllib.
To get the data you want, you can try to mimic the POST request in the showData() function or, preferably, scrape the website using a scriptable headless browser.
Update: While a headless browser would be a much more generally applicable approach, in this case it might be overhead here. You actually will be better off reverse engineering the showData() function. The alax-call in that is all too obvious, delivers a plain HTML table and you can also limit searches :)
http://aepcindia.com/ApparelMarketplaces/ajax_detail/search_type:/search_value:
I am new to Python. I am trying to scrape data from a website and the data I want can not be seen on view > source in the browser. It comes from another file. It is possible to scrape the actual data on the screen with Beautifulsoup and Python?
example site www[dot]catleylakeman[dot]co(dot)uk/cds_banks.php
If not, is this possible using another route?
Thanks
The "other file" is http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369145707664 - you can find this out (and I suspect you already have) by using chrome's developer tools, network tab (or the equivalent in your browser).
This format is easier to parse than the final html would be; generally HTML scrapers should be used as a last resort if the website does not publish raw data like the above.
My guess is, the url you are actually looking for is:
http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122
I found it using the developer toolbar and looking at the network traffic (builtin to chrome and firefox, also using firebug). It gets called in with Ajax. You do not even need beatiful soup to parse that one as it seems to be a long string separated with *| and sometimes **|. The following should get you initial access to that data:
import urllib2
f = urllib2.urlopen('http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122')
try:
data = f.read().split('*|')
finally:
f.close()
print data