I am trying to get the data in the table at this website which is updated via jquery after the page loads (I have permission) :
http://whichchart.com/
I currently use selenium and beautifulsoup to get data, however because this data is not visible in the html source, I can't access it. I have tried PyQt4 but it likewise does not get the updated html source.
The values are visible in firebug and chrome developer, so are there any python packages out there which can exploit this and feed it to beautifulsoup?
I'm not a massive techie so ideally I would like a solution which would work in Python or the next easiest software type.
I'm aware I can get it via proprietary "screen-scraper" software, but that is expensive.
Page is making AJAX call to get a data to http://whichchart.com/service.php?action=NewcastleCoal which returns values in JSON. So you can do the following:
Use urllib to get data using HTTP
Parse that data with json library reads method
Now you have a python object to process
If you need to process HTML page content I would suggest to use library like BeautifulSoup or scrapy
Related
I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.
When I do the crawl, I usually utilize scripts before parsing with python. Since this allows to get JSON which can be easily structured and parsed.
>>> import requests
>>> r = requests.get('~.json')
>>> r.json()
However, encountering this page, https://www.eiganetflix.jp/%E3%82%BF%E3%82%A4%E3%83%97/tv-%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA
It seems there's no interaction to call JSON to show materials on the page.
And it is hard to find pagination javascript functions. (Actually, there is, but I mean it seems hard to execute. )
In this case, how can I utilize existing requests and json method?
Or is there any easy way to crawl this?
If I understand correctly, you want to scrape a webpage which does not have a JSON response. Check to be sure that the website does not have an API that allows you to get JSON data. Or even any other structured data such as XML would also be helpful. If there is no way, you would have to screen scrape, which is not the easiest method to do. Check scrapy which is a framework for doing this, or you can use a library like beautifulsoup for a custom solution.
If the page uses Javascript, you would somehow need to run it on the page to get content and browse pages. You can spynner or Selenium to do that.
http://www.indymini.com/p/mini-marathon/miniresults
I want to scrap table available on this url with python BS4 but when i change the table size or change page, url does not chang.
When navigating through the table, the URL does not change because the table seems to be implemented using javascript (DataTables library in particular) - and uses AJAX to get relevant data to display.
So, basically, I don't see a way you could scrape the page using BS4 and get data other than those displayed by default, when the page loads.
On the other hand, as the data is retrieved using AJAX, you could try to figure out the format of the AJAX request (what parameter does what with respect to the results you want, for example using Firebug) and retrieve the data directly in JSON format by calling the AJAX URL that supplies the data table.
But, depending on your intended use of the data, you might want to consider asking the owner of the website for permission to download and use the data. And, who knows - they might be willing to help.
Well its a ajax call that is sent to server via GET, here is quick and dirty scrapping code in python
ajax url is
import requests,time
c=0
data=list()
for i in range(1,2278):
url='http://results.xacte.com/json/search?eventId=1387&callback=jQuery18309972286304579958_1494520029659&sEcho=8&iColumns=13&sColumns=&iDisplayStart='+str(c)+'&iDisplayLength=10&mDataProp_0=&mDataProp_1=bib&mDataProp_2=firstname&mDataProp_3=lastname&mDataProp_4=sex&mDataProp_5=age&mDataProp_6=city&mDataProp_7=state&mDataProp_8=country&mDataProp_9=&mDataProp_10=&mDataProp_11=&mDataProp_12=&sSearch=&bRegex=false&sSearch_0=&bRegex_0=false&bSearchable_0=false&sSearch_1=&bRegex_1=false&bSearchable_1=true&sSearch_2=&bRegex_2=false&bSearchable_2=true&sSearch_3=&bRegex_3=false&bSearchable_3=true&sSearch_4=&bRegex_4=false&bSearchable_4=true&sSearch_5=&bRegex_5=false&bSearchable_5=true&sSearch_6=&bRegex_6=false&bSearchable_6=true&sSearch_7=&bRegex_7=false&bSearchable_7=true&sSearch_8=&bRegex_8=false&bSearchable_8=true&sSearch_9=&bRegex_9=false&bSearchable_9=true&sSearch_10=&bRegex_10=false&bSearchable_10=true&sSearch_11=&bRegex_11=false&bSearchable_11=false&sSearch_12=&bRegex_12=false&bSearchable_12=false&iSortCol_0=0&sSortDir_0=asc&iSortingCols=1&bSortable_0=false&bSortable_1=true&bSortable_2=true&bSortable_3=true&bSortable_4=true&bSortable_5=true&bSortable_6=true&bSortable_7=true&bSortable_8=true&bSortable_9=false&bSortable_10=false&bSortable_11=false&bSortable_12=false&_='+str(time.time())
r=requests.get(url)
c+=1
print (r.text,'-------------',)
#do whatever you want to do with it, r.text will give the raw data.
I am new to Python. I am trying to scrape data from a website and the data I want can not be seen on view > source in the browser. It comes from another file. It is possible to scrape the actual data on the screen with Beautifulsoup and Python?
example site www[dot]catleylakeman[dot]co(dot)uk/cds_banks.php
If not, is this possible using another route?
Thanks
The "other file" is http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369145707664 - you can find this out (and I suspect you already have) by using chrome's developer tools, network tab (or the equivalent in your browser).
This format is easier to parse than the final html would be; generally HTML scrapers should be used as a last resort if the website does not publish raw data like the above.
My guess is, the url you are actually looking for is:
http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122
I found it using the developer toolbar and looking at the network traffic (builtin to chrome and firefox, also using firebug). It gets called in with Ajax. You do not even need beatiful soup to parse that one as it seems to be a long string separated with *| and sometimes **|. The following should get you initial access to that data:
import urllib2
f = urllib2.urlopen('http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122')
try:
data = f.read().split('*|')
finally:
f.close()
print data
I'm trying to get the content of a HTML table generated dynamically by JavaScript in a webpage & parse it using BeautifulSoup to use certain values from the table.
Since the content is generated by JavaScript it's not available in source (driver.page_source).
Is there any other way to obtain the content and use it? It's table containing list of tasks, I need to parse the table and identify whether specific task I'm searching for is available.
As mentioned by Julian, i'd rather check my "Net" tab in Firebug (or similar tool in other browsers) and get the data like this. If the data is JSON, just use json.loads(), if it's html, you can parse it using BS or any other lib as you say. Maybe you would like to try my dummy lib, which simplifies this and returns tables as tablib objects, which you can get as csv, excel, json etc.
You'd need to figure out what HTTP requests the Javascript is making, and make the same ones in your Python code. You can do this by using your favorite browser's development tools, or wireshark if forced.