Downloading URL in Python without urllib - python

I have a problem in downloadig URL.
I need to download webpage with the table. When I get .html file with the help of urllib or urllib2, it has some problems connected with javascript (or same languages). There's only source code such as id_name e.t.c, but it don't have any table information (columns and rows).
Nevertheless, when I save .html in Google Chrome, it actually has information in table (not source code, but columns and rows). So what should I do to make it in Python?

You can use selenium to simulate browser. It will execute javascript then you can get the information you want

Related

Why does a list appear as a comment with Python Beautiful Soup?

I am trying to scrape the addresses of Dunkin' locations using this website: https://www.dunkindonuts.com/en/locations?location=10001. However, when trying to access the list of each Dunkin' on the web page, it shows up as comment. How do I access the list? I've never done web scraping before.
Here's my current code, I'm expecting a list of Dunkin' stores which I can then extract the addresses from.
requests.get() will return the raw HTML for a web page. This is only the beginning of the journey when you view this page in the browser. Your browser will parse that HTML to create the DOM. It will load other resources, such as images and scripts from other files. Then it will execute those scripts. In the modern web, those scripts will modify the DOM to give the page that you finally see in the browser. requests alone doesn't give you all that.
One solution is to use a library that loads the HTML into a browser and does all of the magic. selenium is one such library.

Scrape multiple table pages with BeautifulSoup and Python

http://www.indymini.com/p/mini-marathon/miniresults
I want to scrap table available on this url with python BS4 but when i change the table size or change page, url does not chang.
When navigating through the table, the URL does not change because the table seems to be implemented using javascript (DataTables library in particular) - and uses AJAX to get relevant data to display.
So, basically, I don't see a way you could scrape the page using BS4 and get data other than those displayed by default, when the page loads.
On the other hand, as the data is retrieved using AJAX, you could try to figure out the format of the AJAX request (what parameter does what with respect to the results you want, for example using Firebug) and retrieve the data directly in JSON format by calling the AJAX URL that supplies the data table.
But, depending on your intended use of the data, you might want to consider asking the owner of the website for permission to download and use the data. And, who knows - they might be willing to help.
Well its a ajax call that is sent to server via GET, here is quick and dirty scrapping code in python
ajax url is
import requests,time
c=0
data=list()
for i in range(1,2278):
url='http://results.xacte.com/json/search?eventId=1387&callback=jQuery18309972286304579958_1494520029659&sEcho=8&iColumns=13&sColumns=&iDisplayStart='+str(c)+'&iDisplayLength=10&mDataProp_0=&mDataProp_1=bib&mDataProp_2=firstname&mDataProp_3=lastname&mDataProp_4=sex&mDataProp_5=age&mDataProp_6=city&mDataProp_7=state&mDataProp_8=country&mDataProp_9=&mDataProp_10=&mDataProp_11=&mDataProp_12=&sSearch=&bRegex=false&sSearch_0=&bRegex_0=false&bSearchable_0=false&sSearch_1=&bRegex_1=false&bSearchable_1=true&sSearch_2=&bRegex_2=false&bSearchable_2=true&sSearch_3=&bRegex_3=false&bSearchable_3=true&sSearch_4=&bRegex_4=false&bSearchable_4=true&sSearch_5=&bRegex_5=false&bSearchable_5=true&sSearch_6=&bRegex_6=false&bSearchable_6=true&sSearch_7=&bRegex_7=false&bSearchable_7=true&sSearch_8=&bRegex_8=false&bSearchable_8=true&sSearch_9=&bRegex_9=false&bSearchable_9=true&sSearch_10=&bRegex_10=false&bSearchable_10=true&sSearch_11=&bRegex_11=false&bSearchable_11=false&sSearch_12=&bRegex_12=false&bSearchable_12=false&iSortCol_0=0&sSortDir_0=asc&iSortingCols=1&bSortable_0=false&bSortable_1=true&bSortable_2=true&bSortable_3=true&bSortable_4=true&bSortable_5=true&bSortable_6=true&bSortable_7=true&bSortable_8=true&bSortable_9=false&bSortable_10=false&bSortable_11=false&bSortable_12=false&_='+str(time.time())
r=requests.get(url)
c+=1
print (r.text,'-------------',)
#do whatever you want to do with it, r.text will give the raw data.

Issues downloading full HTML of webpage with Python

I'm working on a project where I require the all of the game ID #'s found in the current scores section of http://www.nhl.com/ to download content/ parse stats for each game. I want to be able to get all current game ID's in one go, but for some reason, I'm unable to download the full HTML of the page, no matter how I try. I'm using requests and beautifulsoup4.
Here's my problem:
I've determined that the particular tags I'm interested in are div's where the CSS class = 'scrblk'. So, I wrote a function to pass into BeautifulSoup.find_all() to give me, specifically, blocks with that CSS class. It looks like this:
def find_scrblk(css_class):
return css_class is not None and css_class == 'scrblk'
so, when I actually went to the web page in Firefox and saved it, then loaded the saved file in beautifulsoup4, I did the following:
>>>soup = bs(open('nhl.html'))
>>>soup.find_all(class_=find_scrblk)
[<div class="scrblk" id="hsb2015010029"> <div class="defaultState"....]
and everything was all fine and dandy; I had all the info I needed. However, when I tried to download the page using any of several automated methods I know, this returned simply an empty list. Here's what I tried:
using requests.get() and saving the .text attribute in a file
using the iter_content() and iter_lines() methods of the request
object to write to the file piece by piece
using wget to download the page (through subprocess.call())
and open the resultant file. For this option, I was sure to use the --page-requisites and --convert-links flags so I downloaded (or so I thought)
all the necessary data.
With all of the above, I was unable to parse out the data that I need from the HTML files; it's as if they weren't being completely downloaded or something, but I have no idea what that something is or how to fix it. What am I doing wrong or missing here? I'm using python 2.7.9 on Ubuntu 15.04.
All of the files can be downloaded here:
https://www.dropbox.com/s/k6vv8hcxbkwy32b/nhl_html_examples.zip?dl=0
As the comments on your question state, you have to re-think your approach. What you see in the browser is not what the response contains. The site uses JavaScript to load the information you are after so you should look more carefully in the result what you get to find what you are looking for.
In the future to handle such problems try out Chrome's developer console and disable JavaScript and open a site such way. Then you will see if you are facing JS or the site would contain the values you are looking for.
And by the way what you do is against the Terms of Service of the NHL website (according to Section 2. Prohibited Content and Activities)
Engage in unauthorized spidering, scraping, or harvesting of content or information, or use any other unauthorized automated means to compile information;

Get dynamic html table using selenium & parse it using beautifulsoup

I'm trying to get the content of a HTML table generated dynamically by JavaScript in a webpage & parse it using BeautifulSoup to use certain values from the table.
Since the content is generated by JavaScript it's not available in source (driver.page_source).
Is there any other way to obtain the content and use it? It's table containing list of tasks, I need to parse the table and identify whether specific task I'm searching for is available.
As mentioned by Julian, i'd rather check my "Net" tab in Firebug (or similar tool in other browsers) and get the data like this. If the data is JSON, just use json.loads(), if it's html, you can parse it using BS or any other lib as you say. Maybe you would like to try my dummy lib, which simplifies this and returns tables as tablib objects, which you can get as csv, excel, json etc.
You'd need to figure out what HTTP requests the Javascript is making, and make the same ones in your Python code. You can do this by using your favorite browser's development tools, or wireshark if forced.

Scraping Ajax using Python

I am trying to get the data in the table at this website which is updated via jquery after the page loads (I have permission) :
http://whichchart.com/
I currently use selenium and beautifulsoup to get data, however because this data is not visible in the html source, I can't access it. I have tried PyQt4 but it likewise does not get the updated html source.
The values are visible in firebug and chrome developer, so are there any python packages out there which can exploit this and feed it to beautifulsoup?
I'm not a massive techie so ideally I would like a solution which would work in Python or the next easiest software type.
I'm aware I can get it via proprietary "screen-scraper" software, but that is expensive.
Page is making AJAX call to get a data to http://whichchart.com/service.php?action=NewcastleCoal which returns values in JSON. So you can do the following:
Use urllib to get data using HTTP
Parse that data with json library reads method
Now you have a python object to process
If you need to process HTML page content I would suggest to use library like BeautifulSoup or scrapy

Categories

Resources