How to scrape AJAX website? - python

In the past, I've used the urllib2 library to get source codes from websites. However, I've noticed that for a recent website I've been trying to play with, I can't find the information I need in the source code.
http://www.wgci.com/playlist is the site that I've been looking at, and I want to get the most recently played song and the playlist of recent songs. I essentially want to copy and paste the visible, displayed text on a website and put it in a string. Alternatively, being able to access the element that holds these values in plaintext and get them using urllib2 normally would be nice. Is there anyway to do either of these things?
Thanks kindly.

The website you want to scrap is using ajax calls to populate it's pages with data.
You have 2 ways to scrapping data from it:
Use a headless browser that supports javascript (ZombieJS for instance), and scrap the generated output, but that's complicated and overkill
Understand how their API work, and call that directly, which is way simpler.
Use Chrome dev tools (network tab) to see the calls while browsing their website.
For example, the list of last played songs for a given stream is available in JSON at
http://www.wgci.com/services/now_playing.html?streamId=841&limit=12

Related

Suitable Python modules for navigating a website

I am looking for a python module that will let me navigate searchbars, links etc of a website.
For context I am looking to do a little webscraping of this website [https://www.realclearpolitics.com/]
I simply want to take information on each state (polling data etc) in relation to the 2020 election and organize it all in a collection of a database.
Obviously there are a lot of states to go through and each is on a seperate webpage. So im looking for a method in python in which i could quickly navigate the site and take the data of each page etc aswell as update and add to existing data. So finding a method of quickly navigating links and search bars with my inputted data would be very helpful.
Any suggestions would be greatly appreciated.
# a simple list that contains the names of each state
states = ["Alabama", "Alaska" ,"Arizona", "....."]
for state in states:
#code to look up the state in the searchbar of website
#figures being taken from website etc
break
Here is the rough idea i have
There are many options to accomplish this with Python. As #LD mentioned, you can use Selenium. Selenium is a good option if you need to interact with a websites UI via a headless browser. E.g clicking a button, entering text into a search bar, etc. If your needs aren't that complex, for instance if you just need to quickly scrape all the raw content from a web page and process it, than you should use the requests module from Python's standard library.
For processing raw content from a crawl, I would recommend beautiful soup.
Hope that helps!

How to get visible text from a webpage using Selenium & python?

I am trying to grab a bunch numbers that are presented in a table on a web page that I’ve accessed using python and Selenium running headless on a Raspberry Pi. The numbers are not in the page source, rather they are deeply embedded in complex html served by several URLs called by the main page (the numbers update every few seconds). I know I could parse the html to get the numbers I want, but the numbers are already sitting on the front page in perfect format all in one place. I can select and copy the numbers when I view the web page in Chrome on my PC.
How can I use python and get Selenium webdriver to get me those numbers? Can Selenium simply provide all the visible text on a page? How? (I've tried driver.page_source but the text returned does not contain the numbers). Or is there a way to essentially copy text and numbers from a table visible on the screen using python and Selenium? (I’ve looked into xdotool but didn’t find enough documentation to help). I’m just learning Selenium so any suggestions will be much appreciated!
Well, I figured out the answer to my question. It's embarrassingly easy. This line gets just what I need - all the text that is visible on the web page:
page_text = driver.find_element_by_tag_name('body').text
So, there are some different situations why you can not get some info on the page:
Information doesn't loaded yet. You must waiting for some time to get your information ready. You may watch this theme for the better understanding. Some times you get dynamically added page elements with JS and so on, which loading is very slowly.
Information may consists of different type of data. For example you are waiting for a text with numbers, but you may get picture with numbers on the page. In this situation you must change your programming tactics and use another functions to get what you need.

How to get renewable information on a web by python3?

I want to get some information on a web page. I use requests.get to abstract the page. But I cannot find what I want. Checking it carefully, I found the info I want is in a list with a scrollbar. When I drag scrollbar down, more and more info is loaded. So I guess all the info in the list is not loaded yet when I get the page by module requests. I want to know what is happened in this process and How can I gather the information I want. (I am not familiar with Html language).
I want to know what is happened in this process
It sounds like when the user scrolls, the scrolling causes some javascript(js) to execute, and the js makes repeated requests to the server for more data. Unfortunately, the requests module cannot cause the javascript on an html page to execute--all you get back is the text of the js. The unable to execute javascript on an html page in order to retrieve what the user actually sees has been a problem for a long time. Fortunately, smart programmers have largely solved that problem. You need to use a different module. Check out the selenium module.
I am not familiar with Html language
Scraping web pages can get really tricky really fast, and some web pages proactively try to prevent computer programs from scraping their content, so you need to know both html and js in order to figure out what is going on.

Issues downloading full HTML of webpage with Python

I'm working on a project where I require the all of the game ID #'s found in the current scores section of http://www.nhl.com/ to download content/ parse stats for each game. I want to be able to get all current game ID's in one go, but for some reason, I'm unable to download the full HTML of the page, no matter how I try. I'm using requests and beautifulsoup4.
Here's my problem:
I've determined that the particular tags I'm interested in are div's where the CSS class = 'scrblk'. So, I wrote a function to pass into BeautifulSoup.find_all() to give me, specifically, blocks with that CSS class. It looks like this:
def find_scrblk(css_class):
return css_class is not None and css_class == 'scrblk'
so, when I actually went to the web page in Firefox and saved it, then loaded the saved file in beautifulsoup4, I did the following:
>>>soup = bs(open('nhl.html'))
>>>soup.find_all(class_=find_scrblk)
[<div class="scrblk" id="hsb2015010029"> <div class="defaultState"....]
and everything was all fine and dandy; I had all the info I needed. However, when I tried to download the page using any of several automated methods I know, this returned simply an empty list. Here's what I tried:
using requests.get() and saving the .text attribute in a file
using the iter_content() and iter_lines() methods of the request
object to write to the file piece by piece
using wget to download the page (through subprocess.call())
and open the resultant file. For this option, I was sure to use the --page-requisites and --convert-links flags so I downloaded (or so I thought)
all the necessary data.
With all of the above, I was unable to parse out the data that I need from the HTML files; it's as if they weren't being completely downloaded or something, but I have no idea what that something is or how to fix it. What am I doing wrong or missing here? I'm using python 2.7.9 on Ubuntu 15.04.
All of the files can be downloaded here:
https://www.dropbox.com/s/k6vv8hcxbkwy32b/nhl_html_examples.zip?dl=0
As the comments on your question state, you have to re-think your approach. What you see in the browser is not what the response contains. The site uses JavaScript to load the information you are after so you should look more carefully in the result what you get to find what you are looking for.
In the future to handle such problems try out Chrome's developer console and disable JavaScript and open a site such way. Then you will see if you are facing JS or the site would contain the values you are looking for.
And by the way what you do is against the Terms of Service of the NHL website (according to Section 2. Prohibited Content and Activities)
Engage in unauthorized spidering, scraping, or harvesting of content or information, or use any other unauthorized automated means to compile information;

How to crawl a web site where page navigation involves dynamic loading

I want to crawl a website having multiple pages and when a page number is clicked it is dynamically loaded.How to screen scrape it?
i.e as the url is not present as href or a how to crawl to other pages?
Would be greatful if someone helped me on this.
PS:URL remains the same when different page is clicked.
You should consider also Ghost.py, since it allows you tu run arbitrary javascript commands, fill forms and take snapshoot very quickly.
if you are using google chrome, you can check the url which is dynamically being called in
network->headers of the developer tools
so based on that you can identify whether it is a GET or POST request.
If it is a GET request you can find the parameters straight away from the url.
If it is a POST request you can find the parameters from form data in network->headers
of the developer tools.
You could look for the data you want in the javascript code instead of the HTML. This is usually a pain but you can do fun things with regular expressions.
Alternatively, some of the browser testing libraries like splinter work by loading the page up in an actual browser like firefox or chrome before scraping. One of those would work if you are running this on a machine with a browser installed.
Since this post has been tagged with python and web-crawler, Beautiful Soup has to be mentioned: http://www.crummy.com/software/BeautifulSoup/
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs3/download/2.x/documentation.html
You can not do that easily since it is an ajax pagination (even with mechanize). Instead, open the source file of the page and try to know what is the url request used for the ajax pagination. Then, you can create a fake request back and process the returned data by your own way
If you don't mind using gevent.GRobot is another good choose.

Categories

Resources