Suitable Python modules for navigating a website - python

I am looking for a python module that will let me navigate searchbars, links etc of a website.
For context I am looking to do a little webscraping of this website [https://www.realclearpolitics.com/]
I simply want to take information on each state (polling data etc) in relation to the 2020 election and organize it all in a collection of a database.
Obviously there are a lot of states to go through and each is on a seperate webpage. So im looking for a method in python in which i could quickly navigate the site and take the data of each page etc aswell as update and add to existing data. So finding a method of quickly navigating links and search bars with my inputted data would be very helpful.
Any suggestions would be greatly appreciated.
# a simple list that contains the names of each state
states = ["Alabama", "Alaska" ,"Arizona", "....."]
for state in states:
#code to look up the state in the searchbar of website
#figures being taken from website etc
break
Here is the rough idea i have

There are many options to accomplish this with Python. As #LD mentioned, you can use Selenium. Selenium is a good option if you need to interact with a websites UI via a headless browser. E.g clicking a button, entering text into a search bar, etc. If your needs aren't that complex, for instance if you just need to quickly scrape all the raw content from a web page and process it, than you should use the requests module from Python's standard library.
For processing raw content from a crawl, I would recommend beautiful soup.
Hope that helps!

Related

scrape data from URL into pandas

I am trying to scrape date from a URL. The data is not in HTML tables, so pandas.read_html() is not picking it up.
The URL is:
https://www.athlinks.com/event/1015/results/Event/638761/Course/988506/Results
The data I'd like to get is a table gender, age, time for the past 5k races (name is not really important). The data is presented in the web page 50 at a time for around 25 pages.
It uses various javascript frameworks for the UI (node.js, react). Found this out using the "What Runs" ad-on in chrome browser.
Here's the real reason I'd like to get this data. I'm a new runner and will be participating in this 5k next weeked and would like to explore some of the distribution statistics for past faces (its an annual race, and data goes back to 1980's).
Thanks in advance!
The data comes from socket.io, and there are python packages for it. How did I find it?
If you open Network panel in your browser and choose XHR filter, you'll find something like
https://results-hub.athlinks.com/socket.io/?EIO=3&transport=polling&t=MYOPtCN&sid=5C1HrIXd0GRFLf0KAZZi
Look into content it is what we need.
Luckily this site has a source maps.
Now you can go to More tools -> Search and find this domain.
And then find resultsHubUrl in settings.
This property used inside setUpSocket.
And setUpSocket used inside IndividualResultsStream.js and RaseStreams.js.
Now you can press CMD + P and go deep down to this files.
So... I've spent around five minutes to find it. You can go ahead! Now you have all the necessary tools. Feel free to use breakpoints and read more about chrome developer tools.
You actually need to render the JS in a browser engine before crawling the generated HTML. Have you tried https://github.com/scrapinghub/splash, https://github.com/miyakogi/pyppeteer, or https://www.npmjs.com/package/spa-crawler ? You can also try to inspect the page (F12 -> Networking) while is loading the data relevant to you (from a restful api, I suppose), and then make the same calls from command line using curl or the requests python library.

How to get renewable information on a web by python3?

I want to get some information on a web page. I use requests.get to abstract the page. But I cannot find what I want. Checking it carefully, I found the info I want is in a list with a scrollbar. When I drag scrollbar down, more and more info is loaded. So I guess all the info in the list is not loaded yet when I get the page by module requests. I want to know what is happened in this process and How can I gather the information I want. (I am not familiar with Html language).
I want to know what is happened in this process
It sounds like when the user scrolls, the scrolling causes some javascript(js) to execute, and the js makes repeated requests to the server for more data. Unfortunately, the requests module cannot cause the javascript on an html page to execute--all you get back is the text of the js. The unable to execute javascript on an html page in order to retrieve what the user actually sees has been a problem for a long time. Fortunately, smart programmers have largely solved that problem. You need to use a different module. Check out the selenium module.
I am not familiar with Html language
Scraping web pages can get really tricky really fast, and some web pages proactively try to prevent computer programs from scraping their content, so you need to know both html and js in order to figure out what is going on.

How to scrape AJAX website?

In the past, I've used the urllib2 library to get source codes from websites. However, I've noticed that for a recent website I've been trying to play with, I can't find the information I need in the source code.
http://www.wgci.com/playlist is the site that I've been looking at, and I want to get the most recently played song and the playlist of recent songs. I essentially want to copy and paste the visible, displayed text on a website and put it in a string. Alternatively, being able to access the element that holds these values in plaintext and get them using urllib2 normally would be nice. Is there anyway to do either of these things?
Thanks kindly.
The website you want to scrap is using ajax calls to populate it's pages with data.
You have 2 ways to scrapping data from it:
Use a headless browser that supports javascript (ZombieJS for instance), and scrap the generated output, but that's complicated and overkill
Understand how their API work, and call that directly, which is way simpler.
Use Chrome dev tools (network tab) to see the calls while browsing their website.
For example, the list of last played songs for a given stream is available in JSON at
http://www.wgci.com/services/now_playing.html?streamId=841&limit=12

Is a Web Crawler more suitable?

TL;DR Version :
I have only heard about web crawlers in intelluctual conversations Im not part of. All I want to know that can they follow a specific path like:
first page (has lot of links) -->go to links specified-->go to
links(specified, yes again)-->go to certain link-->reach final page
and download source.
I have googled a bit and came across Scrappy. But I am not sure if I fully understand web crawlers to begin with and if scrappy can help me follow the specific path I want.
Long Version
I wanted to extract some text of a group of static web pages. These web pages are very simple with just basic HTML. I used python and the urllib to access the URL,extract the text and work with it. Pretty soon I realized that I will have to basically visit all these pages and copy paste the URL into my program, which is tiresome. I wanted to know if this is more suitable for a web crawler. I want to access this
page. Then select only a few organisms (I have a list of those). On Clicking on of them you can see this page. If you look under the table - MTases active in the genome there are Enzymes which are hyperlinks. Clinking on those lead to this page. On the right hand side there is link named Sequence Data. Once clicked it leads to the page which has a small table on the lower right with yellow headers. under it it has an entry DNA (FASTA STYLE. Clicking on view will lead to the page im interested in and want to download the page source from.
I think you are definitely on the right track for looking at a web crawler to help you do this. You can also look at Norconex HTTP Collector which I know can let you follow links on a page without storing that page if is is just a listing page to you. That crawler lets you filter out pages after their links have been extracted to be followed. Ultimately, you can configure the right filters so that only the pages matching the pattern you want get downloaded for you to process (whether it is based on crawl depth, URL pattern, content pattern, etc).

Suggestion on creation app for getting information from webpage

First want to say that I have experience with python and some web libraries like mechanize, beautiful soup, urllib2.
The idea is to create an app that will grab information from webpage, that I currently looking on in webbrowser. And than store it.
For example:
I manually go to the website, create a user.
Than run my app, that will grab some details from webpage, that I'm currently looking on. like user name, first name, last name and so on.
Problems:
I don't know how to make a program to run kinda on top of my webbrowser. I can't simply make a scipt to login to this webpage and do the rest with Beautiful Soup because it has a very good protection from web-crawlers and web bots.
Need some place to start. So the main question is is it possible to grab information that currently on my web browser? if yes hope to hear some suggestions on how to make my program look at the browser?
Please fill free to ask me if you not kinda understand what I'm asking, or you have some suggestions, some libraries that I can use.
The easiest thing to do is probably to save the HTML content of the current page to a file (using File -> Save Page As or whatever it is in your browser) and then running Beautiful Soup / lxml.html / whatever on that file.
You could probably also get Selenium to do what you want, though I've never used it and am not sure.

Categories

Resources