scrape data from URL into pandas

scrape data from URL into pandas - python

I am trying to scrape date from a URL. The data is not in HTML tables, so pandas.read_html() is not picking it up.
The URL is:
https://www.athlinks.com/event/1015/results/Event/638761/Course/988506/Results
The data I'd like to get is a table gender, age, time for the past 5k races (name is not really important). The data is presented in the web page 50 at a time for around 25 pages.
It uses various javascript frameworks for the UI (node.js, react). Found this out using the "What Runs" ad-on in chrome browser.
Here's the real reason I'd like to get this data. I'm a new runner and will be participating in this 5k next weeked and would like to explore some of the distribution statistics for past faces (its an annual race, and data goes back to 1980's).
Thanks in advance!

The data comes from socket.io, and there are python packages for it. How did I find it?
If you open Network panel in your browser and choose XHR filter, you'll find something like
https://results-hub.athlinks.com/socket.io/?EIO=3&transport=polling&t=MYOPtCN&sid=5C1HrIXd0GRFLf0KAZZi
Look into content it is what we need.
Luckily this site has a source maps.
Now you can go to More tools -> Search and find this domain.
And then find resultsHubUrl in settings.
This property used inside setUpSocket.
And setUpSocket used inside IndividualResultsStream.js and RaseStreams.js.
Now you can press CMD + P and go deep down to this files.
So... I've spent around five minutes to find it. You can go ahead! Now you have all the necessary tools. Feel free to use breakpoints and read more about chrome developer tools.

You actually need to render the JS in a browser engine before crawling the generated HTML. Have you tried https://github.com/scrapinghub/splash, https://github.com/miyakogi/pyppeteer, or https://www.npmjs.com/package/spa-crawler ? You can also try to inspect the page (F12 -> Networking) while is loading the data relevant to you (from a restful api, I suppose), and then make the same calls from command line using curl or the requests python library.

Related

Extracting info from webpage via python

I'd like to ask somebody with experience with headless browsers and python if it's possible to extract box info with distance from closest strike on webpage below. Till now I was using python bs4 but since everything is driven by jQuery here simple download of webpage doesn't work. I found PhantomJS but I wasn't able extract it too so I am not sure if it's possible. Thanks for hints.
https://lxapp.weatherbug.net/v2/lxapp_impl.html?lat=49.13688&lon=16.56522&v=1.2.0

This isn't really a Linux question, it's a StackOverflow question, so I won't go into too much detail.
The thing you want to do can be easily done with Selenium. Selenium has both a headless mode, and a heady mode (where you can watch it open your browser and click on things). The DOM query API is a bit less extensive than bs4, but it does have nice visual query (location on screen) functions. So you would write a Python script that initializes Selenium, goes to your website and interacts with it. You may need to do some image recognition on screenshots at some point. It may be as simple as finding for a certain query image on the screen, or something much more complicated.
You'd have to go through the Selenium tutorials first to see how it works, which would take you 1-2 days. Then figure out what Selenium stuff you can use to do what you want, that depends on luck and whether what you want happens to be easy or hard for that particular website.
Instead of using Selenium, though, I recommend trying to reverse engineer the API. For example, the page you linked to hits https://cmn-lx.pulse.weatherbug.net/data/lightning/v1/spark with parameters like:
_
callback
isGpsLocation
location
locationtype
safetyMessage
shortMessage
units
verbose
authid
timestamp
hash
You can figure out by trial and error which ones you need and what to put in them. You can capture requests from your browser and then read them yourself. Then construct appropriate requests from a Python program and hit their API. It would save you from having to deal with a Web UI designed for humans.

Is it possible to extract data from an interactive JS graph using python?

I want to know if it is possible to extract from an interactive JS graph like the one here:
https://www.eurocontrol.int/Economics/DailyTrafficVariation-States.html
A problem here is that in order to get the data I need I have to first select a gray bar on the first table in order to generate the needed second table, like shown:
Daily Air Traffic first table generates second table with needed percent data
Also only want to extract the percentages that display for four countries.
I tried a few python packages but they were not that effective for interactive JS graphs. Most seem good for only static tables like those found on wikipedia. BeautifulSoup, Pandas, Requests, and Selenium I tried and inspected the webpage to see its xhr data and tried to find if there was a csv file attached. None captured interactive JS graphs in order to extract its data.
Is it possible? and could I download it to an excel?
Thanks!

You must use a tool that is able to render and execute javascript. That basically means a web browser. There are several ones available, some based on Firefox some based on Chrome.
Given the links you provided I think Puppeteer from Google (https://pptr.dev/) will allow you to do what you need.
But it seems to me you are underestimating the complexity of scraping a website. Be prepared to overcome many difficulties, the most important one being the site you are scraping not liking having it's data scraped and taking drastic countermeasures. This approach will probably work to scrape a few pages but hardly a lot of data.

Suitable Python modules for navigating a website

I am looking for a python module that will let me navigate searchbars, links etc of a website.
For context I am looking to do a little webscraping of this website [https://www.realclearpolitics.com/]
I simply want to take information on each state (polling data etc) in relation to the 2020 election and organize it all in a collection of a database.
Obviously there are a lot of states to go through and each is on a seperate webpage. So im looking for a method in python in which i could quickly navigate the site and take the data of each page etc aswell as update and add to existing data. So finding a method of quickly navigating links and search bars with my inputted data would be very helpful.
Any suggestions would be greatly appreciated.
# a simple list that contains the names of each state
states = ["Alabama", "Alaska" ,"Arizona", "....."]
for state in states:
#code to look up the state in the searchbar of website
#figures being taken from website etc
break
Here is the rough idea i have

There are many options to accomplish this with Python. As #LD mentioned, you can use Selenium. Selenium is a good option if you need to interact with a websites UI via a headless browser. E.g clicking a button, entering text into a search bar, etc. If your needs aren't that complex, for instance if you just need to quickly scrape all the raw content from a web page and process it, than you should use the requests module from Python's standard library.
For processing raw content from a crawl, I would recommend beautiful soup.
Hope that helps!

web scraping with python and BeautifulSoup

I am trying to extract data from a web site and the data are in a table :
url=requests.get("xxxxx")
soup =BeautifulSoup(url.content)
table=soup.find_all("table")[0]
rows = table.find_all('tr')
I tried this code it works but only 42 lines are extracted and the source table contains 220 lines ?
someone tell me how to fix this.

Welcome.
2 possibilities. Javascript or website security.
requests is javscript agnostic and doesn't execute any javascript code. You'll want a headless browser solution (selenium is popular) that more closely mimicks a browser, especially when it comes to javascript.
Many websites don't want to be scraped and employ different methods to prevent it. The simplest form is checking the User-Agent value of the client (your Python script) or rate-throttling (20k refreshes a second isn't human). e.g., if the User-Agent is anything other than a known value, it'll behave differently (little or no data). Other forms of defense are more complex. Such as trying to play audio on your "browser" or polling your "browser"'s resolution. For that you'll need to investigate the site's behavior. This can take time. You can start off with either the Networking tab of your browser's developing tools (F12 on Firefox) or Zap Proxy for more refined control.

How to scrape AJAX website?

In the past, I've used the urllib2 library to get source codes from websites. However, I've noticed that for a recent website I've been trying to play with, I can't find the information I need in the source code.
http://www.wgci.com/playlist is the site that I've been looking at, and I want to get the most recently played song and the playlist of recent songs. I essentially want to copy and paste the visible, displayed text on a website and put it in a string. Alternatively, being able to access the element that holds these values in plaintext and get them using urllib2 normally would be nice. Is there anyway to do either of these things?
Thanks kindly.

The website you want to scrap is using ajax calls to populate it's pages with data.
You have 2 ways to scrapping data from it:
Use a headless browser that supports javascript (ZombieJS for instance), and scrap the generated output, but that's complicated and overkill
Understand how their API work, and call that directly, which is way simpler.
Use Chrome dev tools (network tab) to see the calls while browsing their website.
For example, the list of last played songs for a given stream is available in JSON at
http://www.wgci.com/services/now_playing.html?streamId=841&limit=12

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.