Webscraping with python - with interactive website - python

Can anybody recommend a python package to extract data from the Dutch met office website:
https://www.knmi.nl/nederland-nu/weer/waarschuwingen-en-verwachtingen/weer-en-klimaatpluim
The site shows graphs with forecasts of temperature, rainfall, etc. You can click on the graph and select that the underlying data is shown in a table.
Which python package can I use to go to the site, extract the table data for different forecasts in a dataframe.
thanks

Simply you can use BeautifulSoup

Do not listen to the others, the data on this particular page is loaded dynamically with JavaScript therefore BeautifulSoup will not be able to scrape.
Tip
Only scrape with BeautifulSoup if you have to, your first port of call should be to expose the api endpoint.
You can send a request to the api endpoint and return json containing the page data.
If you are using a chromium browser press CTRL + SHIFT + I & select the network tab. Click Clear then click Record.
When you refresh the page you will notice the table below will fill up with requests.
Search the Name column for the JSON requests and use the request url and the code below to return JSON
import requests
import json
target_url = "https://cdn.knmi.nl/knmi/json/page/weer/waarschuwingen_verwachtingen/ensemble/iPluim/260_99999.json"
r = requests.get(target_url)
weather_data = json.loads(r.text)
print(weather_data['series'])
If you can work out the the api parameters you could construct your own requests

Selenium can be useful for dynamic sites https://www.selenium.dev/documentation/

Related

scrape live scores from oddsportal live odds page using requests

I want to scrape inplay odds and scores.
I succeed to get live odds data using the below code, but without finding live scores:
import requests, re, time
from bs4 import BeautifulSoup
url = f"https://fb.oddsportal.com/feed/livegames/live/1/0.dat?_{int(time.time() * 1000)}"
headers = {'User-Agent': 'curl/7.64.0','Referer': 'https://www.oddsportal.com/inplay-odds/live-now/soccer/'}
r = requests.get(url, headers=headers)
live_html = re.findall(r'<table class=.*table>', r.text)[0].replace("\\","")
soup = BeautifulSoup(live_html, 'html.parser')
I tried to search from from Developper Tools > Sources > Page, but can't find any source that provide live scores
The Live-odds or score on any website come through web sockets and thus cannot be scraped using any normal approach but there are some tricks to do that given that website does have a very good authentication protocol. you can refer to the link and may try it for your use case.
https://towardsdatascience.com/websocket-retrieve-live-data-f539b1d9db1e
Where there is a will there is a way.
Have you tried using Selenium? (I've heard it might be slower however)
If not try OCR you can decipher the text for every frame change.
Use python's websocket-client-package to retrieve the LIVE data.
First, you need to copy your web browser’s header to here and use json.dumps to convert it into the string format.
Additionally, you have to do a handshake i.e., sending messages to the website and receiving messages when connecting to the websocket.
After that, create the connection to the server by using create_connection. Then, perform the handshake by sending the message, and you will be able to see the data on your side.

How do I get this information out of this website?

I found this link:
https://search.roblox.com/catalog/json?Category=2&Subcategory=2&SortType=4&Direction=2
The original is:
https://www.roblox.com/catalog/?Category=2&Subcategory=2&SortType=4
I am trying to scrape the prices of all the items in the whole catalog with Python, but I can't seem to locate the prices of the items. The URL does not change whenever I go to the next page. I have tried inspecting the website itself but I can't manage to find anything.
The first URL is somehow scrapeable and I found it randomly on a forum. How did the user get all this text data in there?
Note: I know the website is for children, but I earn money by selling limiteds on there. No harsh judgement please. :)
You can scrape all the item information without BeautifulSoup or Selenium - you just need requests. That being said, it's not super straight-forward, so I'll try to break it down:
When you visit a URL, your browser makes many requests to external resources. These resources are hosted on a server (or, nowadays, on several different servers), and they make up all the files/data that your browser needs to properly render the webpage. Just to list a few, these resources can be images, icons, scripts, HTML files, CSS files, fonts, audio, etc. Just for reference, loading www.google.com in my browser makes 36 requests in total to various resources.
The very first resource you make a request to will always be the actual webpage itself, so an HTML-like file. The browser then figures out which other resources it needs to make requests to by looking at that HTML.
For example's sake, let's say the webpage contains a table containing data we want to scrape. The very first thing we should ask ourselves is "How did that table get on that page?". What I mean is, there are different ways in which a webpage is populated with elements/html tags. Here's one such way:
The server receives a request from our browser for the page.html
resource
That resource contains a table, and that table needs data, so the
server communicates with a database to retrieve the data for the
table
The server takes that table-data and bakes it into the HTML
Finally, the server serves that HTML file to you
What you receive is an HTML with baked in table-data. There is no way
that you can communicate with the previously mentioned database -
this is fine and desirable
Your browser renders the HTML
When scraping a page like this, using BeautifulSoup is standard procedure. You know that the data you're looking for is baked into the HTML, so BeautifulSoup will be able to see it.
Here's another way in which webpages can be populated with elements:
The server receives a request from our browser for the page.html
resource
That resource requires another resource - a script, whose job it is
to populate the table with data at a later point in time
The server serves that HTML file to you (it contains no actual
table-data)
When I say "a later point in time", that time interval is negligible and practically unnoticeable for actual human beings using actual browsers to view pages. However, the server only served us a "bare-bones" HTML. It's just an empty template, and it's relying on a script to populate its table. That script makes a request to a web API, and the web API replies with the actual table-data. All of this takes a finite amount of time, and it can only start once the script resource is loaded to begin with.
When scraping a page like this, you cannot use BeautifulSoup, because it will only see the "bare-bones" template HTML. This is typically where you would use Selenium to simulate a real browsing session.
To get back to your roblox page, this page is the second type.
The approach I'm suggesting (which is my favorite, and in my opinion, should be the approach you always try first), simply involves figuring out what web API potential scripts are making requests to, and then imitating a request to get the data you want. The reason this is my favorite approach is because these web APIs often serve JSON, which is trivial to parse. It's super clean because you only need one third-party module (requests).
The first step is to log all the traffic/requests to resources that your browser makes. I'll be using Google Chrome, but other modern browsers probably have similar features:
Open Google Chrome and navigate to the target page
(https://www.roblox.com/catalog/?Category=2&Subcategory=2&SortType=4)
Hit F12 to open the Chrome Developer Tools menu
Click on the "Network" tab
Click on the "Filter" button (The icon is funnel-shaped), and then
change the filter selection from "All" to "XHR" (XMLHttpRequest or
XHR resources are objects which interact with servers. We want to
only look at XHR resources because they potentially communicate with
web APIs)
Click on the round "Record" button (or press CTRL + E) to enable
logging - the icon should turn red once enabled
Press CTRL + R to refresh the page and begin logging traffic
After refreshing the page, you should see the resource-log start to
fill up. This is a list of all resources our browser requested -
we'll only see XHR objects though since we set up our filter (if
you're curious, you can switch the filter back to "All" to see a
list of all requests to resources made)
Click on one of the items in the list. A panel should open on the right with several tabs. Click on the "Headers" tab to view the request URL, the request- and response headers as well as any cookies (view the "Cookies" tab for a prettier view). If the Request URL contains any query string parameters you can also view them in a prettier format in this tab. Here's what that looks like (sorry for the large image):
This tab tells us everything we want to know about imitating our request. It tells us where we should make the request, and how our request should be formulated in order to be accepted. An ill-formed request will be rejected by the web API - not all web APIs care about the same header fields. For example, some web APIs desperately care about the "User-Agent" header, but in our case, this field is not required. The only reason I know that is because I copy and pasted request headers until the web API wouldn't reject my request anymore - in my solution I'll use the bare minimum to make a valid request.
However, we need to actually figure out which of these XHR objects is responsible for talking to the correct web API - the one that returns the actual information we want to scrape. Select any XHR object from the list and then click on the "Preview" tab to view a parsed version of the data returned by the web API. The assumption is that the web API returned JSON to us - you may have to expand and collapse the tree-structure for a bit before you find what you're looking for, but once you do, you know this XHR object is the one whose request we need to imitate. I happen to know that the data we're interested in is in the XHR object named "details". Here's what part of the expanded JSON looks like in the "Preview" tab:
As you can see, the response we got from this web API (https://catalog.roblox.com/v1/catalog/items/details) contains all the interesting data we want to scrape!
This is where things get sort of esoteric, and specific to this particular webpage (everything up until now you can use to scrape stuff from other pages via web APIs). Here's what happens when you visit https://www.roblox.com/catalog/?Category=2&Subcategory=2&SortType=4:
Your browser gets some cookies that persist and a CSRF/XSRF Token is generated and baked into the HTML of the page
Eventually, one of the XHR objects (the one that starts with
"items?") makes an HTTP GET request (cookies required!) to the web API
https://catalog.roblox.com/v1/search/items?category=Collectibles&limit=60&sortType=4&subcategory=Collectibles
(notice the query string parameters)
The response is JSON. It contains a list of item-descriptor things, it looks like this:
Then, some time later, another XHR object ("details") makes an HTTP POST request to the web API https://catalog.roblox.com/v1/catalog/items/details (refer to first and second screenshots). This request is only accepted by the web API if it contains the right cookies and the previously mentioned CSRF/XSRF token. In addition, this request also needs a payload containing the asset ids whose information we want to scrape - failure to provide this also results in a rejection.
So, it's a bit tricky. The request of one XHR object depends on the response of another.
So, here's the script. It first creates a requests.Session to keep track of cookies. We define a dictionary params (which is really just our query string) - you can change these values to suit your needs. The way it's written now, it pulls the first 60 items from the "Collectibles" category. Then, we get the CSRF/XSRF token from the HTML body with a regular expression. We get the ids of the first 60 items according to our params, and generate a dictionary/payload that the final web API request will accept. We make the final request, create a list of items (dictionaries), and print the keys and values of the first item of our query.
def get_csrf_token(session):
import re
url = "https://www.roblox.com/catalog/"
response = session.get(url)
response.raise_for_status()
token_pattern = "setToken\\('(?P<csrf_token>[^\\)]+)'\\)"
match = re.search(token_pattern, response.text)
assert match
return match.group("csrf_token")
def get_assets(session, params):
url = "https://catalog.roblox.com/v1/search/items"
response = session.get(url, params=params, headers={})
response.raise_for_status()
return {"items": [{**d, "key": f"{d['itemType']}_{d['id']}"} for d in response.json()["data"]]}
def get_items(session, csrf_token, assets):
import json
url = "https://catalog.roblox.com/v1/catalog/items/details"
headers = {
"Content-Type": "application/json;charset=UTF-8",
"X-CSRF-TOKEN": csrf_token
}
response = session.post(url, data=json.dumps(assets), headers=headers)
response.raise_for_status()
items = response.json()["data"]
return items
def main():
import requests
session = requests.Session()
params = {
"category": "Collectibles",
"limit": "60",
"sortType": "4",
"subcategory": "Collectibles"
}
csrf_token = get_csrf_token(session)
assets = get_assets(session, params)
items = get_items(session, csrf_token, assets)
first_item = items[0]
for key, value in first_item.items():
print(f"{key}: {value}")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
id: 76692143
itemType: Asset
assetType: 8
name: Chaos Canyon Sugar Egg
description: This highly collectible commemorative egg recalls that *other* classic ROBLOX level, the one that was never quite as popular as Crossroads.
productId: 11837951
genres: ['All']
itemStatus: []
itemRestrictions: ['Limited']
creatorType: User
creatorTargetId: 1
creatorName: ROBLOX
lowestPrice: 400
purchaseCount: 7714
favoriteCount: 2781
>>>
You can use selenium to controll a browser. https://www.selenium.dev/ It can give you the contend of a item and subitem (and much more). You can press alt i think on firefox and then developer -> Inspector and hover over the Item of the webpage. It shows you the responding html text
the python binding: selenium-python.readthedocs.io

How to download a video when I get the URL of the MP4 file in selenium python? (WITHOUT URLLIB)

I can get to the point where the video is right in front of me. I need to loop through the urls and download all of these videos. The issue is that the request is stateful and I cannot use urllib because then authorization issues occur. How do I just target the three dots in chrome video viewer and download the file?
All I need now is to be able to download by clicking on the download button. I do not know if it can be done without the specification of coordinates. Like I said, the urls follow a pattern and I can generate them. The only issue is the authorization. Please help me get the videos through selenium.
Note that the video is in JavaScript so I cannot really target the three dots or the download button.
You can get the cookies from the driver and pass the information to the Request session. So, you can download with the Requests library.
import requests
cookies = driver.get_cookies()
s = requests.Session()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
response = s.get(urlDownload, stream=True)
print(response.status_code)
with open(fileName,'wb') as f:
f.write(response.content)
you can use selenium in python 2 as you have only pics i cannot give you a real code but something like that will help.you can find XPath by inspecting HTML
import selenium
driver.find_element_by_xpath('xpath of 3 dots')

Extracting a table from a website

I've tried many times to retrieve the table at this website:
http://www.whoscored.com/Players/845/History/Tomas-Rosicky
(the one under "Historical Participations")
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.whoscored.com/Players/845/').read())
This is the Python code I am using to retrieve the table html, but I am getting an empty string. Help me out!
The desired table is formed via an asynchronous API call to the http://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics endpoint request to which returns a JSON response. In other words, urllib2 would return you an initial HTML content of the page without the "dynamic" part. In other words, urllib2 is not a browser.
You can study the request using browser developer tools:
Now, you need to simulate this request in your code. requests package is something you should consider using.
Here is a similar question about whoscored.com I've answered before, there is a sample working code you can use as a starting point:
XHR request URL says does not exist when attempting to parse it's content

How to submit a form to server and get csv file from server via the internet wtih python?

I need to submit a form to the server and get csv file from the server via the internet with python.
The server website is (http://
222.158.245.253/obweb/data/c1/c1_output6.aspx?LocationNo=012), which publishes the observation data of sea in Japan.
So far, I always select the item and the date and click the button.
Then, When a file save dialog box is displayed, I preserve the csv file from the server.
I would like to automate these manual labors with python.
I have studied about python and web scraping and have used python modules(like BeautifulSoup).
However, This website is difficult to do web scraping due to aspx.
So, please help me.
You can avoid scraping if you can find out what URL the form is POSTing to. Inspect the source code of the page and see if the form tag has an action attribute. This is the URL that the form sends all of your fields to (including the item and date you specify).
You're going to want to use the requests library to make your POST request. It'll be something like this example from the requests quickstart:
payload = {'item': '<your item>', 'date': '<your date>'}
r = requests.post("<form post url>", data=payload)
You can then likely access the csv file that's returned with
print r.content
Though you may have to process r.content for it to be meaningful.

Categories

Resources