How do I get this information out of this website? - python

I found this link:
https://search.roblox.com/catalog/json?Category=2&Subcategory=2&SortType=4&Direction=2
The original is:
https://www.roblox.com/catalog/?Category=2&Subcategory=2&SortType=4
I am trying to scrape the prices of all the items in the whole catalog with Python, but I can't seem to locate the prices of the items. The URL does not change whenever I go to the next page. I have tried inspecting the website itself but I can't manage to find anything.
The first URL is somehow scrapeable and I found it randomly on a forum. How did the user get all this text data in there?
Note: I know the website is for children, but I earn money by selling limiteds on there. No harsh judgement please. :)

You can scrape all the item information without BeautifulSoup or Selenium - you just need requests. That being said, it's not super straight-forward, so I'll try to break it down:
When you visit a URL, your browser makes many requests to external resources. These resources are hosted on a server (or, nowadays, on several different servers), and they make up all the files/data that your browser needs to properly render the webpage. Just to list a few, these resources can be images, icons, scripts, HTML files, CSS files, fonts, audio, etc. Just for reference, loading www.google.com in my browser makes 36 requests in total to various resources.
The very first resource you make a request to will always be the actual webpage itself, so an HTML-like file. The browser then figures out which other resources it needs to make requests to by looking at that HTML.
For example's sake, let's say the webpage contains a table containing data we want to scrape. The very first thing we should ask ourselves is "How did that table get on that page?". What I mean is, there are different ways in which a webpage is populated with elements/html tags. Here's one such way:
The server receives a request from our browser for the page.html
resource
That resource contains a table, and that table needs data, so the
server communicates with a database to retrieve the data for the
table
The server takes that table-data and bakes it into the HTML
Finally, the server serves that HTML file to you
What you receive is an HTML with baked in table-data. There is no way
that you can communicate with the previously mentioned database -
this is fine and desirable
Your browser renders the HTML
When scraping a page like this, using BeautifulSoup is standard procedure. You know that the data you're looking for is baked into the HTML, so BeautifulSoup will be able to see it.
Here's another way in which webpages can be populated with elements:
The server receives a request from our browser for the page.html
resource
That resource requires another resource - a script, whose job it is
to populate the table with data at a later point in time
The server serves that HTML file to you (it contains no actual
table-data)
When I say "a later point in time", that time interval is negligible and practically unnoticeable for actual human beings using actual browsers to view pages. However, the server only served us a "bare-bones" HTML. It's just an empty template, and it's relying on a script to populate its table. That script makes a request to a web API, and the web API replies with the actual table-data. All of this takes a finite amount of time, and it can only start once the script resource is loaded to begin with.
When scraping a page like this, you cannot use BeautifulSoup, because it will only see the "bare-bones" template HTML. This is typically where you would use Selenium to simulate a real browsing session.
To get back to your roblox page, this page is the second type.
The approach I'm suggesting (which is my favorite, and in my opinion, should be the approach you always try first), simply involves figuring out what web API potential scripts are making requests to, and then imitating a request to get the data you want. The reason this is my favorite approach is because these web APIs often serve JSON, which is trivial to parse. It's super clean because you only need one third-party module (requests).
The first step is to log all the traffic/requests to resources that your browser makes. I'll be using Google Chrome, but other modern browsers probably have similar features:
Open Google Chrome and navigate to the target page
(https://www.roblox.com/catalog/?Category=2&Subcategory=2&SortType=4)
Hit F12 to open the Chrome Developer Tools menu
Click on the "Network" tab
Click on the "Filter" button (The icon is funnel-shaped), and then
change the filter selection from "All" to "XHR" (XMLHttpRequest or
XHR resources are objects which interact with servers. We want to
only look at XHR resources because they potentially communicate with
web APIs)
Click on the round "Record" button (or press CTRL + E) to enable
logging - the icon should turn red once enabled
Press CTRL + R to refresh the page and begin logging traffic
After refreshing the page, you should see the resource-log start to
fill up. This is a list of all resources our browser requested -
we'll only see XHR objects though since we set up our filter (if
you're curious, you can switch the filter back to "All" to see a
list of all requests to resources made)
Click on one of the items in the list. A panel should open on the right with several tabs. Click on the "Headers" tab to view the request URL, the request- and response headers as well as any cookies (view the "Cookies" tab for a prettier view). If the Request URL contains any query string parameters you can also view them in a prettier format in this tab. Here's what that looks like (sorry for the large image):
This tab tells us everything we want to know about imitating our request. It tells us where we should make the request, and how our request should be formulated in order to be accepted. An ill-formed request will be rejected by the web API - not all web APIs care about the same header fields. For example, some web APIs desperately care about the "User-Agent" header, but in our case, this field is not required. The only reason I know that is because I copy and pasted request headers until the web API wouldn't reject my request anymore - in my solution I'll use the bare minimum to make a valid request.
However, we need to actually figure out which of these XHR objects is responsible for talking to the correct web API - the one that returns the actual information we want to scrape. Select any XHR object from the list and then click on the "Preview" tab to view a parsed version of the data returned by the web API. The assumption is that the web API returned JSON to us - you may have to expand and collapse the tree-structure for a bit before you find what you're looking for, but once you do, you know this XHR object is the one whose request we need to imitate. I happen to know that the data we're interested in is in the XHR object named "details". Here's what part of the expanded JSON looks like in the "Preview" tab:
As you can see, the response we got from this web API (https://catalog.roblox.com/v1/catalog/items/details) contains all the interesting data we want to scrape!
This is where things get sort of esoteric, and specific to this particular webpage (everything up until now you can use to scrape stuff from other pages via web APIs). Here's what happens when you visit https://www.roblox.com/catalog/?Category=2&Subcategory=2&SortType=4:
Your browser gets some cookies that persist and a CSRF/XSRF Token is generated and baked into the HTML of the page
Eventually, one of the XHR objects (the one that starts with
"items?") makes an HTTP GET request (cookies required!) to the web API
https://catalog.roblox.com/v1/search/items?category=Collectibles&limit=60&sortType=4&subcategory=Collectibles
(notice the query string parameters)
The response is JSON. It contains a list of item-descriptor things, it looks like this:
Then, some time later, another XHR object ("details") makes an HTTP POST request to the web API https://catalog.roblox.com/v1/catalog/items/details (refer to first and second screenshots). This request is only accepted by the web API if it contains the right cookies and the previously mentioned CSRF/XSRF token. In addition, this request also needs a payload containing the asset ids whose information we want to scrape - failure to provide this also results in a rejection.
So, it's a bit tricky. The request of one XHR object depends on the response of another.
So, here's the script. It first creates a requests.Session to keep track of cookies. We define a dictionary params (which is really just our query string) - you can change these values to suit your needs. The way it's written now, it pulls the first 60 items from the "Collectibles" category. Then, we get the CSRF/XSRF token from the HTML body with a regular expression. We get the ids of the first 60 items according to our params, and generate a dictionary/payload that the final web API request will accept. We make the final request, create a list of items (dictionaries), and print the keys and values of the first item of our query.
def get_csrf_token(session):
import re
url = "https://www.roblox.com/catalog/"
response = session.get(url)
response.raise_for_status()
token_pattern = "setToken\\('(?P<csrf_token>[^\\)]+)'\\)"
match = re.search(token_pattern, response.text)
assert match
return match.group("csrf_token")
def get_assets(session, params):
url = "https://catalog.roblox.com/v1/search/items"
response = session.get(url, params=params, headers={})
response.raise_for_status()
return {"items": [{**d, "key": f"{d['itemType']}_{d['id']}"} for d in response.json()["data"]]}
def get_items(session, csrf_token, assets):
import json
url = "https://catalog.roblox.com/v1/catalog/items/details"
headers = {
"Content-Type": "application/json;charset=UTF-8",
"X-CSRF-TOKEN": csrf_token
}
response = session.post(url, data=json.dumps(assets), headers=headers)
response.raise_for_status()
items = response.json()["data"]
return items
def main():
import requests
session = requests.Session()
params = {
"category": "Collectibles",
"limit": "60",
"sortType": "4",
"subcategory": "Collectibles"
}
csrf_token = get_csrf_token(session)
assets = get_assets(session, params)
items = get_items(session, csrf_token, assets)
first_item = items[0]
for key, value in first_item.items():
print(f"{key}: {value}")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
id: 76692143
itemType: Asset
assetType: 8
name: Chaos Canyon Sugar Egg
description: This highly collectible commemorative egg recalls that *other* classic ROBLOX level, the one that was never quite as popular as Crossroads.
productId: 11837951
genres: ['All']
itemStatus: []
itemRestrictions: ['Limited']
creatorType: User
creatorTargetId: 1
creatorName: ROBLOX
lowestPrice: 400
purchaseCount: 7714
favoriteCount: 2781
>>>

You can use selenium to controll a browser. https://www.selenium.dev/ It can give you the contend of a item and subitem (and much more). You can press alt i think on firefox and then developer -> Inspector and hover over the Item of the webpage. It shows you the responding html text
the python binding: selenium-python.readthedocs.io

Related

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Python - Capture auto-downloading file from aspx web page

I'm trying to export a CSV from this page via a python script. The complicated part is that the page opens after clicking the export button on this page, begins the download, and closes again, rather than just hosting the file somewhere static. I've tried using the Requests library, among other things, but the file it returns is empty.
Here's what I've done:
url = 'http://aws.state.ak.us/ApocReports/CampaignDisclosure/CDExpenditures.aspx?exportAll=True&amp%3bexportFormat=CSV&amp%3bisExport=True%22+id%3d%22M_C_sCDTransactions_csfFilter_ExportDialog_hlAllCSV?exportAll=True&exportFormat=CSV&isExport=True'
with open('CD_Transactions_02-27-2017.CSV', "wb") as file:
# get request
response = get(url)
# write to file
file.write(response.content)
I'm sure I'm missing something obvious, but I'm pulling my hair out.
It looks like the file is being generated on demand, and the url stays only valid as long as the session lasts.
There are multiple requests from the browser to the webserver (including POST requests).
So to get those files via code, you would have to simulate the browser, possibly including session state etc (and in this case also __VIEWSTATE ).
To see the whole communication, you can use developer tools in the browser (usually F12, then select NET to see the traffic), or use something like WireShark.
In other words, this won't be an easy task.
If this is open government data, it might be better to just ask that government for the data or ask for possible direct links to the (unfiltered) files (sometimes there is a public ftp server for example) - or sometimes there is an API available.
The file is created on demand but you can download it anyway. Essentially you have to:
Establish a session to save cookies and viewstate
Submit a form in order to click the export button
Grab the link which lies behind the popped-up csv-button
Follow that link and download the file
You can find working code here (if you don't mind that it's written in R): Save response from web-scraping as csv file

Filter out image/file links for Python mechanize web crawler

I'm writing a simple Python web crawler using the mechanize library.
Right now, I just want to do the following:
Accept a list of startURLs as input
For each URL in startURLs, grab all the links on the page
Then, do an HTTP request for each of those links, and grab all of the links from them ...
Repeat this to the specified depth from the startURL.
So my problem is that when it is in step 3, I want it to skip downloading any links that point to image files (so if there is a URL http://www.example.com/kittens.jpg) then I want it to not add that to the list of URLs to fetch.
Obviously I could do this by just using a regex to match various file extensions in the URL path, but I was wondering if there is a cleaner way to determine whether or not a URL points to an image file, rather than an HTML document. Is there some sort of library function (either in mechanize, or some other library) that will let me do this?
Your suggested approach of using a regex on the url is probably the best way to do this, the only way to see for sure what the url points to would be to make a request to the server and examine the Content-Type header of the response to see if it starts with 'image/'.
If you don't mind the overhead of making additional server requests then you should send a HEAD request for the resource rather than the usual GET request - this will cause the server to return information about the resource (including its content type) without actually returning the file itself, saving you some bandwidth.

How to submit a form to server and get csv file from server via the internet wtih python?

I need to submit a form to the server and get csv file from the server via the internet with python.
The server website is (http://
222.158.245.253/obweb/data/c1/c1_output6.aspx?LocationNo=012), which publishes the observation data of sea in Japan.
So far, I always select the item and the date and click the button.
Then, When a file save dialog box is displayed, I preserve the csv file from the server.
I would like to automate these manual labors with python.
I have studied about python and web scraping and have used python modules(like BeautifulSoup).
However, This website is difficult to do web scraping due to aspx.
So, please help me.
You can avoid scraping if you can find out what URL the form is POSTing to. Inspect the source code of the page and see if the form tag has an action attribute. This is the URL that the form sends all of your fields to (including the item and date you specify).
You're going to want to use the requests library to make your POST request. It'll be something like this example from the requests quickstart:
payload = {'item': '<your item>', 'date': '<your date>'}
r = requests.post("<form post url>", data=payload)
You can then likely access the csv file that's returned with
print r.content
Though you may have to process r.content for it to be meaningful.

Python - Direct linking blocking via iFrames, can I still get the binaries?

I have a scraper script that pulls binary content off publishers websites. Its built to replace the manual action of saving hundreds of individual pdf files that colleagues would other wise have to undertake.
The websites are credential based, and we have the correct credentials and permissions to collect this content.
I have encountered a website that has the pdf file inside an iFrame.
I can extract the content URL from the HTML. When I feed the URL to the content grabber, I collect a small piece of HTML that says: <html><body>Forbidden: Direct file requests are not allowed.</body></html>
I can feed the URL directly to the browser, and the PDF file resolves correctly.
I am assuming that there is a session cookie (or something, I'm not 100% comfortable with the terminology) that gets sent with the request to show that the GET request comes from a live session, not a remote link.
I looked at the refering URL, and saw these different URLs that point to the same article that I collected over a day of testing (I have scrubbed identifers from the URL):-
http://content_provider.com/NDM3NTYyNi45MTcxODM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3NjYyMS4wNjU3MzY%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njc3Mi4wOTY3MDM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njg3Ni4yOTc0NDg%3D/elibrary//title/issue/article.pdf
This suggests that there is something in the URL that is unique, and needs associating to something else to circumvent the direct link detector.
Any suggestions on how to get round this problem?
OK. The answer was Cookies and headers. I collected the get header info via httpfox and made a identical header object in my script, and i grabbed the session ID from request.cookie and sent the cookie with each request.
For good measure I also set the user agent to a known working browser agent, just in case the server was checking agent details.
Works fine.

Categories

Resources