I'm new to web scraping, programming, and StackOverflow, so I'll try to phrase things as clearly as I can.
I'm using the Python requests library to try to scrape some info from a local movie theatre chain. When I look at the Chrome developer tools response/preview tabs in the network section, I can see what appears to be very clean and useful JSON:
However, when I try to use requests to obtain this same info, instead I get the entire page content (pages upon pages of html). Upon further inspection of the cascade in the Chrome developer tools, I can see there are two events called GetNowPlayingByCity: One contains the JSON info while the other seems to be the HTML.
JSON Response
HTML Response
How can I separate the two and only obtain the JSON response using the Python requests library?
I have already tried modifying the headers within requests.post (the Chrome developer tools indicate this is a post method) to include "accept: application/json, text/plain, */*" but didn't see a difference in the response I was getting with requests.post. As it stands I can't parse any JSON from the response I get with requests.post and get the following error:
"json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)"
I can always try to parse the full HTML, but it's so long and complex I would much rather work with friendly JSON info. Any help would be much appreciated!
This is probably because the javascript the page sends to your browser is making a request to an API to get the json info about the movies.
You could either try sending the request directly to their API (see edit 2), parse the html with a library like Beautiful Soup or you can use a dedicated scraping library in python. I've had great experiences with scrapy. It is much faster than requests
Edit:
If the page uses dynamically loaded content, which I think is the case, you'd have to use selenium with the PhantomJS browser instead of requests. here is an example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "your url"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
# Then parse the html code here
Or you could load the dynamic content with scrapy
I recommend the latter if you want to get into scraping. It would take a bit more time to learn but it is a better solution.
Edit 2:
To make a request directly to their api you can just reproduce the request you see. Using google chrome, you can see the request if you click on it and go to 'Headers':
After that, you simply reproduce the request using the requests library:
import requests
import json
url = 'http://paste.the.url/?here='
response = requests.get(url)
content = response.content
# in my case content was byte string
# (it looks like b'data' instead of 'data' when you print it)
# if this is you case, convert it to string, like so
content_string = content.decode()
content_json = json.loads(content_string)
# do whatever you like with the data
You can modify the url as you see fit, for example if it is something like http://api.movies.com/?page=1&movietype=3 you could modify movietype=3 to movietype=2 to see a different type of movie, etc
Related
So I am trying to create a small code that gets the views from a youtube video and prints them. However using this code when printing the text var I just get the response "None". Is there a way to get a response of the actual view count using these libraries?
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
soup = BeautifulSoup(url.text, 'html.parser')
text = soup.find('span', {'class': "view-count style-scopeytd-video-view-count-renderer"})
print(text)
To see why, you should use wget or curl to fetch a copy of that page and look at it, or use "view source" from your browser. That's what requests sees. None of those classes appear in the HTML you get back. That's why you get None -- because there ARE none.
YouTube builds all of its pages dynamically, through Javascript. requests doesn't interpret Javascript. If you need to do this, you'll need to use something like Selenium to run a real browser with a Javascript interpreter built in.
I'm learning BeautifulSoup and I want to make a list of all image urls from a webpage (https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection).
import requests
from bs4 import BeautifulSoup
url = 'https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
The code above doesn't yield any image urls.
And when I print(soup), I can't see any image urls either.
But when I right click on one of the images and manually copy the link, I find out the url starts with https://storage.googleapis.com/kagglesdsdata/datasets/165566/377107/.
So I try setting url = 'https://storage.googleapis.com/kagglesdsdata/datasets/165566/377107/' for the above code, but that doesn't yield any image urls either.
Thanks for any help!
Since it is not in the page source it is likely loaded in by Javascript. You could look at the Javascript to see where it is generated or alternatively if you are just using this to learn BeautifulSoup you get get the page source with Selenium. Selenium uses a Chromedriver so you will need to have a chromedriver in your repo. (extract from here https://chromedriver.storage.googleapis.com/92.0.4515.107/chromedriver_win32.zip). Where 92.0.4515.107 is the version you want, or the latest version , which you can see here)
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection/'
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
# parse the soup as you normally would from here
If you go to Chrome Dev Tools you can see the requests to these images:
.
If you click on one of these requests you can see these requests have some query string parameters which have values like X-Goog-Signature and X-Goog-Algorithm..
To get this full URL you would need to replicate the POST requests to https://www.kaggle.com/requests/GetDataViewExternalRequest which returns a JSON object like so:
{"result":{"dataView":{"type":"url","dataTable":null,"dataUrl":{"url":"https://storage.googleapis.com/kagglesdsdata/datasets/165566/377107/no/1%20no.jpeg?X-Goog-Algorithm=GOOG4-RSA-SHA256\u0026X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20210801%2Fauto%2Fstorage%2Fgoog4_request\u0026X-Goog-Date=20210801T194627Z\u0026X-Goog-Expires=345599\u0026X-Goog-SignedHeaders=host\u0026X-Goog-Signature=2f3672e41a5821b19eb88a8452237a36943ca0cb54874ec47e47c832480870f1ae29ba4cab3e3717ab1decdb74012135bdb1324b85fd8159084dd9587f5504dbf60f6890f12277e418ddbbf61c720083ce7cca6b8936fa45cb9132a396c12136106c6dcfca8574475156f199169b2eecee7fd51fd784d7ddec3f8e3b80b75a17216893ffa22248e98e9bb5cae7cd5b3598e7f3fbbc6e51c24c864c8746c9fe202d1f6a221baea2f300dedf4ba62eb510d9369607ab2f6e659e3b4e4a18e763943632b110c57e223ffb9f1c09db8dac32da6e273f6248c5146dce8d5633ba38787394852b4bcc240dfa62210f042902e84833cf8817a050fc64655b0ed5f43ac9","type":"image"},"dataRaw":null,"dataCase":"dataUrl"}},"wasSuccessful":true}
Which is easily parsed by a simple code like:
r = request.post("https://www.kaggle.com/requests/GetDataViewExternalRequest", data=data)
url = r.json()["result"]["dataView"]["dataUrl"]
The hard bit will come at generating this data, something like this:
data = {
"verificationInfo": {
"competitionId": None,
"datasetId": 165566,
"databundleVersionId": 391742,
"datasetHashLink": None
},
"firestorePath": "hIPSqqCWJs6oriNI20r6/versions/kKBcaXwa0lr8cvBuOMna/directories/no/files/1 no.jpeg",
"tableQuery": None
}
I would expect most of those values are static for this page, it's likely the firestorePath changes. From a very quick search, it looks like all those values are scrapeable from the page either using regex or BeautifulSoup.
The request also has some headers including __requestverificationtoken and x-xsrf-token. Which looks like they are there to validate your request, they may be scrapeable, they may not. But, they are equal in value. You would need to add these headers to your request as well. I recommend this site to help with creating requests easily. You just need to check the requests and delete any values which are not constants.
In summary, it's not easy! Use selenium if speed is not an issue, use requests and work all this out if it is.
After all that, the best option is using their API as Phorys said. It will be very easy to use.
The scraper looks good so the problem usually is the server which try to protect itself from... the scrapers! BeautifulSoup is a great tool for scraping static pages hence the problem is when you need to request a page:
Pass a user-agent to your request
user_agent = # for example 'Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/88.0'}
response = requests.get(url, headers={ "user-agent": user_agent})
If this doesn't work investigate the response (cookies, coding, ...)
response = requests.get(url)
for key, value in response.headers.items():
print(key, ":", value)
If you want only links maybe this could be better
for link in soup.find_all('a', href=True): # so you are sure the you don't get an error when retrieving the value
print(link['href'])
Personally, I would look into using Kaggle's official API to access the data. They don't seem particularly restrictive and once you have it figured out it should also be possible to upload solutions as well: https://www.kaggle.com/docs/api
Just to push you in the right direction:
"The Kaggle API and CLI tool provide easy ways to interact with Datasets on Kaggle. The commands available can make searching for and downloading Kaggle Datasets a seamless part of your data science workflow.
If you haven’t installed the Kaggle Python package needed to use the command line tool or generated an API token, check out the getting started steps first.
Some of the commands for interacting with Datasets via CLI include:"
kaggle datasets list -s [KEYWORD]: list datasets matching a search term
kaggle datasets download -d [DATASET]: download files associated with a dataset
I'm trying to scrape a website that has a table in it using bs4, but the element of the content I'm getting is not as complete compared to the one I get from inspect. I cannot find the tag <tr> and <td> in it. How can I get the full content of that site especially the tags for the table?
Here's my code:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
soup = BeautifulSoup(src, "html.parser")
print(soup)
I expect the content to have the tag <tr> and <td> in it because they do exist when I inspect it,but I found none from the output.
Here's the image of the page where there is the tag <tr> and <td>
You should dump the contents of the text you're trying to parse to a file and look at it. This will tell you for sure what is and isn't there. Like this:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
with open("/tmp/content.html", "w") as f:
f.write(src)
soup = BeautifulSoup(src, "html.parser")
print(soup)
Run this code, and then look at the file "/tmp/content.html" (use a different path, obviously, if you're on Windows), and look at what is actually in the file. You could probably do this with your browser, but this this is the way to be the most sure you know what you are getting. You could, of course, also just add print(src), but if it were me, I'd dump it to a file
If the HTML you're looking for is not in the initial HTML that you're getting back, then that HTML is coming from somewhere else. The table could be being built dynamically by JavaScript, or coming from another URL reference, possibly one that calls an HTTP API to grab the table's HTML via parameters passed to the API endpoint.
You will have to reverse engineer the site's design to find where that HTML comes from. If it comes from JavaScript, you may be stuck short of scripting the execution of a browser so you can gain access programmatically to the DOM in the browser's memory.
I would recommend running a debugging proxy that will show you each HTTP request being made by your browser. You'll be able to see the contents of each request and response. If you can do this, you can find the URL that actually returns the content you're looking for, if such a URL exists. You'll have to deal with SSL certificates and such because this is a https endpoint. Debugging proxies usually make that pretty easy. We use Charles. The standard browser toolboxes might do this too...allow you to see each request and response that is generated by a particular page load.
If you can discover the URL that actually returns the table HTML, then you can use that URL to grab it and parse it with BS.
I am new to webscraping. So I have been given a task to extract data from : Here
I am choosing dataset of "comments". Below is my code for scraping.
import requests
from bs4 import BeautifulSoup
url = 'https://www.kaggle.com/hacker-news/hacker-news'
headers = {'User-Agent' : 'Mozilla/5.0'}
response = requests.get(url, headers = headers)
response.status_code
response.content
soup = BeautifulSoup(response.content, 'html.parser')
soup.find_all('tbody', class_ = 'TableBody-kSbjpE jGqIxa')
When I try to execute the last command it returns : [].
So, I am stuck here. I know we can get the data from kernel, but just for practice purpose where am I going wrong? Am I choosing wrong class? I want to scrape the data and probably save it to a CSV file or to a No-SQL Database, preferred Cassandra.
you are getting this [] because data you want to scrape is coming from API which loads after you web page load so page you are accessing does not contain that class
you can open you browser console and check in network as given in screenshot there you find data you want to scrape so you have to make request to that URL to get data
you can retrive data in this URL in preview tab you can see all data.
also if you have good knowledge of python you can also use this to scrape data
https://doc.scrapy.org/en/latest/intro/overview.html
Even though you were able to see the 'tbody', class_ = 'TableBody-kSbjpE jGqIxa' in the element inspector, the request that you make does not contain this class. See for yourself print(soup.prettify()). This is most likely because you're not requesting the correct url.
This may be not something you're aware of, but as a fyi:
You don't actually need to scrape using BeautifulSoup, you can get a list of all the available datasets from the API. Once you have it installed and configured, you can get the dataset: kaggle datasets download -d . Here's more info if you wish to proceed with the API instead: https://github.com/Kaggle/kaggle-api
I'm just starting out in Python and I'm trying to request the html source code of a site using urllib2. However when I try and get the html content from a site I'm not getting the full html content - there are tags missing. I know they're missing as when I view the site in firebug the code shows up. Is this due to the way I'm requesting the data - or due to the site? If so is there a way in which I can get the full source code of the site in python, and then parse it?
Currently the code I'm using to request the content and the site I'm trying is:
import urllib2
url = 'http://marinetraffic.com/ais/'
response = urllib2.urlopen(url)
html = response.read()
print(html)
Specifically the content between the - div id="map_area" - is missing. Any help/pointers greatly appreciated!
You are getting incomplete data because most of the content on this page is dynamically generated via Javascript...
read on a descriptor returned by urlopen will only return what has already been downloaded. So you're liable to get a short read. You're better off using urllib.urlretrieve(), which tries to fetch the entire file, checks the Content-Length header, and raises an error if it fails.