I would like to collect data from this page
xxx
My experience level with python and BeautifulSoup is beginner. However I don't think that it has to be very advanced for what I need to do, excepting for the issue that I am describing below
The page that I need to collect data from lists the active properties for sale listed on MLS for the Greater Toronto Area.
At the right side of the map there are some checkboxes that you must select in order to get your data and this is where my problem is. If I use a browser a local cookie is used to remember the previous selections and tha data is displayed
I would like to know either of these:
1) how I can pass all the params (selections) in my initial request from Python
2) how to use the Chrome cookie with Python so I can get a page return that actually contains data
A code example would be great but sending me to links that I should read would also work.
Thanks a lot
PF
If you insist on using urllib2 over Requests, I suggest looking into cookielib.
Here is an example:
import urllib2
import cookielib
from BeautifulSoup import BeautifulSoup
cookiejar = cookielib.CookieJar()
opener = urllib2.build_opener(
urllib2.HTTPRedirectHandler(),
urllib2.HTTPHandler(debuglevel=0),
urllib2.HTTPSHandler(debuglevel=0),
urllib2.HTTPCookieProcessor(cookiejar),
)
This way you're creating a cookiejar to hold cookies, building an opener and establishing your cookie processor and passing cookiejar. This should take care of your cookie issue. At this point, instead of using urllib2.urlopen(url), just use your custom opener: opener.open(url)
url = 'http://www.somesite.com/'
fp = opener.open(url)
html_object = BeautifulSoup(fp)
Related
I'm learning BeautifulSoup and I want to make a list of all image urls from a webpage (https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection).
import requests
from bs4 import BeautifulSoup
url = 'https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
The code above doesn't yield any image urls.
And when I print(soup), I can't see any image urls either.
But when I right click on one of the images and manually copy the link, I find out the url starts with https://storage.googleapis.com/kagglesdsdata/datasets/165566/377107/.
So I try setting url = 'https://storage.googleapis.com/kagglesdsdata/datasets/165566/377107/' for the above code, but that doesn't yield any image urls either.
Thanks for any help!
Since it is not in the page source it is likely loaded in by Javascript. You could look at the Javascript to see where it is generated or alternatively if you are just using this to learn BeautifulSoup you get get the page source with Selenium. Selenium uses a Chromedriver so you will need to have a chromedriver in your repo. (extract from here https://chromedriver.storage.googleapis.com/92.0.4515.107/chromedriver_win32.zip). Where 92.0.4515.107 is the version you want, or the latest version , which you can see here)
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection/'
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
# parse the soup as you normally would from here
If you go to Chrome Dev Tools you can see the requests to these images:
.
If you click on one of these requests you can see these requests have some query string parameters which have values like X-Goog-Signature and X-Goog-Algorithm..
To get this full URL you would need to replicate the POST requests to https://www.kaggle.com/requests/GetDataViewExternalRequest which returns a JSON object like so:
{"result":{"dataView":{"type":"url","dataTable":null,"dataUrl":{"url":"https://storage.googleapis.com/kagglesdsdata/datasets/165566/377107/no/1%20no.jpeg?X-Goog-Algorithm=GOOG4-RSA-SHA256\u0026X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20210801%2Fauto%2Fstorage%2Fgoog4_request\u0026X-Goog-Date=20210801T194627Z\u0026X-Goog-Expires=345599\u0026X-Goog-SignedHeaders=host\u0026X-Goog-Signature=2f3672e41a5821b19eb88a8452237a36943ca0cb54874ec47e47c832480870f1ae29ba4cab3e3717ab1decdb74012135bdb1324b85fd8159084dd9587f5504dbf60f6890f12277e418ddbbf61c720083ce7cca6b8936fa45cb9132a396c12136106c6dcfca8574475156f199169b2eecee7fd51fd784d7ddec3f8e3b80b75a17216893ffa22248e98e9bb5cae7cd5b3598e7f3fbbc6e51c24c864c8746c9fe202d1f6a221baea2f300dedf4ba62eb510d9369607ab2f6e659e3b4e4a18e763943632b110c57e223ffb9f1c09db8dac32da6e273f6248c5146dce8d5633ba38787394852b4bcc240dfa62210f042902e84833cf8817a050fc64655b0ed5f43ac9","type":"image"},"dataRaw":null,"dataCase":"dataUrl"}},"wasSuccessful":true}
Which is easily parsed by a simple code like:
r = request.post("https://www.kaggle.com/requests/GetDataViewExternalRequest", data=data)
url = r.json()["result"]["dataView"]["dataUrl"]
The hard bit will come at generating this data, something like this:
data = {
"verificationInfo": {
"competitionId": None,
"datasetId": 165566,
"databundleVersionId": 391742,
"datasetHashLink": None
},
"firestorePath": "hIPSqqCWJs6oriNI20r6/versions/kKBcaXwa0lr8cvBuOMna/directories/no/files/1 no.jpeg",
"tableQuery": None
}
I would expect most of those values are static for this page, it's likely the firestorePath changes. From a very quick search, it looks like all those values are scrapeable from the page either using regex or BeautifulSoup.
The request also has some headers including __requestverificationtoken and x-xsrf-token. Which looks like they are there to validate your request, they may be scrapeable, they may not. But, they are equal in value. You would need to add these headers to your request as well. I recommend this site to help with creating requests easily. You just need to check the requests and delete any values which are not constants.
In summary, it's not easy! Use selenium if speed is not an issue, use requests and work all this out if it is.
After all that, the best option is using their API as Phorys said. It will be very easy to use.
The scraper looks good so the problem usually is the server which try to protect itself from... the scrapers! BeautifulSoup is a great tool for scraping static pages hence the problem is when you need to request a page:
Pass a user-agent to your request
user_agent = # for example 'Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/88.0'}
response = requests.get(url, headers={ "user-agent": user_agent})
If this doesn't work investigate the response (cookies, coding, ...)
response = requests.get(url)
for key, value in response.headers.items():
print(key, ":", value)
If you want only links maybe this could be better
for link in soup.find_all('a', href=True): # so you are sure the you don't get an error when retrieving the value
print(link['href'])
Personally, I would look into using Kaggle's official API to access the data. They don't seem particularly restrictive and once you have it figured out it should also be possible to upload solutions as well: https://www.kaggle.com/docs/api
Just to push you in the right direction:
"The Kaggle API and CLI tool provide easy ways to interact with Datasets on Kaggle. The commands available can make searching for and downloading Kaggle Datasets a seamless part of your data science workflow.
If you haven’t installed the Kaggle Python package needed to use the command line tool or generated an API token, check out the getting started steps first.
Some of the commands for interacting with Datasets via CLI include:"
kaggle datasets list -s [KEYWORD]: list datasets matching a search term
kaggle datasets download -d [DATASET]: download files associated with a dataset
I have been trying to use web scraping on a website using the requests and Beautifulsoup python libraries.
The problem is that I'm getting the html data of the web page but the body tag content is empty while on the inspect panel on the website it isn't.
Does anyone can explain why is it happening and what can I do to get the content of the body?
Here is my code:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers').text
soup = BeautifulSoup(source, 'lxml')
print(soup)
Here is the inspect panel of the website:
And here is the output of my code:
Thank you :)
There are two reasons, your code could not work for. The fist one is, the website does require additional header or cookie information, that you could try to find using the Inspect Browser Tool and add via
requests.get(url, headers=headers, cookies=cookies)
where headers and cookies are dictionaries.
Another reason, which I believe it is, is that the content is dynamically loaded via Javascript after the side is build, and what you do get is the initially loaded website.
To also provide you a solution, I attache an example using Selenium, which simulates a whole browser, which does serve the full website, however selenium has a bit of a setup overhead, that you can easily google.
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
driver = webdriver.Firefox()
driver.get(url)
sleep(10)
content = driver.page_source
soup = BeautifulSoup(content)
If you want the browser simulation to be none visible you can add
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options)
which will make it run in the backgroud.
Alternatively to Firefox, you can use pretty much any browser using the appropriate driver.
A Linux based setup example can be found here Link
Even though I find the use of Selenium easier for beginners, that site bothered me, so I figured out a pure requests way, that I also want to share.
Process:
When you look at the network traffic after loading the website, you find a lot of outgoing get requests. Assuming, you are interested in the products, that are loaded, I found a call right above the product images being loaded from Amazon S3 going to
https://client-il.rexail.com/client/public/public-catalog?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A
importantly
https://client-il.rexail.com/client/public/public-catalog?s_jwe=[...]
Upon clicking the URL I found it to be indeed a JSON of the products. However the s_jwe token is dynamic and without it, the JSON doesn't load.
Now investigating the initially loading url and searching for s_jwe you will find
<script>
window.customerStore = {store: angular.fromJson({"id":26,"name":"\u05de\u05e9\u05e7 \u05d4\u05e8 \u05e4\u05e8\u05d7\u05d9\u05dd","imagePath":"images\/stores\/26\/88aa6827bcf05f9484b0dafaedf22b0a.png","secondaryImagePath":"images\/stores\/4d5d1f54038b217244956071ca62312d.png","thirdImagePath":"images\/stores\/26\/2f9294180e7d656ba7280540379869ee.png","fourthImagePath":"images\/stores\/26\/bd2861565b18613497a6ce66903bf9eb.png","externalWebTrackingAccounts":"[{\"accountType\":\"googleAnalytics\",\"identifier\":\"UA-130110792-1\",\"primaryDomain\":\"ecomeshek.co.il\"},{\"accountType\":\"facebookPixel\",\"identifier\":\"3958210627568899\"}]","worksWithStoreCoupons":false,"performSellingUnitsEstimationLearning":false}), s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"};
const externalWebTrackingAccounts = angular.fromJson(customerStore.store.externalWebTrackingAccounts);
</script>
containing
s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"
So to summerize, even though, the initial page does not contain the products, it does contain the token and the product url.
Now you can extract the two and call the product catalog directly as such:
FINAL CODE:
import requests
import re
import json
s = requests.Session()
initial_url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
initial_site = s.get(url= initial_url).content.decode('utf-8')
jwe = re.findall(r's_jwe:.*"(.*)"', initial_site)
product_url = "https://client-il.rexail.com/client/public/public-catalog?s_jwe="+ jwe[0]
products_site = s.get(url= product_url).content.decode('utf-8')
products = json.loads(products_site)["data"]
print(products[0])
There is a little bit of finetuning required with the decoding, but I am sure you can manage that. ;)
This of course is the leaner way of scraping that website, but as I hopefully showed, scraping is always a bit of playing Sherlock Holmes.
Any questions, glad to help.
I'm new to web scraping, programming, and StackOverflow, so I'll try to phrase things as clearly as I can.
I'm using the Python requests library to try to scrape some info from a local movie theatre chain. When I look at the Chrome developer tools response/preview tabs in the network section, I can see what appears to be very clean and useful JSON:
However, when I try to use requests to obtain this same info, instead I get the entire page content (pages upon pages of html). Upon further inspection of the cascade in the Chrome developer tools, I can see there are two events called GetNowPlayingByCity: One contains the JSON info while the other seems to be the HTML.
JSON Response
HTML Response
How can I separate the two and only obtain the JSON response using the Python requests library?
I have already tried modifying the headers within requests.post (the Chrome developer tools indicate this is a post method) to include "accept: application/json, text/plain, */*" but didn't see a difference in the response I was getting with requests.post. As it stands I can't parse any JSON from the response I get with requests.post and get the following error:
"json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)"
I can always try to parse the full HTML, but it's so long and complex I would much rather work with friendly JSON info. Any help would be much appreciated!
This is probably because the javascript the page sends to your browser is making a request to an API to get the json info about the movies.
You could either try sending the request directly to their API (see edit 2), parse the html with a library like Beautiful Soup or you can use a dedicated scraping library in python. I've had great experiences with scrapy. It is much faster than requests
Edit:
If the page uses dynamically loaded content, which I think is the case, you'd have to use selenium with the PhantomJS browser instead of requests. here is an example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "your url"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
# Then parse the html code here
Or you could load the dynamic content with scrapy
I recommend the latter if you want to get into scraping. It would take a bit more time to learn but it is a better solution.
Edit 2:
To make a request directly to their api you can just reproduce the request you see. Using google chrome, you can see the request if you click on it and go to 'Headers':
After that, you simply reproduce the request using the requests library:
import requests
import json
url = 'http://paste.the.url/?here='
response = requests.get(url)
content = response.content
# in my case content was byte string
# (it looks like b'data' instead of 'data' when you print it)
# if this is you case, convert it to string, like so
content_string = content.decode()
content_json = json.loads(content_string)
# do whatever you like with the data
You can modify the url as you see fit, for example if it is something like http://api.movies.com/?page=1&movietype=3 you could modify movietype=3 to movietype=2 to see a different type of movie, etc
My question might be bit weird.
So I have some page with different URL but all end up on same page. So Can I get that main URL from the old URL in python. For example:
1) https://www.verisk.com/insurance/products/iso-forms/
2) https://www.verisk.com/insurance/products/forms-library-on-isonet/
Both will end up on same page that is:
https://www.verisk.com/insurance/products/iso-forms/
So for each URL can I know the final URL where it'll land using Python(I have list of 1k URL). And I want another list of where those URL land correspondingly!
Here's one way of doing it, using requests library.
import requests
def get_redirected_url(url):
response = requests.get(url, stream=True) # stream=True prevents fetching the actual content
return response.url
This is a very simplified example, and in real implementation, you want to error handle, probably do delayed retries and possibly check what kind of redirection you're getting. (permanent redirects only?)
Simple approach with urllib.request:
from urllib.request import urlopen
resp = urlopen("http://sitey.com/redirect")
print(resp.url)
Might want to use threads if you're doing 1,000 URLs...
I am relatively new (as in a few days) to Python - I am looking for an example that would show me how to post a form to a website (say www.example.com).
I already know how to use Curl. Infact, I have written C+++ code that does exactly the same thing (i.e. POST a form using Curl), but I would like some starting point (a few lines from which I can build on), which will show me how to do this using Python.
Here is an example using urllib and urllib2 for both POST and GET:
POST - If urlopen() has a second parameter then it is a POST request.
import urllib
import urllib2
url = 'http://www.example.com'
values = {'var' : 500}
data = urllib.urlencode(values)
response = urllib2.urlopen(url, data)
page = response.read()
GET - If urlopen() has a single parameter then it is a GET request.
import urllib
import urllib2
url = 'http://www.example.com'
values = {'var' : 500}
data = urllib.urlencode(values)
fullurl = url + '?' + data
response = urllib2.urlopen(fullurl)
page = response.read()
You could also use curl if you call it using os.system().
Here are some helpful links:
http://docs.python.org/library/urllib2.html#urllib2.urlopen
http://docs.python.org/library/os.html#os.system
curl -d "birthyear=1990&press=AUD" www.site.com/register/user.php
http://curl.haxx.se/docs/httpscripting.html
There are two major Python packages for automating web interactions:
Mechanize
Twill
Twill has apparently not been updated for a couple years and seems to have been at version 0.9 since Dec. 2007. Mechanize shows changelog and releases from just a few days ago: 2010-05-16 with the release of version 0.2.1.
Of course you'll find examples listed in their respective web pages. Twill essentially provides a simple shell like interpreter while Mechanize provides a class and API in which you set form values using Python dictionary-like (__setattr__() method) statements, for example. Both use BeautifulSoup for parsing "real world" (sloppy tag soup) HTML. (This is highly recommended for dealing with HTML that you encounter in the wild, and strongly discouraged for your own HTML which should be written to pass standards conforming, validating, parsers).