Problems accessing page with python requests - python

I'm trying to extract the sector of a stock for a ML classification project. If I go to the following page:
https://www.six-swiss-exchange.com/shares/security_info_en.html?id=CH0012221716CHF4
I get (on the screen) some information about this stock (it changes, with the id code - I just pick the first one of the list). However, none of the information is available on a regular request. (The html page contains mostly javascript functions)
What I need is on the "Shares Details" tab (ICB Supersector at the bottom of the page). Once again nothing is available with a regular requests. I looked into what happens when I click this tab and the desired request is inside the url:
http://www.six-swiss-exchange.com/shares/info_details_en.html?id=CH0210483332CHF4&portalSegment=EQ&dojo.preventCache=1520360103852 HTTP/1.1
However, if I use this url directly, I get an 403 error from requests but work from a browser. I usually don't have any problems with this sort of things but in this case, do I have to submit cookies or any other information to access that page - no login is required and it can be easily accessed from any browser.
I am thinking 1) make a first request to the url that works, 2) store the cookie they send you (I don't know how to do that really) and 3) make a second request to the desired url. Would this work?
I tried using request.session() but I'm not sure if this is the solution or if I implemented it properly.
If anyone has dealt with that sort of problem, I would love any pointers in solving this. Thanks.

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
BASE_URL = 'https://www.six-swiss-exchange.com'
def get_page_html(isin):
params = {
'id': isin,
'portalSegment': 'EQ'
}
r = requests.get(
'{}/shares/info_details_en.html'.format(BASE_URL),
params=params
)
r.raise_for_status()
return r.text
def get_supersector_info(soup):
supersector = soup.find('td', text='ICB Supersector').next_sibling.a
return {
'link': urljoin(BASE_URL, supersector['href']),
'text': supersector.text
}
if __name__ == '__main__':
page_html = get_page_html('CH0012221716CHF4')
soup = BeautifulSoup(page_html, 'lxml')
supersector_info = get_supersector_info(soup)
Console:
https://www.six-swiss-exchange.com/search/quotes_en.html?security=C2700T
Industrial Goods & Services

Related

How can I get the source of chat without using selenium?

So my issue is that, I want to get user's id info from the chat.
The chat area what I'm looking for, looks like this...
<div id="chat_area" class="chat_area" style="will-change: scroll-position;">
<dl class="" user_id="asdf1234"><dt class="user_m"><em class="pc"></em> :</dt><dd id="1">blah blah</dd></dl>
asdf1234
...
What I want do is to,
Get the part starting with <a href='javascript:'' user_id='asdf1234' ...
so that I can parse this and do some other stuffs.
But this webpage is the one I'm currently using, and it can not be proxy(webdriver by selenium).
How can I extract that data from the chat?
It looks like you've got two separate problems here. I'd use both the requests and BeautifulSoup libraries to accomplish this.
Use your browser's developer tools, the network tab, to refresh the page and look for the request which responds with the HTML you want. Use the requests library to emulate this request exactly.
import requests
headers = {"name": "value"}
# Get case example.
response = requests.get("some_url", headers=headers)
# Post case example.
data = {"key": "value"}
response = requests.post("some_url", headers=headers, data=data)
Web-scraping is always finicky, if this doesn't work you're most likely going to need to use a requests session. Or a one-time hacky solution is just to set your cookies from the browser.
Once you have made the request you can use BeautifulSoup to scrape your user id very easily.
from bs4 import BeautifulSoup
# Create BS parser.
soup = BeautifulSoup(response.text, 'lxml')
# Find all elements with the attribute "user_id".
find_results = soup.findAll("a", {"user_id" : True})
# Iterate results. Could also just index if you want the single user_id.
for result in find_results:
user_id = result["user_id"]

Python requests not redirecting

I'm trying to scrape word definitions, but can't get python to redirect to the correct page. For example, I'm trying to get the definition for the word 'agenesia'. When you load that page in a browser with https://www.lexico.com/definition/agenesia, the page which loads is https://www.lexico.com/definition/agenesis, however in Python the page doesn't redirect and gives a 200 status code
URL = 'https://www.lexico.com/definition/agenesia'
page = requests.head(URL, allow_redirects=True)
This is how I'm currently retrieving the page content, I've also tried using requests.get
but that also doesn't work
EDIT: Because it isn't clear, I'm aware that I could change the word to 'agenesis' in the URL to get the correct page, but I am scraping a list of words and would rather automatically follow the URL rather than searching in a browser for the redirect by hand first.
EDIT 2: I realised it might be easier to check solutions with the rest of my code, so far this works with agenesis but not agenesia:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())
Other answers mentioned before doesn't make your request redirect. The cause is you didn't use the correct request header. Try code below:
import requests
from bs4 import BeautifulSoup
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
}
page = requests.get('https://www.lexico.com/definition/agenesia', headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
print(page.url)
print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())
And print:
https://www.lexico.com/definition/agenesis?s=t
Failure of development, or incomplete development, of a part of the body.
noun
You are doing an HEAD request
The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method.
You want to do
URL = 'https://www.lexico.com/definition/agenesia'
page = requests.get(URL, allow_redirects=True)
If you don't mind a pop-up window, Selenium.py is really good for scraping at a more user-friendly level. If you know the selector of the page element, you can scrape it with driver.find_element_by_css_selector('theselector').text
Where driver = webdriver.chromedriver('file path').
This is a pretty radical circumvention of the problem so I understand if it's not applicable to your specific situation but hopefully you find this answer useful. :)
This works as expected:
>>> import requests
>>> url = 'https://www.lexico.com/definition/agenesia'
>>> requests.get(url, allow_redirects=True).url
'https://www.lexico.com/search?filter=en_dictionary&query=agenesia'
This is the URL that other commands find. For example, with curl:
$ curl -v https://www.lexico.com/definition/agenesia 2>&1 | grep location:
< location: https://www.lexico.com/search?filter=en_dictionary&query=agenesia

Python Requests Library - Scraping separate JSON and HTML responses from POST request

I'm new to web scraping, programming, and StackOverflow, so I'll try to phrase things as clearly as I can.
I'm using the Python requests library to try to scrape some info from a local movie theatre chain. When I look at the Chrome developer tools response/preview tabs in the network section, I can see what appears to be very clean and useful JSON:
However, when I try to use requests to obtain this same info, instead I get the entire page content (pages upon pages of html). Upon further inspection of the cascade in the Chrome developer tools, I can see there are two events called GetNowPlayingByCity: One contains the JSON info while the other seems to be the HTML.
JSON Response
HTML Response
How can I separate the two and only obtain the JSON response using the Python requests library?
I have already tried modifying the headers within requests.post (the Chrome developer tools indicate this is a post method) to include "accept: application/json, text/plain, */*" but didn't see a difference in the response I was getting with requests.post. As it stands I can't parse any JSON from the response I get with requests.post and get the following error:
"json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)"
I can always try to parse the full HTML, but it's so long and complex I would much rather work with friendly JSON info. Any help would be much appreciated!
This is probably because the javascript the page sends to your browser is making a request to an API to get the json info about the movies.
You could either try sending the request directly to their API (see edit 2), parse the html with a library like Beautiful Soup or you can use a dedicated scraping library in python. I've had great experiences with scrapy. It is much faster than requests
Edit:
If the page uses dynamically loaded content, which I think is the case, you'd have to use selenium with the PhantomJS browser instead of requests. here is an example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "your url"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
# Then parse the html code here
Or you could load the dynamic content with scrapy
I recommend the latter if you want to get into scraping. It would take a bit more time to learn but it is a better solution.
Edit 2:
To make a request directly to their api you can just reproduce the request you see. Using google chrome, you can see the request if you click on it and go to 'Headers':
After that, you simply reproduce the request using the requests library:
import requests
import json
url = 'http://paste.the.url/?here='
response = requests.get(url)
content = response.content
# in my case content was byte string
# (it looks like b'data' instead of 'data' when you print it)
# if this is you case, convert it to string, like so
content_string = content.decode()
content_json = json.loads(content_string)
# do whatever you like with the data
You can modify the url as you see fit, for example if it is something like http://api.movies.com/?page=1&movietype=3 you could modify movietype=3 to movietype=2 to see a different type of movie, etc

How to extract the Coronavirus cases from a website?

I'm trying to extract the Coronavirus from a website (https://www.trackcorona.live) but I got an error.
This is my code:
response = requests.get('https://www.trackcorona.live')
data = BeautifulSoup(response.text,'html.parser')
li = data.find_all(class_='numbers')
confirmed = int(li[0].get_text())
print('Confirmed Cases:', confirmed)
It gives the following error (though it was working few days back) because it is returning an empty list (li)
IndexError
Traceback (most recent call last)
<ipython-input-15-7a09f39edc9d> in <module>
2 data=BeautifulSoup(response.text,'html.parser')
3 li=data.find_all(class_='numbers')
----> 4 confirmed = int(li[0].get_text())
5 countries = li[1].get_text()
6 dead = int(li[3].get_text())
IndexError: list index out of range
​
Well, Actually the site is generating a redirection behind CloudFlare, And then it's loaded dynamically via JavaScript once the page loads, Therefore we can use several approach such as selenium and requests_html but i will mention for you the quickest solution for that as we will render the JS on the fly :)
import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper()
html = scraper.get("https://www.trackcorona.live/").text
soup = BeautifulSoup(html, 'html.parser')
confirmed = soup.find("a", id="valueTot").text
print(confirmed)
Output:
110981
A tip for 503 response code:
Basically that code referring to service unavailable.
More technically, the GET request which you sent is couldn't be served. the reason why it's because the request got stuck between the receiver of the request which is https://www.trackcorona.live/ where's it's handling it to another source on the same HOST which is https://www.trackcorona.live/?cf_chl_jschl_tk=
Where __cf_chl_jschl_tk__= is holding a token to be authenticated.
So you should usually follow your code to server the host with required data.
Something like the following showing the end url:
import requests
from bs4 import BeautifulSoup
def Main():
with requests.Session() as req:
url = "https://www.trackcorona.live"
r = req.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
print(redirect)
Main()
Output:
https://www.trackcorona.live/?__cf_chl_jschl_tk__=575fd56c234f0804bd8c87699cb666f0e7a1a114-1583762269-0-AYhCh90kwsOry_PAJXNLA0j6lDm0RazZpssum94DJw013Z4EvguHAyhBvcbhRvNFWERtJ6uDUC5gOG6r64TOrAcqEIni_-z1fjzj2uhEL5DvkbKwBaqMeIZkB7Ax1V8kV_EgIzBAeD2t6j7jBZ9-bsgBBX9SyQRSALSHT7eXjz8r1RjQT0SCzuSBo1xpAqktNFf-qME8HZ7fEOHAnBIhv8a0eod8mDmIBDCU2-r6NSOw49BAxDTDL57YAnmCibqdwjv8y3Yf8rYzm2bPh74SxVc
Now to be able to call the end URL so you need to pass the required Form-Data:
Something like that:
def Main():
with requests.Session() as req:
url = "https://www.trackcorona.live"
r = req.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
data = {
'r': 'none',
'jschl_vc': 'none',
'pass': 'none',
'jschl_answer': 'none'
}
r = req.post(redirect, data=data)
print(r.text)
Main()
here you will end up with text without your desired values. because your values is rendered via JS.
That site is covered by Cloudflare DDoS protection, so the Html returned is a Cloudflare page stating this, not the content you want. You will need to navigate that by first, presumably by getting and setting some cookies, etc.
As an alternative, I recommend taking a look at Selenium. It drives a browser and will execute any js on the page and should get you past this much easier if you are just starting out.
Hope that helps!
The website is now protected with Cloudflare DDoS Protection, so it cannot be directly accessed with python requests.
You can just try this out with https://github.com/Anorov/cloudflare-scrape which bypasses this page. pip package is named as cfscrape

Scraping second page in Python Yields Different Data Than Browsing to Second Page

I'm attempting to scrape some data from www.ksl.com/auto/ using Python Requests and Beautiful Soup. I'm able to get the results from the first search page but not subsequent pages. When I request the second page using the same URL Chrome constructs when I click the "Next" button on the page, I get a set of results that no longer matches my search query. I've found other questions on Stack Overflow that discuss Ajax calls that load subsequent pages, and using Chrome's Developer tools to examine the request. But, none of that has helped me with this problem -- which I've had on other sites as well.
Here is an example query that returns only Acuras on the site. When you advance in the browser to the second page, the URL is simply this: https://www.ksl.com/auto/search/index?page=1. When I use Requests to hit those two URLs, the second search results are not Acuras. Is there, perhaps a cookie that my browser is passing back to the server to preserve my filters?
I would appreciate any advice someone can give about how to get subsequent pages of the results I searched for.
Here is the simple code I'm using:
from requests import get
from bs4 import BeautifulSoup
page1 = get('https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=0&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B').text
page2 = get('https://www.ksl.com/auto/search/index?page=2').text
soup = BeautifulSoup(page1, 'html.parser')
listings = soup.findAll("div", { "class" : "srp-listing-body-right" })
listings[0] # An Acura - success!
soup2 = BeautifulSoup(page2, 'html.parser')
listings2 = soup2.findAll("div", { "class" : "srp-listing-body-right" })
listings2[0] # Not an Acura. :(
Try this. Create a Session object and then call the links. This will maintain your session with the server when you send a call to the next link.
import requests
from bs4 import BeautifulSoup
s = requests.Session() # Add this line
page1 = s.get('https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=0&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B').text
page2 = s.get('https://www.ksl.com/auto/search/index?page=1').text
Yes, the website uses cookies so that https://www.ksl.com/auto/search/index shows or extends your last search. More specifically, the search parameters are stored on the server for you particular session cookie, that is, the value of the PHPSESSID cookie.
However, instead of passing that cookie around, you can simply always do full queries (in the sense of the request parameters), each time using a different value for the page parameter.
https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=0&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B
https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=1&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B

Categories

Resources