How to check website title changes with BeautifulSoup? - python

I'm trying to check this site for changes on the title:
https://allestoringen.nl/storing/kpn/
source = requests.get(url).text
soup = bs4.BeautifulSoup(source,'html.parser')
event_string = str(soup.find(text='Er is geen storing bij KPN'))
print (event_string)
However, event_string returns None every time.

The reason you don't get a result might be that the website doesn't accept your request. I got this result.
page = requests.get(url)
page.status_code # 403
page.reason # 'Forbidden'
You might want to take a look at this post for a solution.
It is always a good idea to check the return status of your request in your code.
But to solve your problem. You might want to check the <title> element instead of a specific string.
# stolen from the post I mentioned
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get(url, headers=headers)
page.status_code # 200. Adding a header solved the problem.
soup = bs4.BeautifulSoup(page.text,'html.parser')
# get title.
print(soup.find("title").text)
'KPN storing? Actuele storingen en problemen | Allestoringen'

Related

requests.post() returns the search page, not the results page

What I am trying to do is go onto https://www.ssrn.com/index.cfm/en/, search for an author name, and then click on that author's name to take me to the author page. Right now, I am stuck on that first step, as requests.post() just returns the original URL input, not the results page. I've searched around for a while but I have no idea what the problem is; if I've misidentified the key or something else.
Here is my code:
url = "https://www.ssrn.com/index.cfm/en/"
header = {}
header['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
obj = requests.post(url, headers = header, data = {'txtKey_Words': 'auth_name'}) # make POST request for each name
print(obj.url)
So print(obj.url) just returns https://www.ssrn.com/index.cfm/en/ instead of https://papers.ssrn.com/sol3/results.cfm
Much appreciated!
Edit: Using https://papers.ssrn.com/sol3/results.cfm returns https://papers.ssrn.com/sol3/displayabstractsearch.cfm

BeautifulSoup is returing none

I am trying to get a value from a website using beautiful soup but it keeps returning none. This is what I have for my code so far
def getmarketchange():
source = requests.get("https://www.coinbase.com/price").text
soup = bs4.BeautifulSoup(source, "lxml")
marketchange = soup.get("MarketHealth__Percent-sc-1a64a42-2.bEMszd")
print(marketchange)
getmarketchange()
and attached is a screenshot of html code I was trying to grab.
Thank you for your help in advance!
Have a look at the HTML source returned from your get() request - it's a CAPTCHA challenge. You won't be able to get to the Coinbase pricing without passing this challenge.
Excerpt:
<h2 class="cf-subheadline"><span data-translate="complete_sec_check">
Please complete the security check to access</span> www.coinbase.com
</h2>
<div class="cf-section cf-highlight cf-captcha-container">
Coinbase is recognizing that the HTTP request isn't coming from a standard browser-based user, and it is challenging the requester. BeautifulSoup doesn't have a way to pass this check on its own.
Passing in User-Agent headers (to mimic a browser request) also doesn't resolve this issue.
For example:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
source = requests.get("https://www.coinbase.com/price", headers=headers).text
You might find better luck with Selenium, although I'm not sure about that.
To prevent a captcha page, try to specify User-Agent header:
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/price'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print(soup.select_one('[class^="MarketHealth__Percent"]').text)
Prints:
0.65%

BeautifulSoup using Python keep returning null even though the element exists

I am running the following code to parse an amazon page using beautiful soup in Python but when I run the print line, I keep getting None. I am wondering whether I am doing something wrong or if theres an explanation/solution to this. Any help will be appreciated.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-Bulletin-
Board/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
Your code is absolutely correct.
There seems to be some issue with the the parser that you have used (html.parser)
I used html5lib in place of html.parser and the code now works:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-BulletinBoard/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html5lib')
title = soup.find(id='productTitle')
print(title)
More Info not directly related to the answer:
For the other answer given to this question, I wasn't asked for a captcha when visiting the page.
However Amazon does change the response content if it detects that a bot is visiting the website: Remove the headers from requests.get() method, and try page.text
The default headers added by requests library lead to the identification of the request as being form a bot.
When requesting that page outside of a normal browser environment it asked for a captcha, I'd assume that's why the element doesn't exist.
Amazon probably has specific measures to counter "robots" accessing their pages, I suggest to look at their APIs to see if there's anything helpful instead of scraping the webpages directly.

I get nothing when trying to scrape a table

So I want to extract the number 45.5 from here: https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt
But when I try to find the table I get nothing. Here's my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux armv7l) AppleWebKit/537.36 (KHTML, like Gecko) Raspbian Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
text = soup.find_all('table', class_ = 'odds sortable')
print(text)
Can anybody help me to extract the number and store it's value into a variable?
You can try to do this without Selenium by recreating the dynamic request that loads the table.
Looking around in the network tab of the page, i saw this XMLHTTPRequest: https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu
Try to reproduce the same parameters as the request.
To access the network tab: Click right->inspect element->Network tab->Select XHR and find the second request.
The final code would be like this:
headers = {'x-fsign' : 'SW9D1eZo'}
page =
requests.get('https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu',
headers=headers)
You should check if the x=fisgn value is different based on your browser/ip.

What I scrape from python is not the same as what I see from firebug

I am practicing to write a web crawler to crawl some interesting information from a website. I try this block of code on my personal website. It works as what I expect, but when I try to implement this code on a real website, it does not show what it should show. Does anyone have any ideas? The following is my code and results.
import requests
from bs4 import BeautifulSoup
url = 'https://angel.co/parkwhiz/jobs/284942-product-manager'
page = requests.get(url).text
soup = BeautifulSoup(page,'lxml')
print soup.prettify()
Result from print
Result from firebug(or chrome inspect)
The title shows in the print is "Page not found - 404 - AngelList", but the title shows in firebug is "Product Manager Job at Parkwhiz - AngelList". Is there anything wrong with my code? Shouldn't these two be match?
The website is blocking the script as you're passing default User-Agent which tells the website that it is an automated Python script.
If you check the status code, you'll see that you're getting 404.
>>> r = requests.get('https://angel.co/parkwhiz/jobs/284942-product-manager')
>>> r.status_code
404
To overcome this, change the User-Agent to look like a real browser:
>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
>>> r = requests.get('https://angel.co/parkwhiz/jobs/284942-product-manager', headers=headers)
>>> r.status_code
200

Categories

Resources