Python - Web scraping with Beautiful Soup

Python - Web scraping with Beautiful Soup - python

I am currently trying to reproduce a web scraping example with Beautiful Soup. However, I have to say I find it pretty unintuitive, which of course might alse be due to lack of experience. In case anyone could help me with an example I'd appreciate it. I cannot find much relevant information online. I would like to extract the first value (Dornum) of the following website: http://flow.gassco.no/
I only got this far:
import requests
page = requests.get("http://flow.gassco.no/")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'html.parser')
Thank you in advance!

Another way is to use current requests module.
You can pass user-agent like this:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.114 Mobile Safari/537.36'
}
page = requests.get("http://flow.gassco.no/", headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
EDIT: To make this version work straightforward you can make a workaround with browser sessions.
You need to pass with requests.get a cookie that tells the site a session number, where Terms and Conditions are already accepted.
Run this code:
import requests
from bs4 import BeautifulSoup
url = "http://flow.gassco.no"
s = requests.Session()
r = s.get(url)
action = BeautifulSoup(r.content, 'html.parser').find('form').get('action') #this gives a "tail" of url whick indicates acceptance of Terms
s.get(url+action)
page = s.get(url).content
soup = BeautifulSoup(page, 'html.parser')

You need to learn how to use urllib, urllib2 first.
Some website shield spiders.
something like:
urllib2.request.add_header('User-Agent','Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.114 Mobile Safari/537.36')
Let website think you are Browser, not robot.

Related

JavaScript is not available. (Python)

I am trying to scrape data from a Twitter webpage using Python but instead of getting the data back, I keep getting "Javascript is not available". I've enabled Javascript in my browser(chrome) but nothing changes.
Here is the error -->
<h1>JavaScript is not available.</h1>
<p>We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using twitter.com. You can see a list of supported browsers in our Help Center.</p>
Here is the code -->
from bs4 import BeautifulSoup
import requests
url = "https://twitter.com/search?q=%23developer%20advocate&src=typed_query&f=user"
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, "lxml")
content = soup.find("div")
print(content)
I've tried enabling Javascript in my browser(chrome), I expected to get the required data back instead the error "Javascript is not availble" persists.

I will never advise scraping twitter by violating their policies, you should use an API instead! But for the Javascript part, just pass user agent in headers in your requests.
from bs4 import BeautifulSoup
import requests
user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
headers = {'User-Agent': user_agent}
url = "https://twitter.com/search?q=%23developer%20advocate&src=typed_query&f=user"
source_code = requests.get(url, headers=headers).text
soup = BeautifulSoup(source_code, "lxml")
content = soup.find("div")
print(content)

Getting the page source of imgur

I'm trying to get the page source of an imgur website using requests, but the results I'm getting are different from the source. I understand that these pages are rendered using JS, but that is not what I am searching for.
It seems I'm getting redirected because they detect I'm using an automated browser, but I'd prefer not to use selenium here. For example, see the following code to scrape the page source of two imgur ID's (one valid ID and one invalid ID) with different page sources.
import requests
from bs4 import BeautifulSoup
url1 = "https://i.imgur.com/ssXK5" #valid ID
url2 = "https://i.imgur.com/ssXK4" #invalid ID
def get_source(url):
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Mobile Safari/537.36"}
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
return soup
page1 = get_source(url1)
page2 = get_source(url2)
print(page1==page2)
#True
The scraped page sources are identical, so I presume it's an anti-scraping thing. I know there is an imgur API, but I'd like to know how to circumvent such a redirection, if possible. Is there any way to get the actual source code using the requests module?
Thanks.

BeautifulSoup using Python keep returning null even though the element exists

I am running the following code to parse an amazon page using beautiful soup in Python but when I run the print line, I keep getting None. I am wondering whether I am doing something wrong or if theres an explanation/solution to this. Any help will be appreciated.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-Bulletin-
Board/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)

Your code is absolutely correct.
There seems to be some issue with the the parser that you have used (html.parser)
I used html5lib in place of html.parser and the code now works:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-BulletinBoard/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html5lib')
title = soup.find(id='productTitle')
print(title)
More Info not directly related to the answer:
For the other answer given to this question, I wasn't asked for a captcha when visiting the page.
However Amazon does change the response content if it detects that a bot is visiting the website: Remove the headers from requests.get() method, and try page.text
The default headers added by requests library lead to the identification of the request as being form a bot.

When requesting that page outside of a normal browser environment it asked for a captcha, I'd assume that's why the element doesn't exist.
Amazon probably has specific measures to counter "robots" accessing their pages, I suggest to look at their APIs to see if there's anything helpful instead of scraping the webpages directly.

I get nothing when trying to scrape a table

So I want to extract the number 45.5 from here: https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt
But when I try to find the table I get nothing. Here's my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux armv7l) AppleWebKit/537.36 (KHTML, like Gecko) Raspbian Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
text = soup.find_all('table', class_ = 'odds sortable')
print(text)
Can anybody help me to extract the number and store it's value into a variable?

You can try to do this without Selenium by recreating the dynamic request that loads the table.
Looking around in the network tab of the page, i saw this XMLHTTPRequest: https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu
Try to reproduce the same parameters as the request.
To access the network tab: Click right->inspect element->Network tab->Select XHR and find the second request.
The final code would be like this:
headers = {'x-fsign' : 'SW9D1eZo'}
page =
requests.get('https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu',
headers=headers)
You should check if the x=fisgn value is different based on your browser/ip.

python requests & beautifulsoup bot detection

I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)
But the output doesn't show the entire HTML of the page, so I can't do my further work with product details.
Any help on this?
EDIT 1:
From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :
I might need to add a header in the requests, but I couldn't understand what should be the value of header.
Use Selenium.
Now my question is, do both of the ways provide equal support?

It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.
import requests
from fake_useragent import UserAgent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
ua=UserAgent()
hdr = {'User-Agent': ua.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = requests.get(url, headers=hdr)
print response.content
Selenium is used for browser automation and high level web scraping for dynamic contents.

As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.
Another alternative for you could also be fake-useragent maybe you can also have a try with this.

try this:
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
r = requests.get(url)
r = r.text
##options #1
# print r.text
soup = BeautifulSoup( r.encode("utf-8") , "html.parser")
### options 2
print(soup)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Web scraping with Beautiful Soup - python

Related

JavaScript is not available. (Python)

Getting the page source of imgur

BeautifulSoup using Python keep returning null even though the element exists

I get nothing when trying to scrape a table

python requests & beautifulsoup bot detection

Categories

Resources