Python BeautifulSoup and Requests

Python BeautifulSoup and Requests - python

Whenever I try to run this code:
def CheckStock(url,model):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
RawHTML = requests.get(url, headers=headers)
Page = bs4.BeautifulSoup(RawHTML.text, "lxml")
I keep getting:
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.adidas.com', port=443): Read timed out. (read timeout=None)
The url I am using is:
'https://www.adidas.com/us/stan-smith-shoes/FZ3815.html?forceSelSize=FZ3815_630'
The model is: 'FZ3815'

To get correct page, specify different User-Agent.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.adidas.com/us/stan-smith-shoes/FZ3815.html?forceSelSize=FZ3815_630'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
RawHTML = requests.get(url, headers=headers)
Page = BeautifulSoup(RawHTML.text, "lxml")
print(Page)
Prints:
<!DOCTYPE html>
<html class="theme-adidas" data-reactroot="" lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#"><head><title data-rh="true" id="meta-title">Stan Smith Tropical Print Sneakers | adidas US</title><meta charset="utf-8" data-rh="true" id="meta-charset"/><meta content="IE=edge,chrome=1" data-rh="true" http-equiv="X-UA-Compatible" id="meta-http-ua-compatible"/><meta content="text/html;charset=utf-8" data-rh="true" http-equiv="Content-Type" id="meta-http-content-type"/><meta content="
...and so on.

Related

Get content of HTML Header with Beautifulsoup

I'm creating an bot that should retrieves me the status of an order.
I started with this:
import requests
from bs4 import BeautifulSoup
nextline = '\n'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://footlocker.narvar.com/footlocker/tracking/de-mail?order_number=31900491219XXXXXXX"
def getstatus(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
for EachPart in soup.select('div[class="tracking-status-container status_reposition"]'):
print(EachPart)
getstatus(url)
But still after several tries, "EachPart" is empty.
Then I noticed that the Information I want / need is not in the HTML-Body, it is in the Header.
So if I just print the soup I receive :
<head>
var translation = {"comm_sms_auto_response_msg":"........... "widgets_tracking_status_justshipped_status":"Ready to go" }
var xxxxxx
var xxxxxx
var xxxxxx
</head>
<body>
..................
</body>
In the "var translation", there is "widgets_tracking_status_justshipped_status":"Ready to go"
And thats what i need to extractm the "widgets_tracking_status_justshipped_status" and the text of the field, so "Ready to go".

for Javascript string use Regex
def getstatus(url):
response = requests.get(url, headers=headers)
status = re.search(r'_justshipped_status":"([^"]+)', response.text).group(1)
print(status)
# Ready to go

Weird encoding file format outputted by BeautifulSoup

I would like to access and scrape the data from this link.
where;
new_url='https://www.scopus.com/results/results.uri?sort=plf-f&src=s&imp=t&sid=2c816e0ea43cf176a59117097216e6d4&sot=b&sdt=b&sl=160&s=%28TITLE-ABS-KEY%28EEG%29AND+TITLE-ABS-KEY%28%22deep+learning%22%29+AND+DOCTYPE%28ar%29%29+AND+ORIG-LOAD-DATE+AFT+1591735287+AND+ORIG-LOAD-DATE+BEF+1592340145++AND+PUBYEAR+AFT+2018&origin=CompleteResultsEmailAlert&dgcid=raven_sc_search_en_us_email&txGid=cc4809850a0eff92f629c95380f9f883'
As accessing the new_url via the following line
req = Request(url, headers={'User-Agent': 'Mozilla/5.9'})
produced the error
Webscraping: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop
A set of new line was drafted
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
page_soup = soup(raw, 'html.parser')
print(page_soup.prettify())
While no error is thrown out, but the
print(page_soup.prettify())
output some unrecognized text format
6�>�.�t1k�e�LH�.��]WO�?m�^#�
څ��#�h[>��!�H8����|����n(XbU<~�k�"���#g+�4�Ǻ�Xv�7�UȢB2�
�7�F8�XA��W\�ɚ��^8w��38�#'
SH�<_0�B���oy�5Bނ)E���GPq:�ќU�c���ab�h�$<ra�
;o�Q�a#ð�d\�&J3Τa�����:�I�etf�a���h�$(M�~���ua�$�
n�&9u%ҵ*b���w�j�V��P�D�'z[��������)
with a warning
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
I suspect, this can be resolved by encode it using utf-8, which is as below
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
with open(raw, 'r', encoding='utf-8') as f:
page_soup = soup(f, 'html.parser')
print(page_soup.prettify())
However, the compiler return an error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position
1: invalid start byte
May I know what is the problem, appreciate for any insight.

Try using the requests library
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"}
with requests.Session() as s:
r = s.get(new_url, headers = headers)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.get_text())
you can still use cookies here
Edit: Updated code to show the use of headers, this would tell the website you are a browser instead of a program - but further login operations I would suggest the use of selenium instead of requests

If you want to use urllib library, remove Accept-Encoding from the headers (also specify Accept-Charset just utf-8 for simplicity):
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'utf-8;q=0.7,*;q=0.3','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
The result is:
<!DOCTYPE html>
<!-- Form Name: START -->
<html lang="en">
<!-- Template_Component_Name: id.start.vm -->
<head>
<meta charset="utf-8"/>
...etc.

You don't have permission to access this resource Python webscraping

I am trying to web scrape a website and when I am doing that I am getting below output.
Is there a way I can scrape this website?
url = "https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup)
Output of the above code is as follows
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access this resource.</p>
</body></html>

The website server expected a header to be passed:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
URL = 'https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/'
httpx = requests.get(URL, headers=headers)
print(httpx.text)
By passing header, we told the server that we are Mozilla:)

How to get text paragraphs from a website: Error 403 Forbidden

I am trying to do web scraping with the help of requests and BeautifulSoup. But, the desired outcome is null.
My code is as follows:
def urlscrape(url):
page = requests.get(url).text
soup = BeautifulSoup(page, 'html')
text = [p.text for p in soup.find(class_='bg-white').find_all('p')]
print(url)
return text
The website is: https://www.afghanistan-analysts.org/en/reports/war-and-peace/taleban-prisoners-release-are-the-latest-proposals-legal/
I want all the <p> tags containing paragraphs to be extracted as texts.

you can try this:-
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
url = 'https://www.afghanistan-analysts.org/en/reports/war-and-peace/taleban-prisoners-release-are-the-latest-proposals-legal/'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html')
text = [p.text for p in soup.find_all('p')]

Try this ...
url="https://www.afghanistan-analysts.org/en/reports/war-and-peace/taleban-prisoners-release-are-the-latest-proposals-legal/"
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}
response = requests.get(url,headers=headers)
html = response.content
print(response.content)

BeautifulSoup doesn't seem to parse anything

I've been trying to learn BeautifulSoup by making myself a proxy scraper and I've encountered a problem. BeautifulSoup seems unable to find anything and when printing what it parses, It shows me this :
<html>
<head>
</head>
<body>
<bound 0x7f977c9121d0="" <http.client.httpresponse="" at="" httpresponse.read="" method="" object="" of="">
>
</bound>
</body>
</html>
I have tried changing the website I parsed and the parser itself (lxml, html.parser, html5lib) but nothing seems to change, no matter what I do I get the exact same result. Here's my code, can anyone explain what's wrong ?
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req).read
soup = BeautifulSoup(str(content), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())

You are calling urllib.request.urlopen(req).read, correct syntax is: urllib.request.urlopen(req).read() also you are not closing the connection, fixed that for you.
A better way to open connections is using the with urllib.request.urlopen(url) as req: syntax as this closes the connection for you.
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req)
html = content.read()
soup = BeautifulSoup(str(html), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())
content.close() # Important to close the connection
For more info see: https://docs.python.org/3.0/library/urllib.request.html#examples

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BeautifulSoup and Requests - python

Related

Get content of HTML Header with Beautifulsoup

Weird encoding file format outputted by BeautifulSoup

You don't have permission to access this resource Python webscraping

How to get text paragraphs from a website: Error 403 Forbidden

BeautifulSoup doesn't seem to parse anything

Categories

Resources