I'm creating an bot that should retrieves me the status of an order.
I started with this:
import requests
from bs4 import BeautifulSoup
nextline = '\n'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://footlocker.narvar.com/footlocker/tracking/de-mail?order_number=31900491219XXXXXXX"
def getstatus(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
for EachPart in soup.select('div[class="tracking-status-container status_reposition"]'):
print(EachPart)
getstatus(url)
But still after several tries, "EachPart" is empty.
Then I noticed that the Information I want / need is not in the HTML-Body, it is in the Header.
So if I just print the soup I receive :
<head>
var translation = {"comm_sms_auto_response_msg":"........... "widgets_tracking_status_justshipped_status":"Ready to go" }
var xxxxxx
var xxxxxx
var xxxxxx
</head>
<body>
..................
</body>
In the "var translation", there is "widgets_tracking_status_justshipped_status":"Ready to go"
And thats what i need to extractm the "widgets_tracking_status_justshipped_status" and the text of the field, so "Ready to go".
for Javascript string use Regex
def getstatus(url):
response = requests.get(url, headers=headers)
status = re.search(r'_justshipped_status":"([^"]+)', response.text).group(1)
print(status)
# Ready to go
I would like to access and scrape the data from this link.
where;
new_url='https://www.scopus.com/results/results.uri?sort=plf-f&src=s&imp=t&sid=2c816e0ea43cf176a59117097216e6d4&sot=b&sdt=b&sl=160&s=%28TITLE-ABS-KEY%28EEG%29AND+TITLE-ABS-KEY%28%22deep+learning%22%29+AND+DOCTYPE%28ar%29%29+AND+ORIG-LOAD-DATE+AFT+1591735287+AND+ORIG-LOAD-DATE+BEF+1592340145++AND+PUBYEAR+AFT+2018&origin=CompleteResultsEmailAlert&dgcid=raven_sc_search_en_us_email&txGid=cc4809850a0eff92f629c95380f9f883'
As accessing the new_url via the following line
req = Request(url, headers={'User-Agent': 'Mozilla/5.9'})
produced the error
Webscraping: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop
A set of new line was drafted
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
page_soup = soup(raw, 'html.parser')
print(page_soup.prettify())
While no error is thrown out, but the
print(page_soup.prettify())
output some unrecognized text format
6�>�.�t1k�e�LH�.��]WO�?m�^#�
څ��#�h[>��!�H8����|����n(XbU<~�k�"���#g+�4�Ǻ�Xv�7�UȢB2�
�7�F8�XA��W\�ɚ��^8w��38�#'
SH�<_0�B���oy�5Bނ)E���GPq:�ќU�c���ab�h�$<ra�
;o�Q�a#ð�d\�&J3Τa�����:�I�etf�a���h�$(M�~���ua�$�
n�&9u%ҵ*b���w�j�V��P�D�'z[��������)
with a warning
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
I suspect, this can be resolved by encode it using utf-8, which is as below
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
with open(raw, 'r', encoding='utf-8') as f:
page_soup = soup(f, 'html.parser')
print(page_soup.prettify())
However, the compiler return an error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position
1: invalid start byte
May I know what is the problem, appreciate for any insight.
Try using the requests library
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"}
with requests.Session() as s:
r = s.get(new_url, headers = headers)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.get_text())
you can still use cookies here
Edit: Updated code to show the use of headers, this would tell the website you are a browser instead of a program - but further login operations I would suggest the use of selenium instead of requests
If you want to use urllib library, remove Accept-Encoding from the headers (also specify Accept-Charset just utf-8 for simplicity):
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'utf-8;q=0.7,*;q=0.3','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
The result is:
<!DOCTYPE html>
<!-- Form Name: START -->
<html lang="en">
<!-- Template_Component_Name: id.start.vm -->
<head>
<meta charset="utf-8"/>
...etc.
I am trying to web scrape a website and when I am doing that I am getting below output.
Is there a way I can scrape this website?
url = "https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup)
Output of the above code is as follows
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access this resource.</p>
</body></html>
The website server expected a header to be passed:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
URL = 'https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/'
httpx = requests.get(URL, headers=headers)
print(httpx.text)
By passing header, we told the server that we are Mozilla:)
I am trying to do web scraping with the help of requests and BeautifulSoup. But, the desired outcome is null.
My code is as follows:
def urlscrape(url):
page = requests.get(url).text
soup = BeautifulSoup(page, 'html')
text = [p.text for p in soup.find(class_='bg-white').find_all('p')]
print(url)
return text
The website is: https://www.afghanistan-analysts.org/en/reports/war-and-peace/taleban-prisoners-release-are-the-latest-proposals-legal/
I want all the <p> tags containing paragraphs to be extracted as texts.
you can try this:-
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
url = 'https://www.afghanistan-analysts.org/en/reports/war-and-peace/taleban-prisoners-release-are-the-latest-proposals-legal/'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html')
text = [p.text for p in soup.find_all('p')]
Try this ...
url="https://www.afghanistan-analysts.org/en/reports/war-and-peace/taleban-prisoners-release-are-the-latest-proposals-legal/"
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}
response = requests.get(url,headers=headers)
html = response.content
print(response.content)
I've been trying to learn BeautifulSoup by making myself a proxy scraper and I've encountered a problem. BeautifulSoup seems unable to find anything and when printing what it parses, It shows me this :
<html>
<head>
</head>
<body>
<bound 0x7f977c9121d0="" <http.client.httpresponse="" at="" httpresponse.read="" method="" object="" of="">
>
</bound>
</body>
</html>
I have tried changing the website I parsed and the parser itself (lxml, html.parser, html5lib) but nothing seems to change, no matter what I do I get the exact same result. Here's my code, can anyone explain what's wrong ?
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req).read
soup = BeautifulSoup(str(content), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())
You are calling urllib.request.urlopen(req).read, correct syntax is: urllib.request.urlopen(req).read() also you are not closing the connection, fixed that for you.
A better way to open connections is using the with urllib.request.urlopen(url) as req: syntax as this closes the connection for you.
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req)
html = content.read()
soup = BeautifulSoup(str(html), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())
content.close() # Important to close the connection
For more info see: https://docs.python.org/3.0/library/urllib.request.html#examples