Get content of HTML Header with Beautifulsoup

Get content of HTML Header with Beautifulsoup - python

I'm creating an bot that should retrieves me the status of an order.
I started with this:
import requests
from bs4 import BeautifulSoup
nextline = '\n'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://footlocker.narvar.com/footlocker/tracking/de-mail?order_number=31900491219XXXXXXX"
def getstatus(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
for EachPart in soup.select('div[class="tracking-status-container status_reposition"]'):
print(EachPart)
getstatus(url)
But still after several tries, "EachPart" is empty.
Then I noticed that the Information I want / need is not in the HTML-Body, it is in the Header.
So if I just print the soup I receive :
<head>
var translation = {"comm_sms_auto_response_msg":"........... "widgets_tracking_status_justshipped_status":"Ready to go" }
var xxxxxx
var xxxxxx
var xxxxxx
</head>
<body>
..................
</body>
In the "var translation", there is "widgets_tracking_status_justshipped_status":"Ready to go"
And thats what i need to extractm the "widgets_tracking_status_justshipped_status" and the text of the field, so "Ready to go".

for Javascript string use Regex
def getstatus(url):
response = requests.get(url, headers=headers)
status = re.search(r'_justshipped_status":"([^"]+)', response.text).group(1)
print(status)
# Ready to go

Related

Request Website Timeout While Trying To Read Website- Python

I am attempting to read and parse a website that returns a JSON. Every attempt I have made, it gives me a timeout error or not an error at all(I have to stop it)
URL:
https://api.louisvuitton.com/api/eng-us/catalog/availability/M57089
Code I have tried:
import requests
from urllib.request import Request, urlopen
#Trial 1
BASE_URL = 'https://api.louisvuitton.com/api/eng-us/catalog/availability/M57089'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'
}
response = requests.get(BASE_URL, headers=headers)
#Trial2
url = ('https://api.louisvuitton.com/api/eng-us/catalog/availability/M57089')
req = Request(url, headers= headers)
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
obj=json.loads(str(page_soup))
#Trial3
import dload
j = dload.json('https://api.louisvuitton.com/api/eng-us/catalog/availability/M57089')
print(j)
So far none of these attempts or any variation similar to these have been successful to open the website and read it. Any help would be appreciated.

You don't have permission to access this resource Python webscraping

I am trying to web scrape a website and when I am doing that I am getting below output.
Is there a way I can scrape this website?
url = "https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup)
Output of the above code is as follows
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access this resource.</p>
</body></html>

The website server expected a header to be passed:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
URL = 'https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/'
httpx = requests.get(URL, headers=headers)
print(httpx.text)
By passing header, we told the server that we are Mozilla:)

Generating URL for Yahoo and Bing Scraping for multiple pages with Python and BeautifulSoup

I want to scrape news from different sources. I found a way to generate URL for scraping multiple pages from google, but I think that there is a way to generate much shorter link.
Can you please tell me how to generate the URL for scraping multiple pages for Bing and Yahoo news, and also, is there a way to make google url shorter.
This is the code for google:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
term = 'usa'
page=0
for page in range(1,5):
page = page*10
url = 'https://www.google.com/search?q={}&tbm=nws&sxsrf=ACYBGNTx2Ew_5d5HsCvjwDoo5SC4U6JBVg:1574261023484&ei=H1HVXf-fHfiU1fAP65K6uAU&start={}&sa=N&ved=0ahUKEwi_q9qog_nlAhV4ShUIHWuJDlcQ8tMDCF8&biw=1280&bih=561&dpr=1.5'.format(term,page)
print(url)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
These are the URL-s for yahoo and bing, but for 1 page:
yahoo: url = 'https://news.search.yahoo.com/search?q={}'.format(term)
bing: url = 'https://www.bing.com/news/search?q={}'.format(term)

I am not sure are you looking after this shorten url for news.
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
term = 'usa'
page=0
for page in range(1,5):
page = page*10
url = 'https://www.google.com/search?q={}&tbm=nws&start={}'.format(term,page)
print(url)
response = requests.get(url, headers=headers,verify=False)
soup = BeautifulSoup(response.text, 'html.parser')
#Yahoo:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
term = 'usa'
page=1
while True:
url ='https://news.search.yahoo.com/search?q={}&pz=10&b={}'.format(term,page)
print(url)
page = page + 10
response = requests.get(url, headers=headers,verify=False)
if response.status_code !=200:
break
soup = BeautifulSoup(response.text, 'html.parser')

BeautifulSoup doesn't seem to parse anything

I've been trying to learn BeautifulSoup by making myself a proxy scraper and I've encountered a problem. BeautifulSoup seems unable to find anything and when printing what it parses, It shows me this :
<html>
<head>
</head>
<body>
<bound 0x7f977c9121d0="" <http.client.httpresponse="" at="" httpresponse.read="" method="" object="" of="">
>
</bound>
</body>
</html>
I have tried changing the website I parsed and the parser itself (lxml, html.parser, html5lib) but nothing seems to change, no matter what I do I get the exact same result. Here's my code, can anyone explain what's wrong ?
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req).read
soup = BeautifulSoup(str(content), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())

You are calling urllib.request.urlopen(req).read, correct syntax is: urllib.request.urlopen(req).read() also you are not closing the connection, fixed that for you.
A better way to open connections is using the with urllib.request.urlopen(url) as req: syntax as this closes the connection for you.
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req)
html = content.read()
soup = BeautifulSoup(str(html), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())
content.close() # Important to close the connection
For more info see: https://docs.python.org/3.0/library/urllib.request.html#examples

Fetching live data from website's with continiously updating data

I can easily get the data when I put html = urllib.request.urlopen(req) inside a while loop, but it takes about 3 seconds to get the data. So I thought, maybe if I put that outside, I can get it faster as it won't have to open the URL everytime, but this throws up an AttributeError: 'str' object has no attribute 'read'. Maybe it doesn't recognize HTML variable name. How can I speed the processing ?
def soup():
url = "http://www.investing.com/indices/major-indices"
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
'Connection': 'keep-alive' }
)
global Ltp
global html
html = urllib.request.urlopen(req)
while True:
html = html.read().decode('utf-8')
bsobj = BeautifulSoup(html, "lxml")
Ltp = bsobj.find("td", {"class":"pid-169-last"} )
Ltp = (Ltp.text)
Ltp = Ltp.replace(',' , '');
os.system('cls')
Ltp = float(Ltp)
print (Ltp, datetime.datetime.now())
soup()

if you want to fetching live you need to recall url periodically
html = urllib.request.urlopen(req)
This one should be in a loop.
import os
import urllib
import datetime
from bs4 import BeautifulSoup
import time
def soup():
url = "http://www.investing.com/indices/major-indices"
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
'Connection': 'keep-alive' }
)
global Ltp
global html
while True:
html = urllib.request.urlopen(req)
ok = html.read().decode('utf-8')
bsobj = BeautifulSoup(ok, "lxml")
Ltp = bsobj.find("td", {"class":"pid-169-last"} )
Ltp = (Ltp.text)
Ltp = Ltp.replace(',' , '');
os.system('cls')
Ltp = float(Ltp)
print (Ltp, datetime.datetime.now())
time.sleep(3)
soup()
Result:
sh: cls: command not found
18351.61 2016-08-31 23:44:28.103531
sh: cls: command not found
18351.54 2016-08-31 23:44:36.257327
sh: cls: command not found
18351.61 2016-08-31 23:44:47.645328
sh: cls: command not found
18351.91 2016-08-31 23:44:55.618970
sh: cls: command not found
18352.67 2016-08-31 23:45:03.842745

you reassign html to equal the UTF-8 string response then keep calling it like its an IO ... this code does not fetch new data from the server on every loop, read simply reads the bytes from the IO object, it doesnt make a new request.
you can speed up the processing with the Requests library and utilise persistent connections (or urllib3 directly)
Try this (you will need to pip install requests)
import os
import datetime
from requests import Request, Session
from bs4 import BeautifulSoup
s = Session()
while True:
resp = s.get("http://www.investing.com/indices/major-indices")
bsobj = BeautifulSoup(resp.text, "html.parser")
Ltp = bsobj.find("td", {"class":"pid-169-last"} )
Ltp = (Ltp.text)
Ltp = Ltp.replace(',' , '');
os.system('cls')
Ltp = float(Ltp)
print (Ltp, datetime.datetime.now())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get content of HTML Header with Beautifulsoup - python

for Javascript string use Regex def getstatus(url): response = requests.get(url, headers=headers) status = re.search(r'_justshipped_status":"([^"]+)', response.text).group(1) print(status) # Ready to go

Related

Request Website Timeout While Trying To Read Website- Python

You don't have permission to access this resource Python webscraping

Generating URL for Yahoo and Bing Scraping for multiple pages with Python and BeautifulSoup

BeautifulSoup doesn't seem to parse anything

Fetching live data from website's with continiously updating data

Categories

Resources