Fetching website html using python - python

I am very new to Python. I had a working code to get an html page and parse text from it but it stopped working recently. Perhaps the website changed,but I am not able to retrieve the data anymore.
Any help is appreciated. Here is the code that used to work before.
from cookielib import CookieJar
from urllib2 import build_opener, HTTPCookieProcessor
from lxml import etree
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
url = 'https://www.nasdaq.com/markets/stocks/symbol-change-history.aspx?sortby=EFFECTIVE&d'
cj = CookieJar()
opener = build_opener(HTTPCookieProcessor(cj))
opener.addheaders = [('User-Agent', user_agent),('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')]
page = opener.open(url, timeout=2)
parser = etree.HTMLParser()
rootDOM = etree.parse(page, parser)
html = etree.tostring(rootDOM.getroot(), pretty_print=True, method='html')
I get error SSLError: ('The read operation timed out',)

Related

server sends 403 status code when using requests library in python, but works with browser

I'm trying to automate a login using python's requests module, but whenever I use the POST or GET request the server sends 403 status code; the weird part is that I can access that same URL with any browser but it just won't work with curl and requests.
here is the code:
import requests
import lxml
from bs4 import BeautifulSoup
import os
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w")
FILE.write(ready)
FILE.close()
I'd appreciate any help or idea!
Its probably the /robots.txt, thats blocking you.
try overriding the user-agent with a custom one.
import requests
import lxml
from bs4 import BeautifulSoup
import os
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"}
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url, headers=headers).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w", encoding="utf-8")
FILE.write(ready)
FILE.close()
you also didnt specify the file encoding when opening a file.

BeautifulSoup doesn't seem to parse anything

I've been trying to learn BeautifulSoup by making myself a proxy scraper and I've encountered a problem. BeautifulSoup seems unable to find anything and when printing what it parses, It shows me this :
<html>
<head>
</head>
<body>
<bound 0x7f977c9121d0="" <http.client.httpresponse="" at="" httpresponse.read="" method="" object="" of="">
>
</bound>
</body>
</html>
I have tried changing the website I parsed and the parser itself (lxml, html.parser, html5lib) but nothing seems to change, no matter what I do I get the exact same result. Here's my code, can anyone explain what's wrong ?
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req).read
soup = BeautifulSoup(str(content), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())
You are calling urllib.request.urlopen(req).read, correct syntax is: urllib.request.urlopen(req).read() also you are not closing the connection, fixed that for you.
A better way to open connections is using the with urllib.request.urlopen(url) as req: syntax as this closes the connection for you.
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req)
html = content.read()
soup = BeautifulSoup(str(html), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())
content.close() # Important to close the connection
For more info see: https://docs.python.org/3.0/library/urllib.request.html#examples

Web scraping: HTTPError: HTTP Error 403: Forbidden, python3

Hi I am need to scrape web page end extract data-id use Regular expression
Here is my code :
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://clarity-project.info/tenders/?entity=38163425&offset=100")
bsObj = BeautifulSoup(html,"html.parser")
DataId = bsObg.findAll("data-id", {"skr":re.compile("data-id=[0-9,a-f]")})
for DataId in DataId:
print(DataId["skr"])
when I run my program in Jupyter :
HTTPError: HTTP Error 403: Forbidden
It looks like the web server is asking you to authenticate before serving content to Python's urllib. However, they serve everything neatly to wget and curl and https://clarity-project.info/robots.txt doesn't seem to exist, so I reckon scraping as such is fine with them. Still, it might be a good idea to ask them first.
As for the code, simply changing the User Agent string to something they like better seems to work:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen, Request
request = Request(
'https://clarity-project.info/tenders/?entity=38163425&offset=100',
headers={
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0'})
html = urlopen(request).read().decode()
(unrelated, you have another mistake in your code: bsObj ≠ bsObg)
EDIT added code below to answer additional question from the comments:
What you seem to need is to find the value of the data-id attribute, no matter to which tag it belongs. The code below does just that:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
url = 'https://clarity-project.info/tenders/?entity=38163425&offset=100'
agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36\
(KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = Request(url, headers={'User-Agent': agent})
html = urlopen(request).read().decode()
soup = BeautifulSoup(html, 'html.parser')
tags = soup.findAll(lambda tag: tag.get('data-id', None) is not None)
for tag in tags:
print(tag['data-id'])
The key is to simply use a lambda expression as the parameter to the findAll function of BeautifulSoup.
The server is likely blocking your requests because of the default user agent. You can change this so that you will appear to the server to be a web browser. For example, a Chrome User-Agent is:
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36
To add a User-Agent you can create a request object with the url as a parameter and the User-Agent passed in a dictionary as the keyword argument 'headers'.
See:
import urllib.request
r = urllib.request.Request(url, headers= {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
html = urllib.request.urlopen(r)
You could try with this:
#!/usr/bin/env python
from bs4 import BeautifulSoup
import requests
url = 'your url here'
soup = BeautifulSoup(requests.get(url).text,"html.parser")
for i in soup.find_all('tr', attrs={'class':'table-row'}):
print '[Data id] => {}'.format(i.get('data-id'))
This should work!

Broken Link Checker Fails Head Requests

I am building a broken link checker using Python 3.4 to help ensure the quality of a large collection of articles that I manage. Initially I was using GET requests to check if a link was viable, however I and trying to be as nice as possible when pinging the URLs I'm checking, so I both ensure that I do not check a URL that is tested as working more than once and I have attempted to do just head requests.
However, I have found a site that causes this to simply stop. It neither throws an error, nor opens:
https://www.icann.org/resources/pages/policy-2012-03-07-en
The link itself is fully functional. So ideally I'd like to find a way to process similar links. This code in Python 3.4 will reproduce the issue:
import urllib
import urllib.request
URL = 'https://www.icann.org/resources/pages/policy-2012-03-07-en'
req=urllib.request.Request(URL, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'}, method='HEAD')>>> from http.cookiejar import CookieJar
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
As it does not throw an error, I really do not know how to troubleshoot this further beyond narrowing it down to the link that halted the entire checker. How can I check if this link is valid?
From bs4 import BeautifulSoup,SoupStrainer
import urllib2
import requests
import re
import certifi
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
def getStatus(url):
a=requests.get(url,verify=False)
report = str(a.status_code)
return report
alllinks=[]
passlinks=[]
faillinks=[]
html_page = urllib2.urlopen("https://link")
soup = BeautifulSoup(html_page,"html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^http*")}):
#print link.get('href')
status = getStatus(link.get('href'))
#print ('URL---->',link.get('href'),'Status---->',status)
link='URL---->',link.get('href'),'Status---->',status
alllinks.append(link)
if status == '200':
passlinks.append(link)
else:
faillinks.append(link)
print alllinks
print passlinks
print faillinks

Unable to scrape google news accurately

I'm trying to scrape google headlines for a given keyword (eg. Blackrock) for a given period (eg. 7-jan-2012 to 14-jan-2012).
I'm trying to do this by constructing the url and then using urllib2 as shown in the code below. if I put the constructed url in a browser, it gives me the correct result. however, if I use it through python, I get news results for the right keyword but for the current period.
here'e the code. Can someone tell me what I'm doing wrong and how I can correct it?
import urllib
import urllib2
import json
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=Blackrock&hl=en&gl=uk&authuser=0&source=lnt&tbs=cdr%3A1%2Ccd_min%3A7%2F1%2F2012%2Ccd_max%3A14%2F1%2F2012&tbm=nws'
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
html = response.read()
soup = BeautifulSoup(html)
text = soup.text
start = text.index('000 results')+11
end = text.index('NextThe selection')
text = text[start:end]
print text
The problem is with your user-agent, it works for me with:
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36')
You are using a user-agent for Firefox 3, which is about 6 years old.

Categories

Resources