BeautifulSoup doesn't seem to parse anything - python

I've been trying to learn BeautifulSoup by making myself a proxy scraper and I've encountered a problem. BeautifulSoup seems unable to find anything and when printing what it parses, It shows me this :
<html>
<head>
</head>
<body>
<bound 0x7f977c9121d0="" <http.client.httpresponse="" at="" httpresponse.read="" method="" object="" of="">
>
</bound>
</body>
</html>
I have tried changing the website I parsed and the parser itself (lxml, html.parser, html5lib) but nothing seems to change, no matter what I do I get the exact same result. Here's my code, can anyone explain what's wrong ?
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req).read
soup = BeautifulSoup(str(content), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())

You are calling urllib.request.urlopen(req).read, correct syntax is: urllib.request.urlopen(req).read() also you are not closing the connection, fixed that for you.
A better way to open connections is using the with urllib.request.urlopen(url) as req: syntax as this closes the connection for you.
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req)
html = content.read()
soup = BeautifulSoup(str(html), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())
content.close() # Important to close the connection
For more info see: https://docs.python.org/3.0/library/urllib.request.html#examples

Related

Beautiful soup text returns blank

I'm trying to scrape a website, but it returns blank, can you help please? what am i missing?
import requests
from bs4 import BeautifulSoup
URL = 'https://ks.wjx.top/jq/50921280.aspx'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.text)
To get a response, add the User-Agent header to requests.get(), otherwise, the website thinks that your a bot, and will block you.
import requests
from bs4 import BeautifulSoup
URL = "https://ks.wjx.top/jq/50921280.aspx"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
print(soup.prettify())

requests.get(url) not returning for this specific url

I'm trying to use requests.get(url).text to get the HTML from this website. However, when requests.get(url) is called with this specific url, it never returns no matter how long I wait. This works with other urls, but this one specifically is giving me trouble. Code is below
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.carmax.com/cars/all', allow_redirects=True).text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify().encode('utf-8'))
Thanks for any help!
Try:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
html = requests.get("https://www.carmax.com/cars/all",headers=headers)
soup = BeautifulSoup(html.content, 'html.parser')
print(soup.prettify())

server sends 403 status code when using requests library in python, but works with browser

I'm trying to automate a login using python's requests module, but whenever I use the POST or GET request the server sends 403 status code; the weird part is that I can access that same URL with any browser but it just won't work with curl and requests.
here is the code:
import requests
import lxml
from bs4 import BeautifulSoup
import os
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w")
FILE.write(ready)
FILE.close()
I'd appreciate any help or idea!
Its probably the /robots.txt, thats blocking you.
try overriding the user-agent with a custom one.
import requests
import lxml
from bs4 import BeautifulSoup
import os
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"}
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url, headers=headers).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w", encoding="utf-8")
FILE.write(ready)
FILE.close()
you also didnt specify the file encoding when opening a file.

Access denied while scraping

I want to create a script to go on to https://www.size.co.uk/featured/footwear/ and scrape the content but somehow when i run the script, i got access denied. Here is the code:
from urllib import urlopen
from bs4 import BeautifulSoup as BS
url = urlopen('https://www.size.co.uk/')
print BS(url, 'lxml')
The output is
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.size.co.uk/" on this server.
<p>
Reference #18.6202655f.1498945327.11002828
</p></body>
</html>
When i try it with other websites, the code works fine and also when i use Selenium, nothing happens but i still want to know how to bypass this error without using Selenium. But when i use Selenium on different website like http://www.footpatrol.co.uk/shop i got the same Access Denied error, here is the code for footpatrol:
from selenium import webdriver
driver = webdriver.PhantomJS('C:\Users\V\Desktop\PY\web_scrape\phantomjs.exe')
driver.get('http://www.footpatrol.com')
pageSource = driver.page_source
soup = BS(pageSource, 'lxml')
print soup
Output is:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.footpatrol.co.uk/" on this
server.<p>
Reference #18.6202655f.1498945644.110590db
</p></body></html>
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.size.co.uk/'
agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
page = requests.get(url, headers=agent)
print (BS(page.content, 'lxml'))
try this :
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101
Firefox/50.0'}
source=requests.get(url, headers=headers).text
print(source)

Why can't I scrape Amazon by BeautifulSoup?

Here is my python code:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
it works for google.com and many other websites, but it doesn't work for amazon.com.
I can open amazon.com in my browser, but the resulting "soup" is still none.
Besides, I find that it cannot scrape from appannie.com, either. However, rather than give none, the code returns an error:
HTTPError: HTTP Error 503: Service Temporarily Unavailable
So I doubt whether Amazon and App Annie block scraping.
Add a header, then it will work.
from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"
# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup
You can try this:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
In python arbitrary text is called a string and it must be enclosed in quotes(" ").
I just ran into this and found that setting any user-agent will work. You don't need to lie about your user agent.
response = HTTParty.get #url, headers: {'User-Agent' => 'Httparty'}
Add a header
import urllib2
from bs4 import BeautifulSoup
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

Categories

Resources