BeautifulSoup Not Parsing HTML Correctly inside Try/Except Loop - python

I've got a problem parsing a document with BS4, and I'm not sure what's happening. The response code is OK, the url is fine, the proxies work, everything is great, proxy shuffling works as expected, but soup comes back blank using any parser other than html5lib. The soup that html5lib comes back with stops at the <body> tag.
I'm working in Colab and I've been able to run pieces of this function successfully in another notebook, and have gotten as far as being able to loop through a set of search results, make soup out of the links, and grab my desired data, but my target website eventually blocks me, so I have switched to using proxies.
check(proxy) is a helper function that checks a list of proxies before attempting to make a requests of my target site. The problem seems to have started when I included it in try/except. I'm speculating that maybe it's something to do with the try/except being included in a for loop --- idk.
What's confounding is that I know the site isn't blocking scrapers/robots generally, as I can use BS4 in another notebook piecemeal and get what I'm looking for.
from bs4 import BeautifulSoup as bs
from itertools import cycle
import time
from time import sleep
import requests
import random
head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36', "X-Requested-With": "XMLHttpRequest"}
ips = []
proxy_pool = cycle(ips)
def scrape_boxrec():
search_result_pages = [num for num in range(0, 22700, 20)]
random.shuffle(search_result_pages)
for i in search_result_pages:
search_results_page_attempt.append(i)
proxy = next(proxy_pool)
proxies = {
'http': proxy,
'https': proxy
}
if check(proxy) == True:
url = 'https://boxrec.com/en/locations/people?l%5Brole%5D=proboxer&l%5Bdivision%5D=&l%5Bcountry%5D=&l%5Bregion%5D=&l%5Btown%5D=&l_go=&offset=' + str(i)
try:
results_source = requests.get(url, headers=head, timeout=5, proxies=proxies)
results_content = results_source.content
res_soup = bs(results_content, 'html.parser')
# WHY IS IT NOT PARSING THIS PAGE CORRECTLY!!!!
except Exception as ex:
print(ex)
else:
print("Bad Proxy. Moving On")
def check(proxy):
check_url = 'https://httpbin.org/ip'
check_proxies = {
'http': proxy,
'https': proxy
}
try:
response = requests.get(check_url, proxies=check_proxies, timeout=5)
if response.status_code == 200:
return True
except:
return False

Since nobody took a crack at it I thought I would come back through and update on a solution - my use of "X-Requested-With": "XMLHttpRequest" in my head variable is what was causing the error. I'm still new to programming, especially with making HTTP requests, but I do know it has something to do with Ajax. Anyways, when I removed that bit from the headers attribute in my request BeautifulSoup parsed the document in full.
This answer as well as this one explains in a lot more detail that this is a common approach to prevent Cross Site Request Forgery, which is why my request was always coming back empty.

Related

Site parsing myip.ms

Writing a parser for the site https://myip.ms/ And here for this page https://myip.ms/browse/sites/1/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714 Everything works fine with this link, but if you go to another page https://myip.ms/browse/sites/2/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714 It does not output any data, although the site structure is the same. I think that this may be due to the fact that the site has a limit on views, or because you need to register, but I can't find what request you need to send to log in to your account. Tell me what to do?
import requests
from bs4 import BeautifulSoup
import time
link_list = []
URL = 'https://myip.ms/browse/sites/2/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714'
HEADERS = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 YaBrowser/20.12.2.105 Yowser/2.5 Safari/537.36','accept':'*/*'}
#HOST =
def get_html(url,params=None):
r = requests.get(url,headers=HEADERS,params=params)
return r
def get_content(html):
soup = BeautifulSoup(html,'html.parser')
items = soup.find_all('td',class_='row_name')
for item in items:
links = item.find('a').get('href')
link_list.append({
'link': links
})
def parser():
print(URL)
html = get_html(URL)
if html.status_code == 200:
get_content(html.text)
else:
print('Error')
parser()
print(link_list)
Use SessionID with your request. It will allow you at least 50 requests per day.
If you use proxy that support cookies this number might be even higher.
So the process is as follows:
load the page with your browser.
find session id in the request inside your Dev Tools.
use this session id in your request, no headers or additional info is required.
enjoy results for 50 requests per day.
repeat in 24 hours.

Web scraping with Python using BeautifulSoup 429 error

Fist I have to say that I'm quite new to Web scraping with Python. I'm trying to scrape datas using these lines of codes
import requests
from bs4 import BeautifulSoup
baseurl ='https://name_of_the_website.com'
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)
As output I do not get the expected Html page but another Html page that says : Misbehaving Content Scraper
Please use robots.txt
Your IP has been rate limited
To check the problem I wrote:
try:
page_response = requests.get(baseurl, timeout =5)
if page_response.status_code ==200:
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
else:
print(page_response.status_code)
except requests.Timeout as e:
print(str(e))
Then I get 429 (too many requests).
What can I do to handle this problem? Does it mean I cannot print the Html of the page and does it prevent me to scrape any content of the page? Should I rotate the IP address ?
If you are only hitting the page once and getting a 429 it's probably not you hitting them too much. You can't be sure the 429 error is accurate, it's simply what their webserver returned. I've seen pages return a 404 response code, yet the page was fine, and 200 response code on legit missing pages, just a misconfigured server. They may just return 429 from any bot, try changing your User-Agent to Firefox, Chrome, or "Robot Web Scraper 9000" and see what you get. Like this:
requests.get(baseurl, headers = {'User-agent': 'Super Bot Power Level Over 9000'})
to declare yourself as a bot or
requests.get(baseurl, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
If you wish to more mimic a browser. Note all the version stuff mimicing a browser, at the time of this writing those are current. You may need later version numbers. Just find your user agent of the browser you use, this page will tell you what that is:
https://www.whatismybrowser.com/detect/what-is-my-user-agent
Some sites return better searchable code if you just say you are a bot, others it's the opposite. It's basically the wild wild west, have to just try different things.
Another pro tip, you may have to write your code to have a 'cookie jar' or a way to accept a cookie. Usually it is just an extra line in your request, but I'll leave that for another stackoverflow question :)
If you are indeed hitting them a lot, you need to sleep between calls. It's a server side response completely controlled by them. You will also want to investigate how your code interacts with robots.txt, that's a file usually on the root of the webserver with the rules it would like your spider to follow.
You can read about that here: Parsing Robots.txt in python
Spidering the web is fun and challenging, just remember that you could be blocked at anytime by any site for any reason, you are their guest. So tread nicely :)

Why Python's requests transforms parts of source code?

I am trying to scrape through google news search results using python's requests to get links to different articles. I get the links by using Beautiful Soup.
The problem I get is that although in browser's source view all links look normal, after the operation they are changed - all of the start with "/url?q=" and after the "core" of the link is finished there goes a string of characters which starts with "&". Also - some characters inside the link are also changed - for example url:
http://www.azonano.com/news.aspx?newsID=35576
changes to:
http://www.azonano.com/news.aspx%newsID%35576
I'm using standard "getting started" code:
import requests, bs4
url_list = list()
url = 'https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=graphene&oq=graphene&gs_l=news-cc.3..43j0l9j43i53.2022.4184.0.4322.14.10.3.1.1.1.166.884.5j5.10.0...0.0...1ac.1.-Q2j3YFqIPQ'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.select('h3 > a'):
url_list.append(link.get('href'))
# First link on google news page is:
# https://www.theengineer.co.uk/graphene-sensor-could-speed-hepatitis-diagnosis/
print url_list[0] #this line will print url modified by requests.
I know it's possible to get around this problem by using selenium, but I'd like to know where lies a root cause of this problem with requests (or more plausible not with requests but the way I'm using it).
Thanks for any help!
You're comparing what you are seeing with a browser with what requests generates (i.e. there is no user agent header). If you specify this before making the initial request it will reflect what you would see in a web browser. Google serves the requests differently it looks like:
url = 'https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=graphene&oq=graphene&gs_l=news-cc.3..43j0l9j43i53.2022.4184.0.4322.14.10.3.1.1.1.166.884.5j5.10.0...0.0...1ac.1.-Q2j3YFqIPQ'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'} # I just used a general Chrome 41 user agent header
res = requests.get(url, headers=headers)

Broken Link Checker Fails Head Requests

I am building a broken link checker using Python 3.4 to help ensure the quality of a large collection of articles that I manage. Initially I was using GET requests to check if a link was viable, however I and trying to be as nice as possible when pinging the URLs I'm checking, so I both ensure that I do not check a URL that is tested as working more than once and I have attempted to do just head requests.
However, I have found a site that causes this to simply stop. It neither throws an error, nor opens:
https://www.icann.org/resources/pages/policy-2012-03-07-en
The link itself is fully functional. So ideally I'd like to find a way to process similar links. This code in Python 3.4 will reproduce the issue:
import urllib
import urllib.request
URL = 'https://www.icann.org/resources/pages/policy-2012-03-07-en'
req=urllib.request.Request(URL, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'}, method='HEAD')>>> from http.cookiejar import CookieJar
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
As it does not throw an error, I really do not know how to troubleshoot this further beyond narrowing it down to the link that halted the entire checker. How can I check if this link is valid?
From bs4 import BeautifulSoup,SoupStrainer
import urllib2
import requests
import re
import certifi
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
def getStatus(url):
a=requests.get(url,verify=False)
report = str(a.status_code)
return report
alllinks=[]
passlinks=[]
faillinks=[]
html_page = urllib2.urlopen("https://link")
soup = BeautifulSoup(html_page,"html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^http*")}):
#print link.get('href')
status = getStatus(link.get('href'))
#print ('URL---->',link.get('href'),'Status---->',status)
link='URL---->',link.get('href'),'Status---->',status
alllinks.append(link)
if status == '200':
passlinks.append(link)
else:
faillinks.append(link)
print alllinks
print passlinks
print faillinks

Python's Requests or Mechanize to login to a site

I want to start off by apologizing. I know this has more than likely been done more than enough times, and I'm just beating a dead horse, but I'd really like to know how to get this to work. I am trying to use the Requests module for python in order to login to a website and verify that it works. I'm also using BeautifulSoup in the code in order to find some strings that I have to use to process the request.
I'm getting hung up on how to properly form the header. What exactly is necessary in the header information?
import requests
from bs4 import BeautifulSoup
session = requests.session()
requester = session.get('http://iweb.craven.k12.nc.us/index.php')
soup = BeautifulSoup(requester.text)
ps = soup.find_all('input')
def getCookieInfo():
result = []
for item in ps:
if (item.attrs['name'] == 'return' and item.attrs['type'] == 'hidden'):
strcom = item.attrs['value']
sibling = item.next_sibling.next_sibling.attrs['name']
result.append(strcom)
result.append(sibling)
return result
cookiedInfo=getCookieInfo()
payload = [('username','myUsername'),
('password','myPassword'),
('Submit','Log in'),
('option','com_users'),
('task','user.login'),
('return', cookiedInfo[0]),
(cookiedInfo[1], '1')
]
headers = {
'Connection': 'keep-alive',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin':'http://iweb.craven.k12.nc.us',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)'
}
r = session.post('http://iweb.craven.k12.nc.us/index.php', data=payload, headers=headers)
r = session.get('http://iweb.craven.k12.nc.us')
soup = BeautifulSoup(r.text)
Also if it would be better/more pythonic to utilize the mechanize module I would be open to suggestions.

Categories

Resources