How to find body of a webpage using beautifulsoup

How to find body of a webpage using beautifulsoup - python

I want to check if there is any content available on more than 500 webpages, using beautiful soup. This is the is script I wrote. It works, but somewhere it stops. If I fix the error it shows a different one. Below is the code I tried. I just want to be sure the page has a body. I'm unsure how to handle timeouts. Maybe the website needs more time.
method 1:
res = requests.get(full_https_url, timeout=40)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems == '':
pass
else:
print('body found')
method 2:
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems != '':
print('body found')
else:
pass

select() returns a list, not a string, so it will always compare not equal to '', whether it's successful or not. Just test if the result is not empty.
Use try/except to catch the timeout error.
try:
res = requests.get(full_https_url, timeout=40)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems:
# do stuff
else:
print("No body in {full_https_url}")
except requests.exceptions.Timeout:
print(f"Timeout on {full_https_url}, skipping")

Related

Nested while loops does not break as expected

I have a list of links and for each link I want to check if it contains a specific sublink and add this sublink to the initial list. I have this code:
def getAllLinks():
i = 0
baseUrl = 'http://www.cdep.ro/pls/legis/'
sourcePaths = ['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0','legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0']
while i < len(sourcePaths)+1:
for path in sourcePaths:
res = requests.get(f'{baseUrl}{path}')
soup = BeautifulSoup(res.text)
next_btn = soup.find(lambda e: e.name == 'td' and '1..99' in e.text)
if next_btn:
for a in next_btn.find_all('a', href=True):
linkNextPage = a['href']
sourcePaths.append(linkNextPage)
i += 1
break
else:
i += 1
continue
break
return sourcePaths
print(getAllLinks())
The first link in the list does not contain the sublink, so it's an else case. The code does this OK. However, the second link in the list does contain the sublink, but it gets stuck here:
for a in next_btn.find_all('a', href=True):
linkNextPage = a['href']
sourcePaths.append(linkNextPage)
i += 1
The third link contains the sublink but my code does not get to look at that link. At the end I am getting a list containing the initial links plus 4 times the sublink of the second link.
I think I'm breaking incorrectly somewhere but I can't figure out how to fix it.

Remove the while. It's not needed. Change the selectors
import requests
from bs4 import BeautifulSoup
def getAllLinks():
baseUrl = 'http://www.cdep.ro/pls/legis/'
sourcePaths = ['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0','legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0']
for path in sourcePaths:
res = requests.get(f'{baseUrl}{path}')
soup = BeautifulSoup(res.text, "html.parser")
next_btn = soup.find("p",class_="headline").find("table", {"align":"center"})
if next_btn:
anchor = next_btn.find_all("td")[-1].find("a")
if anchor: sourcePaths.append(anchor["href"])
return sourcePaths
print(getAllLinks())
Output:
['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0', 'legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0', 'legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=100', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0&nrc=100']

Your second break statement never gets executed because the first "for" loop is already broken by the first break statement and never reaches the second break statement. Put condition which break the while loop.

trying to loop through a list of urls and scrape each page for text

I'm having an issue. It loops through the list of URLS, but it's not adding the text content of each page scraped to the presults list.
I haven't gotten to the raw text processing yet. I'll probably make a question for that once I get there if I can't figure out.
What is wrong here? The length of presults remains at 1 even though it seems to be looping through the list of urls for the scrape...
Here's part of the code I'm having an issue with:
counter=0
for xa in range(0,len(qresults)):
pageURL=qresults[xa].format()
pageresp= requests.get(pageURL, headers=headers)
if pageresp.status_code==200:
print(pageURL)
psoup=BeautifulSoup(pageresp.content, 'html.parser')
presults=[]
para=psoup.text
presults.append(para)
print(len(presults))
else: print("Could not reach domain")
print(len(presults))

Your immediate problem is here:
presults=[]
para=psoup.text
presults.append(para)
On every for iteration, you replace your existing presults list with the empty list and add one item. On the next iteration, you again wipe out the previous result.
Your initialization must be done only once and that before the loop:
presults = []
for xa in range(0,len(qresults)):

Ok, I don't even see you looping through any URLs here, but below is a generic example of how this kind of request can be achieved.
import requests
from bs4 import BeautifulSoup
base_url = "http://www.privredni-imenik.com/pretraga?abcd=&keyword=&cities_id=0&category_id=0&sub_category_id=0&page=1"
current_page = 1
while current_page < 200:
print(current_page)
url = base_url + str(current_page)
#current_page += 1
r = requests.get(url)
zute_soup = BeautifulSoup(r.text, 'html.parser')
firme = zute_soup.findAll('div', {'class': 'jobs-item'})
for title in firme:
title1 = title.findAll('h6')[0].text
print(title1)
adresa = title.findAll('div', {'class': 'description'})[0].text
print(adresa)
kontakt = title.findAll('div', {'class': 'description'})[1].text
print(kontakt)
print('\n')
page_line = "{title1}\n{adresa}\n{kontakt}".format(
title1=title1,
adresa=adresa,
kontakt=kontakt
)
current_page += 1

While Not Loop for empty list in python

I am making a request to a server... for whatever reason (beyond my comprehension), the server will give me a status code of 200, but when I use Beautiful Soup to grab a list from the html, nothing is returned. It only happens on the first page of pagination.
To get around a known bug, I have to loop until the list is not empty.
This works, but it's clunky. Is there a better way to do this? Knowing that I have to force the request until the list contains an item.
# look for attractions
attraction_list = soup.find_all(attrs={'class': 'listing_title'})
while not attraction_list:
print('the list is empty')
try:
t = requests.Session()
t.cookies.set_policy(BlockAll)
page2 = t.get(search_url)
print(page2.status_code)
soup2 = BeautifulSoup(page2.content, 'html.parser')
attraction_list = soup2.find_all(attrs={'class': 'listing_title'})
except:
pass

I came up with this.
attraction_list = soup.find_all(attrs={'class': 'listing_title'})
while not attraction_list:
print('the list is empty')
for q in range(0, 4):
try:
t = requests.Session()
t.cookies.set_policy(BlockAll)
page2 = t.get(search_url)
print(page2.status_code)
soup2 = BeautifulSoup(page2.content, 'html.parser')
attraction_list = soup2.find_all(attrs={'class': 'listing_title'})
except Exception as str_error:
print('FAILED TO FIND ATTRACTIONS')
time.sleep(3)
continue
else:
break
It'll try 4 times to get that pull the attractions, if attractions_list ends up with a valid list, it breaks. Good enough.

Try/except to scrape sites with 3 random numbers in the end of URL

I'm trying to scrape data from Brazil's Supreme Court using Python's BeautifulSoup and Requests.
Every one of the 144 links has a number between 1 and 3 in the end (e.g.: http://www.stf.jus.br/portal/remuneracao/listarRemuneracao.asp?periodo=012007&ano=2007&mes=01&folha=3).
There's no pattern in the 'folha'(sheet, in Portuguese) part. Some months are 1, others are 2 or 3. It seems random. When the URL with the wrong number is accessed, the site loads, but with the message 'A folha solicitada não é válida' (the requested sheet is invalid, in Portuguese).
In my code (below), after creating a list with the links without the 'sheet' number, I load the page and check if the message is there. If it is, with the try method, the code then attaches the following number (2 or 3) in the URL.
But the code doesn't run. Is there a way to use try/except for 3 possible outcomes in the code?
records=[]
for x in links:
r = requests.get(x+'1')
soup = BeautifulSoup(r.text, 'html.parser')
if BeautifulSoup(r.text, 'html.parser') == 'A folha solicitada não é válida':
try:
r = requests.get(x+'2')
soup = BeautifulSoup(r.text, 'html.parser')
if BeautifulSoup(r.text, 'html.parser') == 'A folha solicitada não é válida':
try:
r = requests.get(x+'3')
soup = BeautifulSoup(r.text, 'html.parser')
else:
continue
else:
continue
mes = x[-30:-28]+'/'+x[-28:-24]
ativos = soup.find_all('table', {'id':'ministros_ativos'})
ativos = ativos[0]
for x in range(0,11):
nome = ativos.find_all('a', {'class':'exibirServidor'})[x].text
salarios = ativos.contents[3].findAll('td', {'align':'right'})
salarios_brutos = salarios[::2]
salarios_liquidos = salarios[1::2]
for x in salarios_liquidos:
liquido = x.text
for x in salarios_brutos:
bruto = x.text
records.append((nome, bruto, liquido, mes))

You can use range to create a list of numbers between 1 and 3, and iterate over that list to produce a url. If the response is valid break the loop and continue with your code.
for x in links:
for i in range(1,4):
try:
r = requests.get(x+str(i))
except Exception as e:
continue
if 'A folha solicitada não é válida' not in r.text:
break
else:
continue
soup = BeautifulSoup(r.text, 'html.parser')
Notes:
For python 2 you'll have to turn the error message to unicode. (use the u prefix)
requests won't raise an exception for a 404 response so you don't need a try/except for that, however other exceptions may occur.
Use except to catch exceptions. else is used after except in a try/except/else block, and is executed if no exceptions occur.
The else statement in the for/else block is executed if the loop doesn't break. Basically it means "continue with the next x if no valid response is recieved".

Beautifulsoup returning same result when called in a while loop

I am new to python and trying to write a scraper to get all the links on page with multiple pagination.I am calling the following code in a while loop.
page = urllib2.urlopen(givenurl,"",10000)
soup = BeautifulSoup(page, "lxml")
linktags = soup.findAll('span',attrs={'class':'paginationLink pageNum'})
page.close()
BeautifulSoup.clear(soup)
return linktags
It always returns results of the first url i passed. Am i doing something wrong?

#uncollected probably had the right answer for you in the comment, but I wanted to expand on it.
If you are calling you exact code, but nested in a while block, it is going to return right away with the first result. You can do two things here.
I am not sure how you are using the while in your own context, so I am using a for loop here.
Extend a results list, and return a whole list
def getLinks(urls):
""" processes all urls, and then returns all links """
links = []
for givenurl in urls:
page = urllib2.urlopen(givenurl,"",10000)
soup = BeautifulSoup(page, "lxml")
linktags = soup.findAll('span',attrs={'class':'paginationLink pageNum'})
page.close()
BeautifulSoup.clear(soup)
links.extend(linktags)
# dont return here or the loop is over
return links
Or instead of returning, you can make it a generator, using the yield keyword. A generator is going to return each result and pause until the next loop:
def getLinks(urls):
""" generator yields links from one url at a time """
for givenurl in urls:
page = urllib2.urlopen(givenurl,"",10000)
soup = BeautifulSoup(page, "lxml")
linktags = soup.findAll('span',attrs={'class':'paginationLink pageNum'})
page.close()
BeautifulSoup.clear(soup)
# this will return the current results,
# and pause the state, until the the next
# iteration is requested
yield linktags

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find body of a webpage using beautifulsoup - python

Related

Nested while loops does not break as expected

trying to loop through a list of urls and scrape each page for text

While Not Loop for empty list in python

Try/except to scrape sites with 3 random numbers in the end of URL

Beautifulsoup returning same result when called in a while loop

Categories

Resources