I am making a request to a server... for whatever reason (beyond my comprehension), the server will give me a status code of 200, but when I use Beautiful Soup to grab a list from the html, nothing is returned. It only happens on the first page of pagination.
To get around a known bug, I have to loop until the list is not empty.
This works, but it's clunky. Is there a better way to do this? Knowing that I have to force the request until the list contains an item.
# look for attractions
attraction_list = soup.find_all(attrs={'class': 'listing_title'})
while not attraction_list:
print('the list is empty')
try:
t = requests.Session()
t.cookies.set_policy(BlockAll)
page2 = t.get(search_url)
print(page2.status_code)
soup2 = BeautifulSoup(page2.content, 'html.parser')
attraction_list = soup2.find_all(attrs={'class': 'listing_title'})
except:
pass
I came up with this.
attraction_list = soup.find_all(attrs={'class': 'listing_title'})
while not attraction_list:
print('the list is empty')
for q in range(0, 4):
try:
t = requests.Session()
t.cookies.set_policy(BlockAll)
page2 = t.get(search_url)
print(page2.status_code)
soup2 = BeautifulSoup(page2.content, 'html.parser')
attraction_list = soup2.find_all(attrs={'class': 'listing_title'})
except Exception as str_error:
print('FAILED TO FIND ATTRACTIONS')
time.sleep(3)
continue
else:
break
It'll try 4 times to get that pull the attractions, if attractions_list ends up with a valid list, it breaks. Good enough.
Related
I want to check if there is any content available on more than 500 webpages, using beautiful soup. This is the is script I wrote. It works, but somewhere it stops. If I fix the error it shows a different one. Below is the code I tried. I just want to be sure the page has a body. I'm unsure how to handle timeouts. Maybe the website needs more time.
method 1:
res = requests.get(full_https_url, timeout=40)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems == '':
pass
else:
print('body found')
method 2:
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems != '':
print('body found')
else:
pass
select() returns a list, not a string, so it will always compare not equal to '', whether it's successful or not. Just test if the result is not empty.
Use try/except to catch the timeout error.
try:
res = requests.get(full_https_url, timeout=40)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems:
# do stuff
else:
print("No body in {full_https_url}")
except requests.exceptions.Timeout:
print(f"Timeout on {full_https_url}, skipping")
I have a list of links and for each link I want to check if it contains a specific sublink and add this sublink to the initial list. I have this code:
def getAllLinks():
i = 0
baseUrl = 'http://www.cdep.ro/pls/legis/'
sourcePaths = ['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0','legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0']
while i < len(sourcePaths)+1:
for path in sourcePaths:
res = requests.get(f'{baseUrl}{path}')
soup = BeautifulSoup(res.text)
next_btn = soup.find(lambda e: e.name == 'td' and '1..99' in e.text)
if next_btn:
for a in next_btn.find_all('a', href=True):
linkNextPage = a['href']
sourcePaths.append(linkNextPage)
i += 1
break
else:
i += 1
continue
break
return sourcePaths
print(getAllLinks())
The first link in the list does not contain the sublink, so it's an else case. The code does this OK. However, the second link in the list does contain the sublink, but it gets stuck here:
for a in next_btn.find_all('a', href=True):
linkNextPage = a['href']
sourcePaths.append(linkNextPage)
i += 1
The third link contains the sublink but my code does not get to look at that link. At the end I am getting a list containing the initial links plus 4 times the sublink of the second link.
I think I'm breaking incorrectly somewhere but I can't figure out how to fix it.
Remove the while. It's not needed. Change the selectors
import requests
from bs4 import BeautifulSoup
def getAllLinks():
baseUrl = 'http://www.cdep.ro/pls/legis/'
sourcePaths = ['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0','legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0']
for path in sourcePaths:
res = requests.get(f'{baseUrl}{path}')
soup = BeautifulSoup(res.text, "html.parser")
next_btn = soup.find("p",class_="headline").find("table", {"align":"center"})
if next_btn:
anchor = next_btn.find_all("td")[-1].find("a")
if anchor: sourcePaths.append(anchor["href"])
return sourcePaths
print(getAllLinks())
Output:
['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0', 'legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0', 'legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=100', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0&nrc=100']
Your second break statement never gets executed because the first "for" loop is already broken by the first break statement and never reaches the second break statement. Put condition which break the while loop.
I'm having an issue. It loops through the list of URLS, but it's not adding the text content of each page scraped to the presults list.
I haven't gotten to the raw text processing yet. I'll probably make a question for that once I get there if I can't figure out.
What is wrong here? The length of presults remains at 1 even though it seems to be looping through the list of urls for the scrape...
Here's part of the code I'm having an issue with:
counter=0
for xa in range(0,len(qresults)):
pageURL=qresults[xa].format()
pageresp= requests.get(pageURL, headers=headers)
if pageresp.status_code==200:
print(pageURL)
psoup=BeautifulSoup(pageresp.content, 'html.parser')
presults=[]
para=psoup.text
presults.append(para)
print(len(presults))
else: print("Could not reach domain")
print(len(presults))
Your immediate problem is here:
presults=[]
para=psoup.text
presults.append(para)
On every for iteration, you replace your existing presults list with the empty list and add one item. On the next iteration, you again wipe out the previous result.
Your initialization must be done only once and that before the loop:
presults = []
for xa in range(0,len(qresults)):
Ok, I don't even see you looping through any URLs here, but below is a generic example of how this kind of request can be achieved.
import requests
from bs4 import BeautifulSoup
base_url = "http://www.privredni-imenik.com/pretraga?abcd=&keyword=&cities_id=0&category_id=0&sub_category_id=0&page=1"
current_page = 1
while current_page < 200:
print(current_page)
url = base_url + str(current_page)
#current_page += 1
r = requests.get(url)
zute_soup = BeautifulSoup(r.text, 'html.parser')
firme = zute_soup.findAll('div', {'class': 'jobs-item'})
for title in firme:
title1 = title.findAll('h6')[0].text
print(title1)
adresa = title.findAll('div', {'class': 'description'})[0].text
print(adresa)
kontakt = title.findAll('div', {'class': 'description'})[1].text
print(kontakt)
print('\n')
page_line = "{title1}\n{adresa}\n{kontakt}".format(
title1=title1,
adresa=adresa,
kontakt=kontakt
)
current_page += 1
I have a loop inside loop i'm using try n catch once get error try n catch works fine but loop continues to next value. What I need is that where the loop breaks start from the same value don't continue to next so how i can do that with my code [like in other languages: in c++, it is i--]
for
r = urllib2.urlopen(url)
encoding = r.info().getparam('charset')
html = r.read()
c = td.find('a')['href']
urls = []
urls.append(c)
#collecting urls from first page then from those url collecting further info in below loop
for abc in urls:
try:
r = urllib2.urlopen(abc)
encoding = r.info().getparam('charset')
html = r.read()
except Exception as e:
last_error = e
time.sleep(retry_timeout) #here is the problem once get error then switch from next value
I need a more pythonic way to do this.
Waiting for a reply. Thank you.
Unfortunatly, there is no simple way to go back with iterator in Python :
http://docs.python.org/2/library/stdtypes.html
You should be interested in this stackoverflow's thread :
Making a python iterator go backwards?
For your particular case, i will use a simple while loop :
url = []
i = 0
while i < len(url): #url is list contain all urls which contain infinite as url updates every day
data = url[i]
try:
#getting data from there
i+=1
except:
#shows the error received and continue to next loop i need to make the loop start from same position
The problem with the way, you want to handle your problem is that you will risk to go on a infinite loop. For example if a link is broken r = urllib2.urlopen(abc) will always run an exception and you will always stay at the same position. You should consider doing something like that :
r = urllib2.urlopen(url)
encoding = r.info().getparam('charset')
html = r.read()
c = td.find('a')['href']
urls = []
urls.append(c)
#collecting urls from first page then from those url collecting further info in below loop
NUM_TRY = 3
for abc in urls:
for _ in range(NUM_TRY):
try:
r = urllib2.urlopen(abc)
encoding = r.info().getparam('charset')
html = r.read()
break #if we arrive to this line, it means no error occur so we don't need to retry again
#this is why we break the inner loop
except Exception as e:
last_error = e
time.sleep(retry_timeout) #here is the problem once get error then switch from next value
I'm scraping all the URL of my domain with recursive function.
But it outputs nothing, without any error.
#usr/bin/python
from bs4 import BeautifulSoup
import requests
import tldextract
def scrape(url):
for links in url:
main_domain = tldextract.extract(links)
r = requests.get(links)
data = r.text
soup = BeautifulSoup(data)
for href in soup.find_all('a'):
href = href.get('href')
if not href:
continue
link_domain = tldextract.extract(href)
if link_domain.domain == main_domain.domain :
problem.append(href)
elif not href == '#' and link_domain.tld == '':
new = 'http://www.'+ main_domain.domain + '.' + main_domain.tld + '/' + href
problem.append(new)
return len(problem)
return scrape(problem)
problem = ["http://xyzdomain.com"]
print(scrape(problem))
When I create a new list, it works, but I don't want to make a list every time for every loop.
You need to structure your code so that it meets the pattern for recursion as your current code doesn't - you also should not call variables the same name as libraries, e.g. href = href.get() because this will usually stop the library working as it becomes the variable, your code as it currently is will only ever return the len() as this return is unconditionally reached before: return scrap(problem).:
def Recursive(Factorable_problem)
if Factorable_problem is Simplest_Case:
return AnswerToSimplestCase
else:
return Rule_For_Generating_From_Simpler_Case(Recursive(Simpler_Case))
for example:
def Factorial(n):
""" Recursively Generate Factorials """
if n < 2:
return 1
else:
return n * Factorial(n-1)
Hello I've made a none recursive version of this that appears to get all the links on the same domain.
The code below I've tested using the problem included in the code. When I'd solved the problems with the recursive version the next problem was hitting the recursion depth limit so I rewrote it so it ran in an iterative fashion, the code and result below:
from bs4 import BeautifulSoup
import requests
import tldextract
def print_domain_info(d):
print "Main Domain:{0} \nSub Domain:{1} \nSuffix:{2}".format(d.domain,d.subdomain,d.suffix)
SEARCHED_URLS = []
problem = [ "http://Noelkd.neocities.org/", "http://youpi.neocities.org/"]
while problem:
# Get a link from the stack of links
link = problem.pop()
# Check we haven't been to this address before
if link in SEARCHED_URLS:
continue
# We don't want to come back here again after this point
SEARCHED_URLS.append(link)
# Try and get the website
try:
req = requests.get(link)
except:
# If its not working i don't care for it
print "borked website found: {0}".format(link)
continue
# Now we get to this point worth printing something
print "Trying to parse:{0}".format(link)
print "Status Code:{0} Thats: {1}".format(req.status_code, "A-OK" if req.status_code == 200 else "SOMTHINGS UP" )
# Get the domain info
dInfo = tldextract.extract(link)
print_domain_info(dInfo)
# I like utf-8
data = req.text.encode("utf-8")
print "Lenght Of Data Retrived:{0}".format(len(data)) # More info
soup = BeautifulSoup(data) # This was here before so i left it.
print "Found {0} link{1}".format(len(soup.find_all('a')),"s" if len(soup.find_all('a')) > 1 else "")
FOUND_THIS_ITERATION = [] # Getting the same links over and over was boring
found_links = [x for x in soup.find_all('a') if x.get('href') not in SEARCHED_URLS] # Find me all the links i don't got
for href in found_links:
href = href.get('href') # You wrote this seems to work well
if not href:
continue
link_domain = tldextract.extract(href)
if link_domain.domain == dInfo.domain: # JUST FINDING STUFF ON SAME DOMAIN RIGHT?!
if href not in FOUND_THIS_ITERATION: # I'ma check you out next time
print "Check out this link: {0}".format(href)
print_domain_info(link_domain)
FOUND_THIS_ITERATION.append(href)
problem.append(href)
else: # I got you already
print "DUPE LINK!"
else:
print "Not on same domain moving on"
# Count down
print "We have {0} more sites to search".format(len(problem))
if problem:
continue
else:
print "Its been fun"
print "Lets see the URLS we've visited:"
for url in SEARCHED_URLS:
print url
Which prints, after a lot of other logging loads of neocities websites!
What's happening is the script is popping a value of the list of websites yet to visit, it then gets all the links on the page which are on the same domain. If those links are to pages we haven't visited we add the link to the list of links to be visited. After we do that we pop the next page and do the same thing again until there are no pages left to visit.
Think this is what your looking for, get back to us in the comments if this doesn't work in the way that you want or if anyone can improve please leave a comment.