Python requests-html session GET correct usage - python

I'm working on a web scraper that needs to open several thousand pages and get some data.
Since one of the data fields I need the most is only loaded after all javascripts of the site have been loaded, I'm using html-requests to render the page and then get the data I need.
I want to know, what's the best way to do this?
1- Open a session at the start of the script, do my whole scraping and then close the session when the script finishes after thousands of "clicks" and several hours?
2- Or should I open a session everytime I open a link, render the page, get the data, and then close the session, and repeat n times in a cycle?
Currently I'm doing the 2nd option, but I'm getting a problem. This is the code I'm using:
def getSellerName(listingItems):
for item in listingItems:
builtURL = item['href']
try:
session = HTMLSession()
r = session.get(builtURL,timeout=5)
r.html.render()
sleep(1)
sellerInfo = r.html.search("<ul class=\"seller_name\"></ul></div><a href=\"{user}\" target=")["user"]
##
##Do some stuff with sellerinfo
##
session.close()
except requests.exceptions.Timeout:
log.exception("TimeOut Ex: ")
continue
except:
log.exception("Gen Ex")
continue
finally:
session.close()
break
This works pretty well and is quite fast. However, after about 1.5 or 2 hours, I start getting OS exception like this one:
OSError: [Errno 24] Too many open files
And then that's it, I just get this exception over and over again, until I kill the script.
I'm guessing I need to close something else after every get and render, but I'm not sure what or if I'm doing it correctly.
Any help and/or suggestions, please?
Thanks!

You should create a session object outside the loop
def getSellerName(listingItems):
session = HTMLSession()
for item in listingItems:
//code

Related

How can i keep a python app running 24/7 on host?

I wrote a mini-app, that scrapes my school's Website then looks for the title of the last post, compare it to the old title, if it's not the same, it then sends me an email.
In order for the app to work properly it needs to keep running 24/7 so that the value of the title variable is correct.
Here's the code:
import requests
from bs4 import BeautifulSoup
import schedule, time
import sys
import smtplib
#Mailing Info
from_addr = ''
to_addrs = ['']
message = """From: sender
To: receiver
Subject: New Post
A new post has been published
visit the website to view it:
"""
def send_mail(msg):
try:
s = smtplib.SMTP('localhost')
s.login('email',
'password')
s.sendmail(from_addr, to_addrs, msg)
s.quit()
except smtplib.SMTPException as e:
print(e)
#Scraping
URL = ''
title = 'Hello World'
def check():
global title
global message
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
main_section = soup.find('section', id='spacious_featured_posts_widget-2')
first_div = main_section.find('div', class_='tg-one-half')
current_title = first_div.find('h2', class_='entry-title').find('a')['title']
if current_title != title:
send_mail(message)
title = current_title
else:
send_mail("Nothing New")
schedule.every(6).hours.do(check)
while True:
schedule.run_pending()
time.sleep(0.000001)
So my question is How do I keep this code running on host using Cpanel?
I know I can use cron jobs to run it every like 2 hours or something, but I don't know how to keep the script itself running, using a terminal doesn't work when I close the page the app gets terminated
So - generally to run programs for an extended period, they would need to be daemonised. Essentially disconnected from your terminal with a double-fork, and a set-sid. Having that said, I've never actually done it myself, since it was usually either (a) the wrong solution, or (b) it's re-inventing the wheel (https://github.com/thesharp/daemonize).
In this case, I think a better course of action would be to invoke the script every 6 hours, rather than have it internally do something every 6 hours. Making your program resilient to a restart is pretty much how most systems are kept reliable, and putting them in a 'cradle' that automatically restarts them.
In your case, I'd suggest saving the title to a file, and reading from and writing to that file when the script is invoked. It would make your script simplier, and more robust, and you'd be using battle-hardened tools for the job.
A couple of years down the line, when your writing code that needs to survive the total machine crashing, and being replaced (within 6 hours, with everything installed) you can use some external form of storage (like a database) instead of a file, to make your system even more resiliant.

Why I can't scrape that large XML file using Python?

Does anyone know why this code doesn't do the job? It works perfectly when I want to scrape smaller files with data from a certain date e.g only from 2017 but not with this one. Is this file too big or something? There's no error or anything like that. Every time I run this script but with mentioned smaller file It takes about 30 seconds to download everything and save into a database so there are no mistakes in code I think. After running the script I'm just getting "Process finished with exit code 0" and nothing more.
from bs4 import BeautifulSoup
import urllib.request
from app import db
from models import CveData
from sqlalchemy.exc import IntegrityError
url = "https://cve.mitre.org/data/downloads/allitems.xml"
r = urllib.request.urlopen(url)
xml = BeautifulSoup(r, 'xml')
vuln = xml.findAll('Vulnerability')
for element in vuln:
note = element.findAll('Notes')
title = element.find('CVE').text
for element in note:
desc = element.find(Type="Description").text
test_date = element.find(Title="Published")
if test_date is None:
pass
else:
date = test_date.text
data = CveData(title,date,desc)
try:
db.session.add(data)
db.session.commit()
print("adding... " + title)
# don't stop the stream, ignore the duplicates
except IntegrityError:
db.session.rollback()
I downloaded the file that you said didn't work, and the one you said did and ran these two greps with different results:
grep -c "</Vulnerability>" allitems-cvrf-year-2019.xml
21386
grep -c "</Vulnerability>" allitems.xml
0
The program is not stopping on opening the file, it is running to completion. You aren't getting any output because there are no Vulnerability tags in the xml file. (Now my grep is not technically accurate, as I believe there could be spaces in the Vulnerability closing tag, but I doubt that is the case here.)

Selenium Python Wait until all the HTML of a page is Load [duplicate]

I don't really have idea about that so I'd like you to give me some advice if you can.
Generally when I use Selenium I try to search the element that I'm interested in, but now I was thinking to develop some kind of performance test so check how much time take a specific webpage (html, script, etc...) to load.
Do you have some idea how to know the load time of html, script etc without search for a specific element of the page?
PS I use IE or Firefox
You could check the underlying javascript framework for active connections. When there are no active connections you could then assume the page is finished loading.
That, however, requires that you either know what framework the page uses, or that you must systematically check for different frameworks and then check for connections.
def get_js_framework(driver):
frameworks = [
'return jQuery.active',
'return Ajax.activeRequestCount',
'return dojo.io.XMLHTTPTransport.inFlight.length'
]
for f in frameworks:
try:
driver.execute_script(f)
except Exception:
logging.debug("{0} didn't work, trying next js framework".format(f))
continue
else:
return f
else:
return None
def load_page(driver, link):
timeout = 5
begin = time.time()
driver.get(link)
js = _get_js_framework(driver)
if js:
while driver.execute_script(js) and time.time() < begin + timeout:
time.sleep(0.25)
else:
time.sleep(timeout)

Python Selenium: Unable to Find Element After First Refresh

I've seen a few instances of this question, but I was not sure how to apply the changes to my particular situation. I have code that monitors a webpage for changes and refreshes every 30 seconds, as follows:
import sys
import ctypes
from time import sleep
from Checker import Checker
USERNAME = sys.argv[1]
PASSWORD = sys.argv[2]
def main():
crawler = Checker()
crawler.login(USERNAME, PASSWORD)
crawler.click_data()
crawler.view_page()
while crawler.check_page():
crawler.wait_for_table()
crawler.refresh()
ctypes.windll.user32.MessageBoxW(0, "A change has been made!", "Attention", 1)
if __name__ == "__main__":
main()
The problem is that Selenium will always show an error stating it is unable to locate the element after the first refresh has been made. The element in question, I suspect, is a table from which I retrieve data using the following function:
def get_data_cells(self):
contents = []
table_id = "table.datadisplaytable:nth-child(4)"
table = self.driver.find_element(By.CSS_SELECTOR, table_id)
cells = table.find_elements_by_tag_name('td')
for cell in cells:
contents.append(cell.text)
return contents
I can't tell if the issue is in the above function or in the main(). What's an easy way to get Selenium to refresh the page without returning such an error?
Update:
I've added a wait function and adjusted the main() function accordinly:
def wait_for_table(self):
table_selector = "table.datadisplaytable:nth-child(4)"
delay = 60
try:
wait = ui.WebDriverWait(self.driver, delay)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, table_selector)))
except TimeoutError:
print("Operation timeout! The requested element never loaded.")
Since the same error is still occurring, either my timing function is not working properly or it is not a timing issue.
I've run into the same issue while doing web scraping before and found that re-sending the GET request (instead of refreshing) seemed to eliminate it.
It's not very elegant, but it worked for me.
I appear to have fixed my own problem.
My refresh() function was written as follows:
def refresh():
self.driver.refresh()
All I did was switch frames right after the refresh() call. That is:
def refresh():
self.driver.refresh()
self.driver.switch_to.frame("content")
This took care of it. I can see that the page is now refreshing without issues.

Downloading rss content in python

I am creating a python application where I am downloading a list of rss content from internet. I am having a list of 10 url's which I need to download.
I am using the urllib2 library provided by python. The code I am using is:
for url in urls:
rssObject = urllib2.urlopen(url)
data = rssObject.read()
with open(self.SERVER_PATH+"\\feeds\\"+str(feedID)+str(extension), "w") as requiredData:
requiredData.write(str(data))
requiredData.close()
Here the first url is downloaded but while downloading the next url I get an error:
<urlopen error [Errno 66] unknown>
Is there any event which can notify me for the completion of the downloading of the first URL? Or is there any other way with the help of which I can avoid this issue?
Thanks in advance.
Is there any event which can notify me for the completion of the downloading of the first URL?
The raising of the Exception is notification that the URL cannot be downloaded.
Or is there any other way with the help of which I can avoid this issue?
Yes, you can catch the exception.
try:
rssObject = urllib2.urlopen(url)
data = rssObject.read()
except URLError:
# do stuff which handles the error
I'm not perfectly sure that's the error you need to catch, but hopefully you have the skills to work out exactly what to catch (if it's not URLError).
As a follow-up to John Mee's answer, and after reading your comments, you can try something like the following:
def complete_urlopen(url):
complete = False
while not complete:
try:
obj = urllib2.urlopen(url)
complete = True
except URLError, e:
pass
return obj.read()
And then use it by replacing:
rssObject = urllib2.urlopen(url)
data = rssObject.read()
with:
data = complete_urlopen(url)
Be aware that this code assumes that the urlopen() will eventually succeed. If it never does, your process will hang in that while loop forever. A more sophisticated version of this could contemplate a max number of iterations, such that when they're reached, the process exits.

Categories

Resources