python kill python's sub-process - python

I'm using requests_html to scrape some site :
from requests_html import HTMLSession
for i in range (0,30):
session = HTMLSession()
r = session.get('https://www.google.com')
r.html.render()
del session
Now this code creates more than 30 sub-process of chromium as Python's sub-process. And this acquires memory, so how can I remove them?
I don't want to use psutil, as it will increase one more dependency and to kill python's sub-process python may have some built in method, I want to be enlightened, if there is so
I can't even use exit() as I have to return and then exit(inside a method), and of course I can't exit and return

You might want to try closing the session:
session = HTMLSession()
session.close()
See requests_html.HTMLSession.close.

Related

threading: function seems to run as a blocking loop although i am using threading

I am trying to speed up web scraping by running my http requests in a ThreadPoolExecutor from the concurrent.futures library.
Here is the code:
import concurrent.futures
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=ibfxcfd&showcategories=CFD',
'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=chix_ca',
'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=tase',
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=chixen-be&showcategories=STK',
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=bvme&showcategories=STK'
]
def get_url(url):
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
a = soup.select_one('a')
print(a)
with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
results = {executor.submit( get_url(url)) : url for url in urls}
for future in concurrent.futures.as_completed(results):
try:
pass
except Exception as exc:
print('ERROR for symbol:', results[future])
print(exc)
However when looking at how the scripts print in the CLI, it seems that the requests are sent in a blocking loop.
Additionaly if i run the code by using the below, i an see that it is taking roughly the same time.
for u in urls:
get_url(u)
I have add some success in implementing concurrency using that library before, and i am at loss regarding what is going wrong here.
I am aware of the existence of the asyncio library as an alternative, but I would be keen on using threading instead.
You're not actually running your get_url calls as tasks; you call them in the main thread, and pass the result to executor.submit, experiencing the concurrent.futures analog to this problem with raw threading.Thread usage. Change:
results = {executor.submit( get_url(url)) : url for url in urls}
to:
results = {executor.submit(get_url, url) : url for url in urls}
so you pass the function to call and its arguments to the submit call (which then runs them in threads for you) and it should parallelize your code.

Unable to let a script populate results using concurrent.futures in a customized manner

I've created a script in python to scrape the user_name from a site's landing page and title from it's inner page. I'm trying to use concurrent.futures library to perform parallel tasks. I know how to use executor.submit() within the script below, so I'm not interested to go that way. I would like to go for executor.map() which I've already defined (perhaps in the wrong way) within the following script.
I've tried with:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures
URL = "https://stackoverflow.com/questions/tagged/web-scraping"
base = "https://stackoverflow.com"
def get_links(s,url):
res = s.get(url)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".summary"):
user_name = item.select_one(".user-details > a").get_text(strip=True)
post_link = urljoin(base,item.select_one(".question-hyperlink").get("href"))
yield s,user_name,post_link
def fetch(s,name,url):
res = s.get(url)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] > a").text
return name,title
if __name__ == '__main__':
with requests.Session() as s:
with futures.ThreadPoolExecutor(max_workers=5) as executor:
link_list = [url for url in get_links(s,URL)]
for result in executor.map(fetch, *link_list):
print(result)
I get the following error when I run the above script as is:
TypeError: fetch() takes 3 positional arguments but 50 were given
If I run the script modifying this portion link_list = [url for url in get_links(s,URL)][0], I get the following error:
TypeError: zip argument #1 must support iteration
How can I successfully execute the above script keeping the existing design intact?
Because fetch takes 3 arguments (s,name,url), you need to to pass 3 iterables to executor.map().
When you do this:
executor.map(fetch, *link_list)
link_list unpacks 49 or so tuples each with 3 elements (the Session object, username, and url). That's not what you want.
What you need to do is first transform link_list into 3 separate iterables (one for the Session objects, another for the usernames, and one for the urls). Instead of doing this manually, you can use zip() and the unpacking operator twice, like so:
for result in executor.map(fetch, *zip(*link_list)):
Also, when I tested your code, an exception was raised in get_links:
user_name = item.select_one(".user-details > a").get_text(strip=True)
AttributeError: 'NoneType' object has no attribute 'get_text'
item.select_one returned None, which obviously doesn't have a get_text() method, so I just wrapped that in a try/except block, catching AttributeError and continued the loop.
Also note that Requests' Session class isn't thread-safe. Luckily, the script returned sane responses when I ran it, but if you need your script to be reliable, you need to address this. A comment in the 2nd link shows how to use one Session instance per thread thanks to thread-local data. See:
Document threading contract for Session class
Thread-safety of FutureSession

Load URL without graphical interface

In Python3, I need to load a URL every set interval of time, but without a graphical interface / browser window. There is no JavaScript, all it needs to do is load the page, and then quit it. This needs to run as a console application.
Is there any way to do this?
You could use threading and create a Timer that calls your function after every specified interval of time.
import time, threading, urllib.request
def fetch_url():
threading.Timer(10, fetch_url).start()
req = urllib.request.Request('http://www.stackoverflow.com')
with urllib.request.urlopen(req) as response:
the_page = response.read()
fetch_url()
The requests library may have what you're looking for.
import requests, time
url = "url.you.need"
website_object = requests.get(url)
# Repeat as necessary

Ghost.Py keep windown shown

I currently using ghost.py and they have a function show() when I call it shows the website but instant close it. How to keep it open?
from ghost import Ghost
import PySide
ghost = Ghost()
with ghost.start() as session:
page, resources = session.open("https://www.instagram.com/accounts/login/?force_classic_login")
session.set_field_value("input[name=username]", "joe")
session.set_field_value("input[name=password]", "test")
session.show()
session.evaluate("alert('test')")
The session preview will remain open until session exits - by leaving the session context session.exit() is implicitly called. To keep this open you need to either not exit the session context, or not use a session context.
The former can be achieved as so:
from ghost import Ghost
import PySide
ghost = Ghost()
with ghost.start() as session:
page, resources = session.open("https://www.instagram.com/accounts/login/?force_classic_login")
session.set_field_value("input[name=username]", "joe")
session.set_field_value("input[name=password]", "test")
session.show()
session.evaluate("alert('test')")
# other python code
The latter can be achieved as so:
from ghost import Ghost
import PySide
ghost = Ghost()
session = ghost.start()
page, resources = session.open("https://www.instagram.com/accounts/login/?force_classic_login")
session.set_field_value("input[name=username]", "joe")
session.set_field_value("input[name=password]", "test")
session.show()
session.evaluate("alert('test')")
# other python code
The session will however inevitable exit when the python process ends. Also worth noting is that some operations will return as soon as the initial http request has completed. If you wish to wait until other resources have loaded you may need to call session.wait_for_page_loaded(). I have also found that some form submissions require a call to session.sleep() to behave as expected.

Fire off function without waiting for answer (Python)

I have a stream of links coming in, and I want to check them for rss every now and then. But when I fire off my get_rss() function, it blocks and the stream halts. This is unnecessary, and I'd like to just fire-and-forget about the get_rss() function (it stores its results elsewhere.)
My code is like thus:
self.ff.get_rss(url) # not async
print 'im back!'
(...)
def get_rss(url):
page = urllib2.urlopen(url) # not async
soup = BeautifulSoup(page)
I'm thinking that if I can fire-and-forget the first call, then I can even use urllib2 wihtout worrying about it not being async. Any help is much appreciated!
Edit:
Trying out gevent, but like this nothing happens:
print 'go'
g = Greenlet.spawn(self.ff.do_url, url)
print g
print 'back'
# output:
go
<Greenlet at 0x7f760c0750f0: <bound method FeedFinder.do_url of <rss.FeedFinder object at 0x2415450>>(u'http://nyti.ms/SuVBCl')>
back
The Greenlet seem to be registered, but the function self.ff.do_url(url) doesn't seem to be run at all. What am I doing wrong?
Fire and forget using the multiprocessing module:
def fire_and_forget(arg_one):
# do stuff
...
def main_function():
p = Process(target=fire_and_forget, args=(arg_one,))
# you have to set daemon true to not have to wait for the process to join
p.daemon = True
p.start()
return "doing stuff in the background"
here is sample code for thread based method invocation additionally desired threading.stack_size can be added to boost the performance.
import threading
import requests
#The stack size set by threading.stack_size is the amount of memory to allocate for the call stack in threads.
threading.stack_size(524288)
def alpha_gun(url, json, headers):
#r=requests.post(url, data=json, headers=headers)
r=requests.get(url)
print(r.text)
def trigger(url, json, headers):
threading.Thread(target=alpha_gun, args=(url, json, headers)).start()
url = "https://raw.githubusercontent.com/jyotiprakash-work/Live_Video_steaming/master/README.md"
payload="{}"
headers = {
'Content-Type': 'application/json'
}
for _ in range(10):
print(i)
#for condition
if i==5:
trigger(url=url, json =payload, headers=headers)
print('invoked')
You want to use the threading module or the multiprocessing module and save the result either in database, a file or a queue.
You also can use gevent.

Categories

Resources