Load URL without graphical interface - python

In Python3, I need to load a URL every set interval of time, but without a graphical interface / browser window. There is no JavaScript, all it needs to do is load the page, and then quit it. This needs to run as a console application.
Is there any way to do this?

You could use threading and create a Timer that calls your function after every specified interval of time.
import time, threading, urllib.request
def fetch_url():
threading.Timer(10, fetch_url).start()
req = urllib.request.Request('http://www.stackoverflow.com')
with urllib.request.urlopen(req) as response:
the_page = response.read()
fetch_url()

The requests library may have what you're looking for.
import requests, time
url = "url.you.need"
website_object = requests.get(url)
# Repeat as necessary

Related

threading: function seems to run as a blocking loop although i am using threading

I am trying to speed up web scraping by running my http requests in a ThreadPoolExecutor from the concurrent.futures library.
Here is the code:
import concurrent.futures
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=ibfxcfd&showcategories=CFD',
'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=chix_ca',
'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=tase',
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=chixen-be&showcategories=STK',
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=bvme&showcategories=STK'
]
def get_url(url):
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
a = soup.select_one('a')
print(a)
with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
results = {executor.submit( get_url(url)) : url for url in urls}
for future in concurrent.futures.as_completed(results):
try:
pass
except Exception as exc:
print('ERROR for symbol:', results[future])
print(exc)
However when looking at how the scripts print in the CLI, it seems that the requests are sent in a blocking loop.
Additionaly if i run the code by using the below, i an see that it is taking roughly the same time.
for u in urls:
get_url(u)
I have add some success in implementing concurrency using that library before, and i am at loss regarding what is going wrong here.
I am aware of the existence of the asyncio library as an alternative, but I would be keen on using threading instead.
You're not actually running your get_url calls as tasks; you call them in the main thread, and pass the result to executor.submit, experiencing the concurrent.futures analog to this problem with raw threading.Thread usage. Change:
results = {executor.submit( get_url(url)) : url for url in urls}
to:
results = {executor.submit(get_url, url) : url for url in urls}
so you pass the function to call and its arguments to the submit call (which then runs them in threads for you) and it should parallelize your code.

Why requests.post have no response with Clustal Omega service?

import requests
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r=requests.post("http://www.ebi.ac.uk/Tools/msa/clustalo/",data=q)
This is my script, I send this request to website, but the result looks like I did nothing, web service didn't receive my request. This method used to be fine with other website, maybe this page with a pop window to ask cookie agreement?
The form on the page you are referring to has a separate URL, namely
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi
you can verify this with a DOM inspector in your browser.
So in order to proceed with requests, you need to access the right page
r=requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data=q)
this will submit a job with your input data, it doesn't return the result directly. To check the results, it's necessary to extract the job ID from the previous response and then generate another request (with no data) to
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=...
However, you should definitely check whether this programatic access is compatible with the TOS of that website...
Here is an example:
from lxml import html
import requests
import sys
import time
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r = requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data = q)
tree = html.fromstring(r.text)
title = tree.xpath('//title/text()')[0]
#check the status and get the job id
status, job_id = map(lambda s: s.strip(), title.split(':', 1))
if status != "Job running":
sys.exit(1)
#it might take some time for the job to finish
time.sleep(10)
#download the results
r = requests.get("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=%s" % (job_id))
#prints the full response
#print(r.text)
#isolate the alignment block
tree = html.fromstring(r.text)
alignment = tree.xpath('//pre[#id="alignmentContent"]/text()')[0]
print(alignment)

Python: Get pywebview Current URL

I am using the pywebview library to open a page that will redirect the user to another url. What I would like to do is get the URL the user is directed to.
my code so far:
import urllib.request
import urllib.parse
import webview
import threading
import time
def openwebview():
time.sleep(1)
page = webview.create_window("URL_that_redirects_user")
def geturl():
#what goes here?
t = threading.Thread(target = openwebview)
t.start()
I am using Windows, thanks!
Author of pywebview here. There is no way to get the current URL. Uou have to dig into an underlying webview to get the URL.
Thanks for the suggestion, I will look into introducing this feature.
Now you can do it:
def geturl():
print(webview.get_current_url())
See here:
https://github.com/r0x0r/pywebview/blob/master/examples/get_current_url.py

Fire off function without waiting for answer (Python)

I have a stream of links coming in, and I want to check them for rss every now and then. But when I fire off my get_rss() function, it blocks and the stream halts. This is unnecessary, and I'd like to just fire-and-forget about the get_rss() function (it stores its results elsewhere.)
My code is like thus:
self.ff.get_rss(url) # not async
print 'im back!'
(...)
def get_rss(url):
page = urllib2.urlopen(url) # not async
soup = BeautifulSoup(page)
I'm thinking that if I can fire-and-forget the first call, then I can even use urllib2 wihtout worrying about it not being async. Any help is much appreciated!
Edit:
Trying out gevent, but like this nothing happens:
print 'go'
g = Greenlet.spawn(self.ff.do_url, url)
print g
print 'back'
# output:
go
<Greenlet at 0x7f760c0750f0: <bound method FeedFinder.do_url of <rss.FeedFinder object at 0x2415450>>(u'http://nyti.ms/SuVBCl')>
back
The Greenlet seem to be registered, but the function self.ff.do_url(url) doesn't seem to be run at all. What am I doing wrong?
Fire and forget using the multiprocessing module:
def fire_and_forget(arg_one):
# do stuff
...
def main_function():
p = Process(target=fire_and_forget, args=(arg_one,))
# you have to set daemon true to not have to wait for the process to join
p.daemon = True
p.start()
return "doing stuff in the background"
here is sample code for thread based method invocation additionally desired threading.stack_size can be added to boost the performance.
import threading
import requests
#The stack size set by threading.stack_size is the amount of memory to allocate for the call stack in threads.
threading.stack_size(524288)
def alpha_gun(url, json, headers):
#r=requests.post(url, data=json, headers=headers)
r=requests.get(url)
print(r.text)
def trigger(url, json, headers):
threading.Thread(target=alpha_gun, args=(url, json, headers)).start()
url = "https://raw.githubusercontent.com/jyotiprakash-work/Live_Video_steaming/master/README.md"
payload="{}"
headers = {
'Content-Type': 'application/json'
}
for _ in range(10):
print(i)
#for condition
if i==5:
trigger(url=url, json =payload, headers=headers)
print('invoked')
You want to use the threading module or the multiprocessing module and save the result either in database, a file or a queue.
You also can use gevent.

Script to connect to a web page

Looking for a python script that would simply connect to a web page (maybe some querystring parameters).
I am going to run this script as a batch job in unix.
urllib2 will do what you want and it's pretty simple to use.
import urllib
import urllib2
params = {'param1': 'value1'}
req = urllib2.Request("http://someurl", urllib.urlencode(params))
res = urllib2.urlopen(req)
data = res.read()
It's also nice because it's easy to modify the above code to do all sorts of other things like POST requests, Basic Authentication, etc.
Try this:
aResp = urllib2.urlopen("http://google.com/");
print aResp.read();
If you need your script to actually function as a user of the site (clicking links, etc.) then you're probably looking for the python mechanize library.
Python Mechanize
A simple wget called from a shell script might suffice.
in python 2.7:
import urllib2
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urllib2.urlopen(url+"?"+params).read()
print html
more info at https://docs.python.org/2.7/library/urllib2.html
in python 3.6:
from urllib.request import urlopen
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urlopen(url+"?"+params).read()
print(html)
more info at https://docs.python.org/3.6/library/urllib.request.html
to encode params into GET format:
def myEncode(dictionary):
result = ""
for k in dictionary: #k is the key
result += k+"="+dictionary[k]+"&"
return result[:-1] #all but that last `&`
I'm pretty sure this should work in either python2 or python3...
What are you trying to do? If you're just trying to fetch a web page, cURL is a pre-existing (and very common) tool that does exactly that.
Basic usage is very simple:
curl www.example.com
You might want to simply use httplib from the standard library.
myConnection = httplib.HTTPConnection('http://www.example.com')
you can find the official reference here: http://docs.python.org/library/httplib.html

Categories

Resources