I am working on an open-source project called RubberBand which is an open source project that allows you to do what the title says. Locally execute python file that is located on a web server, however I have run a problem. If a comma is located in a string (etc. "http:"), It Will return an error.
'''
RubberBand Version 1.0.1 'Indigo-Charlie'
http://www.lukeshiels.com/rubberband
CHANGE-LOG:
Changed Error Messages.
Changed Whole Code Into one function, rather than three.
Changed Importing required libraries into one line instead of two
'''
#Edit Below this line
import httplib, urlparse
def executeFromURL(url):
if (url == None):
print "!# RUBBERBAND_ERROR: No URL Specified #!"
else:
CORE = None
good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
host, path = urlparse.urlparse(url)[1:3]
try:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
CORE = conn.getresponse().status
except StandardError:
CORE = None
if(CORE in good_codes):
exec(url)
else:
print "!# RUBBERBAND_ERROR: File Does Not Exist On WEBSERVER #!"
RubberBand in three lines without error checking:
import requests
def execute_from_url(url):
exec(requests.get(url).content)
You should use a return statement in your if (url == None): block as there is no point in carrying on with your function.
Where abouts in your code is the error, is there a full traceback as URIs with commas parse fine with the urlparse module.
Is it perhaps httplib.ResponseNotReady when calling CORE = conn.getresponse().status?
Nevermind that error message, that was me quickly testing your code and re-using the same connection object. I can't see what would be erroneous in your code.
I would suggest to check this question.
avoid comma in URL, that my suggestion.
Can I use commas in a URL?
This seems to work well for me:
import urllib
(fn,hd) = urllib.urlretrieve('http://host.com/file.py')
execfile(fn)
I prefer to use standard libraries, because I'm using python bundled with third party software (abaqus) which makes it a real headache to add packages.
Related
I am trying to load a torrent file into rtorrent using xmlrpc with the following python3 code:
import xmlrpc.client
server_url = "https://%s:%s#%s/xmlrpc" % ('[REDACTED]', '[REDACTED]', '[REDACTED]');
server = xmlrpc.client.Server(server_url);
with open("test.torrent", "rb") as torrent:
server.load.raw_verbose(xmlrpc.client.Binary(torrent.read()),"d.delete_tied=","d.custom1.set=Test","d.directory.set=/home/[REDACTED]/files")
The load_raw command returns without an error (return code 0), but the torrent does not appear in the rutorrent UI. I seem to be experiencing the same thing as from this reddit post, but I am using the Binary class without any luck.
I am using a Whatbox seedbox.
EDIT:
After enabling logging I am seeing
1572765194 E Could not create download, the input is not a valid torrent.
when trying to load the torrent file, however manually loading the torrent file through the rutorrent UI works fine.
I needed to add "" as the first argument:
server.load.raw_verbose("",xmlrpc.client.Binary(torrent.read()),"d.delete_tied=","d.custom1.set=Test","d.directory.set=/home/[REDACTED]/files")
Not sure why, the docs don't seem to show this is needed.
I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue
I'm extremely new to coding in general; I delved into this project in order to help my friend tag her fifteen thousand and some-odd posts on Tumblr. We've finally finished, but she wants to be sure that we haven't missed anything...
So, I've scoured the internet, trying to find a coding solution. I came across a script found here, that allegedly does exactly what we need -- so I downloaded Python, and...It doesn't work.
More specifically, when I click on the script, a black box appears for about half a second and then disappears. I haven't been able to screenshot the box to find out exactly what it says, but I believe it says there's a syntax error. At first, I tried with Python 2.4; it didn't seem to find the Json module the creator uses, so I switched to Python 3.3 -- the most recent version for Windows, and this is where the Syntax errors occur.
#!/usr/bin/python
import urllib2
import json
hostname = "(Redacted for Privacy)"
api_key = "(Redacted for Privacy)"
url = "http://api.tumblr.com/v2/blog/" + hostname + "/posts?api_key=" + api_key
def api_response(url):
req = urllib2.urlopen(url)
return json.loads(req.read())
jsonresponse = api_response(url)
post_count = jsonresponse["response"]["total_posts"]
increments = (post_count + 20) / 20
for i in range(0, increments):
jsonresponse = api_response(url + "&offset=" + str((i * 20)))
posts = jsonresponse["response"]["posts"]
for i in range(0, len(posts)):
if not posts[i]["tags"]:
print posts[i]["post_url"]
print("All finished!")
So, uhm, my question is this: If this coding has a syntax error that could be fixed and then used to find the Untagged Posts on Tumblr, what might that error be?
If this code is outdated (either via Tumblr or via Python updates), then might someone with a little free time be willing to help create a new script to find Untagged posts on Tumblr? Searching Tumblr, this seems to be a semi-common problem.
In case it matters, Python is installed in C:\Python33.
Thank you for your assistance.
when I click on the script, a black box appears for about half a second and then
disappears
At the very least, you should be able to run a Python script from the command line e.g., do Exercise 0 from "Learn Python The Hard Way".
"Finding Untagged Posts on Tumblr" blog post contains Python 2 script (look at import urllib2 in the source. urllib2 is renamed to urllib.request in Python 3). It is easy to port the script to Python 3:
#!/usr/bin/env python3
"""Find untagged tumblr posts.
Python 3 port of the script from
http://www.alexwlchan.net/2013/08/untagged-tumblr-posts/
"""
import json
from itertools import count
from urllib.request import urlopen
hostname, api_key = "(Redacted for Privacy)", "(Redacted for Privacy)"
url = "https://api.tumblr.com/v2/blog/{blog}/posts?api_key={key}".format(
blog=hostname, key=api_key)
for offset in count(step=20):
r = json.loads(urlopen(url + "&offset=" + str(offset)).read().decode())
posts = r["response"]["posts"]
if not posts: # no more posts
break
for post in posts:
if not post["tags"]: # no tags
print(post["post_url"])
Here's the same functionality implemented using the official Python Tumblr API v2 Client (Python 2 only library):
#!/usr/bin/env python
from itertools import count
import pytumblr # $ pip install pytumblr
hostname, api_key = "(Redacted for Privacy)", "(Redacted for Privacy)"
client = pytumblr.TumblrRestClient(api_key, host="https://api.tumblr.com")
for offset in count(step=20):
posts = client.posts(hostname, offset=offset)["posts"]
if not posts: # no more posts
break
for post in posts:
if not post["tags"]: # no tags
print(post["post_url"])
Tumblr has an API. You probably would have much better success using it.
https://code.google.com/p/python-tumblr/
I am trying to fetch and parse an XML file into a databse. The XML is compressed in GZIP. The GZIP file is ~8MB. When I run the code locally the memory on pythonw.exe builds up to level where the entire system (Windows 7) stops responding, and when I run it online it exceeds the memory limit on Google App Engine. Not sure if the file is too big or if I am doing something wrong. Any help would be very much appreciated!
from google.appengine.ext import webapp
from google.appengine.api.urlfetch import fetch
from xml.dom.minidom import parseString
import gzip
import base64
import StringIO
class ParseCatalog(webapp.RequestHandler):
user = xxx
password = yyy
catalog = fetch('url',
headers={"Authorization":
"Basic %s" % base64.b64encode(user + ':' + password)}, deadline=600)
xmlstring = StringIO.StringIO(catalog.content)
gz = gzip.GzipFile(fileobj=xmlstring)
gzcontent = gz.read()
contentxml = parseString(gzcontent)
items = contentxml.getElementsByTagName("Product")
for item in items:
item = DatabaseEntry()
item.name = str(coupon.getElementsByTagName("Manufacturer")[0].firstChild.data)
item.put()
UPDATE
So I tried to follow BasicWolf's suggestion to switch to LXML but am having problems importing it. I downloaded the LXML 2.3 library and put it in the folder of my app (I know this is not ideal, but it's the only way I know how to include a 3rd party library). Also, I added following to my app.yaml:
libraries:
- name: lxml
version: "2.3"
Then I wrote the following code to test if it parses:
import lxml
class ParseCatalog(webapp.RequestHandler):
user = xxx
password = yyy
catalog = fetch('url',
headers={"Authorization":
"Basic %s" % base64.b64encode(user + ':' + password)}, deadline=600)
items = etree.iterparse(catalog.content)
def get(self):
for elem in items:
self.response.out.write(str(elem.tag))
However this is resulting in the following error:
ImportError: cannot import name etree
I have checked other questions on this error and it seems that the fact that I run on Windows 7 might play a role. I also tried to install the pre-compiled binary packages from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml but that didn't change anything either.
What do you expect? First, you read a string into the memory, then - unzip it into the memory, then - construct a DOM tree, still in the memory.
Here are some improvements:
del every buffer variable the moment you don't need it.
Get rid of DOM XML parser, use event-driven LXML to save memory.
Current url is
http://myapp.appspot.com/something/<user-id>
or
http://127.0.0.1:8080/something/<user-id>
How in my python code I can get http://myapp.appspot.com/ or http://127.0.0.1:8080/?
This is need for dynamic links generation, for ex., to http://myapp.appspot.com/somethingelse.
self.request.path returns the whole path.
self.request.host_url
I think you want app_identity.get_default_version_hostname().
If an app is served from a custom domain, it may be necessary to
retrieve the entire hostname component. You can do this using the
app_identity.get_default_version_hostname() method.
This code:
logging.info(app_identity.get_default_version_hostname())
prints localhost:8080 on the development server.
If self.request.path returns the whole path, can't you just do:
import urlparse
def get_domain(url):
return urlparse.urlparse(url).netloc
>>> get_domain("http://myapp.appspot.com/something/")
'myapp.appspot.com'