I'm trying to make a universal script in Python that can be used by anybody to import/export all sorts of information from/to Work Etc CRM platform. It has all the documentation here: http://admin.worketc.com/xml.
However, I am now a bit stuck. Authentication works, I can call different API methods, but only the ones without parameters. I am new to Python and that's why I can't figure out how to pass the parameters onto that specific method in the API. Specifically I need to export all time sheets. I'm trying to call this method specifically: http://admin.worketc.com/xml?op=GetDraftTimesheets. For obvious reasons I cannot disclose the login information so it might be a bit hard to test for you.
The code itself:
import xml.etree.ElementTree as ET
import urllib2
import sys
email = 'email#domain.co.uk'
password = 'pass'
#service = 'GetEmployee?EntityID=1658'
#service = 'GetEntryID?EntryID=23354'
#service = ['GetAllCurrenciesWebSafe']
#service = ['GetEntryID', 'EntryID=23354']
service = ['GetDraftTimesheets','2005-08-15T15:52:01+00:00','2014-08-15T15:52:01+00:00' ]
class workEtcUniversal():
sessionkey = None
def __init__(self,url):
if not "http://" in url and not "https://" in url:
url = "http://%s" % url
self.base_url = url
else:
self.base_url = url
def authenticate(self, user, password):
try:
loginurl = self.base_url + email + '&pass=' + password
req = urllib2.Request(loginurl)
response = urllib2.urlopen(req)
the_page = response.read()
root = ET.fromstring(the_page)
sessionkey = root[1].text
print 'Authentication successful!'
try:
f = self.service(sessionkey, service)
except RuntimeError:
print 'Did not perform function!'
except RuntimeError:
print 'Error logging in or calling the service method!'
def service(self, sessionkey, service):
try:
if len(service)<2:
retrieveurl = 'https://domain.worketc.com/xml/' + service[0] + '?VeetroSession=' + sessionkey
else:
retrieveurl = 'https://domain.worketc.com/xml/' + service[0,1,2] + '?VeetroSession=' + sessionkey
except TypeError as err:
print 'Type Error, which means arguments are wrong (or wrong implementation)'
print 'Quitting..'
sys.exit()
try:
responsefile = urllib2.urlopen(retrieveurl)
except urllib2.HTTPError as err:
if err.code == 500:
print 'Internal Server Error: Permission Denied or Object (Service) Does Not Exist'
print 'Quitting..'
sys.exit()
elif err.code == 404:
print 'Wrong URL!'
print 'Quitting..'
sys.exit()
else:
raise
try:
f = open("ExportFolder/worketcdata.xml",'wb')
for line in responsefile:
f.write(line)
f.close()
print 'File has been saved into: ExportFolder'
except (RuntimeError,UnboundLocalError):
print 'Could not write into the file'
client = workEtcUniversal('https://domain.worketc.com/xml/AuthenticateWebSafe?email=')
client.authenticate(email, password)
Writing a code Consuming API requires resolving few questions:
what methods on API are available (get their list with names)
how does a request to such method looks like (find out url, HTTP method to use, requirements to body if used, what headers are expected)
how to build up all the parts to make the request
What methods are available
http://admin.worketc.com/xml lists many of them
How does a request looks like
GetDraftTimesheet is described here http://admin.worketc.com/xml?op=GetDraftTimesheets
and it expects you to create following HTTP request:
POST /xml HTTP/1.1
Host: admin.worketc.com
Content-Type: text/xml; charset=utf-8
Content-Length: length
SOAPAction: "http://schema.veetro.com/GetDraftTimesheets"
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<GetDraftTimesheets xmlns="http://schema.veetro.com">
<arg>
<FromUtc>dateTime</FromUtc>
<ToUtc>dateTime</ToUtc>
</arg>
</GetDraftTimesheets>
</soap:Body>
</soap:Envelope>
Building up the request
The biggest task it to build properly shaped XML document as shown above and having elements FromUtc and ToUtc filled with proper values. I guess, the values shall be in format of ISO datetime, this you shall find yourself.
You shall be able building such an XML by some Python library, I would use lxml.
Note, that the XML document is using namespaces, you have to handle them properly.
Making POST request with all the headers shall be easy. The library you use to make HTTP requests shall fill in properly Content-Length value, but this is mostly done automatically.
Veerto providing many alternative methods
E.g. for "http://admin.worketc.com/xml?op=FindArticlesWebSafe" there is set of different methods for the same service:
SOAP 1.1
SOAP 1.2
HTTP GET
HTTP POST
Depending on your preferences, pick the one which fits your needs.
The simplest is mostly HTTP GET.
For HTTP requests, I would recommend using requests, which are really easy to use, if you get through tutorial, you will understand what I mean.
Related
I'm in the process of writing a very simple Python application for a friend that will query a service and get some data in return. I can manage the GET requests easily enough, but I'm having trouble with the POST requests. Just to get my feet wet, I've only slightly modified their example JSON data, but when I send it, I get an error. Here's the code (with identifying information changed):
import urllib.request
import json
def WebServiceClass(Object):
def __init__(self, user, password, host):
self.user = user
self.password = password
self.host = host
self.url = "https://{}/urldata/".format(self.host)
mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
mgr.add_password(None, "https://{}".format(self.host), self.user, self.password)
self.opener = urllib.request.build_opener(urllib.request.HTTPDigestAuthHandler(mgr))
username = "myusername"
password = "mypassword"
service_host = "thisisthehostinfo"
web_service_object = WebServiceClass(username, password, service_host)
user_query = {"searchFields":
{
"First Name": "firstname",
"Last Name": "lastname"
},
"returnFields":["Entity ID","First Name","Last Name"]
}
user_query = json.dumps(user_query)
user_query = user_query.encode("ascii")
the_url = web_service_object.url + "modules/Users/search"
try:
user_data = web_service_object.opener.open(the_url, user_query)
user_data.read()
except urllib.error.HTTPError as e:
print(e.code)
print(e.read())
I got the class data from their API documentation.
As I said, I can do GET requests fine, but this POST request gives me a 500 error with the following text:
Content type 'application/x-www-form-urlencoded;charset=UTF-8' not supported
In researching this error, my assumption has become that the above error means I need to submit the data as multipart/form-data. Whether or not that assumption is correct is something I would like to test, but stock Python doesn't appear to have any easy way to create multipart/form-data - there are modules out there, but all of them appear to take a file and convert it to multipart/form-data, whereas I just want to convert this simple JSON data to test.
This leads me to two questions: does it seem as though I'm correct in my assumption that I need multipart/form-data to get this to work correctly? And if so, do I need to put my JSON data into a text file and use one of those modules out there to turn it into multipart, or is there some way to do it without creating a file?
Maybe you want to try the requests lib, You can pass a files param, then requests will send a multipart/form-data POST instead of an application/x-www-form-urlencoded POST. You are not limited to using actual files in that dictionary, however:
import requests
response = requests.post('http://httpbin.org/post', files=dict(foo='bar'))
print response.status_code
If you want to know more about the requests lib, and specially in sending multipart forms take a look at this:
http://docs.python-requests.org/en/master/
and
http://docs.python-requests.org/en/master/user/advanced/?highlight=Multipart-Encoded%20Files
We use a custom scraper that have to take a separate website for a language (this is an architecture limitation). Like site1.co.uk, site1.es, site1.de etc.
But we need to parse a website with many languages, separated by url - like site2.com/en, site2.com/de, site2.com/es and so on.
I thought about MITMProxy: I could redirect all requests this way:
en.site2.com/* --> site2.com/en
de.site2.com/* --> site2.com/de
...
I have written a small script which simply takes URLs and rewrites them:
class MyMaster(flow.FlowMaster):
def handle_request(self, r):
url = r.get_url()
# replace URLs
if 'blabla' in url:
r.set_url(url.replace('something', 'another'))
But the target host generates 301 redirect with the response from the webserver - 'the page has been moved here' and the link to the site2.com/en
It worked when I played with URL rewriting, i.e. site2.com/en --> site2.com/de.
But for different hosts (subdomain and the root domain, to be precise), it does not work.
I tried to replace the Host header in the handle_request method from above:
for key in r.headers.keys():
if key.lower() == 'host':
r.headers[key] = ['site2.com']
also I tried to replace the Referrer - all of that didn't help.
How can I finally spoof that request from the subdomain to the main domain? If it generates a HTTP(s) client warning it's ok since we need that for the scraper (and the warnings there can be turned off), not the real browser.
Thanks!
You need to replace the content of the response and craft the header with just a few fields.
Open a new connection to the redirected url and craft your response :
def handle_request(self, flow):
newUrl = <new-url>
retryCount = 3
newResponse = None
while True:
try:
newResponse = requests.get(newUrl) # import requests
except:
if retryCount == 0:
print 'Cannot reach new url ' + newUrl
traceback.print_exc() # import traceback
return
retryCount -= 1
continue
break
responseHeaders = Headers() # from netlib.http import Headers
if 'Date' in newResponse.headers:
responseHeaders['Date'] = str(newResponse.headers['Date'])
if 'Connection' in newResponse.headers:
responseHeaders['Connection'] = str(newResponse.headers['Connection'])
if 'Content-Type' in newResponse.headers:
responseHeaders['Content-Type'] = str(newResponse.headers['Content-Type'])
if 'Content-Length' in newResponse.headers:
responseHeaders['Content-Length'] = str(newResponse.headers['Content-Length'])
if 'Content-Encoding' in newResponse.headers:
responseHeaders['Content-Encoding'] = str(inetResponse.headers['Content-Encoding'])
response = HTTPResponse( # from libmproxy.models import HTTPResponse
http_version='HTTP/1.1',
status_code=200,
reason='OK',
headers=responseHeaders,
content=newResponse.content)
flow.reply(response)
I am using Python's requests library in one method of my application. The body of the method looks like this:
def handle_remote_file(url, **kwargs):
response = requests.get(url, ...)
buff = StringIO.StringIO()
buff.write(response.content)
...
return True
I'd like to write some unit tests for that method, however, what I want to do is to pass a fake local url such as:
class RemoteTest(TestCase):
def setUp(self):
self.url = 'file:///tmp/dummy.txt'
def test_handle_remote_file(self):
self.assertTrue(handle_remote_file(self.url))
When I call requests.get with a local url, I got the KeyError exception below:
requests.get('file:///tmp/dummy.txt')
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/packages/urllib3/poolmanager.pyc in connection_from_host(self, host, port, scheme)
76
77 # Make a fresh ConnectionPool of the desired type
78 pool_cls = pool_classes_by_scheme[scheme]
79 pool = pool_cls(host, port, **self.connection_pool_kw)
80
KeyError: 'file'
The question is how can I pass a local url to requests.get?
PS: I made up the above example. It possibly contains many errors.
As #WooParadog explained requests library doesn't know how to handle local files. Although, current version allows to define transport adapters.
Therefore you can simply define you own adapter which will be able to handle local files, e.g.:
from requests_testadapter import Resp
import os
class LocalFileAdapter(requests.adapters.HTTPAdapter):
def build_response_from_file(self, request):
file_path = request.url[7:]
with open(file_path, 'rb') as file:
buff = bytearray(os.path.getsize(file_path))
file.readinto(buff)
resp = Resp(buff)
r = self.build_response(request, resp)
return r
def send(self, request, stream=False, timeout=None,
verify=True, cert=None, proxies=None):
return self.build_response_from_file(request)
requests_session = requests.session()
requests_session.mount('file://', LocalFileAdapter())
requests_session.get('file://<some_local_path>')
I'm using requests-testadapter module in the above example.
Here's a transport adapter I wrote which is more featureful than b1r3k's and has no additional dependencies beyond Requests itself. I haven't tested it exhaustively yet, but what I have tried seems to be bug-free.
import requests
import os, sys
if sys.version_info.major < 3:
from urllib import url2pathname
else:
from urllib.request import url2pathname
class LocalFileAdapter(requests.adapters.BaseAdapter):
"""Protocol Adapter to allow Requests to GET file:// URLs
#todo: Properly handle non-empty hostname portions.
"""
#staticmethod
def _chkpath(method, path):
"""Return an HTTP status for the given filesystem path."""
if method.lower() in ('put', 'delete'):
return 501, "Not Implemented" # TODO
elif method.lower() not in ('get', 'head'):
return 405, "Method Not Allowed"
elif os.path.isdir(path):
return 400, "Path Not A File"
elif not os.path.isfile(path):
return 404, "File Not Found"
elif not os.access(path, os.R_OK):
return 403, "Access Denied"
else:
return 200, "OK"
def send(self, req, **kwargs): # pylint: disable=unused-argument
"""Return the file specified by the given request
#type req: C{PreparedRequest}
#todo: Should I bother filling `response.headers` and processing
If-Modified-Since and friends using `os.stat`?
"""
path = os.path.normcase(os.path.normpath(url2pathname(req.path_url)))
response = requests.Response()
response.status_code, response.reason = self._chkpath(req.method, path)
if response.status_code == 200 and req.method.lower() != 'head':
try:
response.raw = open(path, 'rb')
except (OSError, IOError) as err:
response.status_code = 500
response.reason = str(err)
if isinstance(req.url, bytes):
response.url = req.url.decode('utf-8')
else:
response.url = req.url
response.request = req
response.connection = self
return response
def close(self):
pass
(Despite the name, it was completely written before I thought to check Google, so it has nothing to do with b1r3k's.) As with the other answer, follow this with:
requests_session = requests.session()
requests_session.mount('file://', LocalFileAdapter())
r = requests_session.get('file:///path/to/your/file')
The easiest way seems using requests-file.
https://github.com/dashea/requests-file (available through PyPI too)
"Requests-File is a transport adapter for use with the Requests Python library to allow local filesystem access via file:// URLs."
This in combination with requests-html is pure magic :)
packages/urllib3/poolmanager.py pretty much explains it. Requests doesn't support local url.
pool_classes_by_scheme = {
'http': HTTPConnectionPool,
'https': HTTPSConnectionPool,
}
In a recent project, I've had the same issue. Since requests doesn't support the "file" scheme, I'll patch our code to load the content locally. First, I define a function to replace requests.get:
def local_get(self, url):
"Fetch a stream from local files."
p_url = six.moves.urllib.parse.urlparse(url)
if p_url.scheme != 'file':
raise ValueError("Expected file scheme")
filename = six.moves.urllib.request.url2pathname(p_url.path)
return open(filename, 'rb')
Then, somewhere in test setup or decorating the test function, I use mock.patch to patch the get function on requests:
#mock.patch('requests.get', local_get)
def test_handle_remote_file(self):
...
This technique is somewhat brittle -- it doesn't help if the underlying code calls requests.request or constructs a Session and calls that. There may be a way to patch requests at a lower level to support file: URLs, but in my initial investigation, there didn't seem to be an obvious hook point, so I went with this simpler approach.
To load a file from a local URL, e.g. an image file you can do this:
import urllib
from PIL import Image
Image.open(urllib.request.urlopen('file:///path/to/your/file.png'))
I think simple solution for this will be creating temporary http server using python and using it.
Put all your files in temporary folder eg. tempFolder
Go to that directory and create a temporary http server in terminal/cmd as per your OS using command python -m http.server 8000 (Note 8000 is port no.)
This will you give you a link to http server. You can access it from http://127.0.0.1:8000/
Open your desired file in browser and copy the link to your url.
I have seen this thread already - How can I unshorten a URL?
My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve.
So far I am stuck with using:
def unshorten_url(url):
resolvedURL = urllib2.urlopen(url)
print resolvedURL.url
#t = Test()
#c = pycurl.Curl()
#c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' % (url))
#c.setopt(c.WRITEFUNCTION, t.body_callback)
#c.perform()
#c.close()
#dom = xml.dom.minidom.parseString(t.contents)
#resolvedURL = dom.getElementsByTagName("resolvedURL")[0].firstChild.nodeValue
return resolvedURL.url
Note: everything in the comments is what I tried to do when using the unshort.me service which was returning captcha links.
Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?
one line functions, using requests library and yes, it supports recursion.
def unshorten_url(url):
return requests.head(url, allow_redirects=True).url
Use the best rated answer (not the accepted answer) in that question:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
h.request('HEAD', resource )
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
else:
return url
You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:
A short link is a key into somebody else's database; you can't expand the link without querying the database
Now to your question.
Does anyone know of a more efficient way to complete this operation
without using open (since it is a waste of bandwidth)?
The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive.
After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:
> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me
HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0
So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.
Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.
You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.
You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.
Here a src code that takes into account almost of the useful corner cases:
set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.
The src code is on github # https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
import requests
short_url = "<your short url goes here>"
long_url = requests.get(short_url).url
print(long_url)
I am having problems with pycurl in conjunction with Twitter's Streaming API filter stream.
What is happening when I run the code below it seems to barf on the perform call. I know this because I placed print statements before and after the perform call. I am using Python 2.6.1 and I am on a Mac if that matters.
#!/usr/bin/python
print "Content-type: text/html"
print
import pycurl, json, urllib
STREAM_URL = "http://stream.twitter.com/1/statuses/filter.json?follow=1&count=100"
USER = "user"
PASS = "password"
print "<html><head></head><body>"
class Client:
def __init__(self):
self.buffer = ""
self.conn = pycurl.Curl()
self.conn.setopt(pycurl.POST,1)
self.conn.setopt(pycurl.USERPWD, "%s:%s" % (USER,PASS))
self.conn.setopt(pycurl.URL, STREAM_URL)
self.conn.setopt(pycurl.WRITEFUNCTION, self.on_receive)
try:
self.conn.perform()
self.conn.close()
except BaseException:
traceback.print_exc()
def on_receive(self,data):
self.buffer += data
if data.endswith("\r\n") and self.buffer.strip():
content = json.loads(self.buffer)
self.buffer = ""
print content
if "text" in content:
print u"{0[user][name]}: {0[text]}".format(content)
client = Client()
print "</body></html>"
First, try turning on verbosity to help debug:
self.conn.setopt(pycurl.VERBOSE ,1)
It looks like you aren't setting the basic auth mode:
self.conn.setopt(pycurl.HTTPAUTH, pycurl.HTTPAUTH_BASIC)
Also according to the documentation, you need to provide a POST of the parameters to the API, not pass them in as a GET parameter:
data = dict( track='stack overflow' )
self.conn.setopt(pycurl.POSTFIELDS,urlencode(data))
You are trying to use basic authentication.
Basic Authentication sends user
credentials in the header of the HTTP
request. This makes it easy to use,
but insecure. OAuth is the Twitter
preferred method of authentication
moving forward - come August 2010,
we'll be turning off Basic Auth from
the API. --Authentication, Twitter