Logging HTML content in a library environment with Python

Logging HTML content in a library environment with Python - python

I have a library which third-party developers use for obtaining information off a few specific websites. The library is responsible for connecting to the website, grabbing pages, parsing necessary information, and returning it to the developer.
However, I'm having issues coming up with an acceptable way to handle storing potentially malformed HTML. Since I can only account for so many things when testing, parsing may fail in the future and it would be helpful if I could find a way to store the HTML that failed parsing for future bug fixing.
Right now I'm using the internal logging module of Python to handle logging in my library. I'm allowing the third-party developer to supply a configuration dictionary to configure how the logging outputs error data. However, printing HTML to the console or even to a file is to me not ideal as I think it would clutter the terminal or error log. I considered storing HTML files on the local hard drive, but that seems extremely intrusive.
I've determined how I'm going to pass HTML internally. My plan is to pass it via the parameters of an exception and then catch it with a Filter. However, what to do with it is really troubling me.
Any feedback on a method to accomplish this is appreciated.

Services based on websites that you don't control are likely to be somewhat fragile, so storing the HTML to avoid recrawling in the event of parsing problems makes perfect sense to me. Since uncompressed HTML can consume a lot of space on the disk, you might want to store it in a compressed form in a database.
I've found MongoDB to be convenient for this. The underlying storage format is BSON (i.e. binary JSON). It's also easy to install and use.
Here's a toy example using PyMongo to store this page in MongoDB:
from pymongo import MongoClient
import urllib2
import time
# what will be stored in the document
ts = time.time()
url = 'http://stackoverflow.com/questions/26683772/logging-html-content-in-a-library-environment-with-python'
html = urllib2.urlopen(url).read()
# create a dict and store it in MongoDB
htmlDict = {'url':url, 'ts':ts, 'html':html}
client = MongoClient()
db = client.html_log
collection = db.html
collection.insert(htmlDict)
Check to see that the document is stored in MongoDB:
$ mongo
> use html_log;
> db.html.find()
{ "_id" : ObjectId("54544d96164a1b22d3afd887"), "url" : "http://stackoverflow.com/questions/26683772/logging-html-content-in-a-library-environment-with-python", "html" : "<!DOCTYPE html> [...] </html>", "ts" : 1414810778.001168 }

Related

How can I get more information about Chrome's cache files? Or how does ChromeCacheView work?

I want to get some data from the browser's cache. Chrome's cache filename is like f_00001, which is meaningless. ChromeCacheView can obtain the request link corresponding to the cache file name.
ChromeCacheView is a small utility that reads the cache folder of Google Chrome Web browser, and displays the list of all files currently stored in the cache. For each cache file, the following information is displayed: URL, Content type, File size, Last accessed time, Expiration time, Server name, Server response, and more.
You can easily select one or more items from the cache list, and then extract the files to another folder, or copy the URLs list to the clipboard.
But this is a GUI program that can only run on Windows. I want to know how it works.
In other words, how can I get more information about cached files, especially request links etc.

After my long search, I found the answer.
Instructions for Chrome disk cache format can be found on the following pages:
Disk cache
Chrome Disk Cache Format
By reading these documents, we can implement parsers in arbitrary programming languages.
Fortunately, I found two python libraries to do this.
dtformats
pyhindsight
The first one doesn't seem to work correctly under Python3. The first one doesn't seem to work properly under Python3. And the second one is fantastic and does the job perfectly. About how to use pyhindsight, there are detailed instructions on the home page, I will introduce how to integrate it into our project.
import pyhindsight
from pyhindsight.analysis import AnalysisSession
import logging
import os
analysis_session = AnalysisSession()
cache_dir = '~\AppData\Local\Microsoft\Edge\User Data\Default'
analysis_session.input_path = cache_dir
analysis_session.cache_path = os.path.join(cache_dir, 'Cache\Cache_Data')
analysis_session.browser_type = 'Chrome'
analysis_session.no_copy = True
analysis_session.timezone = None
logging.basicConfig(filename=analysis_session.log_path, level=logging.FATAL,
format='%(asctime)s.%(msecs).03d | %(levelname).01s | %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
run_status = analysis_session.run()
for p in analysis_session.parsed_artifacts:
if isinstance(p, pyhindsight.browsers.chrome.CacheEntry):
print('Url: {}, Location: {}'.format(p.url, p.location))
That's all, please join it. Thanks for pyhindsight.

How to call a internet Python script from another local Python script

I have a Python script that does some complicated statistical calculations on data and want to sell it. I thought about compiling it, but I have read that Python compiled with py2exe can be easily decompiled. So I tried another aproach, namely running the script from the Internet.
Now my code, in file local.py looks like:
local.py
#imports
#define variables and arrays
#very complicated code i want to protect
I want to obtain something like:
File local.py:
#imports
#define variables and arrays
#call to online.py
print(result)
File online.py:
#code protected
#exit(result)
Is there is a better way to do this ?
I want client to run the code without being able to see it.

A web API would work great for this, which is a super simple thing to do. There are numerous Python frameworks for creating simple HTTP APIs, and even writing one from scratch with just low level sockets wouldn't be too hard. If you only want to protect the code itself, you don't even need security features.
If you only want people who've paid for the usage to have access to your API, then you'll need some sort of authentication, like an API key. That's a pretty easy thing to do too, and may come nearly for free from some of the aforementioned frameworks.
So your client's code might look something like this:
File local.py:
import requests
inputdata = get_data_to_operate_on()
r = requests.post('https://api.gustavo.com/somemagicfunction?apikey=23423h234h2g432j34', data=inputdata)
if r.status_code == 200:
result = r.json()
# operate on the resulting JSON here
...
This code does a HTTP POST request, passing whatever data is returned by the get_data_to_operate_on() call in the body of the request. The body of the response is the result of the processing, which in this code is assumed to be JSON data, but could be in some other format as well.
There are all sorts of options you could add, by changing the path of the request (the '/somemagicfunction' part) or by adding additional query parameters.
This might help you to get started on the server side: https://nordicapis.com/8-open-source-frameworks-for-building-apis-in-python. And here's one way to host your Python server code: https://devcenter.heroku.com/articles/getting-started-with-python

Bulk import of .json files in arangodb with python

I have huge collection of .json files containing hundreds or thousands of documents I want to import to arangodb collections. Can I do it using python and if the answer is yes, can anyone send an example on how to do it from a list of files? i.e:
for i in filelist:
import i to collection
I have read the documentation but I couldn't find anything even resembling that

So after a lot of trial and error I found out that I had the answer in front of me. So I didn't need to import the .json file, I just needed to read it and then do a bulk import of documents. The code is like this:
a = db.collection('collection_name')
for x in list_of_json_files:
with open(x,'r') as json_file:
data = json.load(json_file)
a.import_bulk(data)
So actually it was quite simple. In my implementation I am collecting the .json files from multiple folders and importing them to multiple collections. I am using the python-arango 5.4.0 driver

I had this same problem. Though your implementation will be slightly different, the answer you need (maybe not the one you're looking for) is to use the "bulk import" functionality.
Since ArangoDB doesn't have an "official" Python driver (that I know of), you will have to peruse other sources to give you a good idea on how to solve this.
The HTTP bulk import/export docs provide curl commands, which can be neatly translated to Python web requests. Also see the section on headers and values.
ArangoJS has a bulk import function, which works with an array of objects, so there's no special processing or preparation required.
I have also used the arangoimport tool to great effect. It's command-line, so it could be controlled from Python, or used stand-alone in a script. For me, the key here was making sure my data was in JSONL or "JSON Lines" format (each line of the file is a self-contained JSON object, no bounding array or comma separators).

Convert Wikipedia/MediaWiki's code into HTML using python

I am trying to grab content from Wikipedia and use the HTML of the article. Ideally I would also like to be able to alter the content (eg, hide certain infoboxes etc).
I am able to grab page content using mwclient:
>>> import mwclient
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Samuel_Pepys']
>>> print page.text()
{{Redirect|Pepys}}
{{EngvarB|date=January 2014}}
{{Infobox person
...
But I can't see a relatively simple, lightweight way to translate this wikicode into HTML using python.
Pandoc is too much for my needs.
I could just scrape the original page using Beautiful Soup but that doesn't seem like a particularly elegant solution.
mwparserfromhell might help in the process, but I can't quite tell from the documentation if it gives me anything I need and don't already have.
I can't see an obvious solution on the Alternative Parsers page.
What have I missed?
UPDATE: I wrote up what I ended up doing, following the discussion below.

page="""<html>
your pretty html here
<div id="for_api_content">%s</div>
</html>"""
Now you can grab your raw content with your API and just call
generated_page = page%api_content
This way you can design any HTML you want and just insert the API content in a designed spot.
Those APIs that you are using are designed to return raw content so it's up to you to style how you want the raw content to be displayed.
UPDATE
Since you showed me the actual output you are dealing with I realize your dilemma. However luckily for you there are modules that already parse and convert to HTML for you.
There is one called mwlib that will parse the wiki and output to HTML, PDF, etc. You can install it with pip using the install instructions. This is probably one of your better options since it was created in cooperation between Wikimedia Foundation and PediaPress.
Once you have it installed you can use the writer method to do the dirty work.
def writer(env, output, status_callback, **kwargs): pass
Here are the docs for this module: http://mwlib.readthedocs.org/en/latest/index.html
And you can set attributes on the writer object to set the filetype (HTML, PDF, etc).
writer.description = 'PDF documents (using ReportLab)'
writer.content_type = 'application/pdf'
writer.file_extension = 'pdf'
writer.options = {
'coverimage': {
'param': 'FILENAME',
'help': 'filename of an image for the cover page',
}
}
I don't know what the rendered html looks like but I would imagine that it's close to the actual wiki page. But since it's rendered in code I'm sure you have control over modifications as well.

I would go with HTML parsing, page content is reasonably semantic (class="infobox" and such), and there are classes explicitly meant to demarcate content which should not be displayed in alternative views (the first rule of the print stylesheet might be interesting).
That said, if you really want to manipulate wikitext, the best way is to fetch it, use mwparserfromhell to drop the templates you don't like, and use the parse API to get the modified HTML. Or use the Parsoid API which is a partial reimplementation of the parser returning XHTML/RDFa which is richer in semantic elements.
At any rate, trying to set up a local wikitext->HTML converter is by far the hardest way you can approach this task.

The mediawiki API contains a (perhaps confusingly named) parse action that in effect renders wikitext into HTML. I find that mwclient's faithful mirroring of the API structure sometimes actually gets in the way. There's a good example of just using requests to call the API to "parse" (aka render) a page given its title.

How to get unparsed XML from a suds response, and best django model field to use for storage

I am using suds to request data from a 3rd party using a wsdl. I am only saving some of the data returned for now, but I am paying for the data that I get so I would like to keep all of it. I have decided that the best way to save this data is by capturing the raw xml response into a database field both for future use should I decide that I want to start using different parts of the data and as a paper trail in the event of discrepancies.
So I have a two part question:
Is there a simple way to output the raw received xml from the suds.client object? In my searches for the answer to this I have learned this can be done through logging, but I was hoping to not have to dig that information back out of the logs to put into the database field. I have also looked into the MessagePlugin.recieved() hook, but could not really figure out how to access this information after it has been parsed, only that I can override that function and have access to the raw xml as it is being parsed (which is before I have decided whether or not it is actually worth saving yet or not). I have also explored the retxml option but I would like to use the parsed version as well and making two separate calls, one as retxml and the other parsed will cost me twice. I was hoping for a simple function built into the suds client (like response.as_xml() or something equally simple) but have not found anything like that yet. The option bubbling around in my head might be to extend the client object using the .received() plugin hook that saves the xml as an object parameter before it is parsed, to be referenced later... but the execution of such seems a little tricky to me right now, and I have a hard time believing that the suds client doesn't just have this built in somewhere already, so I thought I would ask first.
The other part to my question is: What type of django model field would be best suited to handle up to ~100 kb of text data as raw xml? I was going to simply use a simple CharField with a stupidly long max_length, but that feels wrong.
Thanks in advance.

I solved this by using the flag retxml on client initialization:
client = Client(settings.WSDL_ADDRESS, retxml=True)
raw_reply = client.service.PersonSearch(soapified_search_object)
I was then able to save raw_reply as the raw xml into a django models.TextField()
and then inject the raw xml to get a suds parsed result without having to re-submit my search lika so:
parsed_result = client.service.PersonSearch(__inject={'reply': raw_reply})
I suppose if I had wanted to strip off the suds envelope stuff from raw reply I could have used a python xml library for further usage of the reply, but as my existing code was already taking the information I wanted from the suds client result I just used that.
Hope this helps someone else.

I have used kyrayzk solution for a while, but have always found it a bit hackish, as I had to create a separate dummy client just for when I needed to process the raw XML.
So I sort of reimplemented .last_received() and .last_sent() methods (which were (IMHO, mistakenly) removed in suds-jurko 0.4.1) through a MessagePlugin.
Hope it helps someone:
class MyPlugin(MessagePlugin):
def __init__(self):
self.last_sent_raw = None
self.last_received_raw = None
def sending(self, context):
self.last_sent_raw = str(context.envelope)
def received(self, context):
self.last_received_raw = str(context.reply)
Usage:
plugin = MyPlugin()
client = Client(TRTH_WSDL_URL, plugins=[plugin])
client.service.SendSomeRequest()
print plugin.last_sent_raw
print plugin.last_received_raw
And as an extra, if you want a nicely indented XML, try this:
from lxml import etree
def xmlpprint(xml):
return etree.tostring(etree.fromstring(xml), pretty_print=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.