single py file for convert rst to html - python

I have a blog written in reStructuredText which I currently have to manually convert to HTML when I make a new post.
I'm writing a new blog system using Google App Engine and need a simple way of converting rst to HTML.
I don't want to use docutils because it is too big and complex. Is there a simpler (ideally single python file) way I can do this?

docutils is a library that you can install. It also installs front end tools to convert from rest to various formats including html.
http://docutils.sourceforge.net/docs/user/tools.html#rst2html-py
This is a stand alone tool that can be used.
Most converters will exploit the docutils library for this.

The Sphinx documentation generator Python library includes many restructured text (RST) command-line converters.
Install Sphinx:
$ pip install sphinx
Then use one of the many rst2*.py helpers:
$ rst2html.py in_file.rst out_file.html

Have a look at the instructions for hacking docutils. You don't need the whole docutils to produce a html from rst, but you do need a reader, parser, transformer and writer. With some effort you could combine all of these to a single file from the existing docutils files.

Well you could try it with the following piece of code, usage would be:
compile_rst.py yourtext.rst
or
compile_rst.py yourtext.rst desiredname.html
# compile_rst.py
from __future__ import print_function
from docutils import core
from docutils.writers.html4css1 import Writer,HTMLTranslator
import sys, os
class HTMLFragmentTranslator( HTMLTranslator ):
def __init__( self, document ):
HTMLTranslator.__init__( self, document )
self.head_prefix = ['','','','','']
self.body_prefix = []
self.body_suffix = []
self.stylesheet = []
def astext(self):
return ''.join(self.body)
html_fragment_writer = Writer()
html_fragment_writer.translator_class = HTMLFragmentTranslator
def reST_to_html( s ):
return core.publish_string( s, writer = html_fragment_writer )
if __name__ == '__main__':
if len(sys.argv)>1:
if sys.argv[1] != "":
rstfile = open(sys.argv[1])
text = rstfile.read()
rstfile.close()
if len(sys.argv)>2:
if sys.argv[2] != "":
htmlfile = sys.argv[2]
else:
htmlfile = os.path.splitext(os.path.basename(sys.argv[1]))[0]+".html"
result = reST_to_html(text)
print(result)
output = open(htmlfile, "wb")
output.write(result)
output.close()
else:
print("Usage:\ncompile_rst.py docname.rst\nwhich results in => docname.html\ncompile_rst.py docname.rst desiredname.html\nwhich results in => desiredname.html")

Building the doc locally
Install Python.
Clone the forked repository to your computer.
Open the folder that contains the repository.
Execute: pip install -r requirements.txt --ignore-installed
Execute: sphinx-build -b html docs build
The rendered documentation is now in the build directory as HTML.

If Pyfunc's answer doesn't fit your needs, you could consider using the Markdown language instead. The syntax is similar to rst, and markdown.py is fairly small and easy to use. It's still not a single file, but you can import it as a module into any existing scripts you may have.
http://www.freewisdom.org/projects/python-markdown/

Related

Python How can I get the version number from a .whl file

Because of some custom logging I would like to get the version number from a wheel file.
I could of course just parse the filename, but my guess is there would be a better way to do this.
You can also use pkginfo e.g.
from pkginfo import Wheel
w = Wheel(r'C:\path\to\package.whl')
print(w.version)
What I did for now is:
Get the package info and put it into a dict:
def get_package_info():
info_dict = {}
with open(os.path.join(glob.glob("./*.egg-info")[0], "PKG-INFO"), "r") as info:
for i in info:
i = i.split(":")
info_dict[i[0].strip()] = i[1].strip()
return info_dict
Now I can get the Verion from this dictionary.
If someone does have a better approach at this, please let me know.
A simple parsing of the wheel's filename should be enough:
>>> 'torch-1.8.1-cp39-cp39-manylinux1_x86_64.whl'.split('-')[1]
'1.8.1'
If you want to be more sophisticated, you can also use wheel-filename to do this parsing for you:
>>> from wheel_filename import parse_wheel_filename
>>> pwf = parse_wheel_filename('pip-18.0-py2.py3-none-any.whl')
>>> pwf.version
'18.0'
If you just need to compute it quickly for a single package, you can use this codepen page I created, to get version and more details from a wheel's filename: https://codepen.io/chaitan94/full/mdWEeBp

How to use Pygments in Pelican with Markdown?

TLDR: I am trying to do CSS line numbering in pelican, while writing in markdown. Pygments is used indirectly and you can't pass options to it, so I can't separate the lines and there is no CSS selector for "new line".
Using Markdown in Pelican, I can generate code blocks using the CodeHilite extension. Pelican doesn't support using pygments directly if you are using Markdown...only RST(and ... no to converting everything to RST).
So, what I have tried:
MD_EXTENSIONS = [
'codehilite(css_class=highlight,linenums=False,guess_lang=True,use_pygments=True)',
'extra']
And:
:::python
<div class="line">import __main__ as main</div>
And:
PYGMENTS_RST_OPTIONS = {'classprefix': 'pgcss', 'linenos': 'table'}
Can I get line numbers to show up? Yes.
Can I get them to continue to the next code block? No.
And that is why I want to use CSS line numbering...its way easier to control when the numbering starts and stops.
Any help would be greatly appreciated, I've been messing with this for a few hours.
The only way I'm aware of is to fork the CodeHilite Extension (and I'm the developer). First you will need to make a copy of the existing extension (this file), make changes to the code necessary to effect your desired result, and save the file to your PYTHONPATH (probably in the "sitepackages" directory, the exact location of which depends on which system you are on and how Python was installed). Note that you will want to create a unique name for your file so as not to conflict with any other Python packages.
Once you have done that, you need to tell Pelican about it. As Pelican's config file is just Python, import your new extension (use the name of your file without the file extension: yourmodule.py => yourmodule) and include it in the list of extensions.
from yourmodule import CodeHiliteExtension
MD_EXTENSIONS = [
CodeHiliteExtension(css_class='highlight', linenums=False),
'extra']
Note that the call to CodeHiliteExtension is not a string but actually calling the class and passing in the appropriate arguments, which you can adjust as appropriate.
And that should be it. If you would like to set up a easier way to deploy your extension (or distribute it for others to use), you might want to consider creating a setup.py file, which is beyond the scope of this question. See this tutorial for help specific to Markdown extensions.
If you would like specific help with the changes you need to make to the code within the extension, that depends on what you want to accomplish. To get started, the arguments are passing to Pygments on line 117. The simplest approach would be to hardcode your desired options there.
Be ware that if you are trying to replicate the behavior in reStructuredText, you will likely be disappointed. Docutils wraps Pygments with some of its own processing. In fact, a few of the options never get passed to Pygments but are handled by the reStructeredText parser itself. If I recall correctly, CSS line numbering is one such feature. In fact, Pygments does not offer that as an option.
That being the case, you would need to modify your fork of the CodeHilite Extension by having Pygments return non-numbered code, then applying the necessary hooks yourself before the extension returns the highlighted code block. To do so, you would likely need to split on line breaks and then loop through the lines wrapping each line appropriately. Finally, join the newly wrapped lines and return.
I suspect the following (untested) changes will get you started:
diff --git a/markdown/extensions/codehilite.py b/markdown/extensions/codehilite.py
index 0657c37..fbd127d 100644
--- a/markdown/extensions/codehilite.py
+++ b/markdown/extensions/codehilite.py
## -115,12 +115,18 ## class CodeHilite(object):
except ValueError:
lexer = get_lexer_by_name('text')
formatter = get_formatter_by_name('html',
- linenos=self.linenums,
+ linenos=self.linenums if self.linenumes != 'css' else False,
cssclass=self.css_class,
style=self.style,
noclasses=self.noclasses,
hl_lines=self.hl_lines)
- return highlight(self.src, lexer, formatter)
+ result = highlight(self.src, lexer, formatter)
+ if self.linenums == 'css':
+ lines = result.split('\n')
+ for i, line in enumerate(lines):
+ lines[i] = '<div class="line">%s</div>' % line
+ result = '\n'.join(lines)
+ return result
else:
# just escape and build markup usable by JS highlighting libs
txt = self.src.replace('&', '&')
You may have better success in attaining what you want by disabling Pygments and using a JavaScript library to do the highlighting. That depends on which JavaScript Library you choose and what features it has.
TL; DR
in the pelicanconf.py, add this:
# for highlighting code-segments
# PYGMENTS_RST_OPTIONS = {'cssclass': 'codehilite', 'linenos': 'table'} # disable RST options
MD_EXTENSIONS = ['codehilite(noclasses=True, pygments_style=native)', 'extra'] # enable MD options
Obviously, you need to have these properly installed
pip install pygments markdown

How to parse doc files on a mac with python? [duplicate]

for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux?
Is there any library?
Use the native Python docx module. Here's how to extract all the text from a doc:
document = docx.Document(filename)
docText = '\n\n'.join(
paragraph.text for paragraph in document.paragraphs
)
print(docText)
See Python DocX site
Also check out Textract which pulls out tables etc.
Parsing XML with regexs invokes cthulu. Don't do it!
You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.
benjamin's answer is a pretty good one. I have just consolidated...
import zipfile, re
docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)
OpenOffice.org can be scripted with Python: see here.
Since OOo can load most MS Word files flawlessly, I'd say that's your best bet.
I know this is an old question, but I was recently trying to find a way to extract text from MS word files, and the best solution by far I found was with wvLib:
http://wvware.sourceforge.net/
After installing the library, using it in Python is pretty easy:
import commands
exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)
And that's it. Pretty much, what we're doing is using the commands.getouput function to run a couple of shell scripts, namely wvText (which extracts text from a Word document, and cat to read the file output). After that, the entire text from the Word document will be in the out variable, ready to use.
Hopefully this will help anyone having similar issues in the future.
Take a look at how the doc format works and create word document using PHP in linux. The former is especially useful. Abiword is my recommended tool. There are limitations though:
However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.
(Note: I posted this on this question as well, but it seems relevant here, so please excuse the repost.)
Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:
unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
So that's:
unzip -p file.docx: -p == "unzip to stdout"
grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)
sed 's/<[^<]>//g'*: Remove everything inside tags
grep -v '^[[:space:]]$'*: Remove blank lines
There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.
As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)
If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.
content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
if item.orig_filename == 'word/document.xml':
content = docx.read(item.orig_filename)
else:
pass
Your content string however needs to be cleaned up, one way of doing this is:
# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
if '>' in item:
bad_good = item.split('>')
if bad_good[-1] != '':
fullyclean.append(bad_good[-1])
else:
pass
else:
pass
# Assemble a new string with all pure content
content = " ".join(fullyclean)
But there is surely a more elegant way to clean up the string, probably using the re module.
Hope this helps.
Unoconv might also be a good alternative: http://linux.die.net/man/1/unoconv
To read Word 2007 and later files, including .docx files, you can use the python-docx package:
from docx import Document
document = Document('existing-document-file.docx')
document.save('new-file-name.docx')
To read .doc files from Word 2003 and earlier, make a subprocess call to antiword. You need to install antiword first:
sudo apt-get install antiword
Then just call it from your python script:
import os
input_word_file = "input_file.doc"
output_text_file = "output_file.txt"
os.system('antiword %s > %s' % (input_word_file, output_text_file))
If you have LibreOffice installed, you can simply call it from the command line to convert the file to text, then load the text into Python.
Is this an old question?
I believe that such thing does not exist.
There are only answered and unanswered ones.
This one is pretty unanswered, or half answered if you wish.
Well, methods for reading *.docx (MS Word 2007 and later) documents without using COM interop are all covered.
But methods for extracting text from *.doc (MS Word 97-2000), using Python only, lacks.
Is this complicated?
To do: not really, to understand: well, that's another thing.
When I didn't find any finished code, I read some format specifications and dug out some proposed algorithms in other languages.
MS Word (*.doc) file is an OLE2 compound file.
Not to bother you with a lot of unnecessary details, think of it as a file-system stored in a file. It actually uses FAT structure, so the definition holds. (Hm, maybe you can loop-mount it in Linux???)
In this way, you can store more files within a file, like pictures etc.
The same is done in *.docx by using ZIP archive instead.
There are packages available on PyPI that can read OLE files. Like (olefile, compoundfiles, ...)
I used compoundfiles package to open *.doc file.
However, in MS Word 97-2000, internal subfiles are not XML or HTML, but binary files.
And as this is not enough, each contains an information about other one, so you have to read at least two of them and unravel stored info accordingly.
To understand fully, read the PDF document from which I took the algorithm.
Code below is very hastily composed and tested on small number of files.
As far as I can see, it works as intended.
Sometimes some gibberish appears at the start, and almost always at the end of text.
And there can be some odd characters in-between as well.
Those of you who just wish to search for text will be happy.
Still, I urge anyone who can help to improve this code to do so.
doc2text module:
"""
This is Python implementation of C# algorithm proposed in:
http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf
Python implementation author is Dalen Bernaca.
Code needs refining and probably bug fixing!
As I am not a C# expert I would like some code rechecks by one.
Parts of which I am uncertain are:
* Did the author of original algorithm used uint32 and int32 when unpacking correctly?
I copied each occurence as in original algo.
* Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not?
* Did I interpret each C# command correctly?
I think I did!
"""
from compoundfiles import CompoundFileReader, CompoundFileError
from struct import unpack
__all__ = ["doc2text"]
def doc2text (path):
text = u""
cr = CompoundFileReader(path)
# Load WordDocument stream:
try:
f = cr.open("WordDocument")
doc = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all."
# Extract file information block and piece table stream informations from it:
fib = doc[:1472]
fcClx = unpack("L", fib[0x01a2l:0x01a6l])[0]
lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0]
tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l
tableName = ("0Table", "1Table")[tableFlag]
# Load piece table stream:
try:
f = cr.open(tableName)
table = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName
cr.close()
# Find piece table inside a table stream:
clx = table[fcClx:fcClx+lcbClx]
pos = 0
pieceTable = ""
lcbPieceTable = 0
while True:
if clx[pos]=="\x02":
# This is piece table, we store it:
lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0]
pieceTable = clx[pos+5:pos+5+lcbPieceTable]
break
elif clx[pos]=="\x01":
# This is beggining of some other substructure, we skip it:
pos = pos+1+1+ord(clx[pos+1])
else: break
if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table."
# Read info from pieceTable, about each piece and extract it from WordDocument stream:
pieceCount = (lcbPieceTable-4)/12
for x in xrange(pieceCount):
cpStart = unpack("l", pieceTable[x*4:x*4+4])[0]
cpEnd = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0]
ofsetDescriptor = ((pieceCount+1)*4)+(x*8)
pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8]
fcValue = unpack("L", pieceDescriptor[2:6])[0]
isANSII = (fcValue & 0x40000000) == 0x40000000
fc = fcValue & 0xbfffffff
cb = cpEnd-cpStart
enc = ("utf-16", "cp1252")[isANSII]
cb = (cb*2, cb)[isANSII]
text += doc[fc:fc+cb].decode(enc, "ignore")
return "\n".join(text.splitlines())
I'm not sure if you're going to have much luck without using COM. The .doc format is ridiculously complex, and is often called a "memory dump" of Word at the time of saving!
At Swati, that's in HTML, which is fine and dandy, but most word documents aren't so nice!
Just an option for reading 'doc' files without using COM: miette. Should work on any platform.
Aspose.Words Cloud SDK for Python is a platform independent solution to convert MS Word/Open Office files to text. It is a commercial product but free trial plan provides 150 monthly API calls.
P.S: I am a developer evangelist at Aspose.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.docx'
dest_name = 'C:/Temp/02_pages.txt'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='txt')
result = words_api.convert_document(request)
copyfile(result, dest_name)

Using multiple keywords in xattr via _kMDItemUserTags or kMDItemOMUserTags

While reorganizing my images, in anticipation of OSX Mavericks I am writing a script to insert tags into the xattr fields of my image files, so I can search them with Spotlight. (I am also editing the EXIF just to be safe.)
My questions are:
Which attribute is the best to use? _kMDItemUserTags seems to be the OSX version, but kMDItemOMUserTags is already in use by OpenMeta. I would ideally like something that will be Linux and OSX forward compatible.
How do I set multiple tags? Are the comma- or space-delimited or something else?
As an example, using the python xattr module, I am issuing these commands:
xattr.setxattr(FileName, "_kMDItemUserTags", "Name - Sample")
xattr.setxattr(FileName, "kMDItemOMUserTags", "Name,Institution,Sample")
I have also seen mention of these tags: kOMUserTags and kMDItemkeywords but don't know if they are likely to be implemented...
EDIT: Further investigation has shown that for things to be searchable in 10.8,
You need to preface the kMD with com.apple.metadata:
You have to either hex-encode or wrap in a plist.
This python code will generate the tag for kMDItemFinderComment which is searchable in spotlight...
def writexattrs(F,TagList):
""" writexattrs(F,TagList):
writes the list of tags to three xattr field:
'kMDItemFinderComment','_kMDItemUserTags','kMDItemOMUserTags'
This version uses the xattr library """
plistFront = '<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"><plist version="1.0"><array>'
plistEnd = '</array></plist>'
plistTagString = ''
for Tag in TagList:
plistTagString = plistTagString + '<string>{}</string>'.format(Tag)
TagText = plistFront + plistTagString + plistEnd
OptionalTag = "com.apple.metadata:"
XattrList = ["kMDItemFinderComment","_kMDItemUserTags","kMDItemOMUserTags"]
for Field in XattrList:
xattr.setxattr (F,OptionalTag+Field,TagText.encode('utf8'))
# Equivalent shell command is xattr -w com.apple.metadata:kMDItemFinderComment [PLIST value] [File name]
I could not get it to work recursively on a folder with reliable results.
If you are worried about compatibility you have to set both of the attributes _kMDItemUserTags and kMDItemOMUserTags. I don't think there's a different solution since all the new OS X apps will use the former attribute, while the old apps still use the latter. This is just my speculation, but I guess OpenMeta will eventually be discontinued in favor of the new native API. Looking to the future you can use the _kMDItemUserTags attribute for your new apps/scripts even in Linux environment.
The tags are set as a property list-encoded array of strings as you have figured out. I don't know if it is a requirement but the OS X encodes the property list in the binary format and not in XML as you did.
I adapted your code to use binary property list as attribute values and everything worked. Here's my code. I am using biplist library which you can get with easy_install biplist.
import xattr
import biplist
def write_xattr_tags(file_path, tags):
bpl_tags = biplist.writePlistToString(tags)
optional_tag = "com.apple.metadata:"
map(lambda a: xattr.setxattr(file_path, optional_tag + a, bpl_tags),
["kMDItemFinderComment", "_kMDItemUserTags", "kMDItemOMUserTags"])
Tested with files and directories using tag:<some_tag> in Spotlight.
Hope this helps.
Note: I am using OS X Lion in this answer but it should work on Mavericks without any problems.
Edit: If you want to apply the tags to the contents of a directory it has to be done individually for every file since the xattr python module doesn't have the recursive option.

How to create fake text file in Python

How can I create a fake file object in Python that contains text? I'm trying to write unit tests for a method that takes in a file object and retrieves the text via readlines() then do some text manipulation. Please note I can't create an actual file on the file system. The solution has to be compatible with Python 2.7.3.
This is exactly what StringIO/cStringIO (renamed to io.StringIO in Python 3) is for.
Or you could implement it yourself pretty easily especially since all you need is readlines():
class FileSpoof:
def __init__(self,my_text):
self.my_text = my_text
def readlines(self):
return self.my_text.splitlines()
then just call it like:
somefake = FileSpoof("This is a bunch\nOf Text!")
print somefake.readlines()
That said the other answer is probably more correct.
In Python3
import io
fake_file = io.StringIO("your text goes here") # takes string as arg
fake_file.read() # you can use fake_file object to do whatever you want
In Python2
import io
fake_file = io.StringIO(u"your text goes here") # takes unicode as argument
fake_file.read() # you can use fake_file object to do whatever you want
For more info check docs here
It's easy using faker-file (supported formats BIN, CSV, DOCX, ICO, JPEG, PDF, PNG, PPTX, SVG, TXT, WEBP and ZIP).
Installation
pip install faker-file[common]
Usage
from faker import Faker
from faker_file.providers.txt_file import TxtFileProvider
FAKER = Faker()
FAKER.add_provider(TxtFileProvider)
file = FAKER.txt_file()
Check the quick start and recipes for more.

Categories

Resources