Is there any way to convert epub2 to epub3 using Python?

Is there any way to convert epub2 to epub3 using Python? - python

So, as the question suggests, I am looking for a way to convert epub2 to epub3 using Python.
What I found so far is how to convert PDF documents in EPUB2 using ASPOSE.PDF for Java (in the python wrapper) and using ASPOSE Word Cloud. The next step would be to manually convert it to epub3 using Sigil, Calibre, or epub3-itzer. However, I would like a python script that could do it automatically similarly to the following:
import asposewordscloud
from asposewordscloud import WordsApi
from asposewordscloud.models.requests import ConvertDocumentRequest
app_sid = 'my_app_id'
app_secret = 'my_secret'
words_api = WordsApi(app_sid, app_secret)
with open('rainer_docs.pdf', 'rb') as f:
request = ConvertDocumentRequest(f, format='epub')
result = words_api.convert_document(request)
print('Output filename : {}'.format(result))
I know this is a bit of hacky method, but it worked for me. I would like something similar to this but simply to convert from epub2 to epub3 using python

Related

Python Dictionary result parsing in mac

Hello I am exploring AWS python library boto3 ,
I am running a sample script which will list s3 buckets in my aws accound:
import boto3
aws = boto3.session.Session(profile_name="sand")
s3_client = aws.client("s3")
response = s3_client.list_buckets()
print(response)
The result I am getting is an Dictionary as following:
It is a bit difficult to understand the dictionary in a proper way, As I am new to python can anyone tell me how can I better visualize this dictionary data in terminal, I know I can take this result and parse it online but is there any other way to print the result in better readable way??
Or any other suggestions???
I wish to see like this as result

Python lets you pretty print dict stuff:
import json
print json.dumps({'a':2, 'b':{'x':3, 'y':{'t1': 4, 't2':5}}},
sort_keys=True, indent=4)

How to save Python dataset (previously exported from IDL) back to IDL format

I have a file in IDL, I import it to Python using readsav from scipy, I change a parameter in the file and I want to export / save it back to the original format, IDL readable.
This is how I import it:
from scipy.io.idl import readsav
input = readsav('Original_file.inp')

I haven't tested any of this, but here are a few options to try:
Python-to-IDL/IDL-to-Python Bridge
The Python to IDL bridge provides a way to run IDL routines within python. You could try the following
from idlpy import *
from scipy.io.idl import readsav
input = readsav('Original_file.inp')
**change parameter**
IDL.run("SAVE, /VARIABLES, FILENAME = 'New_file.sav'")
There is also an IDL to Python bridge, which might allow you to perform your desired Python operation within IDL, and skip all the loading and saving of files...
Read/Write JSON
It looks like readsav() just returns a dictionary of the contents of the IDL save file. I'm not sure of the contents of your file, so I don't know if this would work, but perhaps you could just write it as a JSON string,
import json
from scipy.io.idl import readsav
input = readsav('Original_file.inp')
**change parameter**
with open('New_file.txt', 'w') as outfile:
json.dump(modified_input, outfile)
and then read it back into IDL with JSON_PARSE() (documentation here).
Write your own hack
If all else fails, you could look at Craig Markwardt's Unofficial Format Specification
of the IDL "SAVE" File, and write some custom code to write an IDL save file directly from Python. If nothing else, it would be an interesting exercise.

How do I load JSON into Couchbase Headless Server in Python?

I am trying to create a Python script that can take a JSON object and insert it into a headless Couchbase server. I have been able to successfully connect to the server and insert some data. I'd like to be able to specify the path of a JSON object and upsert that.
So far I have this:
from couchbase.bucket import Bucket
from couchbase.exceptions import CouchbaseError
import json
cb = Bucket('couchbase://XXX.XXX.XXX?password=XXXX')
print cb.server_nodes
#tempJson = json.loads(open("myData.json","r"))
try:
result = cb.upsert('healthRec', {'record': 'bob'})
# result = cb.upsert('healthRec', {'record': tempJson})
except CouchbaseError as e:
print "Couldn't upsert", e
raise
print(cb.get('healthRec').value)
I know that the first commented out line that loads the json is incorrect because it is expecting a string not an actual json... Can anyone help?
Thanks!

Figured it out:
with open('myData.json', 'r') as f:
data = json.load(f)
try:
result = cb.upsert('healthRec', {'record': data})
I am looking into using cbdocloader, but this was my first step getting this to work. Thanks!

I know that you've found a solution that works for you in this instance but I thought I'd correct the issue that you experienced in your initial code snippet.
json.loads() takes a string as an input and decodes the json string into a dictionary (or whatever custom object you use based on the object_hook), which is why you were seeing the issue as you are passing it a file handle.
There is actually a method json.load() which works as expected, as you have used in your eventual answer.
You would have been able to use it as follows (if you wanted something slightly less verbose than the with statement):
tempJson = json.load(open("myData.json","r"))
As Kirk mentioned though if you have a large number of json documents to insert then it might be worth taking a look at cbdocloader as it will handle all of this boilerplate code for you (with appropriate error handling and other functionality).
This readme covers the uses of cbdocloader and how to format your data correctly to allow it to load your documents into Couchbase Server.

How to parse doc files on a mac with python? [duplicate]

for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux?
Is there any library?

Use the native Python docx module. Here's how to extract all the text from a doc:
document = docx.Document(filename)
docText = '\n\n'.join(
paragraph.text for paragraph in document.paragraphs
)
print(docText)
See Python DocX site
Also check out Textract which pulls out tables etc.
Parsing XML with regexs invokes cthulu. Don't do it!

You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.

benjamin's answer is a pretty good one. I have just consolidated...
import zipfile, re
docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)

OpenOffice.org can be scripted with Python: see here.
Since OOo can load most MS Word files flawlessly, I'd say that's your best bet.

I know this is an old question, but I was recently trying to find a way to extract text from MS word files, and the best solution by far I found was with wvLib:
http://wvware.sourceforge.net/
After installing the library, using it in Python is pretty easy:
import commands
exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)
And that's it. Pretty much, what we're doing is using the commands.getouput function to run a couple of shell scripts, namely wvText (which extracts text from a Word document, and cat to read the file output). After that, the entire text from the Word document will be in the out variable, ready to use.
Hopefully this will help anyone having similar issues in the future.

Take a look at how the doc format works and create word document using PHP in linux. The former is especially useful. Abiword is my recommended tool. There are limitations though:
However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.

(Note: I posted this on this question as well, but it seems relevant here, so please excuse the repost.)
Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:
unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
So that's:
unzip -p file.docx: -p == "unzip to stdout"
grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)
sed 's/<[^<]>//g'*: Remove everything inside tags
grep -v '^[[:space:]]$'*: Remove blank lines
There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.
As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)

If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.
content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
if item.orig_filename == 'word/document.xml':
content = docx.read(item.orig_filename)
else:
pass
Your content string however needs to be cleaned up, one way of doing this is:
# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
if '>' in item:
bad_good = item.split('>')
if bad_good[-1] != '':
fullyclean.append(bad_good[-1])
else:
pass
else:
pass
# Assemble a new string with all pure content
content = " ".join(fullyclean)
But there is surely a more elegant way to clean up the string, probably using the re module.
Hope this helps.

Unoconv might also be a good alternative: http://linux.die.net/man/1/unoconv

To read Word 2007 and later files, including .docx files, you can use the python-docx package:
from docx import Document
document = Document('existing-document-file.docx')
document.save('new-file-name.docx')
To read .doc files from Word 2003 and earlier, make a subprocess call to antiword. You need to install antiword first:
sudo apt-get install antiword
Then just call it from your python script:
import os
input_word_file = "input_file.doc"
output_text_file = "output_file.txt"
os.system('antiword %s > %s' % (input_word_file, output_text_file))

If you have LibreOffice installed, you can simply call it from the command line to convert the file to text, then load the text into Python.

Is this an old question?
I believe that such thing does not exist.
There are only answered and unanswered ones.
This one is pretty unanswered, or half answered if you wish.
Well, methods for reading *.docx (MS Word 2007 and later) documents without using COM interop are all covered.
But methods for extracting text from *.doc (MS Word 97-2000), using Python only, lacks.
Is this complicated?
To do: not really, to understand: well, that's another thing.
When I didn't find any finished code, I read some format specifications and dug out some proposed algorithms in other languages.
MS Word (*.doc) file is an OLE2 compound file.
Not to bother you with a lot of unnecessary details, think of it as a file-system stored in a file. It actually uses FAT structure, so the definition holds. (Hm, maybe you can loop-mount it in Linux???)
In this way, you can store more files within a file, like pictures etc.
The same is done in *.docx by using ZIP archive instead.
There are packages available on PyPI that can read OLE files. Like (olefile, compoundfiles, ...)
I used compoundfiles package to open *.doc file.
However, in MS Word 97-2000, internal subfiles are not XML or HTML, but binary files.
And as this is not enough, each contains an information about other one, so you have to read at least two of them and unravel stored info accordingly.
To understand fully, read the PDF document from which I took the algorithm.
Code below is very hastily composed and tested on small number of files.
As far as I can see, it works as intended.
Sometimes some gibberish appears at the start, and almost always at the end of text.
And there can be some odd characters in-between as well.
Those of you who just wish to search for text will be happy.
Still, I urge anyone who can help to improve this code to do so.
doc2text module:
"""
This is Python implementation of C# algorithm proposed in:
http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf
Python implementation author is Dalen Bernaca.
Code needs refining and probably bug fixing!
As I am not a C# expert I would like some code rechecks by one.
Parts of which I am uncertain are:
* Did the author of original algorithm used uint32 and int32 when unpacking correctly?
I copied each occurence as in original algo.
* Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not?
* Did I interpret each C# command correctly?
I think I did!
"""
from compoundfiles import CompoundFileReader, CompoundFileError
from struct import unpack
__all__ = ["doc2text"]
def doc2text (path):
text = u""
cr = CompoundFileReader(path)
# Load WordDocument stream:
try:
f = cr.open("WordDocument")
doc = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all."
# Extract file information block and piece table stream informations from it:
fib = doc[:1472]
fcClx = unpack("L", fib[0x01a2l:0x01a6l])[0]
lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0]
tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l
tableName = ("0Table", "1Table")[tableFlag]
# Load piece table stream:
try:
f = cr.open(tableName)
table = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName
cr.close()
# Find piece table inside a table stream:
clx = table[fcClx:fcClx+lcbClx]
pos = 0
pieceTable = ""
lcbPieceTable = 0
while True:
if clx[pos]=="\x02":
# This is piece table, we store it:
lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0]
pieceTable = clx[pos+5:pos+5+lcbPieceTable]
break
elif clx[pos]=="\x01":
# This is beggining of some other substructure, we skip it:
pos = pos+1+1+ord(clx[pos+1])
else: break
if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table."
# Read info from pieceTable, about each piece and extract it from WordDocument stream:
pieceCount = (lcbPieceTable-4)/12
for x in xrange(pieceCount):
cpStart = unpack("l", pieceTable[x*4:x*4+4])[0]
cpEnd = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0]
ofsetDescriptor = ((pieceCount+1)*4)+(x*8)
pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8]
fcValue = unpack("L", pieceDescriptor[2:6])[0]
isANSII = (fcValue & 0x40000000) == 0x40000000
fc = fcValue & 0xbfffffff
cb = cpEnd-cpStart
enc = ("utf-16", "cp1252")[isANSII]
cb = (cb*2, cb)[isANSII]
text += doc[fc:fc+cb].decode(enc, "ignore")
return "\n".join(text.splitlines())

I'm not sure if you're going to have much luck without using COM. The .doc format is ridiculously complex, and is often called a "memory dump" of Word at the time of saving!
At Swati, that's in HTML, which is fine and dandy, but most word documents aren't so nice!

Just an option for reading 'doc' files without using COM: miette. Should work on any platform.

Aspose.Words Cloud SDK for Python is a platform independent solution to convert MS Word/Open Office files to text. It is a commercial product but free trial plan provides 150 monthly API calls.
P.S: I am a developer evangelist at Aspose.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.docx'
dest_name = 'C:/Temp/02_pages.txt'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='txt')
result = words_api.convert_document(request)
copyfile(result, dest_name)

Opening a geojson file in RhinoPython

I'm hoping my problem can be solved with some geojson expertise. The problem I'm having has to do with RhinoPython - the embedded IronPython engine in McNeel's Rhino 5 (more info here: http://python.rhino3d.com/). I don't think its necessary to be an expert on RhinoPython to answer this question.
I'm trying to load a geojson file in RhinoPython. Because you can't import the geojson module into RhinoPython like in Python I'm using this custom module GeoJson2Rhino provided here: https://github.com/localcode/rhinopythonscripts/blob/master/GeoJson2Rhino.py
Right now my script looks like this:
`import rhinoscriptsyntax as rs
import sys
rp_scripts = "rhinopythonscripts"
sys.path.append(rp_scripts)
import rhinopythonscripts
import GeoJson2Rhino as geojson
layer_1 = rs.GetLayer(layer='Layer 01')
layer_color = rs.LayerColor(layer_1)
f = open('test_3.geojson')
gj_data = geojson.load(f,layer_1,layer_color)
f.close()`
In particular:
f = open('test_3.geojson')
gj_data = geojson.load(f)
works fine when I'm trying to extract geojson data from regular python 2.7. However in RhinoPython I'm getting the following error message: Message: expected string for parameter 'text' but got 'file'; in reference to gj_data = geojson.load(f).
I've been looking at the GeoJson2Rhino script linked above and I think I've set the parameters for the function correctly. As far as I can tell it doesn't seem to recognize my geojson file, and wants it as a string. Is there an alternative file open function I can use to get the function to recognize it as a geojson file?

Judging by the error message, it looks like the load method requires a string as the first input but in the above example a file object is being passed instead. Try this...
f = open('test_3.geojson')
g = f.read(); # read contents of 'f' into a string
gj_data = geojson.load(g)
...or, if you don't actually need the file object...
g = open('test_3.geojson').read() # get the contents of the geojson file directly
gj_data = geojson.load(g)
See here for more information about reading files in python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there any way to convert epub2 to epub3 using Python? - python

Related

Python Dictionary result parsing in mac

How to save Python dataset (previously exported from IDL) back to IDL format

How do I load JSON into Couchbase Headless Server in Python?

How to parse doc files on a mac with python? [duplicate]

Opening a geojson file in RhinoPython

Categories

Resources