Uploading a structured HTML/formatted document to Google Docs API

Uploading a structured HTML/formatted document to Google Docs API - python

I am uploading content to Google Docs using the api. The doc is in plain HTML and I am wanting to convert the tags into the formatting that docs uses, i.e. H1 becomes Heading 1, H2 becomes Heading 2, etc.
I had thought that this conversion would happen naturally, but when you look at the created document in Docs it has lost all of its structure. I can't see anything in the documentation relating to maintaining formatting, or identifying a syntax we might use to tell docs of the intended status of a piece of text.
Has anyone else had any success with this?
Code used to generate the document (Python):
this_document=gdata.data.MediaSource(file_handle=this_html_string,content_type='text/html',content_length=this_html_string_length)
this_new_doc=this_gdoc.Upload(this_document, this_title, content_type='text/html')

Related

Open source html and css

Is HTML or CSS open source like PHP or Python?
For example PHP itself:
PHP source link in GITHUB

You must understand the difference between the specification (documentation, idea) and the implementation (interpreter, compiler, engine). Refer to https://softwareengineering.stackexchange.com/questions/238724/what-license-is-html-released-under for more detailed answers.

HTML
HTML only provides tags around text but other than that html files are practically just text files with extra markers. These tags are then parsed by the parser of the browser you use and displayed in a certain way. The easiest example of this is probably the , and tags. These tags only purpose is to modify the text within them to look different. But the strong tag looks different on different browsers. Some browsers make text bold while others do other changes. This shows you that the browser really is the interpreter and there are not always strict rules on what tags must do. The current standard (html5) was created by a group called the WHATWG. See more information also here https://www.w3.org/html/. and their own website https://html.spec.whatwg.org/multipage/. They have a github https://github.com/whatwg/html and I believe you can make changes if they accept your merge requests.
CSS
CSS is also not really a language. CSS is prescribed by the css working group their members are publicly known. https://www.w3.org/Style/CSS/members. If you scroll through the list you will see most of these people are engineers from big tech companies. They do have a github for css it is used for people to post issues. https://github.com/w3c/csswg-drafts. I do believe you can create a merge request although i am unsure.
Summary
In short yes they are basically opensource. However changes to either of these repositories even when accepted and merged will not work untill browsers have implemented the changes.
note
I am not an expert by any means I googled most of my information. I could be wrong about multiple things.

python-docx How to get content/body of a section

I am using Word's sections term to be able for each page to have different header, where I mark page with some markup like {page1}.
Using python-docx I get sections by:
doc = Document(my_file)`
doc_sections = doc.sections
doc_page_one = doc_sections[0]
I am able to get header and footer of each page and their texts:
doc_page_one.header.paragraphs[0].text
But I don't see the actual page content/body or shapes, while debugging I was not able to find where do they live.
Does python-docx have this possibility?

At present, python-docx does not have API support for getting what I would imagine would be the "block-items" (paragraphs + tables) that are "contained" in a certain section.
You would have to navigate the underlying XML if you wanted it bad enough, probably starting at document._body._body.xml. You could get an idea what that looks like with:
print(document._body._body.xml)
Basically you'd be looking for w:sectPr elements, each of which ends a section. There is some more detail on the XML schema involved in the python-docx analysis page here: https://python-docx.readthedocs.io/en/latest/dev/analysis/features/sections.html

python-docx - replacing characters

I am trying to build a small program in which I open a docx document and replace characters by others, to do some old school caesar-style encrypting, after checking the documentation: [ https://python-docx.readthedocs.io ] I am afraid I can't find the object methods and attributes, the documentation just kind-of explains how to do certain stuff like creating paragraphs and sections but I can't find anything on retrieving document data and parsing. I would like to find a list of the objects in the document so I can parse through them.
I would like to do something like this:
from docx import Document
document = Document('essay.docx')
paragraph = []
for i in document:
paragraph.append(i)
for i in paragraph:
for y in i:
y.replace("a", "y")
...
Can python-docx do something like this? If so where would I find the documentation that could show me how to do it?
If maybe I am using the incorrect library I would also appreciate it if you could point it out.

The API documentation is indexed (i.e. its table of contents appears) on the page you link to and describes all the objects and methods. https://python-docx.readthedocs.io/en/latest/#api-documentation

I think I found something useful in case future readers might be interested. The problem with python-docx is I could get paragraphs individually and it would take a lot of time. I don't even know if titles, footers and headers count as paragraphs.
But there is a library called textract that can read docx and other files, it integrates with python-docx, or at least that's what the short documentation says. But what I can do, is save my docx file to PDF and use:
text = textract.process(
'path/to/norwegian.pdf',
method='pdftofile',
language='nor',
)
This allows you to get all the text as a string and save it preserving the layout of the pdf. Haven't tested it yet, will edit this post if it doesn't work as intended.
http://textract.readthedocs.io/en/latest/python_package.html#python-package

Convert Wikipedia/MediaWiki's code into HTML using python

I am trying to grab content from Wikipedia and use the HTML of the article. Ideally I would also like to be able to alter the content (eg, hide certain infoboxes etc).
I am able to grab page content using mwclient:
>>> import mwclient
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Samuel_Pepys']
>>> print page.text()
{{Redirect|Pepys}}
{{EngvarB|date=January 2014}}
{{Infobox person
...
But I can't see a relatively simple, lightweight way to translate this wikicode into HTML using python.
Pandoc is too much for my needs.
I could just scrape the original page using Beautiful Soup but that doesn't seem like a particularly elegant solution.
mwparserfromhell might help in the process, but I can't quite tell from the documentation if it gives me anything I need and don't already have.
I can't see an obvious solution on the Alternative Parsers page.
What have I missed?
UPDATE: I wrote up what I ended up doing, following the discussion below.

page="""<html>
your pretty html here
<div id="for_api_content">%s</div>
</html>"""
Now you can grab your raw content with your API and just call
generated_page = page%api_content
This way you can design any HTML you want and just insert the API content in a designed spot.
Those APIs that you are using are designed to return raw content so it's up to you to style how you want the raw content to be displayed.
UPDATE
Since you showed me the actual output you are dealing with I realize your dilemma. However luckily for you there are modules that already parse and convert to HTML for you.
There is one called mwlib that will parse the wiki and output to HTML, PDF, etc. You can install it with pip using the install instructions. This is probably one of your better options since it was created in cooperation between Wikimedia Foundation and PediaPress.
Once you have it installed you can use the writer method to do the dirty work.
def writer(env, output, status_callback, **kwargs): pass
Here are the docs for this module: http://mwlib.readthedocs.org/en/latest/index.html
And you can set attributes on the writer object to set the filetype (HTML, PDF, etc).
writer.description = 'PDF documents (using ReportLab)'
writer.content_type = 'application/pdf'
writer.file_extension = 'pdf'
writer.options = {
'coverimage': {
'param': 'FILENAME',
'help': 'filename of an image for the cover page',
}
}
I don't know what the rendered html looks like but I would imagine that it's close to the actual wiki page. But since it's rendered in code I'm sure you have control over modifications as well.

I would go with HTML parsing, page content is reasonably semantic (class="infobox" and such), and there are classes explicitly meant to demarcate content which should not be displayed in alternative views (the first rule of the print stylesheet might be interesting).
That said, if you really want to manipulate wikitext, the best way is to fetch it, use mwparserfromhell to drop the templates you don't like, and use the parse API to get the modified HTML. Or use the Parsoid API which is a partial reimplementation of the parser returning XHTML/RDFa which is richer in semantic elements.
At any rate, trying to set up a local wikitext->HTML converter is by far the hardest way you can approach this task.

The mediawiki API contains a (perhaps confusingly named) parse action that in effect renders wikitext into HTML. I find that mwclient's faithful mirroring of the API structure sometimes actually gets in the way. There's a good example of just using requests to call the API to "parse" (aka render) a page given its title.

How to perform a Google search and take the text result?

Wondering how to use Python 3 to use Google to create a dictionary of some words (so say I enter a word, I want Python to take the definition that Google is able to give, then store or display it)
I haven't done much coding, but I know how to manage the words after. I'm just a bit confused using urllib and stuff. I have only been able to find help for this on other versions of Python, which I have not been able to replicate on Python 3.3.
EDIT: Yes, I want to use Google because I like the way it defines words and phrases, and I plan to use the define protocol you mentioned, icedtrees.

Edit: it appears that Google Search grabs its definitions using AJAX calls or something. The below solution will not work.
If you are having trouble using urllib2, I suggest the nice Python Requests package, which is a lot easier to use.
If you are absolutely committed to getting the Google definition and no other definition, I would suggest doing a HTTP request to a page using the Google Search "define" protocol.
For example:
https://www.google.com.au/search?q=define:test
You would then save the HTML result, and then parse it for the definitions that you require. Some examples of Python HTML parsers are the HTMLParser module, and also BeautifulSoup. However, this parsing operation seems pretty simple, so a basic regex should be more than enough. All definitions are stored as follows:
<div style="display:inline" data-dobid="dfn"> # the order of the style and the data-dobid can change
<span>definition goes here</span>
</div>
An example of a regex to grab the definitions of "test" from the HTML page:
import re
definitions = re.findall(r'data-dobid="dfn".*?>.*?\<span>(.*?)</span>.*?</div>', html, re.DOTALL)
>>> len(definitions)
18
>>> definitions[0]
'a\n procedure intended to establish the quality, performance, or \nreliability of something, especially before it is taken into widespread \nuse.'
# Looks like you might need to remove the newlines
>>> definitions[5]
'the result of a medical examination or analytical procedure.'
As a sidenote, there also exists a Google Dictionary API, which can give you definition results in JSON format in response to a request.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.