Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have created a word cloud from the text file using python by using the module called (pytagcloud),but I have created it as .jpeg file. But i want to make the words in the cloud to be active when I hover the cursor over the words in cloud.
When I click the particular word in word cloud, the corresponding sentence in the passage should be highlighted.How to do that ? please help me. I have project work in this topic.
I created a gui in which i have options to import the text file and read a text file. after reading text file, I need to produce the word cloud from the passage in .txt file.
def wordcloud(self):
from pytagcloud import create_tag_image, create_html_data, make_tags, LAYOUT_HORIZONTAL, LAYOUTS, LAYOUT_MIX, LAYOUT_VERTICAL, LAYOUT_MOST_HORIZONTAL, LAYOUT_MOST_VERTICAL
from pytagcloud.lang.counter import get_tag_counts
from pytagcloud.colors import COLOR_SCHEMES
import webbrowser
#import Tkinter
#from tkFileDialog import askopenfilename
#filename=askopenfilename()
#with open(filename,'r') as f:
# text=f.read()
#def create_tag_cloud(text):
words = nltk.word_tokenize(self._contents)
doc = " ".join(d for d in words[:70])
tags = make_tags(get_tag_counts(doc), maxsize=100)
create_tag_image(tags, 'sid.jpeg',size=(1600, 1200),fontname='Philosopher',layout=LAYOUT_MIX,rectangular=True)
webbrowser.open('sid.jpeg')
Without seeing your code there is nothing to correct.
However the best approach to this is to output your tag cloud in HTML and CSS so you end something like their demo.
After you have your HTML code one approach to take is to use Javascript to react on a word clicked and highlight every occurrence of that word in your body.
However there are many other approaches that may be better suited but without any context it's impossible to comment I'm afraid. Regardless of this don't render your tag cloud as a jpeg. This is static and will not have the capability to be interactive.
Edit1: Code provided
Have a look at the test_create_html_data(self): function in the PyTagCloud tests available on github to get an idea of how to output HTML and CSS.
Just a quick note on your code, Python will be importing all those packages upon every run of your wordcloud() method. Pull them out to something like this (I started the adaption for you):
from pytagcloud import (create_tag_image, create_html_data,
make_tags, LAYOUT_MIX)
from pytagcloud.lang.counter import get_tag_counts
from pytagcloud.colors import COLOR_SCHEMES
import webbrowser
# ...the rest of your code...
def wordcloud(self):
words = nltk.word_tokenize(self._contents)
doc = " ".join(d for d in words[:70])
tags = make_tags(get_tag_counts(doc), maxsize=100)
data = create_html_data(tags, (1600,1200), layout=LAYOUT_MIX, fontname='Philosopher', rectangular=True)
Related
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 5 days ago.
Improve this question
import aspose.words as aw
import os
import glob
import openpyxl
import json
import aspose.cells
from aspose.cells import License,Workbook,FileFormatType
workbook = Workbook("bookwithChart.xlsx")
os.chdir(os.path.join(r"C:\Users\13216\Downloads\pythontests"))
docx_files = glob.glob("*.docx")
for files in docx_files:
doc = aw.Document(files)
doc.save("document1.docx")
doc = aw.Document("document1.docx")
doc.save("html_output.html", aw.SaveFormat.HTML)
book = Workbook("html_output.html")
book.save("word-to-json.json", FileFormatType.JSON)
I need to convert a set of WORD documents to JSON. It works great. However, when I change the PATH and document name to test out other documents, the output doesn't chain. The JSON file returns the same output shown in the initial test document.
I tried changing the save command for the JSON and HTML files. It didn't work. I assume the program is storing the first output from the very first test document ("document1.docx"). I tried inputting a different document so many times. The output does not change.
There is no need to open/save DOCX document in your code. Also, you save all the documents into the same output file, so it is overridden on each iteration. You can modify your code like this:
import aspose.words as aw
import aspose.cells as ac
import os
import glob
os.chdir(os.path.join(r"C:\Temp"))
docx_files = glob.glob("*.docx")
i = 0
for files in docx_files:
doc = aw.Document(files)
doc.save("tmp.html", aw.SaveFormat.HTML)
book = ac.Workbook("tmp.html")
book.save("word-to-json_" + str(i) + ".json", ac.FileFormatType.JSON)
i += 1
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
How can we convert a PDF to docx with/without using python. Actually I want to automate conversion of large number of file, so I need an API.
I have used online websites like:
https://pdf2docx.com/
https://online2pdf.com/pdf2docx
https://www.zamzar.com/convert/pdf-to-docx/
I was unable to get access for using there api directly
pdf2docx
Install the pdf2docx package Click here
Installation
Clone or download pdf2docx
pip install pdf2docx
or
# download the package and install your environment
python setup.py install
Option 1
from pdf2docx import Converter
pdf_file = r'C:\Users\ABCD\Desktop\XYZ/Document1.pdf'# source file
docx_file = r'C:\Users\ABCD\Desktop\XYZ/sample.docx' # destination file
# convert pdf to docx
cv = Converter(pdf_file)
cv.convert(docx_file, start=0, end=None)
cv.close()
#Output
Parsing Page 53: 53/53...
Creating Page 53: 53/53...
--------------------------------------------------
Terminated in 6.258919400000195s.
Option 2
from pdf2docx import parse
pdf_file = r'C:\Users\ABCD\Desktop\XYZ/Document2.pdf' # source file
docx_file = r'C:\Users\ABCD\Desktop\XYZ/sample_2.docx' # destination file
# convert pdf to docx
parse(pdf_file, docx_file, start=0, end=None)
# output
Parsing Page 53: 53/53...
Creating Page 53: 53/53...
--------------------------------------------------
Terminated in 5.883666100000482s.
You can try pdftohtml, then use Pandoc to convert HTML to docx.
Actually, PDF is not a really document format but a rather page layout format, so conversion can be problematic.
I'm the CTO at Zamzar and we have an API to do just this available at https://developers.zamzar.com/
We have a Test account you can use for free to try out the service, and code samples for Python in our docs which which would enable you to convert a PDF file to DOCX quite simply:
import requests
from requests.auth import HTTPBasicAuth
api_key = 'YOUR_API_KEY'
endpoint = "https://sandbox.zamzar.com/v1/jobs"
source_file = "/tmp/my.pdf"
target_format = "docx"
file_content = {'source_file': open(source_file, 'rb')}
data_content = {'target_format': target_format}
res = requests.post(endpoint, data=data_content, files=file_content, auth=HTTPBasicAuth(api_key, ''))
print res.json()
You can then poll the job to see when it has finished before downloading your converted file.
Try PDF.to it has a PDF API that has Curl, PHP, Python, and NodeJS examples, and has good documentation
Converting PDFs to Documents can be a problematic task rather it would be easy the other way.
One possible solution could be "Save As" the PDF file in a desired location with ".docx" extension. This might work if the PDF was saved from docx and vice-versa.
What I've got
I've built a GUI that populates a MS Word file with text and saves it. I was asked to include a function that makes it possible to directly print the output from the GUI.
When printing just a string object the following code works as intended:
from PyQt5.QtWidgets import QTextEdit
from PyQt5.QtPrintSupport import *
def func_print_short(obj_str):
print_content = QTextEdit()
print_content.setText(obj_str)
dlg_print = QPrintDialog()
if dlg_print.exec_() == QDialog.Accepted:
print_content.document().print_(dlg_print.printer())
What I'm trying to achieve
Trying to use the same routine with a MS Word file I ended up with the following code snippet, not being able to figure out how to properly send the document to the printer.
import docx
def send_to_printer(doc):
# -- datatype conversions --
print_content = doc
dlg_print = QPrintDialog()
if dlg_print.exec_() == QDialog.Accepted:
print_content.document().print_(dlg_print.printer())
As expected, this doesn't work as the printer can't handle the data it receives. Unfortunately, I found a similar question on printing a PDF file from a GUI here, which is why I suppose that what I want to achieve might not be possible either without some workaround.
I've also found a post on printing MS Word documents here. However, I don't want to save it somewhere first to be able to print it.
My questions
Is there a way to print the document directly from the GUI? Any suggestions how to convert the document into the right format? Or is there a better solution as temporarily saving it and using subprocess for the rest?
Thank you in advance!
I am developing an infrastructure where developers can document their verification tests using Jupyter notebooks. One part of the infrastructure will be a python script that can convert their .ipynb files to .html files to provide public-facing documentation of their tests.
Using the nbconvert module does most of what I want, but I would like to allow citations and references in the final HTML file. I can use pypandoc to generate HTML text that converts the citations to proper inlined syntax and adds a References section:
from urllib import urlopen
import nbformat
import pypandoc
from nbconvert import MarkdownExporter
response = urlopen('SimpleExample.ipynb').read().decode()
notebook = nbformat.reads(response, as_version=4)
exporter = MarkdownExporter()
(body, resources) = exporter.from_notebook_node(notebook)
filters = ['pandoc-citeproc']
extra_args = ['--bibliography="ref.bib"',
'--reference-links',
'--csl=MWR.csl']
new_body = pypandoc.convert_text(body,
'html',
'md',
filters=filters,
extra_args=extra_args)
The problem is that this generated HTML loses all of the considerable formatting and other capabilities provided by nbconvert.HTMLExporter.
My question is, is there a straightforward way to merge the results of nbconvert.HTMLExporter and pypandoc.convert_text() such that I get mostly the former, with inline citations and a Reference section added from the latter?
I don't know that this necessarily counts as "straightforward" but I was able to come up with a solution. It involves writing a class that inherits from nbconvert.preprocessors.Preprocessor and implements the preprocess(self, nb, resources) method. Here is what preprocess() does:
Loop over every cell in the notebook and store a set of citation keys (these are of the form [#bibtex_key]
Create a short body of text consisting of only these citation keys, each separated by '\n\n'
Use the pandoc conversion above to generate HTML text from this short body of text. If num_cite is the number of citations, the first num_cite lines of the generated text will be the inline versions of the citations (e.g. '(Author, Year)'); the remaining lines will be the content of the references section.
Go back through each cell and substitute the inline text of each citation for its key.
Add a cell to the notebook with ## References
Add a cell to the notebook with the content of the references section
Now, when an HTMLExporter, using this Preprocessor, converts a notebook, the results will have inline citations, a reference section, and all of the formatting you expect from the HTMLExporter.
I am working on a script that needs to grab all of the documents out of a directory and merge them together with comments and formatting.
I know VBS can do this but VBS is very limited and slow when it comes to parsing documents. Especially in word 2013.
I've looked over the documentation of pywin32 but couldn't find anything.
I would think it would be something as simple as
word = win32.Dispatch("Word.Application")
doc = word.AddDocument()
doc.InsertDocument(Filename)
There isn't much code to show because all the code I have currently is used after the documents are merged.
The code will look like this:
import win32com.client as win32
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False
output = word.Documents.Add()
output.Application.Selection.Range.InsertFile('second.doc')
output.Application.Selection.Range.InsertBreak()
output.Application.Selection.Range.InsertFile('first.doc')
output.SaveAs('output.doc')
output.Close()
This question also may be useful.