I am trying to allow user to upload MS Word file and then I run a certain function that takes a string as input argument. I am uploading Word file through FileUpload however I am getting a coded object. I am unable to decode using byte UTF-8 and using upload.value or upload.data just returns coded text
Any ideas how I can extract content from uploaded Word File?
> upload = widgets.FileUpload()
> upload
#I select the file I want to upload
> upload.value #Returns coded text
> upload.data #Returns coded text
> #Previously upload['content'] worked, but I read this no longer works in IPYWidgets 8.0
Modern ms-word files (.docx) are actually zip-files.
The text (but not the page headers) are actually inside an XML document called word/document.xml in the zip-file.
The python-docx module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.
>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')
>>> fullText = []
>>> for paragraph in doc.paragraphs:
... fullText.append(paragraph.text)
...
Note that this will only extract the text from paragraphs. Not e.g. the text from tables.
Edit:
I want to be able to upload the MS file through the FileUpload widget.
There are a couple of ways you can do that.
First, isolate the actual file data. upload.data is actually a dictionary, see here. So do something like:
rawdata = upload.data[0]
(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)
write rawdata to e.g. foo.docx and open that. That would certainly work, but it does seem somewhat un-elegant.
docx.Document can work with file-like objects. So you could create an io.BytesIO object, and use that.
Like this:
foo = io.BytesIO(rawdata)
doc = docx.Document(foo)
Tweaking with #Roland Smith great suggestions, following code finally worked:
import io
import docx
from docx import Document
upload = widgets.FileUpload()
upload
rawdata = upload.data[0]
test = io.BytesIO(rawdata)
doc = Document(test)
for p in doc.paragraphs:
print (p.text)
I am trying to extract text from a pdf file I usually have to deal with at work, so that I can automize it.
When using PyPDF2, it works for my CV for instance, but not for my work-document. The problem is, that the text is then like that: "Helloworldthisisthetext". I then tried to use .join(" "), but this is not working.
I read that this is a known problem with PyPDF2 - it seems to depend on the way the pdf was built.
Does anyone know another approach how to extract text out of it which I then can use for further steps?
Thank you in advance
I can suggest you to try another tool - pdfreader. You can extract the both plain strings and "PDF markdown" (decoded text strings + operators). "PDF markdown" can be parsed as a regular text (with regular expressions for example).
Below you find the code sample for walking pages and extracting PDF content for further parsing.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(your_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
try:
while True:
viewer.render()
pdf_markdown = viewer.canvas.text_content
result = my_text_parser(pdf_markdown)
# The one below will probably be the same as PyPDF2 returns
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
...
def my_text_parser(text):
""" Code your parser here """
...
pdf_markdown variable contains all texts including PDF commands (positioning, display): all strings come in brackets followed by Tj or TJ operator.
For more on PDF text operators see PDF 1.7 sec. 9.4 Text Objects
You can parse it with regular expressions for example.
I had a similar requirement at work for which I used PyMuPDF. They also have a collection of recipes which cover typical scenarios of text extraction.
I did python script:
from string import punctuation
from collections import Counter
import urllib
from stripogram import html2text
myurl = urllib.urlopen("https://www.google.co.in/?gfe_rd=cr&ei=v-PPV5aYHs6L8Qfwwrlg#q=samsung%20j7")
html_string = myurl.read()
text = html2text( html_string )
file = open("/home/nextremer/Final_CF/contentBased/contentCount/hi.txt", "w")
file.write(text)
file.close()
Using this script I didn't get perfect output only some HTML code.
I want save all webpage text content in a text file.
I used urllib2 or bs4 but I didn't get results.
I don't want output as a html structure.
I want all text data from webpage
What do you mean with "webpage text"?
It seems you don't want the full HTML-File. If you just want the text you see in your browser, that is not so easily solvable, as the parsing of a HTML-document can be very complex, especially with JavaScript-rich pages.
That starts with assessing if a String between "<" and ">" is a regular tag and includes analyzing the CSS-Properties changed by JavaScript-behavior.
That is why people write very big and complex rendering-Engines for Webpage-Browsers.
You dont need to write any hard algorithms to extract data from search result. Google has a API to do this.
Here is an example:https://github.com/google/google-api-python-client/blob/master/samples/customsearch/main.py
But to use it, first you must to register in google for API Key.
All information you can find here: https://developers.google.com/api-client-library/python/start/get_started
import urllib
urllib.urlretrieve("http://www.example.com/test.html", "test.txt")
Is it possible to embed rendered HTML output into IPython output?
One way is to use
from IPython.core.display import HTML
HTML('link')
or (IPython multiline cell alias)
%%html
link
Which return a formatted link, but
This link doesn't open a browser with the webpage itself from the console. IPython notebooks support honest rendering, though.
I'm unaware of how to render HTML() object within, say, a list or pandas printed table. You can do df.to_html(), but without making links inside cells.
This output isn't interactive in the PyCharm Python console (because it's not QT).
How can I overcome these shortcomings and make IPython output a bit more interactive?
This seems to work for me:
from IPython.core.display import display, HTML
display(HTML('<h1>Hello, world!</h1>'))
The trick is to wrap it in display as well.
Source: http://python.6.x6.nabble.com/Printing-HTML-within-IPython-Notebook-IPython-specific-prettyprint-tp5016624p5016631.html
Edit:
from IPython.display import display, HTML
In order to avoid:
DeprecationWarning: Importing display from IPython.core.display is
deprecated since IPython 7.14, please import from IPython display
Some time ago Jupyter Notebooks started stripping JavaScript from HTML content [#3118]. Here are two solutions:
Serving Local HTML
If you want to embed an HTML page with JavaScript on your page now, the easiest thing to do is to save your HTML file to the directory with your notebook and then load the HTML as follows:
from IPython.display import IFrame
IFrame(src='./nice.html', width=700, height=600)
Serving Remote HTML
If you prefer a hosted solution, you can upload your HTML page to an Amazon Web Services "bucket" in S3, change the settings on that bucket so as to make the bucket host a static website, then use an Iframe component in your notebook:
from IPython.display import IFrame
IFrame(src='https://s3.amazonaws.com/duhaime/blog/visualizations/isolation-forests.html', width=700, height=600)
This will render your HTML content and JavaScript in an iframe, just like you can on any other web page:
<iframe src='https://s3.amazonaws.com/duhaime/blog/visualizations/isolation-forests.html', width=700, height=600></iframe>
Related: While constructing a class, def _repr_html_(self): ... can be used to create a custom HTML representation of its instances:
class Foo:
def _repr_html_(self):
return "Hello <b>World</b>!"
o = Foo()
o
will render as:
Hello World!
For more info refer to IPython's docs.
An advanced example:
from html import escape # Python 3 only :-)
class Todo:
def __init__(self):
self.items = []
def add(self, text, completed):
self.items.append({'text': text, 'completed': completed})
def _repr_html_(self):
return "<ol>{}</ol>".format("".join("<li>{} {}</li>".format(
"☑" if item['completed'] else "☐",
escape(item['text'])
) for item in self.items))
my_todo = Todo()
my_todo.add("Buy milk", False)
my_todo.add("Do homework", False)
my_todo.add("Play video games", True)
my_todo
Will render:
☐ Buy milk
☐ Do homework
☑ Play video games
Expanding on #Harmon above, looks like you can combine the display and print statements together ... if you need. Or, maybe it's easier to just format your entire HTML as one string and then use display. Either way, nice feature.
display(HTML('<h1>Hello, world!</h1>'))
print("Here's a link:")
display(HTML("<a href='http://www.google.com' target='_blank'>www.google.com</a>"))
print("some more printed text ...")
display(HTML('<p>Paragraph text here ...</p>'))
Outputs something like this:
Hello, world!
Here's a link:
www.google.com
some more printed text ...
Paragraph text here ...
First, the code:
from random import choices
def random_name(length=6):
return "".join(choices("abcdefghijklmnopqrstuvwxyz", k=length))
# ---
from IPython.display import IFrame, display, HTML
import tempfile
from os import unlink
def display_html_to_frame(html, width=600, height=600):
name = f"temp_{random_name()}.html"
with open(name, "w") as f:
print(html, file=f)
display(IFrame(name, width, height), metadata=dict(isolated=True))
# unlink(name)
def display_html_inline(html):
display(HTML(html, metadata=dict(isolated=True)))
h="<html><b>Hello</b></html>"
display_html_to_iframe(h)
display_html_inline(h)
Some quick notes:
You can generally just use inline HTML for simple items. If you are rendering a framework, like a large JavaScript visualization framework, you may need to use an IFrame. Its hard enough for Jupyter to run in a browser without random HTML embedded.
The strange parameter, metadata=dict(isolated=True) does not isolate the result in an IFrame, as older documentation suggests. It appears to prevent clear-fix from resetting everything. The flag is no longer documented: I just found using it allowed certain display: grid styles to correctly render.
This IFrame solution writes to a temporary file. You could use a data uri as described here but it makes debugging your output difficult. The Jupyter IFrame function does not take a data or srcdoc attribute.
The tempfile
module creations are not sharable to another process, hence the random_name().
If you use the HTML class with an IFrame in it, you get a warning. This may be only once per session.
You can use HTML('Hello, <b>world</b>') at top level of cell and its return value will render. Within a function, use display(HTML(...)) as is done above. This also allows you to mix display and print calls freely.
Oddly, IFrames are indented slightly more than inline HTML.
to do this in a loop, you can do:
display(HTML("".join([f"<a href='{url}'>{url}</a></br>" for url in urls])))
This essentially creates the html text in a loop, and then uses the display(HTML()) construct to display the whole string as HTML
I would like to get the plain text (e.g. no html tags and entities) from a given URL.
What library should I use to do that as quickly as possible?
I've tried (maybe there is something faster or better than this):
import re
import mechanize
br = mechanize.Browser()
br.open("myurl.com")
vh = br.viewing_html
//<bound method Browser.viewing_html of <mechanize._mechanize.Browser instance at 0x01E015A8>>
Thanks
you can use HTML2Text if the site isnt working for you you can go to HTML2Text github Repo and get it for Python
or maybe try this:
import urllib
from bs4 import*
html = urllib.urlopen('myurl.com').read()
soup = BeautifulSoup(html)
text = soup.get_text()
print text
i dont know if it gets rid of all the js and stuff but it gets rid of the HTML
do some Google searches there are multiple other questions similar to this one
also maybe take a look at Read2Text
In Python 3, you can fetch the HTML as bytes and then convert to a string representation:
from urllib import request
text = request.urlopen('myurl.com').read().decode('utf8')