How to display <IPython.core.display.HTML object>? - python

I try to run the below codes but I have a problem in showing the results.
also, I use pycharm IDE.
from fastai.text import *
data = pd.read_csv("data_elonmusk.csv", encoding='latin1')
data.head()
data = (TextList.from_df(data, cols='Tweet')
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=48))
data.show_batch()
The output while I run the line "data.show_batch()" is:
IPython.core.display.HTML object

If you don't want to work within a Jupyter Notebook you can save data as an HTML file and open it in a browser.
with open("data.html", "w") as file:
file.write(data)

You can only render HTML in a browser and not in a Python console/editor environment.
Hence it works in Jupiter notebook, Jupyter Lab, etc.
At best you call .data to see HTML, but again it will not render.

I solved my problem by running the codes on Jupiter Notebook.

You could add this code after data.show_batch():
plt.show()

Another option besides writing it do a file is to use an HTML parser in Python to programatically edit the HTML. The most commonly used tool in Python is beautifulsoup. You can install it via
pip install beautifulsoup4
Then in your program you could do
from bs4 import BeautifulSoup
html_string = data.show_batch().data
soup = BeautifulSoup(html_string)
# do some manipulation to the parsed HTML object
# then do whatever else you want with the object

Just use the data component of HTML Object.
with open("data.html", "w") as file:
file.write(data.data)

Related

Why can I not get local files to parse using BeautifulSoup4 in Jupyterlab

I'm following a web tutorial trying to use BeautifulSoup4 to extract data from a html file (stored on my local PC) in Jupyterlab as follows:
from bs4 import BeautifulSoup
with open ('simple.html') as html_file:
simple = BeautifulSoup('html_file','lxml')
print(simple.prettify())
I'm getting the following output irrespective of what is in the html file instead of the expected html
<html>
<body>
<p>
html_file
</p>
</body>
</html>
I've also tried it using the html parser html.parser and I simply get html_file as the output.
I know it can find the file because when I run the code after removing it from the directory I get a FileNotFoundError.
It works perfectly well when I run python interactively from the same directory. I'm able to run other BeautifulSoup to parse web pages.
I'm using Fedora 32 linux with Python3, Jupyterlab, BeautifulSoup4,requests, lxml installed in a virtual environment using pipenv.
Any help to get to the bottom of the problem is welcome.
Your problem is in this line:
simple = BeautifulSoup('html_file','lxml')
In particular, you're telling BeautifulSoup to parse the literal string 'html_file' instead of the contents of the variable html_file.
Changing it to:
simple = BeautifulSoup(html_file,'lxml')
(note the lack of quotes surrounding html_file) should give the desired result.

Fatal error reading PNG image file: Not a PNG file in Ubuntu 20.04 LTS

I try to download an image using requests module in python.It works but when i try to open this image it showing "Fatal error reading PNG image file: Not a PNG file". Here is my error screenshot.And the code i used to download is,
import requests
img_url = "http://dimik.pub/wp-content/uploads/2020/02/javaWeb.jpg"
r = requests.get(img_url)
with open("java_book.png","wb") as f:
f.write(r.content)
And i run my code in terminal just saying, python3 s.py (s.py is the name of file).
Is something wrong in my code or something else in my operating system(ubuntu 20.04 LTS)?
import requests
response = requests.get("https://devnote.in/wp-content/uploads/2020/04/devnote.png")
file = open("sample_image.png", "wb")
file.write(response.content)
print (response.content)
file.close()
https://devnote.in/wp-content/uploads/2020/04/devnote.png this url is Disable mod_security. so this return error like :
<html><head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>.
Disable mod_security using .htaccess on apache server
Mod_security can be easily disabled with the help of .htaccess.
<IfModule mod_security.c>
SecFilterEngine Off
SecFilterScanPOST Off
</IfModule>
It's because you tried to save javaWeb.jpg (A JPG file) as java_book.png (A PNG file).
In an attempt to see what we are working on, I've tried replicating the issue, please see below what found out.
1.) The file you are attempting to open is the ENTIRE HTML document. I can support this, because we are finding !DOCTYPE html at the beginning of your 'wb' or WRITE BINARY command.
<---------------------------------------------- WE ARE AT AN IMPASSE
From here we have a few options to solve our problem.
a.) We could simply download the image from the web page - placing it in a local folder/directory/ or wherever you want it. This is by far our easiest call, because it allows us to call and open it for later without having to do too much. While I'm on a Windows machine - Ubuntu should have no problem doing this either (Unless you aren't in an UBUNTU with a GUI - that can be fixed with startx IF SUPPORTED)
b.) If you have to pull directly from the site itself, you could try something like this using BEAUTIFULSOUP from this answer here. Honestly, I've never really used the latter option since downloading and moving is much more effective.
You just need to save the image as a JPG.
import requests
img_url = "http://dimik.pub/wp-content/uploads/2020/02/javaWeb.jpg"
r = requests.get(img_url)
with open("java_book.jpg","wb") as f:
f.write(r.content)
Yeah, it's a full HTML document:

Boxes are displayed instead of text in pdfkit - Python3

I am trying to convert an HTML file to pdf using pdfkit python library.
I followed the documentation from here.
Currently, I am trying to convert plain texts to PDF instead of whole html document. Everything is working fine but instead of text, I am seeing boxes in the generated PDF.
This is my code.
import pdfkit
config = pdfkit.configuration(wkhtmltopdf='/usr/local/bin/wkhtmltopdf/wkhtmltox/bin/wkhtmltopdf')
content = 'This is a paragraph which I am trying to convert to pdf.'
pdfkit.from_string(content,'test.pdf',configuration=config)
This is the output.
Instead of the text 'This is a paragraph which I am trying to convert to pdf.', converted PDF contains boxes.
Any help is appreciated.
Thank you :)
Unable to reproduce the issue with Python 2.7 on Ubuntu 16.04 and it works fine on the specs mentioned. From my understanding this problem is from your Operating System not having the font or encoding in which the file is being generated by the pdfkit.
Maybe try doing this:
import pdfkit
config = pdfkit.configuration(wkhtmltopdf='/usr/local/bin/wkhtmltopdf/wkhtmltox/bin/wkhtmltopdf')
content = 'This is a paragraph which I am trying to convert to pdf.'
options = {
'encoding':'utf-8',
}
pdfkit.from_string(content,'test.pdf',configuration=config, options=options)
The options to modify pdf can be added as dictionary and assigned to options argument in from_string functions. The list of options can be found here.
This issue is referred here Include custom fonts in AWS Lambda
if you are using pdfkit on lambda you will have to setup ENV variables as
"FONT_CONFIG_PATH": '/opt/fonts/'
"FONTCONFIG_FILE": '/opt/fonts/fonts.conf'
if this problem is in the local environment a fresh installation of wkhtmltopdf must resolve this

Python AttributeError NoneType 'text'

Trying to make a python script that can scrape from pastebin's RAW Paste Data section of the page on saved pastebin outputs. But I'm running into an issue with Python Attribute Error about NoneType has no object attribute 'text', I'm using the libraries from BeautifulSoup in my project. I tried to install spider-egg with pip install so I could use that also, but there was issues downloading the package from the server.
I need to be able to grab different multiple lines from the RAW Paste Data section and the print them back out to me.
first_string = raw_box.text.strip()
second_string = raw_box2.text.strip()
from the pastebin page I have the class element names for the RAW Paste Data section which is;
<textarea id="paste_code" class="paste_code" name="paste_code" onkeydown="return catchTab(this,event)">
taking the class name paste_code I then have this
raw_box = soup.find('first_string ', attrs={'class': 'paste_code'})
raw_box2 = soup.find('second_string ', attrs={'class': 'paste_code'})
I thought that should of been it, but apparently not, because I get the error I mentioned before. After parsing the data that has been stripped I need to be able to redirect that into a file after printing what it got. I also want to try make this python3 compatible, but that would take a little more work I think, since there's a lot of differences between python 2.7.12 and 3.5.2.
The following approach should help to get you started:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://pastebin.com/hGeHMBQf')
soup = BeautifulSoup(r.text, "html.parser")
raw = soup.find('textarea', id='paste_code').text
print raw
Which for this example should display:
hello world

Show output of .html file in Jupyter notebooks

I can create an html file using this code:
with open(file_loc+'file.html', 'w') as html:
html.write(s.set_table_attributes("border=1").render())
How can I show the output in Jupyter Notebooks without creating the file?
If I simply try to render it in Jupyter (shown below) then it shows the html code instead of displaying the desired output that I would see in my browser:
from IPython.core.display import display, HTML
s.set_table_attributes("border=1").render()
Use this
from IPython.display import HTML
HTML(filename="profiling/z2pDecisionTreeProfiling.html")
You need to invoke IPython's HTML function:
HTML(s.set_table_attributes("border=1").render())

Categories

Resources