Weird HTML code looks like this b'\xff\xd8\xff\xe0 - python

I'm using python to retrieve an HTML source, but what comes out looks like this. What is this, and why am I not getting the actual page source?
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C

This is an image. Specifically a jpeg. Since it's a byte stream python prints it with b'.............'
A jpeg starts with \xff\xd8\xff\

Try using BeautifulSoup
Here's an example
How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
Basically, what you're seeing is encoded characters that need to be decoded.

Related

Python encoding character stress

I have some scraped data that I have output into a html file held locally as a 'raw' version before I do some data manipulation.
The issue is when I process the website I have a troublesome time dealing with a "'" character.
After much research I am getting to the end of my tether. I have seen much on that apostrophe causing issues, I have tried many versions of encoding and decoding, chardet etc and still cannot get it to work.
A word in a few tables is : CA’BELLAVISTA
When I process a script the IDE screen prints it correctly after I get the right encoding/decoding pattern however when I view the outputted HTML file I get the following CA\x92BELLAVISTA every time.
The script is simply a urllib.response.read() then encoding.
Is it the web browser doing it or is the script genially not getting the correct character?
The next step involves me loading in the HTML file for further manipulation and output to JSON/csv so I thought nailing the encoding on html file output would be the best option.
I think it's a ISO-9959-1/Latin1 charset although that seems to change on the odd webpage.
I hope I'm doing the correct thing in trying to put it into UTF-8.

HTML Decoding in Python

I am writing a python script for mass-replacement of links(actually image and script sources) in HTML files; I am using lxml. There is one problem, the html files are quizzes and they have data packaged like this(there is also some Cyrillic here):
<input class="question_data" value="{"text":"<p>[1] је наука која се бави чувањем, обрадом и преносом информација помоћу рачунара.</p>","fields":[{"id":"1","type":"fill","element":{"sirina":"103","maxDuzina":"12","odgovor":["Информатика"]}}]}" name="question:1:data" id="id3a1"/>
When I try to print out this data in python using:
print "OLD_DATA:", data
It just prints out the error "UnicodeEncodeError: character maps to undefined". There are more of these elements. My goal is to change the links of images in the value part of input, but I can't change the links if I don't know how to print this data(or how it should be written to the file). How does Python handle(interpret) this? Please help. Thanks!!! :)
You're running into the same problem I've hit many times in the past. That error almost always means that the console environment you're using can't display the characters it's trying to print. It might be worth trying to log to a file instead, then opening the log in an editor that can display the characters.
If you really want to be able to see it on your console, it might be worth writing a function to screen the strings you're printing for unprintable characters
I also found a couple other StackOverflow posts that might be helpful in your efforts:
How do I get Cyrillic in the output, Python?
What is right way to use cyrillic in python lxml library
I would also recommend this article and python manual entry:
https://docs.python.org/2/howto/unicode.html
http://www.joelonsoftware.com/articles/Unicode.html

html DOM: good test webpages to test encoding/decoding

What I'm doing is:
via javascript, reading the DOM of webpage
converting to json string
sending to python as ajax
in Python, json decoding the string into object
What I want is for any text that is part of the json to be in unicode to avoid any character issues. I used to use beautifulsoup for this:
from bs4 import *
from bs4.dammit import UnicodeDammit
text_unicode = UnicodeDammit(text, [None, None], "html", True).unicode_markup
But that doesn't work with the json string. Running the string through UnicodeDammit causes an error when I try to json decode it.
The thing is, I'm not even sure that collecting the DOM doesn't handle this issue automatically.
For starters, I would therefore like a series of test webpages to test this. Where one is encoded with utf-8, another with something else, etc. And that uses characters that will look wrong if, for example, you think it's utf-8 but it's not. Note that I don't even bother considering the webpage's stated encoding. This is too often wrong.
You are trying to solve a problem that does not exist.
The browser is responsible for detecting and handling the web page encoding. It'll determine the correct encoding based on the server headers, meta tags in the HTML page and plain guessing if needed. The DOM gives you Unicode data.
JSON handles Unicode data; sending JSON data to your Python process sends appropriately encoded byte data that any decent JSON library will turn back into Unicode values for you. The Python json module is such a library.
Just load the data from your JavaScript script with the json.load() or json.loads() functions as is. Your browser will already have used the correct encoding (most likely UTF-8), and the Python json module will decode any of the standard encodings used without additional configuration or handling.

Python detect broken encoding

After crawling many websites, in some of them i receive broken-encoding data. I can't do anything with them, i just need to detect them. For example text like:
·ç¼wÃdª«¦Ê³f
or
ãà³n³¾å¢
How can I recognize text like that ? I any language, so searching for non-english text is not an option. The only option I can think of is guess-language module.
There's NLTK which has a guess_encoding function that takes a byte string and tries all of the available encodings, would this serve your purpose?
Take a look at https://github.com/LuminosoInsight/python-ftfy
If I understand correctly, it will attempt to 'repair' incorrectly encoded/decoded text.

Determining charset from html meta tags w/python

I have a script that needs to determine the charset before being read by lxml.HTML() for parsing. I will assume ISO-8859-1(that's the normal assumed charset for this right?) if it can't be found and search the html for the meta tag with the charset attribute. However I'm not sure the best way to do that. I could try to create an etree with lxml, but I don't want to read the whole file since I may run into encoding problems. However, if I don't read the whole file I can't build an etree since some tags will not have been closed.
Should I just find the meta tag with some fancy string subscripting and break out of the loop once it's found or a certain number of lines have been read? Maybe use a low level HTML parser, eg html.parser? Using python3 btw, thanks.
You should first try to extract encoding from HTTP headers. If it is not present there, you should parse it with the lxml. This might be tricky since lxml throws parse errors if charset does not match. A work-around would be decoding and encoding the data ignoring the unknown characters.
html_data=html_data.decode("UTF-8","ignore")
html_data=html_data.encode("UTF-8","ignore")
After this, you can parse by invoking the lxml.HTML() command with utf-8 encoding.
This way, you'll be able to find the correct encoding defined in the HTML headers.
After finding the encoding, you'll have to re-parse the HTML document with proper encoding.
Unfortunately, sometimes you might not find character encoding even in the HTML headers. I'd suggest you using the chardet module to find the proper encoding only after these steps fail.
Determining the character encoding of an HTML file correctly is actually quite a complex matter, but the HTML5 spec defines exactly how a processor should do it. You can find the algorithm here: http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

Categories

Resources