Special characters in output are like � - python

I have some .txt files which included Turkish characters. I prepared a HTML code and wanna include texts which are in my txt files. The processes are successful but the html files which are made by python have character problems(special characters seems like this: �)
I have tried add u before strings in python code but it did not work.
txt files are made by python. actually they are my blog entries I got them using urrlib. Moreover, they have not character problems
thank you for your answers.

When you serve the content to a web browser, you need to tell it what encoding the file is in. Ideally, you should send a Content-type: HTTP header in the response with something like text/plain; charset=utf-8, where "utf-8" is replaced by whatever encoding you're actually using if it's not utf-8.
Your browser may also need to be set to use a unicode-aware font for displaying text files; if it uses a font that doesn't have the necessary glyphs, obviously it can't display them.

I think it isn't a Python question. Have you specified the document's character encoding?

You have to supply you html with something like the next:
<meta content="text/html; charset=UTF-8">
Replace UTF-8 with encoding used in .txt file.

Related

How to remove non-english from the textual data in csv file

I am cleaning the textual data that I scrawled from multiple urls. How can I remove non-English words/symbols from the data in a csv file?
I saved the data and read the data using the following codes:
To save the data as csv file:
df.to_csv("blogdata.csv", encoding = "utf-8")
After saving the data, the csv file shows as follows including non-English words and symbols (e.g., '\n\t\t\t', m’, etc.):
The symbols did not show in the original data and some of them even appear from the data that are in English. Take 'Ross Parker' in the 7th row as an example.
The data saved in the csv file says: ['\n\t\t\t', 'It’s about time I wrote an update on what we’ve been up to over the past few months. We’re about to...
Where in the original data scrawled from the url, it shows as follows:
Can anybody explain why this happens and help me solve this issue and clean the non-English data from the file?
Thank you so much in advance!
It looks like pilot error: the data is correct but you are looking at it in a tool which is configured or hard-coded to display the text as Latin-1 (or Windows code page 1252?) even though you saved it as UTF-8.
Some tools - epecially on Windows - will do whimsical things with UTF-8 which doesn't have a BOM. Maybe try adding one (and maybe file a bug report if this really helps; the tool should at the very least let you override its default encoding, without modifying the input data).
In other words, if the screen shot with the broken data is from Excel, you probably selected a DOS oode page (or the horribly mislabelled "ANSI") instead of UTF-8 when it asked how to import this CSV file. Perhaps the best fix would be to devise a workflow which does not involve a spreadsheet.
Or perhaps you used a tool which didn't ask you anything, and tried to "sniff" the data to determine its encoding, and it guessed wrong. Adding an invisible byte sequence called a BOM which is unique to UTF-8 should hopefully allow it to guess right; but this is buggy behavior - you should not be a hostage to its clearly imperfect heuristics. (See also "Bush hid the facts" for a related story.)

HTML Decoding in Python

I am writing a python script for mass-replacement of links(actually image and script sources) in HTML files; I am using lxml. There is one problem, the html files are quizzes and they have data packaged like this(there is also some Cyrillic here):
<input class="question_data" value="{"text":"<p>[1] је наука која се бави чувањем, обрадом и преносом информација помоћу рачунара.</p>","fields":[{"id":"1","type":"fill","element":{"sirina":"103","maxDuzina":"12","odgovor":["Информатика"]}}]}" name="question:1:data" id="id3a1"/>
When I try to print out this data in python using:
print "OLD_DATA:", data
It just prints out the error "UnicodeEncodeError: character maps to undefined". There are more of these elements. My goal is to change the links of images in the value part of input, but I can't change the links if I don't know how to print this data(or how it should be written to the file). How does Python handle(interpret) this? Please help. Thanks!!! :)
You're running into the same problem I've hit many times in the past. That error almost always means that the console environment you're using can't display the characters it's trying to print. It might be worth trying to log to a file instead, then opening the log in an editor that can display the characters.
If you really want to be able to see it on your console, it might be worth writing a function to screen the strings you're printing for unprintable characters
I also found a couple other StackOverflow posts that might be helpful in your efforts:
How do I get Cyrillic in the output, Python?
What is right way to use cyrillic in python lxml library
I would also recommend this article and python manual entry:
https://docs.python.org/2/howto/unicode.html
http://www.joelonsoftware.com/articles/Unicode.html

Scraping a website whose encoding is iso-8859-1 instead of utf-8: how do I store the correct unicode in my database?

I'd like to scrape a website using Python that is full of horrible problems, one being the wrong encoding at the top:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
This is wrong because the page is full of occurrences like the following:
Nell’ambito
instead of
Nell'ambito (please notice ’ replaces ')
If I understand correctly, this is happening because utf-8 bytes (probably the database encoding) are interpreted as iso-8859-1 bytes (forced by the charset in the meta tag).
I found some initial explanation at this link http://www.i18nqa.com/debug/utf8-debug.html
I am using BeautifulSoup to navigate the page, Google App Engine's urlfetch to make requests, however all I need is to understand what is the correct way to store in my database a string that fixes ’ by encoding the string to '.
I am using BeautifulSoup to navigate the page, Google App Engine's urlfetch to make requests
Are you feeding the encoding from the Content-Type HTTP header into BeautifulSoup?
If an HTML page has both a Content-Type header and a meta tag, the header should ‘win’, so if you're only taking the meta tag you may get the wrong encoding.
Otherwise, you could either feed the fixed encoding 'utf-8' into Beautiful, or fix up each string indvidually.
Annoying note: it's not actually ISO-8859-1. When web pages say ISO-8859-1, browsers actually take it to mean Windows code page 1252, which is similar to 8859-1 but not the same. The € would seem to indicate cp1252 because it's not present in 8859-1.
u'Nell’ambito'.encode('cp1252').decode('utf-8')
If the content is encoded inconsistently with some UTF-8 and some cp1252 on the same page (typically due to poor database content handling), this would be the only way to recover it, catching UnicodeError and returning the original string when it wouldn't transcode.

Peter Piper piped a Python program - and lost all his unicode characters

I have a Python script that loads a web page using urllib2.urlopen, does some various magic, and spits out the results using print. We then run the program on Windows like so:
python program.py > output.htm
Here's the problem:
The urlopen reads data from an IIS web server which outputs UTF8. It spits out this same data to the output, however certain characters (such as the long hyphen that Word always inserts for you against your will because it's smarter than you) get garbled and end up like – instead.
Upon further investigation, I noticed even though the web server spits out UTF8 data, the output.htm file is encoded with the ISO-8859-1 character set.
My questions:
When you redirect a Python program to an output file on Windows, does it always use this character set?
If so, is there any way to change that behavior?
If not, is there a workaround? I suppose I could just pass in output.htm as a command line parameter and write to that file instead of the screen, but I'd have to redo a whole bunch of logic in my program.
Thanks for any help!
UPDATE:
At the top of output.htm I added:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
However, it makes no difference. The characters are still garbled. If I manually switch over to UTF-8 in Firefox, the file displays correctly. Both IE and FF think this file is Western ISO even though it is clearly not.
From your comments and question update it seems that the data is correctly encoded in UTF-8. This means you just need to tell your browser it's UTF-8, either by using a BOM, or better, by adding encoding information to your HTML document:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
You really shouldn't use an XML declaration if the document is no valid XML.
The best and most reliable way would be to serve the file via HTTP and set the Content-Type: header appropriately.
When you pipe a Python program to an output file on Windows, does it always use this character set?
Default encoding used to output to pipe. On my machine:
In [5]: sys.getdefaultencoding()
Out[5]: 'ascii'
If not, is there a workaround?
import sys
try:
sys.setappdefaultencoding('utf-8')
except:
sys = reload(sys)
sys.setdefaultencoding('utf-8')
Now all output is encoded to 'utf-8'.
I think correct way to handle this situation without
redo a whole bunch of logic
is to decode all data from your internet source from server or page encoding to unicode, and then to use workaround shown above to set default encoding to utf-8.
Most programs under Windows will assume that you're using the default Windows encoding, which will be ISO-8859-1 for an English installation. This goes for the command window output as well. There's no way to set the default encoding to UTF-8 unfortunately - there's a code page defined for it, but it's not well supported.
Some editors will recognize any BOM characters at the start of the file and switch to UTF-8, but that's not guaranteed.
If you're generating HTML you should include the proper charset tag; then the browser will interpret it properly.

Determining charset from html meta tags w/python

I have a script that needs to determine the charset before being read by lxml.HTML() for parsing. I will assume ISO-8859-1(that's the normal assumed charset for this right?) if it can't be found and search the html for the meta tag with the charset attribute. However I'm not sure the best way to do that. I could try to create an etree with lxml, but I don't want to read the whole file since I may run into encoding problems. However, if I don't read the whole file I can't build an etree since some tags will not have been closed.
Should I just find the meta tag with some fancy string subscripting and break out of the loop once it's found or a certain number of lines have been read? Maybe use a low level HTML parser, eg html.parser? Using python3 btw, thanks.
You should first try to extract encoding from HTTP headers. If it is not present there, you should parse it with the lxml. This might be tricky since lxml throws parse errors if charset does not match. A work-around would be decoding and encoding the data ignoring the unknown characters.
html_data=html_data.decode("UTF-8","ignore")
html_data=html_data.encode("UTF-8","ignore")
After this, you can parse by invoking the lxml.HTML() command with utf-8 encoding.
This way, you'll be able to find the correct encoding defined in the HTML headers.
After finding the encoding, you'll have to re-parse the HTML document with proper encoding.
Unfortunately, sometimes you might not find character encoding even in the HTML headers. I'd suggest you using the chardet module to find the proper encoding only after these steps fail.
Determining the character encoding of an HTML file correctly is actually quite a complex matter, but the HTML5 spec defines exactly how a processor should do it. You can find the algorithm here: http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

Categories

Resources