SPSS-Python write to CSV - wrong encoding when opening in Excel

SPSS-Python write to CSV - wrong encoding when opening in Excel - python

In SPSS, using python, I am writing a list of lists into a csv file:
begin program.
import spss,spssaux, sys, csv, codecs
def WriteDim():
MyArray=[some list of lists]
for MyVar in MyFile.varlist:
MyArray.append([MyVar.name,MyVar.label])
DimFile="DimCSV.csv"
with codecs.open(DimFile, "w",encoding='utf8') as output:
writer = csv.writer(output, lineterminator='\n')
writer.writerows(MyArray)
end program.
I have some Spanish text in my practice array, for example "reparación". If I open the output file in a text editor, all looks fine. However, if I open it in Excel 2016, it looks like this: "reparaciÃ³n". I would need to go to Data/Import From text" and manually choose UTF encoding, but this is not an option for the future users of my SPSS program.
Is there any way of writing the file so that Excel will open it using UTF-8 encoding ?
It has to be a csv file - opening it in excel is only one use of it.

You explicitely ask for a utf8 encoding at codecs.open(DimFile, "w",encoding='utf8'), and later say you would prefere not to use utf8. Just directly use the expected encoding:
with codecs.open(DimFile, "w",encoding='cp1252') as output:
(cp1252 is the common encoding for Spanish on Windows)

While Serge Ballesta's answer worked perfectly for Spanish, I found that encoding='utf-8-sig' works best for all characters I tested. I felt UTF-8 should be used, as it is more common than the other suggested encodings.
Credit to this topic:
Write to UTF-8 file in Python

Related

Cannot read imported csv file with excel in UTF-8 format after python scraping

I have a csv file encoded in utf-8 (filled with information from website through scraping with a python code, with str(data_scraped.encode('utf-8') in the end for the content)).
When I import it to excel (even if I pick 65001: Unicode UTF8 in the options), it doesn't display the special characters.
For example, it would show \xc3\xa4 instead of ä
Any ideas of what is going on?

I solved the problem.
The reason is that in the original code, I removed items such as \t \n that were "polluting" the output with the replace function. I guess I removed too much and it was not readable for excel after.
In the final version, I didn't use
str(data_scrapped.encode('utf-8') but
data_scrapped.encode('utf-8','ignore').decode('utf-8')
then I used split and join to rempove the "polluting terms":
string_split=data_scrapped.split()
data_scrapped=" ".join(string_split)

UT8 issue - Is there a way to convert strange looking characters Ã¤ to its proper German character ä in Python?

I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is Ã¤ instead of ä, Ã instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß.
One work around to solve this issue is -
Open the file in standard Notepad.
Press 'Save As' and then a window appears.
Then in the drop down, change encoding to UTF-8.
Now, when you import the files, in SAS or Python, then everything is imported correctly.
But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue.
I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that.
Is there any Python library which can automatically translate these strange characters into their proper characters - like Ã¤ gets translated to ä and so on?

did you try to use codecs library?
import codecs
your_file= codecs.open('your_file.extension','w','encoding_type')

If the file contains the correct code points, you just have to specify the correct encoding. Python 3 will default to UTF-8 on most sane platforms, but if you need your code to also run on Windows, you probably want to spell out the encoding.
with open(filename, 'r', encoding='utf-8') as f:
# do things with f
If the file actually contains mojibake there is no simple way in the general case to revert every possible way to screw up text, but a common mistake is assuming text was in Latin-1 and convert it to UTF-8 when in fact the input was already UTF-8. What you can do then is say you want Latin-1, and probably make sure you save it in the correct format as soon as you have read it.
with open(filename, 'r', encoding='latin-1') as inp, \
open('newfile', 'w', encoding='utf-8') as outp:
for line in inp:
outp.write(line)
The ftfy library claims to be able to identify and correct a number of common mojibake problems.

Converting character codes to unicode [Python]

So I have a large csv of french verbs that I am using to make a program, in the csv, verbs with accent characters contain codes instead of the actual accents:
être is Ãªtre, for example (atleast when I open the file in Excel)
Here is the csv:
https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv
In Chrome and Firefox atleast, the codes are converted to the correct accents. I was wondering if once the string is imported in python into a given a variable, ie.
...
for row in reader:
inf_lst.append(row[0])
verb = inf_lst[2338]
#(verb = Ãªtre)
if there was a straightforward/built in method for printing it out with correct unicode to give "être"?
I am aware that you could do this by replacing the Ãª's with ê's in each string but since this would have to be done for each different possible accent, I was wondering if there was an easier way.
Thanks,

You can use unicode encoding by prefixing a string with 'u'.
>>> foo = u'être' >>> print foo être

It all comes down to the character encoding of the data. Its possible that it is utf-8 encoded and you are viewing it in a Windows tool that is using your local code page, which gives a different display for the stream. How to read/write with files is covered in the csv doc examples.
You've given us a zipped, utf-8 encoded web page and the requests modules is good at handling that sort of thing. So, you could read the csv with:
>>> import requests
>>> import csv
>>> resp=requests.get("https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv",
... stream=True)
>>> try:
... inf_lst = list(csv.reader(resp.iter_lines(decode_unicode=True)))
... finally:
... del resp
...
>>> len(inf_list)
5362

You have a UTF-8-encoded file. Excel likes that encoding to start with a byte order mark character (U+FEFF) or it assumes the default ANSI encoding for your version of Windows instead. To get UTF-8 with BOM, use a tool like Notepad++. Open the file in Notepad++. On the Encoding menu, select "Encode in UTF-8-BOM" and save. Now it will display correctly in Excel.
To write a file that Excel can open, use the encoding utf-8-sig and write Unicode strings:
import io
with io.open('out.csv','w',encoding='utf-8-sig') as f:
f.write(u'être')

Writing unicode data in python

I have an xlsx file that I need to convert to csv, I used openpyxl module along with unicodecsv for this. My problem is that while writing some files I am getting some junk characters in output. Details below
One of my file has unicode code point u'\xa0' in it which corresponds to NON BREAK SPACE, but when converted to csv my file shows the Â instead of space. While printing the same data on console using Python GUI it prints perfectly without any Â. What am I doing wrong here? any help is appreciated.
Sample Code:
import unicodecsv
from openpyxl import load_workbook
xlsx_file=load_workbook('testfile.xlsx',use_iterators=True)
with open('test_utf.csv','wb') as open_file:
csv_file=unicodecsv.writer(open_file)
sheet=xls_file.get_active_sheet()
for row in sheet.iter_rows():
csv_file.writerow(cell.internal_value for cell in row)
P.S: The type of data written is Unicode.

Okay, so what is going on is that Excel likes to assume that you are using the currently configured codepage. You have a couple of options:
Write your data in that codepage. This requires however that you know
which one your users will be using.
Load the csv file using the "import data" menu option. If you are
relying on your users to do this, don't. Most people will not be
willing to do this.
Use a different program that will accept unicode in csv by default,
such as Libre Office.
Add a BOM to the beginning of the file to get Excel to recognise utf-8. This may break in other programs.
Since this is for your personal use, if you are only ever going to use Excel, then appending a byte order marker to the beginning is probably the easiest solution.

Microsoft likes byte-order marks in its text files. Even though a BOM doesn't make sense with UTF-8, it is used as a signature to let Excel know the file is encoded in UTF-8.
Make sure to generate your .csv as UTF-8 with BOM. I created the following using Notepad++:
English,Chinese
American,美国人
Chinese,中国人
The result saved with BOM:
The result without BOM:

Exporting data containing umlauts into a .csv which is readable by Excel

I am using Python 2.7.2 on a Mac OS X 10.8.2. I need to write a .csv file which often contains several "Umlauts" like ä, ö and ü. When I write the .csv file Numbers and Open Office are all able to read the csv correctly and also display the Umlauts without any problems.
But if I read it with Microsoft Excel 2004 the words are display like that:
TuÃàrlersee
I know, Excel has problems dealing with UTF-8. I read something that Excel versions below 2007 are not able to read UTF-8 files properly, even if you have setted the UTF-8 BOM (Byte Order Marker). I'm setting the UTF-8 BOM with the following line:
e.write(codecs.BOM_UTF8)
So what I tried as next step was instead of exporting it as UTF-8 file I wanted to set the character encoding to mac-roman. With the following line I decoded the value from utf-8 and reencoded it with mac-roman.
projectName = projectDict['ProjectName'].decode('utf-8').encode('mac-roman')
But then I receive the following error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0308' in position 6: character maps to <undefined>
How can I export this data into a .csv where Excel is able to read also the Umlauts correctly? Python internally handles everything in UTF-8. Or maybe I'm not understanding the decoding/encoding correctly. In Python 3.0 they have adapted the whole encoding/decoding model, but I need to stay on version 2.7.2..
I am using the DictWriter like that:
w = csv.DictWriter(e, fieldnames=fieldnames, extrasaction='ignore', delimiter=';', quotechar='\"', quoting=csv.QUOTE_NONNUMERIC)
w.writeheader()

The \u0308 is a combining diaeresis; you'll need to normalize your unicode string before decoding to mac-roman:
import unicodedata
unicodedata.normalize('NFC', projectDict['ProjectName'].decode('utf-8')).encode('mac-roman')
Demo, encoding a ä character in denormalized form (a plus combining diaeresis) to mac-roman after normalization to composed characters:
>>> unicodedata.normalize('NFC', u'a\u0308').encode('mac-roman')
'\x8a'
I've used this technique in the past to produce CSV for Excel for specific clients where their platform encoding was known upfront (Excel will interpret the file in the current Windows encoding, IIRC). In that case I encoded to windows-1252.

CSV files are really only meant to be in ASCII - if what you're doing is just writing out data for import into Excel later, then I'd write it as an Excel workbook to start with which would avoid having to muck about with this kind of stuff.
Check http://www.python-excel.org/ for the xlwt module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

SPSS-Python write to CSV - wrong encoding when opening in Excel - python

You explicitely ask for a utf8 encoding at codecs.open(DimFile, "w",encoding='utf8'), and later say you would prefere not to use utf8. Just directly use the expected encoding: with codecs.open(DimFile, "w",encoding='cp1252') as output: (cp1252 is the common encoding for Spanish on Windows)

While Serge Ballesta's answer worked perfectly for Spanish, I found that encoding='utf-8-sig' works best for all characters I tested. I felt UTF-8 should be used, as it is more common than the other suggested encodings. Credit to this topic: Write to UTF-8 file in Python

Related

Cannot read imported csv file with excel in UTF-8 format after python scraping

UT8 issue - Is there a way to convert strange looking characters Ã¤ to its proper German character ä in Python?

Converting character codes to unicode [Python]

Writing unicode data in python

Exporting data containing umlauts into a .csv which is readable by Excel

Categories

Resources