Save pandas dataframe containing Chinese character to file - python

I have a pandas dataframe, where some fields contain Chinese character. I use the below code:
df = pd.read_csv('original.csv', encoding='utf-8')
df.to_csv('saved.csv')
Then I use excel or text editor to open saved.csv. All Chinese characters become junk characters. However, I am able to load the saved file and show the Chinese properly as follows.
df = pd.read_csv('saved.csv')
df.head() # Chinese characters are properly displayed.
Does anyone know how to solve the problem?

Try the following:
df = pd.read_csv('original.csv', encoding='utf-8')
df.to_csv('saved.csv', encoding='utf_8_sig')
it works for me when utf-8 failed

The problem is with the encoding of excel.
To resolve the issue, I first open the csv using sublime and then File->Save with encoding->UTF-8 with BOM (Byte Order Mark).
Now excel is able to open the csv without any problems!

Related

How to change the Windows-1251 encoding in a csv file with the actual UTF-8

I'm trying to translate "hieroglyphs" into Russian letters. I took a dataset (RUvideos.csv file), uploaded it via pandas:
import pandas as pd
data=pd.read_csv('RUvideos.csv',encoding = "utf-8")
I took the "title" column (pd.Series) and saved it to another CSV file:
x=data.title.copy()
x.to_csv(join(fld,'title.csv'))
then I opened the resulting file using Notepad++, because there are tips that you can use it to change the encoding in CSV:
There is UTF-8 encoding, although the characters are actually encoded in Win-1251. I tried all the encodings, it didn't help.
Who knows how to change Win-1251 to UTF-8 in this case?
I changed the encoding in Excel, but the file splitting into columns is broken when executing the read_csv function.

Python/Pandas : how to read a csv in cp1252 with a first row to delete?

Solution :
See answer, it was not encoded in CP1252 but in UTF-16 . Solution code is :
import pandas as pd
df = pd.read_csv('my_file.csv', sep='\t', header=1, encoding='utf-16')
Also works with encoding='utf-16-le'
Update : output of the first 3 lines in bytes :
In : import itertools
...: print(list(itertools.islice(open('file_T.csv', 'rb'), 3)))
Out : [b'\xff\xfe"\x00D\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00 \x00a\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00\n', b'\x00"\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\n', b'\x00C\x00o\x00d\x00e\x00 \x00M\x00C\x00U\x00\t\x00I\x00m\x00m\x00a\x00t\x00r\x00i\x00c\x00u\x00l\x00a\x00t\x00i\x00o\x00n\x00\t\x00D\x00a\x00t\x00e\x00\t\x00h\x00e\x00u\x00r\x00e\x00\t\x00V\x00i\x00t\x00e\x00s\x00s\x00e\x00\t\x00L\x00a\x00t\x00i\x00t\x00u\x00d\x00e\x00\t\x00L\x00o\x00n\x00g\x00i\x00t\x00u\x00d\x00e\x00\t\x00T\x00y\x00p\x00e\x00\t\x00E\x00n\x00t\x00r\x00\xe9\x00e\x00\t\x00E\x00t\x00a\x00t\x00\n']
I'm working with csv files whose raw form is :
The problem is that it has two features raising a problem together :
the first row is not the header
There is an accent in header "Entrée", which raises an UnicodeDecode Error if I don't precise the encoding cp1252
I'm using Python 3.X and pandas to deal with these files.
But when I try to read it with this code :
import pandas as pd
df_T = pd.read_csv('file_T.csv', header=1, sep=';', encoding = 'cp1252')
print(df_T)
I get the following output (same with header=0):
In order to read the csv correctly, I need to :
get rid of the accent
and ignore / delete the first row (which I don't need anyway).
How can I achieve that ?
PS : I know I could make a VBA program or something for this, but I'd
rather not. I'm interested in including it in my Python program, or in
knowing for sure that it is not possible.
CP1252 is the plain old Latin codepage, which does support all Western European accents. There wouldn't be any garbled characters if the file was written in that codepage.
The image of the data you posted is just that - an image. It says nothing about the file's raw format. Is it a UTF8 file? UTF16? It's definitely not CP1252.
Neither UTF8 nor CP1252 would produce NANs either. Any single-byte codepage would read the numeric digits at least, which means the file is saved in a multi-byte encoding.
The two strange characters at the start look like a Byte Order Mark. If you check Wikipedia's BOM entry you'll see that ÿþ is the BOM for UTF16LE.
Try using utf-16 or utf-16-le instead of cp1252

Pandas csv not reading Arabic characters

I have a csv file with mixed columns of numbers and Arabic text. I am trying to open it using pandas csv reader. I tried using the following.
Text= pd.read_csv('text.csv', encoding ='utf-8-sig')
However, after reading the file, when I am looking at the columns, I am getting instead of Arabic text a bunch of exclamation marks for each row.
what could be the problem?
thanks

How to display Chinese characters inside a pandas dataframe?

I can read a csv file in which there is a column containing Chinese characters (other columns are English and numbers). However, Chinese characters don't display correctly. see photo below
I loaded the csv file with pd.read_csv().
Either display(data06_16) or data06_16.head() won't display Chinese characters correctly.
I tried to add the following lines into my .bash_profile:
export LC_ALL=zh_CN.UTF-8
export LANG=zh_CN.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
but it doesn't help.
Also I have tried to add encoding arg to pd.read_csv():
pd.read_csv('data.csv', encoding='utf_8')
pd.read_csv('data.csv', encoding='utf_16')
pd.read_csv('data.csv', encoding='utf_32')
These won't work at all.
How can I display the Chinese characters properly?
I just remembered that the source dataset was created using encoding='GBK', so I tried again using
data06_16 = pd.read_csv("../data/stocks1542monthly.csv", encoding="GBK")
Now, I can see all the Chinese characters.
Thanks guys!
Try this
df = pd.read_csv(path, engine='python', encoding='utf-8-sig')
I see here three possible issues:
1) You can try this:
import codecs
x = codecs.open("testdata.csv", "r", "utf-8")
2) Another possibility can be theoretically this:
import pandas as pd
df = pd.DataFrame(pd.read_csv('testdata.csv',encoding='utf-8'))
3) Maybe you should convert you csv file into utf-8 before importing with Python (for example in Notepad++)? It can be a solution for one-time-import, not for automatic process, of course.
A non-python relate answer. I just ran into this problem this afternoon and found that using Excel to import data from CSV can show us lots of encoding names. We can play with the encodings there and see which one fit our need. For instance, I found that in excel both gb2312 and gb18030 convert the data nicely from csv to xlsx. But only gb18030 works in Python.
pd.read_csv(in_path + 'XXX.csv', encoding='gb18030')
Anyway, this is not about how to import csv in Python, but rather to find the available encodings to try.
You load a dataset and you have some strange characters.
Exemple :
'戴森美å�‘é€\xa0型器完整版套装Dyson Airwrap
HS01(铜金色礼盒版)'
In my case, I know that the strange characters are chineses. So I can figure that the one who send me the data have encode it in utf-8 but should do it in 'ISO-8859-1'.
So first step, I had encoded the string, then I decode with utf-8.
so my lines are :
_encoding = 'ISO-8859-1'
_my_str.encode(_encoding, 'ignore').decode("utf-8", 'ignore')
Then my output is :
"'森Dyson Airwrap HS01礼'"
This works for me, but I guess that I do not really well understood under the hood. So feel free to tell me if you have further information.
Bonus. I'll try to detect when the str is in the first strange format because some of my entries are in chinese but others are in english
EDIT : The Bonus is useless. I Just use lamba on ma column to encode and decode without care about format. So I changed the encoding after loading the dataframe
_encoding = 'ISO-8859-1'
_decoding = "utf-8"
df[col] = df[col].apply(lambda x : x.encode(_encoding, 'ignore').decode(_decoding , 'ignore'))

Writing unicode data in python

I have an xlsx file that I need to convert to csv, I used openpyxl module along with unicodecsv for this. My problem is that while writing some files I am getting some junk characters in output. Details below
One of my file has unicode code point u'\xa0' in it which corresponds to NON BREAK SPACE, but when converted to csv my file shows the  instead of space. While printing the same data on console using Python GUI it prints perfectly without any Â. What am I doing wrong here? any help is appreciated.
Sample Code:
import unicodecsv
from openpyxl import load_workbook
xlsx_file=load_workbook('testfile.xlsx',use_iterators=True)
with open('test_utf.csv','wb') as open_file:
csv_file=unicodecsv.writer(open_file)
sheet=xls_file.get_active_sheet()
for row in sheet.iter_rows():
csv_file.writerow(cell.internal_value for cell in row)
P.S: The type of data written is Unicode.
Okay, so what is going on is that Excel likes to assume that you are using the currently configured codepage. You have a couple of options:
Write your data in that codepage. This requires however that you know
which one your users will be using.
Load the csv file using the "import data" menu option. If you are
relying on your users to do this, don't. Most people will not be
willing to do this.
Use a different program that will accept unicode in csv by default,
such as Libre Office.
Add a BOM to the beginning of the file to get Excel to recognise utf-8. This may break in other programs.
Since this is for your personal use, if you are only ever going to use Excel, then appending a byte order marker to the beginning is probably the easiest solution.
Microsoft likes byte-order marks in its text files. Even though a BOM doesn't make sense with UTF-8, it is used as a signature to let Excel know the file is encoded in UTF-8.
Make sure to generate your .csv as UTF-8 with BOM. I created the following using Notepad++:
English,Chinese
American,美国人
Chinese,中国人
The result saved with BOM:
The result without BOM:

Categories

Resources