Writing DataFrame to encoded JSON Newline Delimited

Writing DataFrame to encoded JSON Newline Delimited - python

In Python 2.7, I have a Pandas Dataframe with several unicode columns, integer columns, etc. I need to be able to write it encoded utf-8 to JSON Newline Delimited file.
I tried this, but it only works in Python 3, not Python 2.7.
with io.open('myjson.json','w',encoding='utf-8') as f:
f.write(df.to_json(orient="records", lines=True, force_ascii=False))
This is my attempt's result, but as you can see it's not encoded utf-8.
{"account_id":"support","case_id":7697,"message":"\u0633\u0628 \u0627\u0644\u0644\u0647\u0627\u0644\u0644\u0647 \u0627\u0644\u0639","created_at":1536606086392,"agent":"108915"}
{"account_id":"support","case_id":7697924,"message":"\u0647\u0627\u064a","created_at":1536601516354,"agent":"108915"}
I think it has something to do with this. But I'm not sure.
Other research I've done shows that if I put this in my code it works. But I also read that this isn't recommended.
import sys
reload(sys)
sys.setdefaultencoding('utf8')

edit - I missed the 2.7 part - I usually use 3.5 or higher. In any case, w/ python 2.7, I was able to convert the unicode string to utf-8 using codecs:
import codecs
codecs.unicode_escape_decode(a['message'])[0].encode("utf-8")
'\xd8\xb3\xd8\xa8 \xd8\xa7\xd9\x84\xd9\x84\xd9\x87\xd8\xa7\xd9\x84\xd9\x84\xd9\x87 \xd8\xa7\xd9\x84\xd8\xb9'
Old answer -
It looks like pandas .to_json() has a default setting of ensure_ascii=True, which converts non ascii to Unicode.
From docs:
to_json(path_or_buf=None, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression=None, index=True)
Try setting it to False:
df.to_json(force_ascii=False)
'{"agent":{"0":"108915"},"created_at":{"0":1536606086392},"message":{"0":"سب اللهالله الع"}}'
Edit - Forgot you were looking for newline delimited,
>>> df.to_json(force_ascii=False, orient="records")
[{"agent":"108915","created_at":1536606086392,"message":"سب اللهالله الع"}]

Related

FPDF encoding error when reading a UTF8 txt file in Python [duplicate]

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n into my favorite editor, in file f2.
Then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?
What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'

Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.
Supposing the file is encoded in UTF-8, we can use:
>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")
Then f.read returns a decoded Unicode object:
>>> f.read()
u'Capit\xe1l\n\n'
In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).
We can also use open from the codecs standard library module:
>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'
Note, however, that this can cause problems when mixing read() and readline().

In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.
Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.
In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'

Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
open(file, mode='r', buffering=-1,
encoding=None, errors=None, newline=None,
closefd=True, opener=None)
Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.
So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)

So, I've found a solution for what I'm looking for, which is:
print open('f2').read().decode('string-escape').decode("utf-8")
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.
This allows for the sort of round trip that I was imagining.

This works for reading a file with UTF-8 encoding in Python 3.2:
import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
print(line)

# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()

Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:
import io
text = u'á'
encoding = 'utf8'
with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
fout.write(text)
with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
text2 = fin.read()
assert text == text2

To read in an Unicode string and then send to HTML, I did this:
fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')
Useful for python powered http servers.

Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.
If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:
>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
CapitÃ¡n
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.

You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can't unless the file format provides for this. XML, for example, begins with:
<?xml encoding="utf-8"?>
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.
As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
>>> x.decode('utf-8')
u'Capit\xe1n\n'
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
0000000: 4361 7069 745c 7863 335c 7861 316e Capit\xc3\xa1n
codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().
Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.

The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.

You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.
import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')

I was trying to parse iCal using Python 2.7.9:
from icalendar import Calendar
But I was getting:
Traceback (most recent call last):
File "ical.py", line 92, in parse
print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
and it was fixed with just:
print "{}".format(e[attr].encode("utf-8"))
(Now it can print liké á böss.)

I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':
import sys
reload(sys)
sys.setdefaultencoding('utf8')
any open, print or other statement will just use utf8.
Works at least for Python 2.7.9.
Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).

Python not picking up UTF-8 encoding

I am having some trouble encoding ascii characters to UTF-8, or a string is not picking up the encoding.
import unicodecsv as csv
import re
import pyodbc
import sys
import unicodedata
#!/usr/bin/python
# -*- coding: UTF-8 -*-
def remove_non_ascii_1(text):
text.encode('utf-8')
for i in text:
return ''.join(i for i in text if i=='£')
In Python 2.7 I get the error
SyntaxError: Non-ASCII character '\xc2' in file on line 16, but no encoding declared; see SyntaxError: Non-ASCII character '\xc2' in file.
With the Unicode replacement
return ''.join(i for i in text if i=='\xc2')
the error is
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Sample text :
row from a csv file reading in
[u'06/11/2020', u'ABC', u'32154E', u'3214', u'DEF', u'Cash Purchase', u'Final', u'', u'20.00%', u'ABC', u'Sold From Pickup', u'New ', u'10.00%', u'0', u'15%', u'\xa3469.84', u'Jonathan Jerome', u'3', u'\xa3140.95', u'2%', u'\xa393.97', u'\xa39,396.83', u'', u'\xa35,638.00', u'30/06/2020', u'4', u'Boiler-Amended']
I want to remove the \xa3 or £ in the currency fields.

First 2 things ahead:
Don't use Python 2 any more because of reason mentioned here!
Don't use different encodings in Python 2.
TL;DR Python 3 just improved so many things regarding encodings that it simply isn't worth it.
Whole story: read here
Ok this out of the way let's start fixing your code.
As Klaus D. already mentioned you do not save the result of text. This leads to an encoding warning when comparing seamingly equal characters (£ and £) but one is encoded in the encoding coming from the file you read the other one is encoded in ascii (despite you encoding your code in -*- coding: UTF-8 -*-. This is just to show what your code-file is encoded in, this does not change the behaviour of the interpreter regarding str-parsing).
Edit: Also when comparing to the character you will need to compare to a unicode char so you could either convert it or simply tell the interpreter to encode it as unicode in the first place (that's why I added a leading 'u' in front of your '£')
To fix this simply safe your result into text again after you called text.encode('utf-8').
Additionally the "shebang" and the encoding info should always be on the very top of a file that as soon as you open the file you know what you are dealing with.
Something else I would correct is the first for-loop. this one is unnecessary because you return out of this function anyway after you handled the first element.
This means the completely "corrected" code is this.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import unicodecsv as csv
import re
import pyodbc
import sys
import unicodedata
def remove_non_ascii_1(text):
text.encode('utf-8')
return ''.join(i for i in text if i==u'£')
PS: You should really think again about whether the def remove_non_ascii_1(text) is really necessary. By the looks of it you already input a list of unicode encoded strings which you probably directly read from the file. This means you don't need to correct encoding though the comparison for '£' could stay. You might just want to rename the method ;)
Hope this helped and fixed possible unclarities about Python 2 encodings :D
If you print your list as a whole now you will see it still contains \xca and not the actual '£' but if you print the elements seperately it works fine. This is because the __str__() method of list does not encode unicodes directly but uses the standard ascii encoding...

Python 3 greatly improved unicode text handling. If you have to use Python 2.7, I would recommend using the codecs library when reading text files since it helps you with Pyhton 2.7 unicode issues:
import codecs
fp = codecs.open("file", "r"; encoding="utf-8")
In your case I noticed that you are using unicodecsv as a drop-in csv replacement. In this case, you can hand the parameter encoding="utf-8" when reading the csv file into a list:
r = csv.reader(f, encoding='utf-8')
For just removing non-Ascii characters I would recommend checking this good answer on StackOverflow

Python/Pandas : how to read a csv in cp1252 with a first row to delete?

Solution :
See answer, it was not encoded in CP1252 but in UTF-16 . Solution code is :
import pandas as pd
df = pd.read_csv('my_file.csv', sep='\t', header=1, encoding='utf-16')
Also works with encoding='utf-16-le'
Update : output of the first 3 lines in bytes :
In : import itertools
...: print(list(itertools.islice(open('file_T.csv', 'rb'), 3)))
Out : [b'\xff\xfe"\x00D\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00 \x00a\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00\n', b'\x00"\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\n', b'\x00C\x00o\x00d\x00e\x00 \x00M\x00C\x00U\x00\t\x00I\x00m\x00m\x00a\x00t\x00r\x00i\x00c\x00u\x00l\x00a\x00t\x00i\x00o\x00n\x00\t\x00D\x00a\x00t\x00e\x00\t\x00h\x00e\x00u\x00r\x00e\x00\t\x00V\x00i\x00t\x00e\x00s\x00s\x00e\x00\t\x00L\x00a\x00t\x00i\x00t\x00u\x00d\x00e\x00\t\x00L\x00o\x00n\x00g\x00i\x00t\x00u\x00d\x00e\x00\t\x00T\x00y\x00p\x00e\x00\t\x00E\x00n\x00t\x00r\x00\xe9\x00e\x00\t\x00E\x00t\x00a\x00t\x00\n']
I'm working with csv files whose raw form is :
The problem is that it has two features raising a problem together :
the first row is not the header
There is an accent in header "Entrée", which raises an UnicodeDecode Error if I don't precise the encoding cp1252
I'm using Python 3.X and pandas to deal with these files.
But when I try to read it with this code :
import pandas as pd
df_T = pd.read_csv('file_T.csv', header=1, sep=';', encoding = 'cp1252')
print(df_T)
I get the following output (same with header=0):
In order to read the csv correctly, I need to :
get rid of the accent
and ignore / delete the first row (which I don't need anyway).
How can I achieve that ?
PS : I know I could make a VBA program or something for this, but I'd
rather not. I'm interested in including it in my Python program, or in
knowing for sure that it is not possible.

CP1252 is the plain old Latin codepage, which does support all Western European accents. There wouldn't be any garbled characters if the file was written in that codepage.
The image of the data you posted is just that - an image. It says nothing about the file's raw format. Is it a UTF8 file? UTF16? It's definitely not CP1252.
Neither UTF8 nor CP1252 would produce NANs either. Any single-byte codepage would read the numeric digits at least, which means the file is saved in a multi-byte encoding.
The two strange characters at the start look like a Byte Order Mark. If you check Wikipedia's BOM entry you'll see that ÿþ is the BOM for UTF16LE.
Try using utf-16 or utf-16-le instead of cp1252

Converting character codes to unicode [Python]

So I have a large csv of french verbs that I am using to make a program, in the csv, verbs with accent characters contain codes instead of the actual accents:
être is Ãªtre, for example (atleast when I open the file in Excel)
Here is the csv:
https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv
In Chrome and Firefox atleast, the codes are converted to the correct accents. I was wondering if once the string is imported in python into a given a variable, ie.
...
for row in reader:
inf_lst.append(row[0])
verb = inf_lst[2338]
#(verb = Ãªtre)
if there was a straightforward/built in method for printing it out with correct unicode to give "être"?
I am aware that you could do this by replacing the Ãª's with ê's in each string but since this would have to be done for each different possible accent, I was wondering if there was an easier way.
Thanks,

You can use unicode encoding by prefixing a string with 'u'.
>>> foo = u'être' >>> print foo être

It all comes down to the character encoding of the data. Its possible that it is utf-8 encoded and you are viewing it in a Windows tool that is using your local code page, which gives a different display for the stream. How to read/write with files is covered in the csv doc examples.
You've given us a zipped, utf-8 encoded web page and the requests modules is good at handling that sort of thing. So, you could read the csv with:
>>> import requests
>>> import csv
>>> resp=requests.get("https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv",
... stream=True)
>>> try:
... inf_lst = list(csv.reader(resp.iter_lines(decode_unicode=True)))
... finally:
... del resp
...
>>> len(inf_list)
5362

You have a UTF-8-encoded file. Excel likes that encoding to start with a byte order mark character (U+FEFF) or it assumes the default ANSI encoding for your version of Windows instead. To get UTF-8 with BOM, use a tool like Notepad++. Open the file in Notepad++. On the Encoding menu, select "Encode in UTF-8-BOM" and save. Now it will display correctly in Excel.
To write a file that Excel can open, use the encoding utf-8-sig and write Unicode strings:
import io
with io.open('out.csv','w',encoding='utf-8-sig') as f:
f.write(u'être')

How to display Chinese characters inside a pandas dataframe?

I can read a csv file in which there is a column containing Chinese characters (other columns are English and numbers). However, Chinese characters don't display correctly. see photo below
I loaded the csv file with pd.read_csv().
Either display(data06_16) or data06_16.head() won't display Chinese characters correctly.
I tried to add the following lines into my .bash_profile:
export LC_ALL=zh_CN.UTF-8
export LANG=zh_CN.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
but it doesn't help.
Also I have tried to add encoding arg to pd.read_csv():
pd.read_csv('data.csv', encoding='utf_8')
pd.read_csv('data.csv', encoding='utf_16')
pd.read_csv('data.csv', encoding='utf_32')
These won't work at all.
How can I display the Chinese characters properly?

I just remembered that the source dataset was created using encoding='GBK', so I tried again using
data06_16 = pd.read_csv("../data/stocks1542monthly.csv", encoding="GBK")
Now, I can see all the Chinese characters.
Thanks guys!

Try this
df = pd.read_csv(path, engine='python', encoding='utf-8-sig')

I see here three possible issues:
1) You can try this:
import codecs
x = codecs.open("testdata.csv", "r", "utf-8")
2) Another possibility can be theoretically this:
import pandas as pd
df = pd.DataFrame(pd.read_csv('testdata.csv',encoding='utf-8'))
3) Maybe you should convert you csv file into utf-8 before importing with Python (for example in Notepad++)? It can be a solution for one-time-import, not for automatic process, of course.

A non-python relate answer. I just ran into this problem this afternoon and found that using Excel to import data from CSV can show us lots of encoding names. We can play with the encodings there and see which one fit our need. For instance, I found that in excel both gb2312 and gb18030 convert the data nicely from csv to xlsx. But only gb18030 works in Python.
pd.read_csv(in_path + 'XXX.csv', encoding='gb18030')
Anyway, this is not about how to import csv in Python, but rather to find the available encodings to try.

You load a dataset and you have some strange characters.
Exemple :
'æˆ´æ£®ç¾Žå�‘é€\xa0åž‹å™¨å®Œæ•´ç‰ˆå¥—è£…Dyson Airwrap
HS01ï¼ˆé“œé‡‘è‰²ç¤¼ç›’ç‰ˆï¼‰'
In my case, I know that the strange characters are chineses. So I can figure that the one who send me the data have encode it in utf-8 but should do it in 'ISO-8859-1'.
So first step, I had encoded the string, then I decode with utf-8.
so my lines are :
_encoding = 'ISO-8859-1'
_my_str.encode(_encoding, 'ignore').decode("utf-8", 'ignore')
Then my output is :
"'森Dyson Airwrap HS01礼'"
This works for me, but I guess that I do not really well understood under the hood. So feel free to tell me if you have further information.
Bonus. I'll try to detect when the str is in the first strange format because some of my entries are in chinese but others are in english
EDIT : The Bonus is useless. I Just use lamba on ma column to encode and decode without care about format. So I changed the encoding after loading the dataframe
_encoding = 'ISO-8859-1'
_decoding = "utf-8"
df[col] = df[col].apply(lambda x : x.encode(_encoding, 'ignore').decode(_decoding , 'ignore'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Writing DataFrame to encoded JSON Newline Delimited - python

Related

FPDF encoding error when reading a UTF8 txt file in Python [duplicate]

Python not picking up UTF-8 encoding

Python/Pandas : how to read a csv in cp1252 with a first row to delete?

Converting character codes to unicode [Python]

How to display Chinese characters inside a pandas dataframe?

Categories

Resources