Pandas csv not reading Arabic characters - python

I have a csv file with mixed columns of numbers and Arabic text. I am trying to open it using pandas csv reader. I tried using the following.
Text= pd.read_csv('text.csv', encoding ='utf-8-sig')
However, after reading the file, when I am looking at the columns, I am getting instead of Arabic text a bunch of exclamation marks for each row.
what could be the problem?
thanks

Related

How to change the Windows-1251 encoding in a csv file with the actual UTF-8

I'm trying to translate "hieroglyphs" into Russian letters. I took a dataset (RUvideos.csv file), uploaded it via pandas:
import pandas as pd
data=pd.read_csv('RUvideos.csv',encoding = "utf-8")
I took the "title" column (pd.Series) and saved it to another CSV file:
x=data.title.copy()
x.to_csv(join(fld,'title.csv'))
then I opened the resulting file using Notepad++, because there are tips that you can use it to change the encoding in CSV:
There is UTF-8 encoding, although the characters are actually encoded in Win-1251. I tried all the encodings, it didn't help.
Who knows how to change Win-1251 to UTF-8 in this case?
I changed the encoding in Excel, but the file splitting into columns is broken when executing the read_csv function.

Cannot read imported csv file with excel in UTF-8 format after python scraping

I have a csv file encoded in utf-8 (filled with information from website through scraping with a python code, with str(data_scraped.encode('utf-8') in the end for the content)).
When I import it to excel (even if I pick 65001: Unicode UTF8 in the options), it doesn't display the special characters.
For example, it would show \xc3\xa4 instead of ä
Any ideas of what is going on?
I solved the problem.
The reason is that in the original code, I removed items such as \t \n that were "polluting" the output with the replace function. I guess I removed too much and it was not readable for excel after.
In the final version, I didn't use
str(data_scrapped.encode('utf-8') but
data_scrapped.encode('utf-8','ignore').decode('utf-8')
then I used split and join to rempove the "polluting terms":
string_split=data_scrapped.split()
data_scrapped=" ".join(string_split)

Save pandas dataframe containing Chinese character to file

I have a pandas dataframe, where some fields contain Chinese character. I use the below code:
df = pd.read_csv('original.csv', encoding='utf-8')
df.to_csv('saved.csv')
Then I use excel or text editor to open saved.csv. All Chinese characters become junk characters. However, I am able to load the saved file and show the Chinese properly as follows.
df = pd.read_csv('saved.csv')
df.head() # Chinese characters are properly displayed.
Does anyone know how to solve the problem?
Try the following:
df = pd.read_csv('original.csv', encoding='utf-8')
df.to_csv('saved.csv', encoding='utf_8_sig')
it works for me when utf-8 failed
The problem is with the encoding of excel.
To resolve the issue, I first open the csv using sublime and then File->Save with encoding->UTF-8 with BOM (Byte Order Mark).
Now excel is able to open the csv without any problems!

How to display Chinese characters inside a pandas dataframe?

I can read a csv file in which there is a column containing Chinese characters (other columns are English and numbers). However, Chinese characters don't display correctly. see photo below
I loaded the csv file with pd.read_csv().
Either display(data06_16) or data06_16.head() won't display Chinese characters correctly.
I tried to add the following lines into my .bash_profile:
export LC_ALL=zh_CN.UTF-8
export LANG=zh_CN.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
but it doesn't help.
Also I have tried to add encoding arg to pd.read_csv():
pd.read_csv('data.csv', encoding='utf_8')
pd.read_csv('data.csv', encoding='utf_16')
pd.read_csv('data.csv', encoding='utf_32')
These won't work at all.
How can I display the Chinese characters properly?
I just remembered that the source dataset was created using encoding='GBK', so I tried again using
data06_16 = pd.read_csv("../data/stocks1542monthly.csv", encoding="GBK")
Now, I can see all the Chinese characters.
Thanks guys!
Try this
df = pd.read_csv(path, engine='python', encoding='utf-8-sig')
I see here three possible issues:
1) You can try this:
import codecs
x = codecs.open("testdata.csv", "r", "utf-8")
2) Another possibility can be theoretically this:
import pandas as pd
df = pd.DataFrame(pd.read_csv('testdata.csv',encoding='utf-8'))
3) Maybe you should convert you csv file into utf-8 before importing with Python (for example in Notepad++)? It can be a solution for one-time-import, not for automatic process, of course.
A non-python relate answer. I just ran into this problem this afternoon and found that using Excel to import data from CSV can show us lots of encoding names. We can play with the encodings there and see which one fit our need. For instance, I found that in excel both gb2312 and gb18030 convert the data nicely from csv to xlsx. But only gb18030 works in Python.
pd.read_csv(in_path + 'XXX.csv', encoding='gb18030')
Anyway, this is not about how to import csv in Python, but rather to find the available encodings to try.
You load a dataset and you have some strange characters.
Exemple :
'戴森美å�‘é€\xa0型器完整版套装Dyson Airwrap
HS01(铜金色礼盒版)'
In my case, I know that the strange characters are chineses. So I can figure that the one who send me the data have encode it in utf-8 but should do it in 'ISO-8859-1'.
So first step, I had encoded the string, then I decode with utf-8.
so my lines are :
_encoding = 'ISO-8859-1'
_my_str.encode(_encoding, 'ignore').decode("utf-8", 'ignore')
Then my output is :
"'森Dyson Airwrap HS01礼'"
This works for me, but I guess that I do not really well understood under the hood. So feel free to tell me if you have further information.
Bonus. I'll try to detect when the str is in the first strange format because some of my entries are in chinese but others are in english
EDIT : The Bonus is useless. I Just use lamba on ma column to encode and decode without care about format. So I changed the encoding after loading the dataframe
_encoding = 'ISO-8859-1'
_decoding = "utf-8"
df[col] = df[col].apply(lambda x : x.encode(_encoding, 'ignore').decode(_decoding , 'ignore'))

Writing unicode data in python

I have an xlsx file that I need to convert to csv, I used openpyxl module along with unicodecsv for this. My problem is that while writing some files I am getting some junk characters in output. Details below
One of my file has unicode code point u'\xa0' in it which corresponds to NON BREAK SPACE, but when converted to csv my file shows the  instead of space. While printing the same data on console using Python GUI it prints perfectly without any Â. What am I doing wrong here? any help is appreciated.
Sample Code:
import unicodecsv
from openpyxl import load_workbook
xlsx_file=load_workbook('testfile.xlsx',use_iterators=True)
with open('test_utf.csv','wb') as open_file:
csv_file=unicodecsv.writer(open_file)
sheet=xls_file.get_active_sheet()
for row in sheet.iter_rows():
csv_file.writerow(cell.internal_value for cell in row)
P.S: The type of data written is Unicode.
Okay, so what is going on is that Excel likes to assume that you are using the currently configured codepage. You have a couple of options:
Write your data in that codepage. This requires however that you know
which one your users will be using.
Load the csv file using the "import data" menu option. If you are
relying on your users to do this, don't. Most people will not be
willing to do this.
Use a different program that will accept unicode in csv by default,
such as Libre Office.
Add a BOM to the beginning of the file to get Excel to recognise utf-8. This may break in other programs.
Since this is for your personal use, if you are only ever going to use Excel, then appending a byte order marker to the beginning is probably the easiest solution.
Microsoft likes byte-order marks in its text files. Even though a BOM doesn't make sense with UTF-8, it is used as a signature to let Excel know the file is encoded in UTF-8.
Make sure to generate your .csv as UTF-8 with BOM. I created the following using Notepad++:
English,Chinese
American,美国人
Chinese,中国人
The result saved with BOM:
The result without BOM:

Categories

Resources