Replacing special characters not working in Pandas - python

I am trying to remove special characters from a string, but when I export the Pandas dataframe as a CSV, I can still see the special characters.
Does anyone know why that is?
Current Code:
document = json.dumps(jfile,default=str)
document2 = re.sub("[“â£$€™]", '', document)
document2 = json.loads(document2)
document2.to_csv("test.csv", index = False)
Output (special character still found in CSV file):

This seems the pandas Encoding problem. Try to read/load your file with the appropriate encoding.

Related

How to remove special character from a csv file in python?

Hi I am trying to remove special character from csv file but not getting the satisfied result. Could you please help me how to do this?
example:
ÃœþÑÂúòð
Óþрþô áðýúт-ßõтõрñурó
These king of special characters I am getting.
I am saving the file using below python code-
df = pd.read_csv(r"D:\Users\SPate233\Documents\cleanData-JnJv2.csv", low_memory=False)
df.to_csv(r"D:\Users\SPate233\Documents\cleanData-JnJv2_new.csv", encoding='utf-8-sig', index=False)
I am not sure but you can try the Code Snippet given below:-
Basically, I have DataFrame from your Data. So, for Uploading CSV with Special Characters. It is important to specify the encoding type. So, I have used the ISO-8859-1 type of encoding technique. Because ISO-8859-1 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255.
To Learn more about ISO-8859-1 Click here
# Import all the important Libraries
import pandas as pd
# Read 'Data'
df = pd.read_csv('temp_data.csv', encoding = "ISO-8859-1")
# Print a few records of data with special characters
df
# Output of Above Cell:-
Data
0 ÃœþÑÂúòð
1 Óþрþô áðýúт-ßõтõрñурó
After reading DataFrame. we can move towards, the process of removal of Special Character. code for the same was stated below:-
# Removal of Special Characters
df['Data'] = df['Data'].map(str).apply(lambda x: x.encode('utf-8').decode('ascii', 'ignore'))
# Print Cleaned data
df
# Output of Above Cell:-
Data
0
1 -
As you can see we have removed all Special Characters. So, we can store this Result to CSV:-
# Store clean data into 'CSV' Format
df.to_csv(r'cleaned_temp_data.csv', encoding = 'utf-8-sig', index = False)
Hope this, Solution helps you.

Python/Pandas : how to read a csv in cp1252 with a first row to delete?

Solution :
See answer, it was not encoded in CP1252 but in UTF-16 . Solution code is :
import pandas as pd
df = pd.read_csv('my_file.csv', sep='\t', header=1, encoding='utf-16')
Also works with encoding='utf-16-le'
Update : output of the first 3 lines in bytes :
In : import itertools
...: print(list(itertools.islice(open('file_T.csv', 'rb'), 3)))
Out : [b'\xff\xfe"\x00D\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00 \x00a\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00\n', b'\x00"\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\n', b'\x00C\x00o\x00d\x00e\x00 \x00M\x00C\x00U\x00\t\x00I\x00m\x00m\x00a\x00t\x00r\x00i\x00c\x00u\x00l\x00a\x00t\x00i\x00o\x00n\x00\t\x00D\x00a\x00t\x00e\x00\t\x00h\x00e\x00u\x00r\x00e\x00\t\x00V\x00i\x00t\x00e\x00s\x00s\x00e\x00\t\x00L\x00a\x00t\x00i\x00t\x00u\x00d\x00e\x00\t\x00L\x00o\x00n\x00g\x00i\x00t\x00u\x00d\x00e\x00\t\x00T\x00y\x00p\x00e\x00\t\x00E\x00n\x00t\x00r\x00\xe9\x00e\x00\t\x00E\x00t\x00a\x00t\x00\n']
I'm working with csv files whose raw form is :
The problem is that it has two features raising a problem together :
the first row is not the header
There is an accent in header "Entrée", which raises an UnicodeDecode Error if I don't precise the encoding cp1252
I'm using Python 3.X and pandas to deal with these files.
But when I try to read it with this code :
import pandas as pd
df_T = pd.read_csv('file_T.csv', header=1, sep=';', encoding = 'cp1252')
print(df_T)
I get the following output (same with header=0):
In order to read the csv correctly, I need to :
get rid of the accent
and ignore / delete the first row (which I don't need anyway).
How can I achieve that ?
PS : I know I could make a VBA program or something for this, but I'd
rather not. I'm interested in including it in my Python program, or in
knowing for sure that it is not possible.
CP1252 is the plain old Latin codepage, which does support all Western European accents. There wouldn't be any garbled characters if the file was written in that codepage.
The image of the data you posted is just that - an image. It says nothing about the file's raw format. Is it a UTF8 file? UTF16? It's definitely not CP1252.
Neither UTF8 nor CP1252 would produce NANs either. Any single-byte codepage would read the numeric digits at least, which means the file is saved in a multi-byte encoding.
The two strange characters at the start look like a Byte Order Mark. If you check Wikipedia's BOM entry you'll see that ÿþ is the BOM for UTF16LE.
Try using utf-16 or utf-16-le instead of cp1252

Save pandas dataframe containing Chinese character to file

I have a pandas dataframe, where some fields contain Chinese character. I use the below code:
df = pd.read_csv('original.csv', encoding='utf-8')
df.to_csv('saved.csv')
Then I use excel or text editor to open saved.csv. All Chinese characters become junk characters. However, I am able to load the saved file and show the Chinese properly as follows.
df = pd.read_csv('saved.csv')
df.head() # Chinese characters are properly displayed.
Does anyone know how to solve the problem?
Try the following:
df = pd.read_csv('original.csv', encoding='utf-8')
df.to_csv('saved.csv', encoding='utf_8_sig')
it works for me when utf-8 failed
The problem is with the encoding of excel.
To resolve the issue, I first open the csv using sublime and then File->Save with encoding->UTF-8 with BOM (Byte Order Mark).
Now excel is able to open the csv without any problems!

Writing non-ASCII characters to file in pandas python

I have a list like this:
a = ['olivsortiment utan kärna', 'perunajauho', 'chili extrakt','Keitetty herkullisista äyriäisistä', 'SOIJAKASTIKEJAUHE', 'Rypsisiemenöljy']
So there are non-ascii characters in the elements of the list including ö, ä, Ö, and Ä. I am using pandas dataframe to print this into csv using following code.
frame = pd.Dataframe(a)
frame.to_csv('path', sep=',', encoding = 'utf-8')
It prints the dataframe correctly but non-ascii characters are not correctly printed. They are showing some weird characters. I am already encoding it to 'utf-8' but still it does not print correctly.
The first element a[0] for example written in csv file is olivsortiment utan kärna. So it replace ä by ä. Thanks in advance for the help.

How to display Chinese characters inside a pandas dataframe?

I can read a csv file in which there is a column containing Chinese characters (other columns are English and numbers). However, Chinese characters don't display correctly. see photo below
I loaded the csv file with pd.read_csv().
Either display(data06_16) or data06_16.head() won't display Chinese characters correctly.
I tried to add the following lines into my .bash_profile:
export LC_ALL=zh_CN.UTF-8
export LANG=zh_CN.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
but it doesn't help.
Also I have tried to add encoding arg to pd.read_csv():
pd.read_csv('data.csv', encoding='utf_8')
pd.read_csv('data.csv', encoding='utf_16')
pd.read_csv('data.csv', encoding='utf_32')
These won't work at all.
How can I display the Chinese characters properly?
I just remembered that the source dataset was created using encoding='GBK', so I tried again using
data06_16 = pd.read_csv("../data/stocks1542monthly.csv", encoding="GBK")
Now, I can see all the Chinese characters.
Thanks guys!
Try this
df = pd.read_csv(path, engine='python', encoding='utf-8-sig')
I see here three possible issues:
1) You can try this:
import codecs
x = codecs.open("testdata.csv", "r", "utf-8")
2) Another possibility can be theoretically this:
import pandas as pd
df = pd.DataFrame(pd.read_csv('testdata.csv',encoding='utf-8'))
3) Maybe you should convert you csv file into utf-8 before importing with Python (for example in Notepad++)? It can be a solution for one-time-import, not for automatic process, of course.
A non-python relate answer. I just ran into this problem this afternoon and found that using Excel to import data from CSV can show us lots of encoding names. We can play with the encodings there and see which one fit our need. For instance, I found that in excel both gb2312 and gb18030 convert the data nicely from csv to xlsx. But only gb18030 works in Python.
pd.read_csv(in_path + 'XXX.csv', encoding='gb18030')
Anyway, this is not about how to import csv in Python, but rather to find the available encodings to try.
You load a dataset and you have some strange characters.
Exemple :
'戴森美å�‘é€\xa0型器完整版套装Dyson Airwrap
HS01(铜金色礼盒版)'
In my case, I know that the strange characters are chineses. So I can figure that the one who send me the data have encode it in utf-8 but should do it in 'ISO-8859-1'.
So first step, I had encoded the string, then I decode with utf-8.
so my lines are :
_encoding = 'ISO-8859-1'
_my_str.encode(_encoding, 'ignore').decode("utf-8", 'ignore')
Then my output is :
"'森Dyson Airwrap HS01礼'"
This works for me, but I guess that I do not really well understood under the hood. So feel free to tell me if you have further information.
Bonus. I'll try to detect when the str is in the first strange format because some of my entries are in chinese but others are in english
EDIT : The Bonus is useless. I Just use lamba on ma column to encode and decode without care about format. So I changed the encoding after loading the dataframe
_encoding = 'ISO-8859-1'
_decoding = "utf-8"
df[col] = df[col].apply(lambda x : x.encode(_encoding, 'ignore').decode(_decoding , 'ignore'))

Categories

Resources