Hi I am trying to remove special character from csv file but not getting the satisfied result. Could you please help me how to do this?
example:
ÃœþÑÂúòð
Óþрþô áðýúт-ßõтõрñурó
These king of special characters I am getting.
I am saving the file using below python code-
df = pd.read_csv(r"D:\Users\SPate233\Documents\cleanData-JnJv2.csv", low_memory=False)
df.to_csv(r"D:\Users\SPate233\Documents\cleanData-JnJv2_new.csv", encoding='utf-8-sig', index=False)
I am not sure but you can try the Code Snippet given below:-
Basically, I have DataFrame from your Data. So, for Uploading CSV with Special Characters. It is important to specify the encoding type. So, I have used the ISO-8859-1 type of encoding technique. Because ISO-8859-1 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255.
To Learn more about ISO-8859-1 Click here
# Import all the important Libraries
import pandas as pd
# Read 'Data'
df = pd.read_csv('temp_data.csv', encoding = "ISO-8859-1")
# Print a few records of data with special characters
df
# Output of Above Cell:-
Data
0 ÃœþÑÂúòð
1 Óþрþô áðýúт-ßõтõрñурó
After reading DataFrame. we can move towards, the process of removal of Special Character. code for the same was stated below:-
# Removal of Special Characters
df['Data'] = df['Data'].map(str).apply(lambda x: x.encode('utf-8').decode('ascii', 'ignore'))
# Print Cleaned data
df
# Output of Above Cell:-
Data
0
1 -
As you can see we have removed all Special Characters. So, we can store this Result to CSV:-
# Store clean data into 'CSV' Format
df.to_csv(r'cleaned_temp_data.csv', encoding = 'utf-8-sig', index = False)
Hope this, Solution helps you.
Related
I am trying to remove special characters from a string, but when I export the Pandas dataframe as a CSV, I can still see the special characters.
Does anyone know why that is?
Current Code:
document = json.dumps(jfile,default=str)
document2 = re.sub("[“â£$€™]", '', document)
document2 = json.loads(document2)
document2.to_csv("test.csv", index = False)
Output (special character still found in CSV file):
This seems the pandas Encoding problem. Try to read/load your file with the appropriate encoding.
Solution :
See answer, it was not encoded in CP1252 but in UTF-16 . Solution code is :
import pandas as pd
df = pd.read_csv('my_file.csv', sep='\t', header=1, encoding='utf-16')
Also works with encoding='utf-16-le'
Update : output of the first 3 lines in bytes :
In : import itertools
...: print(list(itertools.islice(open('file_T.csv', 'rb'), 3)))
Out : [b'\xff\xfe"\x00D\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00 \x00a\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00\n', b'\x00"\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\n', b'\x00C\x00o\x00d\x00e\x00 \x00M\x00C\x00U\x00\t\x00I\x00m\x00m\x00a\x00t\x00r\x00i\x00c\x00u\x00l\x00a\x00t\x00i\x00o\x00n\x00\t\x00D\x00a\x00t\x00e\x00\t\x00h\x00e\x00u\x00r\x00e\x00\t\x00V\x00i\x00t\x00e\x00s\x00s\x00e\x00\t\x00L\x00a\x00t\x00i\x00t\x00u\x00d\x00e\x00\t\x00L\x00o\x00n\x00g\x00i\x00t\x00u\x00d\x00e\x00\t\x00T\x00y\x00p\x00e\x00\t\x00E\x00n\x00t\x00r\x00\xe9\x00e\x00\t\x00E\x00t\x00a\x00t\x00\n']
I'm working with csv files whose raw form is :
The problem is that it has two features raising a problem together :
the first row is not the header
There is an accent in header "Entrée", which raises an UnicodeDecode Error if I don't precise the encoding cp1252
I'm using Python 3.X and pandas to deal with these files.
But when I try to read it with this code :
import pandas as pd
df_T = pd.read_csv('file_T.csv', header=1, sep=';', encoding = 'cp1252')
print(df_T)
I get the following output (same with header=0):
In order to read the csv correctly, I need to :
get rid of the accent
and ignore / delete the first row (which I don't need anyway).
How can I achieve that ?
PS : I know I could make a VBA program or something for this, but I'd
rather not. I'm interested in including it in my Python program, or in
knowing for sure that it is not possible.
CP1252 is the plain old Latin codepage, which does support all Western European accents. There wouldn't be any garbled characters if the file was written in that codepage.
The image of the data you posted is just that - an image. It says nothing about the file's raw format. Is it a UTF8 file? UTF16? It's definitely not CP1252.
Neither UTF8 nor CP1252 would produce NANs either. Any single-byte codepage would read the numeric digits at least, which means the file is saved in a multi-byte encoding.
The two strange characters at the start look like a Byte Order Mark. If you check Wikipedia's BOM entry you'll see that ÿþ is the BOM for UTF16LE.
Try using utf-16 or utf-16-le instead of cp1252
I have a pandas dataframe, where some fields contain Chinese character. I use the below code:
df = pd.read_csv('original.csv', encoding='utf-8')
df.to_csv('saved.csv')
Then I use excel or text editor to open saved.csv. All Chinese characters become junk characters. However, I am able to load the saved file and show the Chinese properly as follows.
df = pd.read_csv('saved.csv')
df.head() # Chinese characters are properly displayed.
Does anyone know how to solve the problem?
Try the following:
df = pd.read_csv('original.csv', encoding='utf-8')
df.to_csv('saved.csv', encoding='utf_8_sig')
it works for me when utf-8 failed
The problem is with the encoding of excel.
To resolve the issue, I first open the csv using sublime and then File->Save with encoding->UTF-8 with BOM (Byte Order Mark).
Now excel is able to open the csv without any problems!
I can read a csv file in which there is a column containing Chinese characters (other columns are English and numbers). However, Chinese characters don't display correctly. see photo below
I loaded the csv file with pd.read_csv().
Either display(data06_16) or data06_16.head() won't display Chinese characters correctly.
I tried to add the following lines into my .bash_profile:
export LC_ALL=zh_CN.UTF-8
export LANG=zh_CN.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
but it doesn't help.
Also I have tried to add encoding arg to pd.read_csv():
pd.read_csv('data.csv', encoding='utf_8')
pd.read_csv('data.csv', encoding='utf_16')
pd.read_csv('data.csv', encoding='utf_32')
These won't work at all.
How can I display the Chinese characters properly?
I just remembered that the source dataset was created using encoding='GBK', so I tried again using
data06_16 = pd.read_csv("../data/stocks1542monthly.csv", encoding="GBK")
Now, I can see all the Chinese characters.
Thanks guys!
Try this
df = pd.read_csv(path, engine='python', encoding='utf-8-sig')
I see here three possible issues:
1) You can try this:
import codecs
x = codecs.open("testdata.csv", "r", "utf-8")
2) Another possibility can be theoretically this:
import pandas as pd
df = pd.DataFrame(pd.read_csv('testdata.csv',encoding='utf-8'))
3) Maybe you should convert you csv file into utf-8 before importing with Python (for example in Notepad++)? It can be a solution for one-time-import, not for automatic process, of course.
A non-python relate answer. I just ran into this problem this afternoon and found that using Excel to import data from CSV can show us lots of encoding names. We can play with the encodings there and see which one fit our need. For instance, I found that in excel both gb2312 and gb18030 convert the data nicely from csv to xlsx. But only gb18030 works in Python.
pd.read_csv(in_path + 'XXX.csv', encoding='gb18030')
Anyway, this is not about how to import csv in Python, but rather to find the available encodings to try.
You load a dataset and you have some strange characters.
Exemple :
'戴森美å�‘é€\xa0型器完整版套装Dyson Airwrap
HS01(铜金色礼盒版)'
In my case, I know that the strange characters are chineses. So I can figure that the one who send me the data have encode it in utf-8 but should do it in 'ISO-8859-1'.
So first step, I had encoded the string, then I decode with utf-8.
so my lines are :
_encoding = 'ISO-8859-1'
_my_str.encode(_encoding, 'ignore').decode("utf-8", 'ignore')
Then my output is :
"'森Dyson Airwrap HS01礼'"
This works for me, but I guess that I do not really well understood under the hood. So feel free to tell me if you have further information.
Bonus. I'll try to detect when the str is in the first strange format because some of my entries are in chinese but others are in english
EDIT : The Bonus is useless. I Just use lamba on ma column to encode and decode without care about format. So I changed the encoding after loading the dataframe
_encoding = 'ISO-8859-1'
_decoding = "utf-8"
df[col] = df[col].apply(lambda x : x.encode(_encoding, 'ignore').decode(_decoding , 'ignore'))
I am currently inserting data in my django models using csv file. Below is a simple save function that am using:
def save(self):
myfile = file.csv
data = csv.reader(myfile, delimiter=',', quotechar='"')
i=0
for row in data:
if i == 0:
i = i + 1
continue #skipping the header row
b=MyModel()
b.create_from_csv_row(row) # calls a method to save in models
The function is working perfectly with ascii characters. However, if the csv file has some non-ascii characters then, an error is raised: UnicodeDecodeError
'ascii' codec can't decode byte 0x93 in position 1526: ordinal not in range(128)
My question is: How can i remove non-ascii characters before saving my csv file to avoid this error.
Thanks in advance.
If you really want to strip it, try:
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')
* WARNING THIS WILL MODIFY YOUR DATA *
It attempts to find a close match - i.e. ć -> c
Perhaps a better answer is to use unicodecsv instead.
----- EDIT -----
Okay, if you don't care that the data is represented at all, try the following:
# If row references a unicode string
b.create_from_csv_row(row.encode('ascii', 'ignore'))
If row is a collection, not a unicode string, you will need to iterate over the collection to the string level to re-serialize it.
If you want to remove non-ascii characters from your data then iterate through your data and keep only the ascii.
for item in data:
if ord(item) <= 128: # 1 - 128 is ascii
[append,write,print,whatever]
If you want to convert unicode characters to ascii, then the response above by DivinusVox is accurate.
Pandas csv parser (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html) supports different encodings:
import pandas
data = pandas.read_csv(myfile, encoding='utf-8', quotechar='"', delimiter=',')