Removing non-ascii characters in a csv file - python

I am currently inserting data in my django models using csv file. Below is a simple save function that am using:
def save(self):
myfile = file.csv
data = csv.reader(myfile, delimiter=',', quotechar='"')
i=0
for row in data:
if i == 0:
i = i + 1
continue #skipping the header row
b=MyModel()
b.create_from_csv_row(row) # calls a method to save in models
The function is working perfectly with ascii characters. However, if the csv file has some non-ascii characters then, an error is raised: UnicodeDecodeError
'ascii' codec can't decode byte 0x93 in position 1526: ordinal not in range(128)
My question is: How can i remove non-ascii characters before saving my csv file to avoid this error.
Thanks in advance.

If you really want to strip it, try:
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')
* WARNING THIS WILL MODIFY YOUR DATA *
It attempts to find a close match - i.e. ć -> c
Perhaps a better answer is to use unicodecsv instead.
----- EDIT -----
Okay, if you don't care that the data is represented at all, try the following:
# If row references a unicode string
b.create_from_csv_row(row.encode('ascii', 'ignore'))
If row is a collection, not a unicode string, you will need to iterate over the collection to the string level to re-serialize it.

If you want to remove non-ascii characters from your data then iterate through your data and keep only the ascii.
for item in data:
if ord(item) <= 128: # 1 - 128 is ascii
[append,write,print,whatever]
If you want to convert unicode characters to ascii, then the response above by DivinusVox is accurate.

Pandas csv parser (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html) supports different encodings:
import pandas
data = pandas.read_csv(myfile, encoding='utf-8', quotechar='"', delimiter=',')

Related

Python adding extra text and braces while reading from CSV file

I wanted to read data from a csv file using python but after using the following code there are some extra characters and braces in the text which is not in the original data.
Please help to remove it.
import csv
with open("data.csv",encoding="utf8") as csvDataFile:
csvReader = csv.reader(csvDataFile)
for row in csvReader:
print(row)
What is displayed after reading is:- ['\ufeffwww.aslteramo.it']
This is utf-8 encoding with a Byte Order Mark (BOM) - which is used as a signature in windows.
Open the file using the utf-8-sig encoding instead of utf8
\ufeff is a UTF-8 BOM (also known as 'ZERO WIDTH NO-BREAK SPACE' character).
It's sometimes used to indicate that the file is in UTF-8 format.
You could use str.replace('\ufeff', '') in your code to get rid of it. Like this:
import csv
with open("data.csv",encoding="utf8") as csvDataFile:
csvReader = csv.reader(csvDataFile)
for row in csvReader:
print([col.replace('\ufeff', '') for col in row])
Another solution is to open the file with 'utf-8-sig' encoding instead of 'utf-8' encoding.
By the way, the braces are a added because row is a list. If your CSV file only has one column, you could select the first item from each row like this:
print(row[0].replace('\ufeff', ''))

How to handle unusual characters in a csv read operation

I have a very large Excel spreadsheet where the data need to be pivoted into columns etc and no problems excepting that it is full of some very odd characters from French, Portuguese. The Panda data frame copes with out any problems until I need to output the data as a csv file
dframe.to_csv(csvf, header=False)
sourceFile = open(csvf, 'r')
csvdata = sourceFile.read()
with open('output.csv', 'a') as destFile:
destFile.write(csvdata)
sourceFile.close()
This produces an error at line 3:
File "C:\Program Files (x86)\python\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3184: character maps to undefined
This I tracked to a field in the data containing TTD ChÔteauneuf-du-Pape 75cl - specifically it is this character ” which is, according to a Unicode identifier, "U+201D : RIGHT DOUBLE QUOTATION MARK {double comma quotation mark}"
I have tried to remove this in the data frame itself with:
df[self.data_columns[i]] = df[self.data_columns[i]].str.replace('”',' ')
But it didn't seem to do anything NB a similar line successfully strips out unwanted commas.
Is there any way of handling the read() problem so it can cope with any character? What I find odd is that the rest of the system e.g. loading the data in to a data frame, manipulating the data is all fine

Recover UTF-8 encoding in string

I'm working on my python script to extract multiple strings from a .csv file but I can't recover the Spanish characters (like á, é, í) after I open the file and read the lines.
This is my code so far:
import csv
list_text=[]
with open(file, 'rb') as data:
reader = csv.reader(data, delimiter='\t')
for row in reader:
print row[0]
list_text.extend(row[0])
print list_text
And I get something like this:
'Vivió el sueño, ESPAÑOL...' ['Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...']
I don't know why it prints it in the correct form but when I append it to the list is not correct.
Edited:
The problem is that I need to recover the characters because, after I read the file, the list has thousands of words in it and I don't need to print it I need to use regex to get rid of the punctuation, but this also deletes the backslash and the word is incomplete.
The python 2.x csv module doesn't support unicode and you did the right thing by opening the file in binary mode and parsing the utf-8 encoded strings instead of decoded unicode strings. Python 2 is kinda strange in that the str type (as opposed to the unicode type) holds either string or binary data. You got 'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...' which is the binary utf-8 encoding of the unicode.
We can decode it to get the unicode version...
>>> encoded_text = 'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...'
>>> text = encoded_text.decode('utf-8')
>>> print repr(text)
u'Vivi\xf3 el sue\xf1o, ESPA\xd1OL...'
>>> print text
Vivió el sueño, ESPAÑOL...
...but wait a second, the encoded text prints the same
>>> print encoded_text
Vivió el sueño, ESPAÑOL...
what's up with that? That has everything to do with your display surface which is a utf-8 encoded terminal. In the first case (print text), text is a unicode string so python has to encode it before sending it to the terminal which sees the utf-8 encoded version. In the second case its just a regular string and python sent it without conversion... but it just so happens it was holding encoded text which the terminal decoded.
Finally, when a string is in a list, python prints its repr representation, not its str value, as in
>>> print repr(encoded_text)
'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...'
To make things right, convert the cells in your rows to unicode after the csv module is done with them.
import csv
list_text=[]
with open(file, 'rb') as data:
reader = csv.reader(data, delimiter='\t')
for row in reader:
row = [cell.decode('utf-8') for cell in row]
print row[0]
list_text.extend(row[0])
print list_text
When you print a list it shows all the cancelled characters, that way \n and other characters don't throw off the list display, so if you print the string it will work properly:
'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...'.decode('utf-8')
use unicodecsv instead of csv , csv doesn't support well unicode
open the file with codecs, and 'utf-8'
see code below
import unicodecsv as csv
import codecs
list_text=[]
with codecs.open(file, 'rb','utf-8') as data:
reader = csv.reader(data, delimiter='\t')
for row in reader:
print row[0]
list_text.extend(row[0])
print list_text

How to correct the encoding while creating excel file from 'utf-8' data using python

I am trying to create an excel file using python from a list of dictionaries. Initially I was getting an error of improper encoding. So I decoded my data to 'utf-8' format. Now after the creation of excel, when I checked the values in each field, their format has been changed to text only. Below are the stpes I used while performing this activity with a snippet of code.
1.>I got error of improper encoding while creating excel file as my data had some 'ascii' values in it. Error snippet:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)
2.>To remove the error of improper encoding, I inserted a decode() function while reading my input csv file. Snippet of code while decoding to 'utf-8':
data = []
with open(datafile, "r") as f:
header = f.readline().split(',')
counter = 0
for line in f:
line = line.decode('utf-8')
fields = line.split(',')
entry = {}
for i,value in enumerate(fields):
entry[header[i].strip()] = value.strip()
data.append(entry)
counter += 1
return data
3.>After inserting decode() funtion, I created my excel file using below code:
ordered_list= dicts[0].keys()
wb=Workbook("New File.xlsx")
ws=wb.add_worksheet("Unique")
first_row=0
for header in ordered_list:
col=ordered_list.index(header)
ws.write(first_row,col,header)
row=1
for trans in dicts:
for _key,_value in trans.items():
col=ordered_list.index(_key)
ws.write(row,col,_value)
row+=1 #enter the next row
wb.close()
But after creation of excel, all the values in each field of excel is coming with text format and not their original format (some datetime values, decimal values etc.). How do I make sure to get that the data format does not change from the input data format I read using input csv file?
When reading text files you should pass the encoding to open() so that it's automatically decoded for you.
Python 2.x:
with io.open(datafile, "r", encoding="utf-8") as f:
Python 3.x:
with open(datafile, "r", encoding="utf-8") as f:
Now each line read will be a Unicode string.
As you're reading a CSV file, you may want to consider the CSV module, which understands CSV dialects. It will automatically return dictionaries per row, keyed by the header. In Python 3, it's just the csv module. In Python, the CSV module is broken with non-ASCII. Use https://pypi.python.org/pypi/unicodecsv/0.13.0
Once you have clean Unicode strings, you can proceed to store the data.
The Excel format requires that you tell it what kind of data you're storing. If you put a timestamp string into a cell, it will think it's just a string. The same applies if you insert a string of an integer.
Therefore, you need to convert the value type in Python before adding to the workbook.
Convert to decimal string to float:
my_decimal = float(row["height"])
ws.write(row,col,my_decimal)
Create datetime field from string. Assuming string is "Jun 1 2005 1:33PM":
date_object = datetime.strptime(my_date, '%b %d %Y %I:%M%p')
ws.write(row,col,date_object)

UnicodeDecodeError: save to file in python

i want to read file, find something in it and save the result, but when I want to save it it give me a error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Code to save to file:
fileout.write((key + ';' + nameDict[key]+ ';'+src + alt +'\n').decode('utf-8'))
What can I do to fix it?
Thank you
You are trying to concatenate unicode values with byte strings, then turn the result to unicode, to write it to a file object that most likely only takes byte strings.
Don't mix unicode and byte strings like that.
Open the file to write to with io.open() to automate encoding Unicode values, then handle only unicode in your code:
import io
with io.open(filename, 'w', encoding='utf8') as fileout:
# code gathering stuff from BeautifulSoup
fileout.write(u'{};{};{}{}\n'.format(key, nameDict[key], src, alt)
You may want to check out the csv module to handle writing out delimiter-separated values. If you do go that route, you'll have to explicitly encode your columns:
import csv
with open(filename, 'wb') as fileout:
writer = csv.writer(fileout, delimiter=';')
# code gathering stuff from BeautifulSoup
row = [key, nameDict[key], src + alt]
writer.writerow([c.encode('utf8') for c in row])
If some of this data comes from other files, make sure you also decode to Unicode first; again, io.open() to read these files is probably the best option, to have the data decoded to Unicode values for you as you read.

Categories

Resources