Python adding extra text and braces while reading from CSV file - python

I wanted to read data from a csv file using python but after using the following code there are some extra characters and braces in the text which is not in the original data.
Please help to remove it.
import csv
with open("data.csv",encoding="utf8") as csvDataFile:
csvReader = csv.reader(csvDataFile)
for row in csvReader:
print(row)
What is displayed after reading is:- ['\ufeffwww.aslteramo.it']

This is utf-8 encoding with a Byte Order Mark (BOM) - which is used as a signature in windows.
Open the file using the utf-8-sig encoding instead of utf8

\ufeff is a UTF-8 BOM (also known as 'ZERO WIDTH NO-BREAK SPACE' character).
It's sometimes used to indicate that the file is in UTF-8 format.
You could use str.replace('\ufeff', '') in your code to get rid of it. Like this:
import csv
with open("data.csv",encoding="utf8") as csvDataFile:
csvReader = csv.reader(csvDataFile)
for row in csvReader:
print([col.replace('\ufeff', '') for col in row])
Another solution is to open the file with 'utf-8-sig' encoding instead of 'utf-8' encoding.
By the way, the braces are a added because row is a list. If your CSV file only has one column, you could select the first item from each row like this:
print(row[0].replace('\ufeff', ''))

Related

Should I use utf8 or utf-8-sig when opening a file to read in Python?

I have alway used 'utf8' to read in a file:
with open(filename, 'r', encoding='utf8') as f, open(filename2, 'r', encoding='utf8') as f2:
for line in f:
line = line.strip()
columns = line.split(' ')
for line in f2:
line = line.strip()
columns = line.split(' ')
However, the code above introduced an additional '\ufeff' code at the line of for 'f2':
columns = line.split(' ')
Now the columns[0] contains this character, while 'line' doesn't have this character. Why is that? Then I switched to 'utf-8-sig', and the problem is gone.
However, the first file reading 'f' and the 'columns' doesn't have this issue at all even with 'encoding=utf8' only. Both are plain text files.
So I have two questions regarding:
I am using Python3 and when reading a file, should I always use 'utf-8-sig' to be safe?
Why doesn't 'line' contain this additional code, but 'columns' contains it?
UTF-8-encoded files can be written with a signature indicating it is UTF-8. This signature code is called the "byte order mark" (or BOM) and has the Unicode code point value U+FEFF. If the file containing a BOM is viewed in a hex editor the file will start with the hexadecimal bytes EF BB BF. When viewed in a text editor with a non UTF-8 encoding they often appear as  but that depends on the encoding.
The 'utf-8-sig' codec can read UTF-8-encoded files written with and without the starting BOM signature and will remove it if present.
Use 'utf-8-sig' for writing a file only if you want a UTF-8 BOM written at the start of the file. Some (usually Windows) programs, such as Excel when reading text files, expect a BOM if the file contains UTF-8, and assume a localized encoding otherwise. Other programs may not expect a BOM and could read it as an extra character, so the choice is yours.
So for your two questions:
I am using Python3 and when reading a file, should I always use 'utf-8-sig' to be safe?
Yes, it will remove the BOM if present.
Why doesn't 'line' contain this additional code, but 'columns' contains it?
line.strip() doesn't remove \ufeff so I can't reproduce your claim. If a UTF-8 w/ BOM-encoded file is opened with utf8 the first character should be \ufeff. Are you using print to display the line? \ufeff is a whitespace character if printed:
>>> line = '\ufeffabc'
>>> line
'\ufeffabc'
>>> print(line)
abc
>>> print(line.strip())
abc
>>> line.strip()
'\ufeffabc'

Recover UTF-8 encoding in string

I'm working on my python script to extract multiple strings from a .csv file but I can't recover the Spanish characters (like á, é, í) after I open the file and read the lines.
This is my code so far:
import csv
list_text=[]
with open(file, 'rb') as data:
reader = csv.reader(data, delimiter='\t')
for row in reader:
print row[0]
list_text.extend(row[0])
print list_text
And I get something like this:
'Vivió el sueño, ESPAÑOL...' ['Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...']
I don't know why it prints it in the correct form but when I append it to the list is not correct.
Edited:
The problem is that I need to recover the characters because, after I read the file, the list has thousands of words in it and I don't need to print it I need to use regex to get rid of the punctuation, but this also deletes the backslash and the word is incomplete.
The python 2.x csv module doesn't support unicode and you did the right thing by opening the file in binary mode and parsing the utf-8 encoded strings instead of decoded unicode strings. Python 2 is kinda strange in that the str type (as opposed to the unicode type) holds either string or binary data. You got 'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...' which is the binary utf-8 encoding of the unicode.
We can decode it to get the unicode version...
>>> encoded_text = 'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...'
>>> text = encoded_text.decode('utf-8')
>>> print repr(text)
u'Vivi\xf3 el sue\xf1o, ESPA\xd1OL...'
>>> print text
Vivió el sueño, ESPAÑOL...
...but wait a second, the encoded text prints the same
>>> print encoded_text
Vivió el sueño, ESPAÑOL...
what's up with that? That has everything to do with your display surface which is a utf-8 encoded terminal. In the first case (print text), text is a unicode string so python has to encode it before sending it to the terminal which sees the utf-8 encoded version. In the second case its just a regular string and python sent it without conversion... but it just so happens it was holding encoded text which the terminal decoded.
Finally, when a string is in a list, python prints its repr representation, not its str value, as in
>>> print repr(encoded_text)
'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...'
To make things right, convert the cells in your rows to unicode after the csv module is done with them.
import csv
list_text=[]
with open(file, 'rb') as data:
reader = csv.reader(data, delimiter='\t')
for row in reader:
row = [cell.decode('utf-8') for cell in row]
print row[0]
list_text.extend(row[0])
print list_text
When you print a list it shows all the cancelled characters, that way \n and other characters don't throw off the list display, so if you print the string it will work properly:
'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...'.decode('utf-8')
use unicodecsv instead of csv , csv doesn't support well unicode
open the file with codecs, and 'utf-8'
see code below
import unicodecsv as csv
import codecs
list_text=[]
with codecs.open(file, 'rb','utf-8') as data:
reader = csv.reader(data, delimiter='\t')
for row in reader:
print row[0]
list_text.extend(row[0])
print list_text

How to correct the encoding while creating excel file from 'utf-8' data using python

I am trying to create an excel file using python from a list of dictionaries. Initially I was getting an error of improper encoding. So I decoded my data to 'utf-8' format. Now after the creation of excel, when I checked the values in each field, their format has been changed to text only. Below are the stpes I used while performing this activity with a snippet of code.
1.>I got error of improper encoding while creating excel file as my data had some 'ascii' values in it. Error snippet:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)
2.>To remove the error of improper encoding, I inserted a decode() function while reading my input csv file. Snippet of code while decoding to 'utf-8':
data = []
with open(datafile, "r") as f:
header = f.readline().split(',')
counter = 0
for line in f:
line = line.decode('utf-8')
fields = line.split(',')
entry = {}
for i,value in enumerate(fields):
entry[header[i].strip()] = value.strip()
data.append(entry)
counter += 1
return data
3.>After inserting decode() funtion, I created my excel file using below code:
ordered_list= dicts[0].keys()
wb=Workbook("New File.xlsx")
ws=wb.add_worksheet("Unique")
first_row=0
for header in ordered_list:
col=ordered_list.index(header)
ws.write(first_row,col,header)
row=1
for trans in dicts:
for _key,_value in trans.items():
col=ordered_list.index(_key)
ws.write(row,col,_value)
row+=1 #enter the next row
wb.close()
But after creation of excel, all the values in each field of excel is coming with text format and not their original format (some datetime values, decimal values etc.). How do I make sure to get that the data format does not change from the input data format I read using input csv file?
When reading text files you should pass the encoding to open() so that it's automatically decoded for you.
Python 2.x:
with io.open(datafile, "r", encoding="utf-8") as f:
Python 3.x:
with open(datafile, "r", encoding="utf-8") as f:
Now each line read will be a Unicode string.
As you're reading a CSV file, you may want to consider the CSV module, which understands CSV dialects. It will automatically return dictionaries per row, keyed by the header. In Python 3, it's just the csv module. In Python, the CSV module is broken with non-ASCII. Use https://pypi.python.org/pypi/unicodecsv/0.13.0
Once you have clean Unicode strings, you can proceed to store the data.
The Excel format requires that you tell it what kind of data you're storing. If you put a timestamp string into a cell, it will think it's just a string. The same applies if you insert a string of an integer.
Therefore, you need to convert the value type in Python before adding to the workbook.
Convert to decimal string to float:
my_decimal = float(row["height"])
ws.write(row,col,my_decimal)
Create datetime field from string. Assuming string is "Jun 1 2005 1:33PM":
date_object = datetime.strptime(my_date, '%b %d %Y %I:%M%p')
ws.write(row,col,date_object)

Removing non-ascii characters in a csv file

I am currently inserting data in my django models using csv file. Below is a simple save function that am using:
def save(self):
myfile = file.csv
data = csv.reader(myfile, delimiter=',', quotechar='"')
i=0
for row in data:
if i == 0:
i = i + 1
continue #skipping the header row
b=MyModel()
b.create_from_csv_row(row) # calls a method to save in models
The function is working perfectly with ascii characters. However, if the csv file has some non-ascii characters then, an error is raised: UnicodeDecodeError
'ascii' codec can't decode byte 0x93 in position 1526: ordinal not in range(128)
My question is: How can i remove non-ascii characters before saving my csv file to avoid this error.
Thanks in advance.
If you really want to strip it, try:
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')
* WARNING THIS WILL MODIFY YOUR DATA *
It attempts to find a close match - i.e. ć -> c
Perhaps a better answer is to use unicodecsv instead.
----- EDIT -----
Okay, if you don't care that the data is represented at all, try the following:
# If row references a unicode string
b.create_from_csv_row(row.encode('ascii', 'ignore'))
If row is a collection, not a unicode string, you will need to iterate over the collection to the string level to re-serialize it.
If you want to remove non-ascii characters from your data then iterate through your data and keep only the ascii.
for item in data:
if ord(item) <= 128: # 1 - 128 is ascii
[append,write,print,whatever]
If you want to convert unicode characters to ascii, then the response above by DivinusVox is accurate.
Pandas csv parser (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html) supports different encodings:
import pandas
data = pandas.read_csv(myfile, encoding='utf-8', quotechar='"', delimiter=',')

The Python CSV writer is adding letters to the beginning of each element and issues with encode

So I'm trying to parse out JSON files into a tab delimited file. The parsing seems to work fine and all the data is coming through. Although the oddest thing is happening on the output file. I told it to use a tab delimiter and on the output it does use tabs, but it still seems to keep the single quotes. And for some reason it also seems to be adding the letter B to the beginning. I manually typed in the header, and that works fine, but the data itself is acting weird. Here's an example of the output I'm getting.
id created text screen name name latitude longitude place name place type
b'1234567890' b'Thu Mar 14 19:39:07 +0000 2013' "b""I'm at Bank Of America (Wayne, MI) http://t.co/asdf""" b'userid' b'username' 42.28286837 -83.38487864 b'Bank Of America, Wayne' b'poi'
b'1234567891' b'Thu Mar 14 19:39:16 +0000 2013' b'here is a sample tweet \xf0\x9f\x8f\x80 #notingoodhands' b'userid2' b'username2'
Here is the code that I'm using to write the data out.
out = open(filename, 'w')
out.write('id\tcreated\ttext\tscreen name\tname\tlatitude\tlongitude\tplace name\tplace type')
out.write('\n')
rows = zip(ids, times, texts, screen_names, names, lats, lons, place_names, place_types)
from csv import writer
csv = writer(out, dialect='excel', delimiter = '\t')
for row in rows:
values = [(value.encode('utf-8') if hasattr(value, 'encode') else value) for value in row]
csv.writerow(values)
out.close()
So here's the thing. If i did this without the utf-8 bit and just output it straight, the formatting would be perfectly how i want it. But then when people type in special characters, the program crashes and isn't able to handle it.
Traceback (most recent call last):
File "tweets.py", line 34, in <module>
csv.writerow(values)
File "C:\Python33\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f3c0' in position 153: character maps to <undefined>
Adding the utf-8 bit converts it to the type of output you see here, but then it adds all these characters to the output. Does anyone have any thoughts on this?
You are writing byte data instead of unicode to your files, because you are encoding the data yourself.
Remove the encode calls altogether and let Python handle this for you; open the file with the UTF8 encoding and the rest takes care of itself:
out = open(filename, 'w', encoding='utf8')
This is documented in the csv module documentation:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
You've got multiple things going on here, but first, let's clear up a bit of confusion.
Encoding non-ASCII characters to UTF-8 means you get multiple bytes. For example, the character 🏀 is \xf0\x9f\x8f\x80 in UTF-8. But that's still just one character, it's just a character that takes four bytes. If you write the string to a binary file, then look at that file in a UTF-8-compatible tool (Notepad or TextEdit, or just cat on a UTF-8-friendly terminal/shell), you'll see one 🏀, not four garbage characters.
Second, b'abc' is not a string with b added to the beginning, it's the repr representation of the byte-string abc. The b is no more a part of the string than the quotes are.
Finally, in Python 3, you can't open a file in text mode and then write byte strings to it. Either open it in text mode, with an encoding, and write normal unicode strings, or open it in binary mode and write encoded byte strings.

Categories

Resources