How to handle unusual characters in a csv read operation

How to handle unusual characters in a csv read operation - python

I have a very large Excel spreadsheet where the data need to be pivoted into columns etc and no problems excepting that it is full of some very odd characters from French, Portuguese. The Panda data frame copes with out any problems until I need to output the data as a csv file
dframe.to_csv(csvf, header=False)
sourceFile = open(csvf, 'r')
csvdata = sourceFile.read()
with open('output.csv', 'a') as destFile:
destFile.write(csvdata)
sourceFile.close()
This produces an error at line 3:
File "C:\Program Files (x86)\python\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3184: character maps to undefined
This I tracked to a field in the data containing TTD ChÃ”teauneuf-du-Pape 75cl - specifically it is this character ” which is, according to a Unicode identifier, "U+201D : RIGHT DOUBLE QUOTATION MARK {double comma quotation mark}"
I have tried to remove this in the data frame itself with:
df[self.data_columns[i]] = df[self.data_columns[i]].str.replace('”',' ')
But it didn't seem to do anything NB a similar line successfully strips out unwanted commas.
Is there any way of handling the read() problem so it can cope with any character? What I find odd is that the rest of the system e.g. loading the data in to a data frame, manipulating the data is all fine

Related

How do I remove quotation marks around the fields of a CSV?

I'm working on taking csv files and putting them into a postgreSQL database. For one of the files though, every field is surrounded by quotes (When looking at it in Excel it looks normal. In notepad though, one row looks like "Firstname","Lastname","CellNumber","HomeNumber",etc. when it should look like Firstname,Lastname,CellNumber,HomeNumber). It breaks when I tried to load it into SQL.
I tried loading the file into python to do data cleaning, but i'm getting an error:
This is the code I'm running to load in the file in python:
import pandas as pd
logics = pd.read_csv("test.csv")
and this is the error I'm getting:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 28682: invalid continuation byte
I tried encoding it into utf-8, but that gave me a different error.
code:
import pandas as pd
logics = pd.read_csv("test.csv", encoding= 'utf-8')
error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 12 fields in line 53, saw 14
For whatever reason, when I manually save the file in file explorer as UTF-8 and then save it back again as a CSV file it removes the quotation marks, but I need to automate this process. Is there any way I can use python to remove these quotation marks? Is it just some different kind of encoding?

So you can add more to this, maybe pull out some of the functionality into a function called "clean_line". Below should go through your csv, and remove all " characters in any of the lines. No real need for the pandas overhead on this one, using the standard python libraries should make it faster as well.
with open("test.csv",'r')as f:
lines = f.readlines()
with open("output.csv", 'w') as f:
output=[]
for line in lines:
output.append(line.replace('"',''))
f.writelines(output)

Removing 'b and \n from data being transferred from .csv file to text file

Currently have data in a .csv file and trying to move it to a temp.txt file. When I transfer the data, each line begins with b' and ends with \n' which I want to remove.
Previously got it working however had issues with utf-8 language in that I'd get error: UnicodeEncodeError: 'charmap' codec can't encode character '\u0336' in position 113: character maps to undefined
def data(file):
for i in range(1000):
print(file.readline().encode("utf-8"))
file = open(sys.argv[1], encoding = "utf-8")
data(file)
Currently getting these sort of results:
b'Datahere\n'
And I would prefer just getting:
Datahere

It is a bit of a hack, but you can just index into each line that you read by [1:-2]. This will get rid of the first character on each line 'b', and also the last two characters on each line '\n'.

Adding delimiters to a text file using python

I have recently started my job as an ETL Developer and as a part of my exercise, I am extracting data from a text file containing raw data. My raw data looks like this as shown in the image.
My Raw Data
Now I want to add delimiters to my data file. Basically after every line, I want to add a comma (,). My code in Python looks like this.
with open ('new_locations.txt', 'w') as output:
with open('locations.txt', 'r') as input:
for line in input:
new_line = line+','
output.write(new_line)
where new_locations.txt is the output text file, locations.txt is the raw data.
However, it throws me error all the time.
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3724: character maps to
Where exactly am I going wrong?
Note: The characters in raw data are not all ASCII characters. Some are Latin characters as well.

When you open a file in python 3 in "text" mode then reading and writing convert the bytes in the file to python (unicode) strings. The default encoding is platform dependent, but is usually UTF-8.
If you file uses latin-1 encoding, you should open with
with open('locations.txt', 'r', encoding='latin_1') as input
You should probably also do this with the output if you want the output also to be in latin-1.
In the longer term, you should probably consider converting all your data to a unicode format in the data files.

So when you write to file you need to encode it before writing. If you google that you will find ton of results.
Here is how it can be done :
output.write(new_line.encode('utf-8'))# or ascii
You can also ask to ignore which can't be converted but that wil cause loss of charachter and may not be the desired output, here is how that will be done :
output.write(new_line.encode('ascii','ignore'))# or 'utf-8'

UnicodeDecodeError: save to file in python

i want to read file, find something in it and save the result, but when I want to save it it give me a error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Code to save to file:
fileout.write((key + ';' + nameDict[key]+ ';'+src + alt +'\n').decode('utf-8'))
What can I do to fix it?
Thank you

You are trying to concatenate unicode values with byte strings, then turn the result to unicode, to write it to a file object that most likely only takes byte strings.
Don't mix unicode and byte strings like that.
Open the file to write to with io.open() to automate encoding Unicode values, then handle only unicode in your code:
import io
with io.open(filename, 'w', encoding='utf8') as fileout:
# code gathering stuff from BeautifulSoup
fileout.write(u'{};{};{}{}\n'.format(key, nameDict[key], src, alt)
You may want to check out the csv module to handle writing out delimiter-separated values. If you do go that route, you'll have to explicitly encode your columns:
import csv
with open(filename, 'wb') as fileout:
writer = csv.writer(fileout, delimiter=';')
# code gathering stuff from BeautifulSoup
row = [key, nameDict[key], src + alt]
writer.writerow([c.encode('utf8') for c in row])
If some of this data comes from other files, make sure you also decode to Unicode first; again, io.open() to read these files is probably the best option, to have the data decoded to Unicode values for you as you read.

Removing non-ascii characters in a csv file

I am currently inserting data in my django models using csv file. Below is a simple save function that am using:
def save(self):
myfile = file.csv
data = csv.reader(myfile, delimiter=',', quotechar='"')
i=0
for row in data:
if i == 0:
i = i + 1
continue #skipping the header row
b=MyModel()
b.create_from_csv_row(row) # calls a method to save in models
The function is working perfectly with ascii characters. However, if the csv file has some non-ascii characters then, an error is raised: UnicodeDecodeError
'ascii' codec can't decode byte 0x93 in position 1526: ordinal not in range(128)
My question is: How can i remove non-ascii characters before saving my csv file to avoid this error.
Thanks in advance.

If you really want to strip it, try:
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')
* WARNING THIS WILL MODIFY YOUR DATA *
It attempts to find a close match - i.e. ć -> c
Perhaps a better answer is to use unicodecsv instead.
----- EDIT -----
Okay, if you don't care that the data is represented at all, try the following:
# If row references a unicode string
b.create_from_csv_row(row.encode('ascii', 'ignore'))
If row is a collection, not a unicode string, you will need to iterate over the collection to the string level to re-serialize it.

If you want to remove non-ascii characters from your data then iterate through your data and keep only the ascii.
for item in data:
if ord(item) <= 128: # 1 - 128 is ascii
[append,write,print,whatever]
If you want to convert unicode characters to ascii, then the response above by DivinusVox is accurate.

Pandas csv parser (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html) supports different encodings:
import pandas
data = pandas.read_csv(myfile, encoding='utf-8', quotechar='"', delimiter=',')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to handle unusual characters in a csv read operation - python

Related

How do I remove quotation marks around the fields of a CSV?

Removing 'b and \n from data being transferred from .csv file to text file

Adding delimiters to a text file using python

UnicodeDecodeError: save to file in python

Removing non-ascii characters in a csv file

Categories

Resources