encoding issue when reading CSV file with python

encoding issue when reading CSV file with python - python

I have hit a road block when trying to read a CSV file with python.
UPDATE:
if you want to just skip the character or error you can open the file like this:
with open(os.path.join(directory, file), 'r', encoding="utf-8", errors="ignore") as data_file:
So far I have tried.
for directory, subdirectories, files in os.walk(root_dir):
for file in files:
with open(os.path.join(directory, file), 'r') as data_file:
reader = csv.reader(data_file)
for row in reader:
print (row)
the error I am getting is:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>
I have Tried
with open(os.path.join(directory, file), 'r', encoding="UTF-8") as data_file:
Error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 223: character maps to <undefined>
Now if I just print the data_file it says they are cp1252 encoded but if I try
with open(os.path.join(directory, file), 'r', encoding="cp1252") as data_file:
The error I get is:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>
I also tried the recommended package.
The error I get is:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>
The line I am trying to parse is:
2015-11-28 22:23:58,670805374291832832,479174464,"MarkCrawford15","RT #WhatTheFFacts: The tallest man in the world was Robert Pershing Wadlow of Alton, Illinois. He was slighty over 8 feet 11 inches tall.","None
any thoughts or help is appreciated.

I would use csvkit, that uses automatic detection of apposite encoding and decoding. e.g.
import csvkit
reader = csvkit.reader(data_file)
As disscussed in the chat- solution is-
for directory, subdirectories, files in os.walk(root_dir):
for file in files:
with open(os.path.join(directory, file), 'r', encoding="utf-8") as data_file:
reader = csv.reader(data_file)
for row in reader:
data = [i.encode('ascii', 'ignore').decode('ascii') for i in row]
print (data)

Related

I keep getting a UnicodeDecodeError when trying to pull lines from multiple ascii files into a single file. How can I resolve this?

The files I am working with are .ASC files, each represents an analysis of a sample on a mass spectrometer. The same isotopes were measured in each sample and therefore the files have common headers. The goal of this code is to pull the lines of text which contain the common headers and the counts per second (cps) data from all of the .ASC files in a given folder and to compile it into a single file.
I have searched around and I believe my code is along the right lines, but I keep getting encoding errors. I have tried specifying the encoding wherever I call 'open('and I have tried using ascii as the encoding type and utf-8, but still errors.
Below are the error messages I received:
Without specifying encoding: UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1010: character maps to undefined>
ascii: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 148: ordinal not in range(128)
utf-8: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 148: invalid continuation byte
I am very inexperienced with coding so if you notice anything idiotic in the code, let me know.
filepath = open("insert_filepath_here")
output_lst = []
def process_file(filepath):
interesting_keys = (
'Li7(LR)',
'Be9(LR)',
'Na23(LR)',
'Rb85(LR)',
'Sr88(LR)',
'Y89(LR)',
'Zr90(LR)',
'Nb93(LR)',
'Mo95(LR)',
'Cd111(LR)',
'In115(LR)',
'Sn118(LR)',
'Cs13(LR)',
'Ba137(LR)',
'La139(LR)',
'Ce140(LR)',
'Pr141(LR)',
'Nd146(LR)',
'Sm147(LR)',
'Eu153(LR)',
'Gd157(LR)',
'Tb159(LR)',
'Dy163(LR)',
'Ho165(LR)',
'Er166(LR)',
'Tm169(LR)',
'Yb172(LR)',
'Lu175(LR)',
'Hf178(LR)',
'Ta181(LR)',
'W182(LR)',
'Tl205(LR)',
'Pb208(LR)',
'Bi209(LR)',
'Th232(LR)',
'U238(LR)',
'Mg24(MR)',
'Al27(MR)',
'Si28(MR)',
'Ca44(MR)',
'Sc45(MR)',
'Ti47(MR)',
'V51(MR)',
'Cr52(MR)',
'Mn55(MR)',
'Fe56(MR)',
'Co59(MR)',
'Ni60(MR)',
'Cu63(MR)',
'Zn66(MR)',
'Ga69(MR)',
'K39(HR)'
)
with open(filepath) as fh:
content = fh.readlines()
for line in content:
line = line.strip()
if ":" in line:
key, _ = line.split(":",1)
if key.strip() in interesting_keys:
output_lst.append(line)
def write_output_data():
if output_lst:
with open(output_file, "w") as fh:
fh.write("\n".join(output_lst))
print("See", output_file)
def process_files():
for filepath in os.listdir(input_dir):
process_file(os.path.join(input_dir, filepath))
write_output_data()
process_files()

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2483: character maps to <undefined>

I am parsing a csv file and i am getting the below error
import os
import csv
from collections import defaultdict
demo_data = defaultdict(list)
if os.path.exists("infoed_daily _file.csv"):
f = open("infoed_daily _file.csv", "rt")
csv_reader = csv.DictReader(f)
line_no = 0
for line in csv_reader:
line_no +=1
print(line,line_no)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2483: character maps to
<undefined>
Please advise.
Thanks..
-Prasanna.K

Error may means you have file in encoding different then UTF-8 which (probably in most systems) is used as default in open()
When I run
b'\x81'.decode('Latin1')
b'\x81'.decode('Latin2')
b'\x81'.decode('iso8859')
b'\x81'.decode('iso8859-2')
then it runs without error - so your file can be in some of these encodings (or similar encoding) and you have to use it
open(..., encoding='Latin1')
or similar.
List of other encodings: codecs: standard encodings

f=open("myfile1.txt",'r')
print(f.read())
Well, for the above code I got an error as:
'charmap' codec can't decode byte 0x81 in position 637: character maps to
so i tried changing the name of the file extension and it worked.
Happy Coding
Thanks!
Vani

you can use '
with open ('filename.txt','r') as f:
f.write(content)
The good thing is that it automatically closes the file after work is done.

Python reading from file encoding problem

when I read like this, some files
list_of_files = glob.glob('./*.txt') # create the list of files
for file_name in list_of_files:
FI = open(file_name, 'r', encoding='cp1252')
Error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1260: character maps to
When I switch to this
list_of_files = glob.glob('./*.txt') # create the list of files
for file_name in list_of_files:
FI = open(file_name, 'r', encoding="utf-8")
Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1459: invalid start byte
And I have read that I should open this as a binary file. But I'm not sure how to do this. Here is my function:
def readingAndAddToList():
list_of_files = glob.glob('./*.txt') # create the list of files
for file_name in list_of_files:
FI = open(file_name, 'r', encoding="utf-8")
stext = textProcessing(FI.read())# split returns a list of words delimited by sequences of whitespace (including tabs, newlines, etc, like re's \s)
secondaryWord_list = stext.split()
word_list.extend(secondaryWord_list) # Add words to main list
print("Lungimea fisierului ",FI.name," este de", len(secondaryWord_list), "caractere")
sortingAndNumberOfApparitions(secondaryWord_list)
FI.close()
Just the beggining of my functions matter because I get the error at the reading part

If you are on windows,open the file in NotePad and save as desired encoding .
In Linux , DO the same in text editor.
hope your program runs.

UnicodeEncodeError: 'charmap' codec can't encode character inspite of encoding to utf-8

I am converting my XML documents to plain text. There is a directory containing XML files and one python file to compile.
I have opened my XML files as:
with open(file, 'r', encoding = 'utf-8') as f:
then wrote in another file the contents of f:
for items in xmllist:
fx.write(items)
but it gives me the error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2009' in position 25: character maps to

Writing CSV file with umlauts causing "UnicodeEncodeError: 'ascii' codec can't encode character"

I am trying to write characters with double dots (umlauts) such as ä, ö and Ö. I am able to write it to the file with data.encode("utf-8") but the result b'\xc3\xa4\xc3\xa4\xc3\x96' is not nice (UTF-8 as literal characters). I want to get "ääÖ" as written stored to a file.
How can I write data with umlaut characters to a CSV file in Python 3?
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
data=resultFile
a.writerows(data)
Traceback:
File "<ipython-input-280-73b1f615929e>", line 5, in <module>
a.writerows(data)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 15: ordinal not in range(128)

Add a parameter encoding to the open() and set it to 'utf8'.
import csv
data = "ääÖ"
with open("test.csv", 'w', encoding='utf8') as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Edit: Removed the use of io library as open is same as io.open in Python 3.

This solution should work on both python2 and 3 (not needed in python3):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Credits to:
Working with utf-8 encoding in Python source

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

encoding issue when reading CSV file with python - python

Related

I keep getting a UnicodeDecodeError when trying to pull lines from multiple ascii files into a single file. How can I resolve this?

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2483: character maps to <undefined>

Python reading from file encoding problem

UnicodeEncodeError: 'charmap' codec can't encode character inspite of encoding to utf-8

Writing CSV file with umlauts causing "UnicodeEncodeError: 'ascii' codec can't encode character"

Categories

Resources