I am using python 2.7 to read a JSON file. My code is:
import json
from json import JSONDecoder
import os
path = os.path.dirname(os.path.abspath(__file__))+'/json'
print path
for root, dirs, files in os.walk(os.path.dirname(path+'/json')):
for f in files:
if f.lower().endswith((".json")):
fp=open(root + '/'+f)
data = fp.read()
print data.decode('utf-8')
But I got the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 72: invalid continuation byte
Your file is not encoded in UTF-8, and the error occurs at the fp.read() line. You must use:
import io
io.open(filename, encoding='latin-1')
And the correct, not platform-dependent usage for joining your paths is:
os.path.join(root, f)
Related
The files I am working with are .ASC files, each represents an analysis of a sample on a mass spectrometer. The same isotopes were measured in each sample and therefore the files have common headers. The goal of this code is to pull the lines of text which contain the common headers and the counts per second (cps) data from all of the .ASC files in a given folder and to compile it into a single file.
I have searched around and I believe my code is along the right lines, but I keep getting encoding errors. I have tried specifying the encoding wherever I call 'open('and I have tried using ascii as the encoding type and utf-8, but still errors.
Below are the error messages I received:
Without specifying encoding: UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1010: character maps to undefined>
ascii: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 148: ordinal not in range(128)
utf-8: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 148: invalid continuation byte
I am very inexperienced with coding so if you notice anything idiotic in the code, let me know.
filepath = open("insert_filepath_here")
output_lst = []
def process_file(filepath):
interesting_keys = (
'Li7(LR)',
'Be9(LR)',
'Na23(LR)',
'Rb85(LR)',
'Sr88(LR)',
'Y89(LR)',
'Zr90(LR)',
'Nb93(LR)',
'Mo95(LR)',
'Cd111(LR)',
'In115(LR)',
'Sn118(LR)',
'Cs13(LR)',
'Ba137(LR)',
'La139(LR)',
'Ce140(LR)',
'Pr141(LR)',
'Nd146(LR)',
'Sm147(LR)',
'Eu153(LR)',
'Gd157(LR)',
'Tb159(LR)',
'Dy163(LR)',
'Ho165(LR)',
'Er166(LR)',
'Tm169(LR)',
'Yb172(LR)',
'Lu175(LR)',
'Hf178(LR)',
'Ta181(LR)',
'W182(LR)',
'Tl205(LR)',
'Pb208(LR)',
'Bi209(LR)',
'Th232(LR)',
'U238(LR)',
'Mg24(MR)',
'Al27(MR)',
'Si28(MR)',
'Ca44(MR)',
'Sc45(MR)',
'Ti47(MR)',
'V51(MR)',
'Cr52(MR)',
'Mn55(MR)',
'Fe56(MR)',
'Co59(MR)',
'Ni60(MR)',
'Cu63(MR)',
'Zn66(MR)',
'Ga69(MR)',
'K39(HR)'
)
with open(filepath) as fh:
content = fh.readlines()
for line in content:
line = line.strip()
if ":" in line:
key, _ = line.split(":",1)
if key.strip() in interesting_keys:
output_lst.append(line)
def write_output_data():
if output_lst:
with open(output_file, "w") as fh:
fh.write("\n".join(output_lst))
print("See", output_file)
def process_files():
for filepath in os.listdir(input_dir):
process_file(os.path.join(input_dir, filepath))
write_output_data()
process_files()
I am parsing a csv file and i am getting the below error
import os
import csv
from collections import defaultdict
demo_data = defaultdict(list)
if os.path.exists("infoed_daily _file.csv"):
f = open("infoed_daily _file.csv", "rt")
csv_reader = csv.DictReader(f)
line_no = 0
for line in csv_reader:
line_no +=1
print(line,line_no)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2483: character maps to
<undefined>
Please advise.
Thanks..
-Prasanna.K
Error may means you have file in encoding different then UTF-8 which (probably in most systems) is used as default in open()
When I run
b'\x81'.decode('Latin1')
b'\x81'.decode('Latin2')
b'\x81'.decode('iso8859')
b'\x81'.decode('iso8859-2')
then it runs without error - so your file can be in some of these encodings (or similar encoding) and you have to use it
open(..., encoding='Latin1')
or similar.
List of other encodings: codecs: standard encodings
f=open("myfile1.txt",'r')
print(f.read())
Well, for the above code I got an error as:
'charmap' codec can't decode byte 0x81 in position 637: character maps to
so i tried changing the name of the file extension and it worked.
Happy Coding
Thanks!
Vani
you can use '
with open ('filename.txt','r') as f:
f.write(content)
The good thing is that it automatically closes the file after work is done.
im trying to parse docx file. I unziped it first, then tried to read Document.xml file with with open(..) and its raise error that "'charmap' codec can't decode byte 0x98 in position 7618: character maps to ". XML is "UTF-8" encoding:
Error:
I wrote the following code:
with open(self.tempDir + self.CONFIG['main_xml']) as xml_file:
self.dom_xml = etree.parse(xml_file)
I treid to force encode to UTF-8, but then i can't read etree.fromstring(..) correctly
7618 symbol (from error) is :
Please help me. How to read xml file correctly?
Thnks
This works without errors on your file:
import zipfile
import xml.etree.ElementTree as ET
zipfile.ZipFile('file.docx').extractall()
root = ET.parse('word/document.xml').getroot()
I have hit a road block when trying to read a CSV file with python.
UPDATE:
if you want to just skip the character or error you can open the file like this:
with open(os.path.join(directory, file), 'r', encoding="utf-8", errors="ignore") as data_file:
So far I have tried.
for directory, subdirectories, files in os.walk(root_dir):
for file in files:
with open(os.path.join(directory, file), 'r') as data_file:
reader = csv.reader(data_file)
for row in reader:
print (row)
the error I am getting is:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>
I have Tried
with open(os.path.join(directory, file), 'r', encoding="UTF-8") as data_file:
Error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 223: character maps to <undefined>
Now if I just print the data_file it says they are cp1252 encoded but if I try
with open(os.path.join(directory, file), 'r', encoding="cp1252") as data_file:
The error I get is:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>
I also tried the recommended package.
The error I get is:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>
The line I am trying to parse is:
2015-11-28 22:23:58,670805374291832832,479174464,"MarkCrawford15","RT #WhatTheFFacts: The tallest man in the world was Robert Pershing Wadlow of Alton, Illinois. He was slighty over 8 feet 11 inches tall.","None
any thoughts or help is appreciated.
I would use csvkit, that uses automatic detection of apposite encoding and decoding. e.g.
import csvkit
reader = csvkit.reader(data_file)
As disscussed in the chat- solution is-
for directory, subdirectories, files in os.walk(root_dir):
for file in files:
with open(os.path.join(directory, file), 'r', encoding="utf-8") as data_file:
reader = csv.reader(data_file)
for row in reader:
data = [i.encode('ascii', 'ignore').decode('ascii') for i in row]
print (data)
I have a set of data, but I need to work only with utf-8 data, so I need to delete all data with non-utf-8 symbols.
When I try to work with these files, I receive:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3062: character maps to <undefined> and UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 1576: invalid start byte
My code
class Corpus:
def __init__(self,path_to_dir=None):
self.path_to_dir = path_to_dir if path_to_dir else []
def emails_as_string(self):
for file_name in os.listdir(self.path_to_dir):
if not file_name.startswith("!"):
with io.open(self.add_slash(self.path_to_dir)+file_name,'r', encoding ='utf-8') as body:
yield[file_name,body.read()]
def add_slash(self, path):
if path.endswith("/"): return path
return path + "/"
I recive error here yield[file_name,body.read()] and herelist_of_emails = mailsrch.findall(text), but when I work with utf-8 all great.
I suspect you want to use the errors='ignore' argument on bytes.decode. See http://docs.python.org/3/howto/unicode.html#unicode-howto and http://docs.python.org/3/library/stdtypes.html#bytes.decode .for more info.
Edit:
Here's an example showing a good way to do this:
for file_name in os.listdir(self.path_to_dir):
if not file_name.startswith("!"):
fullpath = os.path.join(self.path_to_dir, file_name)
with open(fullpath, 'r', encoding ='utf-8', errors='ignore') as body:
yield [file_name, body.read()]
Using os.path.join, you can eliminate your add_slash method, and ensure that it works cross-platform.