Delete all files in directory with non utf-8 symbols

Delete all files in directory with non utf-8 symbols - python

I have a set of data, but I need to work only with utf-8 data, so I need to delete all data with non-utf-8 symbols.
When I try to work with these files, I receive:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3062: character maps to <undefined> and UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 1576: invalid start byte
My code
class Corpus:
def __init__(self,path_to_dir=None):
self.path_to_dir = path_to_dir if path_to_dir else []
def emails_as_string(self):
for file_name in os.listdir(self.path_to_dir):
if not file_name.startswith("!"):
with io.open(self.add_slash(self.path_to_dir)+file_name,'r', encoding ='utf-8') as body:
yield[file_name,body.read()]
def add_slash(self, path):
if path.endswith("/"): return path
return path + "/"
I recive error here yield[file_name,body.read()] and herelist_of_emails = mailsrch.findall(text), but when I work with utf-8 all great.

I suspect you want to use the errors='ignore' argument on bytes.decode. See http://docs.python.org/3/howto/unicode.html#unicode-howto and http://docs.python.org/3/library/stdtypes.html#bytes.decode .for more info.
Edit:
Here's an example showing a good way to do this:
for file_name in os.listdir(self.path_to_dir):
if not file_name.startswith("!"):
fullpath = os.path.join(self.path_to_dir, file_name)
with open(fullpath, 'r', encoding ='utf-8', errors='ignore') as body:
yield [file_name, body.read()]
Using os.path.join, you can eliminate your add_slash method, and ensure that it works cross-platform.

Related

I keep getting a UnicodeDecodeError when trying to pull lines from multiple ascii files into a single file. How can I resolve this?

The files I am working with are .ASC files, each represents an analysis of a sample on a mass spectrometer. The same isotopes were measured in each sample and therefore the files have common headers. The goal of this code is to pull the lines of text which contain the common headers and the counts per second (cps) data from all of the .ASC files in a given folder and to compile it into a single file.
I have searched around and I believe my code is along the right lines, but I keep getting encoding errors. I have tried specifying the encoding wherever I call 'open('and I have tried using ascii as the encoding type and utf-8, but still errors.
Below are the error messages I received:
Without specifying encoding: UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1010: character maps to undefined>
ascii: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 148: ordinal not in range(128)
utf-8: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 148: invalid continuation byte
I am very inexperienced with coding so if you notice anything idiotic in the code, let me know.
filepath = open("insert_filepath_here")
output_lst = []
def process_file(filepath):
interesting_keys = (
'Li7(LR)',
'Be9(LR)',
'Na23(LR)',
'Rb85(LR)',
'Sr88(LR)',
'Y89(LR)',
'Zr90(LR)',
'Nb93(LR)',
'Mo95(LR)',
'Cd111(LR)',
'In115(LR)',
'Sn118(LR)',
'Cs13(LR)',
'Ba137(LR)',
'La139(LR)',
'Ce140(LR)',
'Pr141(LR)',
'Nd146(LR)',
'Sm147(LR)',
'Eu153(LR)',
'Gd157(LR)',
'Tb159(LR)',
'Dy163(LR)',
'Ho165(LR)',
'Er166(LR)',
'Tm169(LR)',
'Yb172(LR)',
'Lu175(LR)',
'Hf178(LR)',
'Ta181(LR)',
'W182(LR)',
'Tl205(LR)',
'Pb208(LR)',
'Bi209(LR)',
'Th232(LR)',
'U238(LR)',
'Mg24(MR)',
'Al27(MR)',
'Si28(MR)',
'Ca44(MR)',
'Sc45(MR)',
'Ti47(MR)',
'V51(MR)',
'Cr52(MR)',
'Mn55(MR)',
'Fe56(MR)',
'Co59(MR)',
'Ni60(MR)',
'Cu63(MR)',
'Zn66(MR)',
'Ga69(MR)',
'K39(HR)'
)
with open(filepath) as fh:
content = fh.readlines()
for line in content:
line = line.strip()
if ":" in line:
key, _ = line.split(":",1)
if key.strip() in interesting_keys:
output_lst.append(line)
def write_output_data():
if output_lst:
with open(output_file, "w") as fh:
fh.write("\n".join(output_lst))
print("See", output_file)
def process_files():
for filepath in os.listdir(input_dir):
process_file(os.path.join(input_dir, filepath))
write_output_data()
process_files()

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2483: character maps to <undefined>

I am parsing a csv file and i am getting the below error
import os
import csv
from collections import defaultdict
demo_data = defaultdict(list)
if os.path.exists("infoed_daily _file.csv"):
f = open("infoed_daily _file.csv", "rt")
csv_reader = csv.DictReader(f)
line_no = 0
for line in csv_reader:
line_no +=1
print(line,line_no)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2483: character maps to
<undefined>
Please advise.
Thanks..
-Prasanna.K

Error may means you have file in encoding different then UTF-8 which (probably in most systems) is used as default in open()
When I run
b'\x81'.decode('Latin1')
b'\x81'.decode('Latin2')
b'\x81'.decode('iso8859')
b'\x81'.decode('iso8859-2')
then it runs without error - so your file can be in some of these encodings (or similar encoding) and you have to use it
open(..., encoding='Latin1')
or similar.
List of other encodings: codecs: standard encodings

f=open("myfile1.txt",'r')
print(f.read())
Well, for the above code I got an error as:
'charmap' codec can't decode byte 0x81 in position 637: character maps to
so i tried changing the name of the file extension and it worked.
Happy Coding
Thanks!
Vani

you can use '
with open ('filename.txt','r') as f:
f.write(content)
The good thing is that it automatically closes the file after work is done.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte - even though I opened the file in mode 'rb'

I'm trying to write an HTTP server, but it doesn't matter.
When I try to decode an image data (after writing 'data = file.read()', it gives an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I opened the file in 'rb' mode.
Other people usually open the file in 'r' mode and that causes the error. But what is the error here?
What is the problem???
def get_content_file(file_path):
"""
Gets a full path to a file and returns the content of it.
file_path must be a valid path.
:param file_path: str (path)
:return: str (data)
"""
print(file_path)
file = open(file_path, 'rb')
data = file.read()
file.close()
return data.decode()

I'll suggest that you confirm the encoding format of 'file_path'. Download and open the file with Notepad++, check the lower right corner; there you can see whether your file was encoded in the compatible format, or if it has the Byte Order Marker or BOM sign, if either of these is true, simply 'save as' -the correct/required format.

'utf8' codec can't decode byte 0xf3

I am using python 2.7 to read a JSON file. My code is:
import json
from json import JSONDecoder
import os
path = os.path.dirname(os.path.abspath(__file__))+'/json'
print path
for root, dirs, files in os.walk(os.path.dirname(path+'/json')):
for f in files:
if f.lower().endswith((".json")):
fp=open(root + '/'+f)
data = fp.read()
print data.decode('utf-8')
But I got the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 72: invalid continuation byte

Your file is not encoded in UTF-8, and the error occurs at the fp.read() line. You must use:
import io
io.open(filename, encoding='latin-1')
And the correct, not platform-dependent usage for joining your paths is:
os.path.join(root, f)

How can I get my Python to parse the following text?

I have a sample of the text:
"PROTECTING-ħarsien",
I'm trying to parse with the following
import csv, json
with open('./dict.txt') as maltese:
entries = maltese.readlines()
for entry in entries:
tokens = entry.replace('"', '').replace(",", "").replace("\r\n", "").split("-")
if len(tokens) == 1:
pass
else:
print tokens[0] + "," + unicode(tokens[1])
But I'm getting an error message
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
What am I doing wrong?

It appears that dict.txt is UTF-8 encoded (ħ is 0xc4 0xa7 in UTF-8).
You should open the file as UTF-8, then:
import codecs
with codecs.open('./dict.txt', encoding="utf-8") as maltese:
# etc.
You will then have Unicode strings instead of bytestrings to work with; you therefore don't need to call unicode() on them, but you may have to re-encode them to the encoding of the terminal you're outputting to.

You have to change your last line to (this has been tested to work on your data):
print tokens[0] + "," + unicode(tokens[1], 'utf8')
If you don't have that utf8, Python assumes that the source is ascii encoding, hence the error.
See http://docs.python.org/2/howto/unicode.html#the-unicode-type

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Delete all files in directory with non utf-8 symbols - python

Related

I keep getting a UnicodeDecodeError when trying to pull lines from multiple ascii files into a single file. How can I resolve this?

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2483: character maps to <undefined>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte - even though I opened the file in mode 'rb'

'utf8' codec can't decode byte 0xf3

How can I get my Python to parse the following text?

Categories

Resources