How can I get my Python to parse the following text? - python

I have a sample of the text:
"PROTECTING-ħarsien",
I'm trying to parse with the following
import csv, json
with open('./dict.txt') as maltese:
entries = maltese.readlines()
for entry in entries:
tokens = entry.replace('"', '').replace(",", "").replace("\r\n", "").split("-")
if len(tokens) == 1:
pass
else:
print tokens[0] + "," + unicode(tokens[1])
But I'm getting an error message
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
What am I doing wrong?

It appears that dict.txt is UTF-8 encoded (ħ is 0xc4 0xa7 in UTF-8).
You should open the file as UTF-8, then:
import codecs
with codecs.open('./dict.txt', encoding="utf-8") as maltese:
# etc.
You will then have Unicode strings instead of bytestrings to work with; you therefore don't need to call unicode() on them, but you may have to re-encode them to the encoding of the terminal you're outputting to.

You have to change your last line to (this has been tested to work on your data):
print tokens[0] + "," + unicode(tokens[1], 'utf8')
If you don't have that utf8, Python assumes that the source is ascii encoding, hence the error.
See http://docs.python.org/2/howto/unicode.html#the-unicode-type

Related

I keep getting a UnicodeDecodeError when trying to pull lines from multiple ascii files into a single file. How can I resolve this?

The files I am working with are .ASC files, each represents an analysis of a sample on a mass spectrometer. The same isotopes were measured in each sample and therefore the files have common headers. The goal of this code is to pull the lines of text which contain the common headers and the counts per second (cps) data from all of the .ASC files in a given folder and to compile it into a single file.
I have searched around and I believe my code is along the right lines, but I keep getting encoding errors. I have tried specifying the encoding wherever I call 'open('and I have tried using ascii as the encoding type and utf-8, but still errors.
Below are the error messages I received:
Without specifying encoding: UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1010: character maps to undefined>
ascii: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 148: ordinal not in range(128)
utf-8: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 148: invalid continuation byte
I am very inexperienced with coding so if you notice anything idiotic in the code, let me know.
filepath = open("insert_filepath_here")
output_lst = []
def process_file(filepath):
interesting_keys = (
'Li7(LR)',
'Be9(LR)',
'Na23(LR)',
'Rb85(LR)',
'Sr88(LR)',
'Y89(LR)',
'Zr90(LR)',
'Nb93(LR)',
'Mo95(LR)',
'Cd111(LR)',
'In115(LR)',
'Sn118(LR)',
'Cs13(LR)',
'Ba137(LR)',
'La139(LR)',
'Ce140(LR)',
'Pr141(LR)',
'Nd146(LR)',
'Sm147(LR)',
'Eu153(LR)',
'Gd157(LR)',
'Tb159(LR)',
'Dy163(LR)',
'Ho165(LR)',
'Er166(LR)',
'Tm169(LR)',
'Yb172(LR)',
'Lu175(LR)',
'Hf178(LR)',
'Ta181(LR)',
'W182(LR)',
'Tl205(LR)',
'Pb208(LR)',
'Bi209(LR)',
'Th232(LR)',
'U238(LR)',
'Mg24(MR)',
'Al27(MR)',
'Si28(MR)',
'Ca44(MR)',
'Sc45(MR)',
'Ti47(MR)',
'V51(MR)',
'Cr52(MR)',
'Mn55(MR)',
'Fe56(MR)',
'Co59(MR)',
'Ni60(MR)',
'Cu63(MR)',
'Zn66(MR)',
'Ga69(MR)',
'K39(HR)'
)
with open(filepath) as fh:
content = fh.readlines()
for line in content:
line = line.strip()
if ":" in line:
key, _ = line.split(":",1)
if key.strip() in interesting_keys:
output_lst.append(line)
def write_output_data():
if output_lst:
with open(output_file, "w") as fh:
fh.write("\n".join(output_lst))
print("See", output_file)
def process_files():
for filepath in os.listdir(input_dir):
process_file(os.path.join(input_dir, filepath))
write_output_data()
process_files()

Encoding when read from a .txt file

I am trying to extracting text under a section from txt file. I got ’ instead of apostrophe ' when encoding is not identified. When I use encoding = 'utf-8', I got error message 'utf-8' codec can't decode bytes in position 669-670: invalid continuation byte. I assume it is due to issues in other parts of the txt file. I tried other options such as encoding='latin-1' and encoding = 'ISO-8859-1', then I got â\x80\x99 instead of apostrophe '
Any ideas on what encoding I should use? Here is the Link to the txt file.
mylines = []
with open('icrr.txt', 'r', encoding='utf-8') as myfile:
copy = False
for line in myfile:
if line.strip().lower().startswith("13. lessons"):
copy = True
continue
elif line.strip().lower().startswith("14. assessment recommended"):
copy = False
continue
elif copy:
mylines.append(line)
print("".join(mylines))
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 669-670: invalid continuation byte
The error is correct. That file contains an invalid UTF-8 sequence at byte 17,053 (E2 80 3F). It has been damaged somehow.
You can read this by telling Python to replace bad characters with '?':
with open('icrr.txt','r',encoding='utf-8',errors='replace'):

How to extract String from a Unicoded JSONObject in Python?

I'm getting the below error when I try to parse a String with Unicodes like ' symbol and Emojis, etc :
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f33b' in position 19: ordinal not in range(128)
Sample Object:
{"user":{"name":"\u0e2a\u0e31\u0e48\u0e07\u0e14\u0e48\u0e27\u0e19 \u0e2b\u0e21\u0e14\u0e44\u0e27 \u0e40\u0e14\u0e23\u0e2a\u0e41\u0e1f\u0e0a\u0e31\u0e48\u0e19\u0e21\u0e32\u0e43\u0e2b\u0e21\u0e48 \u0e23\u0e32\u0e04\u0e32\u0e40\u0e1a\u0e32\u0e46 \u0e2a\u0e48\u0e07\u0e17\u0e31\u0e48\u0e27\u0e44\u0e17\u0e22 \u0e44\u0e14\u0e49\u0e02\u0e2d\u0e07\u0e0a\u0e31\u0e27\u0e23\u0e4c\u0e08\u0e49\u0e32 \u0e2a\u0e19\u0e43\u0e08\u0e15\u0e34\u0e14\u0e15\u0e48\u0e2d\u0e2a\u0e2d\u0e1a\u0e16\u0e32\u0e21 Is it","tag":"XYZ"}}
I'm able to extract tag value, but I'm unable to extract name value.
Here is my code:
dict = json.loads(json_data)
print('Tag - 'dict['user']['tag'])
print('Name - 'dict['user']['name'])
You can save the data in CSV file format which could also be opened using Excel. When you open a file in this way: open(filename, "w") then you can only store ASCII characters, but if you try to store Unicode data this way, you would get UnicodeEncodeError. In order for you to store Unicode data, you need to open the file with UTF-8 encoding.
mydict = json.loads(json_data) # or whatever dictionary it is...
# Open the file with UTF-8 encoding, most important step
f = open("userdata.csv", "w", encoding='utf-8')
f.write(mydict['user']['name'] + ", " + mydict['user']['tag'] + "\n")
f.close()
Feel free to change the code based on the data you have.
That's it...

Program (twitter bot) works on Windows machine, but not on Linux machine [duplicate]

I was trying to read a file in python2.7, and it was readen perfectly. The problem that I have is when I execute the same program in Python3.4 and then appear the error:
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
Also, when I run the program in Windows (with python3.4), the error doesn't appear. The first line of the document is:
Codi;Codi_lloc_anonim;Nom
and the code of my program is:
def lectdict(filename,colkey,colvalue):
f = open(filename,'r')
D = dict()
for line in f:
if line == '\n': continue
D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]
f.close
return D
Traduccio = lectdict('Noms_departaments_centres.txt',1,2)
In Python2,
f = open(filename,'r')
for line in f:
reads lines from the file as bytes.
In Python3, the same code reads lines from the file as strings. Python3
strings are what Python2 call unicode objects. These are bytes decoded
according to some encoding. The default encoding in Python3 is utf-8.
The error message
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
shows Python3 is trying to decode the bytes as utf-8. Since there is an error, the file apparently does not contain utf-8 encoded bytes.
To fix the problem you need to specify the correct encoding of the file:
with open(filename, encoding=enc) as f:
for line in f:
If you do not know the correct encoding, you could run this program to simply
try all the encodings known to Python. If you are lucky there will be an
encoding which turns the bytes into recognizable characters. Sometimes more
than one encoding may appear to work, in which case you'll need to check and
compare the results carefully.
# Python3
import pkgutil
import os
import encodings
def all_encodings():
modnames = set(
[modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)
filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
try:
with open(filename, encoding=enc) as f:
# print the encoding and the first 500 characters
print(enc, f.read(500))
except Exception:
pass
Ok, I did the same as #unutbu tell me. The result was a lot of encodings one of these are cp1250, for that reason I change :
f = open(filename,'r')
to
f = open(filename,'r', encoding='cp1250')
like #triplee suggest me. And now I can read my files.
In my case I can't change encoding because my file is really UTF-8 encoded. But some rows are corrupted and causes the same error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte
My decision is to open file in binary mode:
open(filename, 'rb')

Delete all files in directory with non utf-8 symbols

I have a set of data, but I need to work only with utf-8 data, so I need to delete all data with non-utf-8 symbols.
When I try to work with these files, I receive:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3062: character maps to <undefined> and UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 1576: invalid start byte
My code
class Corpus:
def __init__(self,path_to_dir=None):
self.path_to_dir = path_to_dir if path_to_dir else []
def emails_as_string(self):
for file_name in os.listdir(self.path_to_dir):
if not file_name.startswith("!"):
with io.open(self.add_slash(self.path_to_dir)+file_name,'r', encoding ='utf-8') as body:
yield[file_name,body.read()]
def add_slash(self, path):
if path.endswith("/"): return path
return path + "/"
I recive error here yield[file_name,body.read()] and herelist_of_emails = mailsrch.findall(text), but when I work with utf-8 all great.
I suspect you want to use the errors='ignore' argument on bytes.decode. See http://docs.python.org/3/howto/unicode.html#unicode-howto and http://docs.python.org/3/library/stdtypes.html#bytes.decode .for more info.
Edit:
Here's an example showing a good way to do this:
for file_name in os.listdir(self.path_to_dir):
if not file_name.startswith("!"):
fullpath = os.path.join(self.path_to_dir, file_name)
with open(fullpath, 'r', encoding ='utf-8', errors='ignore') as body:
yield [file_name, body.read()]
Using os.path.join, you can eliminate your add_slash method, and ensure that it works cross-platform.

Categories

Resources