Function throws a SyntaxError: (unicode error) - python

I am running the following code in python and it's giving me this error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
def filePro(filename):
f=open(filename,'r')
wordcount=0
for lines in f:
f1=lines.split()
wordcount=wordcount+len(f1)
f.close()
print ('word count:'), str(wordcount)
Please help me.

Unicode literals (String literals in Python 3.x) with \U or \u escape sequence should be one of following forms:
>>> u'\U00000061' # 8 hexadecimals
'a'
>>> u'\u0061' # 4 hexadecimals
'a'
If there's not enough escape sequence, you get a SyntaxError.
>>> u'\u61'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-3: truncated \uXXXX escape
>>> u'\U000061'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-7: truncated \UXXXXXXXX escape
If you mean literal \ and U. You'd better to use raw string:
>>> r'\u0061'
'\\u0061'
>>> print(r'\u0061')
\u0061
In the code you posted, there's no unicode escape sequence. You should Check other part of your code.

Not sure, not much information provided here, but I guess python is trying to open the file with wrong encoding, you could open the file with the codecs library, use the correct codec to open the file, if I don't know or if it comes from windows I usually use 'cp1252' as this can open most types.
import codecs
def filePro(filename):
f = codecs.open(filename, 'r', 'cp1252'):
wordcount=0
for lines in f:
f1=lines.split()
wordcount=wordcount+len(f1)
f.close()
print ('word count:'), str(wordcount)
Another possibillity is that you have a filename that python translates to code, for example a file name like 'c:\Users\something' here the \U will be interpret. See this answer

Related

Os.chdir with non-Latin characters

While learning the os module in Python and I've come across a problem.
Let's pretend my current working directory is: C:\Users\Москва\Desktop\Coding\Project 1.
I'd like to change the cwd to Desktop but since the path contains some Russian letters (Москва) it throws a Syntax error:
print(os.getcwd()) # C:\Users\Москва\Desktop\Coding\Project 1
os.chdir('C:\Users\Москва\Desktop')
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes...
How shall I usually treat non-standard characters in paths and change the directory in my case?
It isn't about the russian, it's about the backslash with u: \U.
When you print os.getcwd, escaped backslashes goes away:
os.getcwd()
# 'C:\\Users\\chris\\Documents\\Москва\\test'
print(os.getcwd())
C:\Users\chris\Documents\Москва\test
And now if you try to use the printed one by copy-paste, python will understand \Users part as a unicode but of course fail. You can simply reproduce by executing
"\Uaaaa"
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \UXXXXXXXX escape
You can either use raw string, or use escaped backslashes:
os.chdir(r'C:\Users\sjysk\Documents\Москва')
# ^ note `r` here
os.getcwd()
# 'C:\\Users\\chris\\Documents\\Москва'
os.chdir('C:\\Users\\sjysk\\Documents\\Москва')
os.getcwd()
# 'C:\\Users\\chris\\Documents\\Москва'

SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xe1 in position 9: invalid continuation byte

I have the following sh file:
#!/bin/bash
python <py_filename>.py
The file filename.py begins with
#!/usr/bin/env python
# coding: utf-8
and contains a list of strings with special characters like ã and ç.
When I do
sh <sh_filename>.sh
The following error is returned:
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xe1 in position 9: invalid continuation byte
and I can see that, for instance, the word 'Ações' is represented as 'A��es'.
I understand that is something related to file encoding but I don't know how to solve it.

degree symbol in Python triple-quoted string

It appears that the Python 3.7 interpreter won't accept a module that has triple-quoted strings containing special characters like the degree symbol. (I don't want to use an encoding for the degree symbol because the comments are the benefit of someone looking at the code, which will then become less intelligible). Is there any way to get around this?
This problem can be reproduced if the Python file is incorrectly encoded with an 8-bit encoding instead of UTF-8. The byte 0xb0 maps to the degree symbol in many 8-bit encodings, as can be seen here.
The error is reproduced if the python file is encoded as latin-1
iconv --from-code=utf-8 --to-code=latin1 special_char.py > latin_1_char.py
python3.7 latin_1_char.py
File "latin_1_char.py", line 4
"""
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xb0 in position 145: invalid start byte
or as cp1252
iconv --from-code=utf-8 --to-code=cp1252 char.py > cp1252_char.py
(so38) kdwyer#osiris:~/p/so38 $ python3.7 cp1252_char.py
File "cp1252_char.py", line 4
"""
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xb0 in position 145: invalid start byte
but not if the file is encoded as utf-8
iconv --from-code=latin1 --to-code=utf-8 latin_1_char.py > utf8_char.py
python3.7 utf8_char.py
Hello!

change directory in python

from PIL import Image
import os
for f in os.listdir('C:\Users\diodi\Pictures'):
if f.endswith('.jpg'):
print(f)
i get the error
for f in os.listdir('C:\Users\diodi\Pictures'):
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
if someone can edit the error message please do.
i want to print the names of the pictures(jpg) i have in
('C:\Users\diodi\Pictures')
i am using python 3.7,i know i didn't use the pillow library yet.
The backslashes are being parsed as escape characters, use r to denote raw string
os.listdir(r"C:\Users\diodi\Pictures"):
Or escape them with more backslashes
os.listdir('C:\\Users\\diodi\\Pictures'):

unicode error when reading fastq files - python 3.4.2

I am trying to read fastq files but I keep getting the following error:
(unicode error) 'unicodeescape' codec can't decode bytes in position 18 -19: truncated \UXXXXXXXX escape
I used the following code:
file = open(r'C:\Users\jim\Documents\samples\3009_TGACCA_L005_R1_trimmed.fq\3009_TGACCA_L005_R1_trimmed.fq','r', newline = '' )
for i, line in file:
if i < 5:
print (line)
file.close()
Could I please get some advice on how I might be able to resolve this issue?
Thanks
Try to duplicate all backslashes \ => \\ without the r' prefix

Categories

Resources