unicode error when reading fastq files - python 3.4.2

unicode error when reading fastq files - python 3.4.2 - python

I am trying to read fastq files but I keep getting the following error:
(unicode error) 'unicodeescape' codec can't decode bytes in position 18 -19: truncated \UXXXXXXXX escape
I used the following code:
file = open(r'C:\Users\jim\Documents\samples\3009_TGACCA_L005_R1_trimmed.fq\3009_TGACCA_L005_R1_trimmed.fq','r', newline = '' )
for i, line in file:
if i < 5:
print (line)
file.close()
Could I please get some advice on how I might be able to resolve this issue?
Thanks

Try to duplicate all backslashes \ => \\ without the r' prefix

Related

UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 37988-37989: malformed \N character escape

I am having trouble reading a csv file into a DataFrame. Below is the code snippet.
patron_df = pd.read_csv('Patron_Checkouts.csv', encoding = 'unicode_escape')
I keep getting the error:
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 37988-37989: malformed \N character escape
I have tried many solutions including adding an 'r' to the right of the parentheses to get the raw string. I have also tried renaming and moving the file.

Converting the csv file to a xlsx file and then changing:
patron_df = pd.read_csv('Patron_Checkouts.csv', encoding = 'unicode_escape')
to:
patron_df = pd.read_excel('Patron_Checkouts.xlsx')
Worked.

Python Reading File and Identifying Source of UnicodeDecodeError

I am trying to read a text file using the following statement:
with open(inputFile) as fp:
for line in fp:
if len(line) > 0:
lineRecords.append(line.strip());
The problem is that I get the following error:
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6880: character maps to <undefined>
My question is how can I identify exactly where in the file the error is encountered since the position Python gives is tied to the location in the record being read at the time and not the absolution position in the file. So is it the 6,880 character in record 20 or the 6,880 character in record 2000? Without record information, the position value returned by Python is worthless.
Bottom line: is there a way to get Python to tell me what record it was processing at the time it encountered the error?
(And yes I know that 0x9d is a tab character and that I can do a search for that but that is not what I am after.)
Thanks.
Update: the post at UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function has nothing to do with the question I am asking - which is how can I get Python to tell me what record of the input file it was reading when it encountered the unicode error.

I think the only way is to track the line number separately and output it yourself.
with open(inputFile) as fp:
num = 0
try:
for num, line in enumerate(fp):
if len(line) > 0:
lineRecords.append(line.strip())
except UnicodeDecodeError as e:
print('Line ', num, e)

You can use the read method of the file object to obtain the first 6880 characters, encode it, and the length of the resulting bytes object will be the index of the starting byte of the offending character:
with open(inputFile) as fp:
print(len(fp.read(6880).encode()))

I have faced this issue before and the easiest fix is to open file in utf8 mode
with open(inputFile, encoding="utf8") as fp:

Pygame sound not importing correctly [duplicate]

This question already has answers here:
How should I write a Windows path in a Python string literal?
(5 answers)
Closed 2 years ago.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
I am using Python 3.1 on a Windows 7 machine. Russian is the default system language, and utf-8 is the default encoding.
Looking at the answer to a previous question, I have attempting using the "codecs" module to give me a little luck. Here's a few examples:
>>> g = codecs.open("C:\Users\Eric\Desktop\beeline.txt", "r", encoding="utf-8")
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-4: truncated \UXXXXXXXX escape (<pyshell#39>, line 1)
>>> g = codecs.open("C:\Users\Eric\Desktop\Site.txt", "r", encoding="utf-8")
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-4: truncated \UXXXXXXXX escape (<pyshell#40>, line 1)
>>> g = codecs.open("C:\Python31\Notes.txt", "r", encoding="utf-8")
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 11-12: malformed \N character escape (<pyshell#41>, line 1)
>>> g = codecs.open("C:\Users\Eric\Desktop\Site.txt", "r", encoding="utf-8")
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-4: truncated \UXXXXXXXX escape (<pyshell#44>, line 1)
My last idea was, I thought it might have been the fact that Windows "translates" a few folders, such as the "users" folder, into Russian (though typing "users" is still the correct path), so I tried it in the Python31 folder. Still, no luck. Any ideas?

The problem is with the string
"C:\Users\Eric\Desktop\beeline.txt"
Here, \U in "C:\Users... starts an eight-character Unicode escape, such as \U00014321. In your code, the escape is followed by the character 's', which is invalid.
You either need to duplicate all backslashes:
"C:\\Users\\Eric\\Desktop\\beeline.txt"
Or prefix the string with r (to produce a raw string):
r"C:\Users\Eric\Desktop\beeline.txt"

Typical error on Windows because the default user directory is C:\user\<your_user>, so when you want to pass this path as a string argument into a Python function, you get a Unicode error, just because the \u is a Unicode escape. If the next 8 characters after the \u are not numeric this produces an error.
To solve it, just double the backslashes: C:\\user\\<\your_user>...
This will ensure that Python treats the single backslashes as single backslashes.

Prefixing with 'r' works very well, but it needs to be in the correct syntax. For example:
passwordFile = open(r'''C:\Users\Bob\SecretPasswordFile.txt''')
No need for \\ here - maintains readability and works well.

With Python 3 I had this problem:
self.path = 'T:\PythonScripts\Projects\Utilities'
produced this error:
self.path = 'T:\PythonScripts\Projects\Utilities'
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 25-26: truncated \UXXXXXXXX escape
the fix that worked is:
self.path = r'T:\PythonScripts\Projects\Utilities'
It seems the '\U' was producing an error and the 'r' preceding the string turns off the eight-character Unicode escape (for a raw string) which was failing. (This is a bit of an over-simplification, but it works if you don't care about unicode)
Hope this helps someone

Or you could replace '\' with '/' in the path.

path = pd.read_csv(**'C:\Users\mravi\Desktop\filename'**)
The error is because of the path that is mentioned
Add 'r' before the path
path = pd.read_csv(**r'C:\Users\mravi\Desktop\filename'**)
This would work fine.

I had this same error in python 3.2.
I have script for email sending and:
csv.reader(open('work_dir\uslugi1.csv', newline='', encoding='utf-8'))
when I remove first char in file uslugi1.csv works fine.

Refer to openpyxl document, you can do changes as followings.
from openpyxl import Workbook
from openpyxl.drawing.image import Image
wb = Workbook()
ws = wb.active
ws['A1'] = 'Insert a xxx.PNG'
# Reload an image
img = Image(**r**'x:\xxx\xxx\xxx.png')
# Insert to worksheet and anchor next to cells
ws.add_image(img, 'A2')
wb.save(**r**'x:\xxx\xxx.xlsx')

I had same error, just uninstalled and installed again the numpy package, that worked!

I had this error.
I have a main python script which calls in functions from another, 2nd, python script.
At the end of the first script I had a comment block designated with ''' '''.
I was getting this error because of this commenting code block.
I repeated the error multiple times once I found it to ensure this was the error, & it was.
I am still unsure why.

How do I load fiducial node data from Slicer in Python? [duplicate]

This question already has answers here:
How should I write a Windows path in a Python string literal?
(5 answers)
Closed 2 years ago.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
I am using Python 3.1 on a Windows 7 machine. Russian is the default system language, and utf-8 is the default encoding.
Looking at the answer to a previous question, I have attempting using the "codecs" module to give me a little luck. Here's a few examples:
>>> g = codecs.open("C:\Users\Eric\Desktop\beeline.txt", "r", encoding="utf-8")
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-4: truncated \UXXXXXXXX escape (<pyshell#39>, line 1)
>>> g = codecs.open("C:\Users\Eric\Desktop\Site.txt", "r", encoding="utf-8")
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-4: truncated \UXXXXXXXX escape (<pyshell#40>, line 1)
>>> g = codecs.open("C:\Python31\Notes.txt", "r", encoding="utf-8")
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 11-12: malformed \N character escape (<pyshell#41>, line 1)
>>> g = codecs.open("C:\Users\Eric\Desktop\Site.txt", "r", encoding="utf-8")
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-4: truncated \UXXXXXXXX escape (<pyshell#44>, line 1)
My last idea was, I thought it might have been the fact that Windows "translates" a few folders, such as the "users" folder, into Russian (though typing "users" is still the correct path), so I tried it in the Python31 folder. Still, no luck. Any ideas?

The problem is with the string
"C:\Users\Eric\Desktop\beeline.txt"
Here, \U in "C:\Users... starts an eight-character Unicode escape, such as \U00014321. In your code, the escape is followed by the character 's', which is invalid.
You either need to duplicate all backslashes:
"C:\\Users\\Eric\\Desktop\\beeline.txt"
Or prefix the string with r (to produce a raw string):
r"C:\Users\Eric\Desktop\beeline.txt"

Typical error on Windows because the default user directory is C:\user\<your_user>, so when you want to pass this path as a string argument into a Python function, you get a Unicode error, just because the \u is a Unicode escape. If the next 8 characters after the \u are not numeric this produces an error.
To solve it, just double the backslashes: C:\\user\\<\your_user>...
This will ensure that Python treats the single backslashes as single backslashes.

Prefixing with 'r' works very well, but it needs to be in the correct syntax. For example:
passwordFile = open(r'''C:\Users\Bob\SecretPasswordFile.txt''')
No need for \\ here - maintains readability and works well.

With Python 3 I had this problem:
self.path = 'T:\PythonScripts\Projects\Utilities'
produced this error:
self.path = 'T:\PythonScripts\Projects\Utilities'
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 25-26: truncated \UXXXXXXXX escape
the fix that worked is:
self.path = r'T:\PythonScripts\Projects\Utilities'
It seems the '\U' was producing an error and the 'r' preceding the string turns off the eight-character Unicode escape (for a raw string) which was failing. (This is a bit of an over-simplification, but it works if you don't care about unicode)
Hope this helps someone

Or you could replace '\' with '/' in the path.

path = pd.read_csv(**'C:\Users\mravi\Desktop\filename'**)
The error is because of the path that is mentioned
Add 'r' before the path
path = pd.read_csv(**r'C:\Users\mravi\Desktop\filename'**)
This would work fine.

I had this same error in python 3.2.
I have script for email sending and:
csv.reader(open('work_dir\uslugi1.csv', newline='', encoding='utf-8'))
when I remove first char in file uslugi1.csv works fine.

Refer to openpyxl document, you can do changes as followings.
from openpyxl import Workbook
from openpyxl.drawing.image import Image
wb = Workbook()
ws = wb.active
ws['A1'] = 'Insert a xxx.PNG'
# Reload an image
img = Image(**r**'x:\xxx\xxx\xxx.png')
# Insert to worksheet and anchor next to cells
ws.add_image(img, 'A2')
wb.save(**r**'x:\xxx\xxx.xlsx')

I had same error, just uninstalled and installed again the numpy package, that worked!

I had this error.
I have a main python script which calls in functions from another, 2nd, python script.
At the end of the first script I had a comment block designated with ''' '''.
I was getting this error because of this commenting code block.
I repeated the error multiple times once I found it to ensure this was the error, & it was.
I am still unsure why.

Function throws a SyntaxError: (unicode error)

I am running the following code in python and it's giving me this error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
def filePro(filename):
f=open(filename,'r')
wordcount=0
for lines in f:
f1=lines.split()
wordcount=wordcount+len(f1)
f.close()
print ('word count:'), str(wordcount)
Please help me.

Unicode literals (String literals in Python 3.x) with \U or \u escape sequence should be one of following forms:
>>> u'\U00000061' # 8 hexadecimals
'a'
>>> u'\u0061' # 4 hexadecimals
'a'
If there's not enough escape sequence, you get a SyntaxError.
>>> u'\u61'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-3: truncated \uXXXX escape
>>> u'\U000061'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-7: truncated \UXXXXXXXX escape
If you mean literal \ and U. You'd better to use raw string:
>>> r'\u0061'
'\\u0061'
>>> print(r'\u0061')
\u0061
In the code you posted, there's no unicode escape sequence. You should Check other part of your code.

Not sure, not much information provided here, but I guess python is trying to open the file with wrong encoding, you could open the file with the codecs library, use the correct codec to open the file, if I don't know or if it comes from windows I usually use 'cp1252' as this can open most types.
import codecs
def filePro(filename):
f = codecs.open(filename, 'r', 'cp1252'):
wordcount=0
for lines in f:
f1=lines.split()
wordcount=wordcount+len(f1)
f.close()
print ('word count:'), str(wordcount)
Another possibillity is that you have a filename that python translates to code, for example a file name like 'c:\Users\something' here the \U will be interpret. See this answer

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

unicode error when reading fastq files - python 3.4.2 - python

Try to duplicate all backslashes \ => \\ without the r' prefix

Related

UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 37988-37989: malformed \N character escape

Python Reading File and Identifying Source of UnicodeDecodeError

Pygame sound not importing correctly [duplicate]

How do I load fiducial node data from Slicer in Python? [duplicate]

Function throws a SyntaxError: (unicode error)

Categories

Resources