Here is the code:
def readFasta(filename):
""" Reads a sequence in Fasta format """
fp = open(filename, 'rb')
header = ""
seq = ""
while True:
line = fp.readline()
if (line == ""):
break
if (line.startswith('>')):
header = line[1:].strip()
else:
seq = fp.read().replace('\n','')
seq = seq.replace('\r','') # for windows
break
fp.close()
return (header, seq)
FASTAsequence = readFasta("MusChr01.fa")
The error I'm getting is:
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
But the first argument to startswith is supposed to be a string according to the docs... so what is going on?
I'm assuming I'm using at least Python 3 since I'm using the latest version of LiClipse.
It's because you're opening the file in bytes mode, and so you're calling bytes.startswith() and not str.startswith().
You need to do line.startswith(b'>'), which will make '>' a bytes literal.
If remaining to open a file in binary, replacing 'STR' to bytes('STR'.encode('utf-8')) works for me.
Without having your file to test on try encoding to utf-8 on the 'open'
fp = open(filename, 'r', encoding='utf-8')
Related
I read an file into a string in Python, and it shows up as encoded (not sure the encoding).
query = ""
with open(file_path) as f:
for line in f.readlines():
print(line)
query += line
query
The lines all print out in English as expected
select * from table
but the query at the end shows up like
ÿþd\x00r\x00o\x00p\x00 \x00t\x00a\x00b\x00l\x00e\x00
What's going on?
Agreed with Carlos, the encoding seems to be UTF-16LE. There seems to be BOM present, thus encoding="utf-16" would be able to autodetect if it's little- or big-endian.
Idiomatic Python would be:
with open(file_path, encoding="...") as f:
for line in f:
# do something with this line
In your case, you append each line to query, thus entire code can be reduced to:
query = open(file_path, encoding="...").read()
It seems like UTF-16 data.
Can you try decoding it with utf-16?
with open(file_path) as f:
query=f.decode('utf-16')
print(query)
with open(filePath) as f:
fileContents = f.read()
if isinstance(fileContents, str):
fileContents = fileContents.decode('ascii', 'ignore').encode('ascii') #note: this removes the character and encodes back to string.
elif isinstance(fileContents, unicode):
fileContents = fileContents.encode('ascii', 'ignore')
I have this program
import csv
with open("C:\Users\frederic\Desktop\WinPython-64bit-3.4.4.3Qt5\notebooks\scores.txt","r") as scoreFile:
# write = w, read = r, append = a
scoreFileReader = csv.reader(scoreFile)
scoreList = []
for row in scoreFileReader:
if len (row) != 0:
scoreList = scoreList + [row]
scoreFile.close()
print(scoreList)
Why do I get this Error ?
with
open("C:\Users\frederic\Desktop\WinPython-64bit-3.4.4.3Qt5\notebooks\scores.txt","r")
as scoreFile:
^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape.
You need to use raw strings with Windows-style filenames:
with open(r"C:\Users\frederic\Desktop\WinPython-64bit-3.4.4.3Qt5\notebooks\scores.txt", 'r') as scoreFile:
^^
Otherwise, Python's string engine thinks that \U is the start of a Unicode escape sequence - which of course it isn't in this case.
Also, be careful also your scoreFile.close() is useless:
The with statement replace a try and catch. It also
automatically close the file after the block. That mean you can delete the scoreFile.close() line.
Also, you can change the line if len(row) != 0
According to PEP8:
For sequences, (strings, lists, tuples), use the fact that empty
sequences are false.
Yes: if not seq:
if seq:
No: if len(seq):
if not len(seq):
One last thing, your for loop isn't good either, to read csv you better start copying an example from the doc first!
with open('eggs.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in spamreader:
print ', '.join(row)
you may try this:
import csv
import io
def guess_encoding(csv_file):
"""guess the encoding of the given file"""
with io.open(csv_file, "rb") as f:
data = f.read(5)
if data.startswith(b"\xEF\xBB\xBF"): # UTF-8 with a "BOM"
return ["utf-8-sig"]
elif data.startswith(b"\xFF\xFE") or data.startswith(b"\xFE\xFF"):
return ["utf-16"]
else: # in Windows, guessing utf-8 doesn't work, so we have to try
try:
with io.open(csv_file, encoding="utf-8") as f:
preview = f.read(222222)
return ["utf-8"]
except:
return [locale.getdefaultlocale()[1], "utf-8"]
encodings = guess_encoding(r"C:\Users\frederic\Desktop\WinPython-64bit-3.4.4.3Qt5\notebooks\scores.txt")[0])
# then your code with
with open(r"C:\Users\frederic\Desktop\WinPython-64bit-3.4.4.3Qt5\notebooks\scores.txt","r", encoding=encodings[0]) as scoreFile:
I have the following code to read the lines in a file and split them with a delimiter specified. After split I have to write some specific fields into another file.
Sample Data:
Week49_A_60002000;Mar;FY14;Actual;Working;E_1000;PC_000000;4287.63
Code:
import os
import codecs
sfilename = "WEEK_RPT_1108" + os.extsep + "dat"
sfilepath = "Club" + "/" + sfilename
sbackupname = "Club" + "/" + sfilename + os.extsep + "bak"
try:
os.unlink(sbackupname)
except OSError:
pass
os.rename(sfilepath, sbackupname)
try:
inputfile = codecs.open(sbackupname, "r", "utf-16-le")
outputfile = codecs.open(sfilepath, "w", "utf-16-le")
sdelimdatfile = ";"
for line in inputfile:
record = line.split(';')
outputfile.write(record[1])
except IOError, err:
pass
I can see that the 0th array position contains the whole line instead of the first record:
record[0] = Week49_A_60002000;Mar;FY14;Actual;Working;E_1000;PC_000000;4287.63
while on printing record[1], it says array index out of range.
Need help as new to python.
Thanks!
After you comment saying that print line outputs u'\u6557\u6b65\u3934\u415f\u365f\u3030\u3230\u3030\u3b30\u614d\u3b72\u5946\u3431\u413b\u7463\u6175\u3b6c\u6f57\u6b72\u6e69\u3b67\u5f45\u3031\u3030\u503b\u5f43\u3030\u3030\u3030\u343b\u3832\u2e37\u3336', I can explain what happens and how to fix it.
What happens:
you have a normal 8bits characters file, and the line you show is even in plain ASCII, but you try to decode it as if it were in UTF-16 little endian. So you wrongly combine every two bytes in a single 16 bits unicode character! If your system had been able to correctly display them and if you had directly print line instead of repr(line), you would have got 敗步㤴䅟㙟〰㈰〰㬰慍㭲奆㐱䄻瑣慵㭬潗歲湩㭧彅〱〰倻彃〰〰〰㐻㠲⸷㌶. Of course, none of those unicode characters is the semicolon (; or \x3b of \u003b) so the line cannot be splitted on it.
But as you encode it back before writing record[0] you find the whole line in the new file, what let you believe erroneously that the problem is in the split function.
How to fix:
Just open the file normally, or use the correct encoding if it contains non ascii characters. But as you are using a version 2 of python, I would just do:
try:
inputfile = open(sbackupname, "r")
outputfile = open(sfilepath, "w")
sdelimdatfile = ";"
for line in inputfile:
record = line.split(sdelimdatfile)
outputfile.write(record[1])
except IOError, err:
pass
If you really need to use the codecs module, for example if the file contains UTF8 or latin1 characters, you can replace the open part with:
encoding = "utf8" # or "latin1" or whatever the actual encoding is...
inputfile = codecs.open(sbackupname, "r", encoding)
outputfile = codecs.open(sfilepath, "w", encoding)
Then there is no index [1]:
Either skip the line with "continue" if len(record) < 1 or just not write to the file (like here)
for line in inputfile:
record = line.split(';')
if len(record) >= 1:
outputfile.write(record[1])
Hi have this two funtions in Py2 works fine but it doesn´t works on Py3
def encoding(text, codes):
binary = ''
f = open('bytes.bin', 'wb')
for c in text:
binary += codes[c]
f.write('%s' % binary)
print('Text in binary:', binary)
f.close()
return len(binary)
def decoding(codes, large):
f = file('bytes.bin', 'rb')
bits = f.read(large)
tmp = ''
decode_text = ''
for bit in bits:
tmp += bit
if tmp in fordecodes:
decode_text += fordecodes[tmp]
tmp = ''
f.close()
return decode_text
The console ouputs this:
Traceback (most recent call last):
File "Practica2.py", line 83, in <module>
large = encoding(text, codes)
File "Practica2.py", line 56, in encoding
f.write('%s' % binary)
TypeError: 'str' does not support the buffer interface
The fix was simple for me
Use
f = open('bytes.bin', 'w')
instead of
f = open('bytes.bin', 'wb')
In python 3 'w' is what you need, not 'wb'.
In Python 2, bare literal strings (e.g. 'string') are bytes, whereas in Python 3 they are unicode. This means if you want literal strings to be treated as bytes in Python 3, you always have to explicitly mark them as such.
So, for instance, the first few lines of the encoding function should look like this:
binary = b''
f = open('bytes.bin', 'wb')
for c in text:
binary += codes[c]
f.write(b'%s' % binary)
and there are a few lines in the other function which need similar treatment.
See Porting to Python 3, and the section Bytes, Strings and Unicode for more details.
I'm trying to write a series of files for testing that I am building from scratch. The output of the data payload builder is of type string, and I'm struggling to get the string written directly to the file.
The payload builder only uses hex values, and simply adds a byte for each iteration.
The 'write' functions I have tried all either fall over the writing of strings, or write the ASCII code for the string, rather than the string its self...
I want to end up with a series of files - with the same filename as the data payload (e.g. file ff.txt contains the byte 0xff
def doMakeData(counter):
dataPayload = "%X" %counter
if len(dataPayload)%2==1:
dataPayload = str('0') + str(dataPayload)
fileName = path+str(dataPayload)+".txt"
return dataPayload, fileName
def doFilenameMaker(counter):
counter += 1
return counter
def saveFile(dataPayload, fileName):
# with open(fileName, "w") as text_file:
# text_file.write("%s"%dataPayload) #this just writes the ASCII for the string
f = file(fileName, 'wb')
dataPayload.write(f) #this also writes the ASCII for the string
f.close()
return
if __name__ == "__main__":
path = "C:\Users\me\Desktop\output\\"
counter = 0
iterator = 100
while counter < iterator:
counter = doFilenameMaker(counter)
dataPayload, fileName = doMakeData(counter)
print type(dataPayload)
saveFile(dataPayload, fileName)
To write just a byte, use chr(n) to get a byte containing integer n.
Your code can be simplified to:
import os
path = r'C:\Users\me\Desktop\output'
for counter in xrange(100):
with open(os.path.join(path,'{:02x}.txt'.format(counter)),'wb') as f:
f.write(chr(counter))
Note use of raw string for the path. If you had a '\r' or '\n' in the string they would be treated as a carriage return or linefeed without using a raw string.
f.write is the method to write to a file. chr(counter) generates the byte. Make sure to write in binary mode 'wb' as well.
dataPayload.write(f) # this fails "AttributeError: 'str' object has no attribute 'write'
Of course it does. You don't write to strings; you write to files:
f.write(dataPayload)
That is to say, write() is a method of file objects, not a method of string objects.
You got this right in the commented-out code just above it; not sure why you switched it around here...