Create an ISO9660 compliant filename using pycdlib - python

I'm trying to implement the pycdlib example-creating-new-basic-iso example shown below. About half way down there is a line that reads, iso.add_fp(BytesIO(foostr), len(foostr), '/FOO.;1'). This writes a new file to the ISO that will be names "FOO" in the root directory of the iso. This example works for me.
Building on the example, I'm trying to change the filename inside the iso from "/FOO", to "/FOO.txt" but I keep getting the error, PyCdlibInvalidInput: ISO9660 filenames must consist of characters A-Z, 0-9, and _. How do I write an ISO9660 compliant filename with pycdlib with ".txt" in it?
Example code:
try:
from cStringIO import StringIO as BytesIO
except ImportError:
from io import BytesIO
import pycdlib
iso = pycdlib.PyCdlib()
iso.new()
foostr = b'foo\n'
iso.add_fp(BytesIO(foostr), len(foostr), '/FOO.;1')
iso.add_directory('/DIR1')
iso.write('new.iso')
iso.close()

The key here is in the error: PyCdlibInvalidInput: ISO9660 filenames must consist of characters A-Z, 0-9, and _, but there is a more complete [explanation]
(https://wiki.osdev.org/ISO_9660#Filenames):
d-characters: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _
Filenames must use d-character encoding (strD), plus dot and semicolon which have to occur exactly once per filename. Filenames are composed of a File Name, a dot, a File Name Extension, a semicolon; and a version number in decimal digits. The latter two are usually not displayed to the user.
There are three Levels of Interchange defined. Level 1 allows filenames with a File Name length of 8 and an extension length of 3 (like MS-DOS). Levels 2 and 3 allow File Name and File Name Extension to have a combined length of up to 30 characters.
The ECMA-119 Directory Record format can hold composed names of up to 222 characters. This would violate the specs but must nevertheless be handled by a reader of the filesystem.
You can't name the file FOO.txt because lowercase letters aren't included in the d-characters. You need to capitalize the extension in order to be ISO9660-compliant.
iso.add_fp(BytesIO(foostr), len(foostr), '/FOO.TXT;1')

Related

BioPython AlignIO sequences must be the same length [multiple files]

I got an issue when I try to align multiple files
Here is my script:
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
from Bio.Align.Applications import ClustalOmegaCommandline
def divergence(fic1dna,fic2dna,fic1prot,fic2prot):
from Bio import SeqIO
seq1dna = list(SeqIO.parse(fic1dna, "fasta",alphabet=IUPAC.IUPACUnambiguousDNA()))
seq2dna = list(SeqIO.parse(fic2dna, "fasta",alphabet=IUPAC.IUPACUnambiguousDNA()))
seq1prot = list(SeqIO.parse(fic1prot, "fasta",alphabet=IUPAC.protein))
seq2prot= list(SeqIO.parse(fic2prot, "fasta",alphabet=IUPAC.protein))
u=0
while u < len(seq1dna): # make an alignment betwen each element on 2 files for 2 paires files
nuc1=str(seq1dna[u].seq)
nuc2=str(seq2dna[u].seq)
prot1=str(seq1prot[u].seq)
prot2=str(seq2prot[u].seq)
prot1 = SeqRecord(Seq(prot1, alphabet=IUPAC.protein),id='pro1')
prot2 = SeqRecord(Seq(prot2, alphabet=IUPAC.protein),id='pro2')
aln = MultipleSeqAlignment([prot1, prot2])
print(aln)
u+=1
print(divergence("concatenate_0035_fna_renamed.fst","concatenate_0042_fna_renamed.fst","concatenate_0035_faa_renamed.fst","concatenate_0042_faa_renamed.fst"))
So, as you can see I have 4 files, corresponding to 244 sequences from 2 species and I need to calculate dN dS for each of them so, I need to align each paired seq in codon alignment.
But, when I'm trying to align my 244 protein sequences, the error " ValueError("Sequences must all be the same length") raises "
I do not know why the script does not accept sequence with different length since all other programmes do.
short input would be :
one file with the AA seq from the sp 1
>EOG090X005Q
CEHNTAGRDCEKCLDFYNDAPWGRASPTNVHECKACNCNGFSNKCYFDKDLYERTGHGGHCIDCEENRDGANCERCKENFYQGMEDICLPCNCNPTGSRSLQCNAEGKCQCKPGVTGDKCDVCAPNYFEFTMHGCKPCDCNVSGSYGNTPQCDPQTGVCLCKQNVEGRRCRECKPGFFNLDVENEFGCTPCFCFGHSSQCSSAPKYQAHEISAHYIRDAEKWGAEDDQRKPVQLQFNANTQNIAVASKGSEILYFLASGQFLGDQRPSYNHDLKFTLRLGESGGYPSSQDIILEGARSSVSMNIYAQNNPEPSDVAQEYSFRLHEDPRYGWTPTLSNFEFMSILQNLTAIKIRGTYNKGGVGYLINFKLETAKIGREKGSAPANWVEKCSCPKAYVGDYCEECAPGYKHEPANGGPYSTCIPCDCNGHAHICDTATGFCICKHNTTGSNCELCAKGFYGNAIAGTADDCKPCPCPKDSGCIQLMDQSIVCTDCPVGYAGPRCEVCADAHFGDPTGQFGAPQECEECQCNGNVDPNAVGNCNRTTGECLKCIYNTAGEHCDKCLSGYFGDALDQKKKGDCKPCQCLEAGTVESPEGARKAPLCDGLTGFCSCRPHVIGRNCDKCEVDLNCIAVLKT
>EOG090X00BV
MNAHFPQNEIARSEAYNIMSVRKQYLVPKDGTPLSGLIQDHVISGVKMSIRGAFFTKADYQQLVFQALSNHKGEIKLLPPTILKPIMLWSGKQILSTIIINSIPKGKPYLSLTGKAKISSKAWQKEPARTWNAGGTPFTNPNSMSEAEVIIRKGELLCGVLDKTHYGATPYGLVHCMYELYGGDSSSALLSSFSKVFTFYLQWIGFTLGVKDILVVEEADKQRDNFINLVRKVGKVAAAKATELPVDVDELKLKETISEMLIKDPKFRANLDRQYKSLLDSYTNNINTVCLSEGLLEKFPYNNLQLMVQSGAKGSTVNTMQISCLLGQIELEGKRPPLMISGRSLPSFPPYDISPRAGGFIDGRFMTGIQPQEFFFHCMAGREGLIDTAVKTSRSGYLQRCLIKHLEGLSVAYDHTVRDSDSSVIQFAYGEDGLDVIKCQYFNKDQFEFLDVNSNAVISKSAIKKLKEDDKSKALAKSQKSLKKWKKKNGNPFEKVRYSPFTEFSAIAKNDIVLDDKPTDQTRDPNYWELEKMWRNLDADEKKQYARKRCPDPIPSKYSPEYKFGVINEQLNELTQNYLKNRKEHMYSDYTDKDKFTEIINAKYLASMAAPGEPVGLLAAQSIGEPSTQMTLNTFHFAGRGDMNVTLGIPRLREILMTASAKLKTPSMDIPFRSDLPDLNKKAERLRQKMNRVTVSDVLEKIDVHCEIVTNPNRQLKTVMRFSFLPHSQYKVQYTVKPAQIIKHMQNKFFSEMFSIIRKQAKTTCGVMWSTEKEKKRRAASDEDDEDGEGASPDVAEKAVNMDEDSSDEEGPNDDDDNTDVS
and the other for the specie 2:
>EOG090X005Q
MGGKIAAILLFAFFTSGSRSEPDFVDGQFNKINKNRVEVKCYDDFGAPQRCIPPFENAAFGVLMEATNTCGQDGRPTEFCRQTGVQRKPCEFCHPGDHPASFLTDRDNNDNATWWQSETMHEGIEYPNKVVLTLNLGKTYDITYVRVLFESPRPESWGIFRRRTEDSPWEPYQFYSATCRDTYGLPDRKDTVRGEDTRVLCTSEYSDISPLRRGTVAFSTLEGRPSAFQFDTNPALQSWVQATDLRLSLDRPNTFGDELFGDGQVLKSYYYAIADVAVGARCACNGHAGECINSPHTNGTTRRVCRCEHNTAGPDCNECLPFYNDAPWGRATTTDAHECKPCNCNGYSDRCYFDKDLYERSGHGGHCTDCRANRAGPNCERCRENFYQRLEDSYCVACNCNEIGSRSLQCNSEGKCQCKPGITGDKCDRCAANFFNFDSLGCTSCECSPKGSLDNEPNCDPVSGACVCKENVEGKRCRECRPGFFNLDLDNEFGCTPCFCYGHSSVCNLANGYSKLTIESMFGRGNEKWTASVAGNPIPLHYDAVTQTISVNAPDRDNVYFVAPERFLGDQRASYNQDLTFTLRIAENEPAPTARDVILEGGNGEQLTQPIFGQTNQLPNASPQVYKFRLNEHADYGWEPRVTSRAFMSVLSNLTAIKIRGTYTHQGRGFLDDVSLETAQRGAAGEPADWIEHCQCPHGYVGQFCESCAPGFHHDPPNGGPFSLCVPCNCNGHADICEAETGQCICHHNTAGSNCDLCSRGFYGYPLKGTPHDCKPCPCPDNGPCILLGNNPDPICSECPSGRTGARCETCSDGYFGNPDQGQACRLCDCNNNIDLNAVRNCNHETGECLKCVNNTAGFHCEDCLSGYWGDALSERKEDSCKLCQCYPPGTIELDDGSVAPCNQLTGHCACKPHVIGRNCDKCEDGYYQILSGDGCTACNCDPEGSYNRTCDATTGQCECRPGITGKRCDTCLPYQFGFGRDGCKHCDCDTIGSQELQCDASGQCPCLTNVEGRRCDRCKENKYNRQYGCIDCPPCYNLIQDSVNQHRRRLNELESTLRKINNSPTVMKDSDFEKELKNVENRVKSLLQVAKQGSGNENKTLVEQLDELRDQLNQIEKISQSVDATAEDARRTTNEGLTSIEEAERVLDQIYEQLTEAEDYLATDGARALAAAKKRADQVGQQNQQMTIIAQEARVLADLNTNEAKKIHVLAEQARNTSLEAYNLAKKAIAKYSNISDEIRGLENKLELLEDRFNEVKNLTAAAVAKSAAVDKEALQLLILDLRVPAVDTNELRILLETVSVDGSEIKEQAQLLLGQNEAWLNELANKARKSEELLERAQDQQAATADLLSEVDGANEKAKDALKRGNQTLVEAQETLKKLGEFDAEVQKERIKAQEALTVLEEIKDMVNEAIAKANETESVLKDAESNAIAAKDIAIQAQVSNNADEASANANLIRQEANKTKLDAVRLGNEADKLHLRVEITNSIAKKHEARVDKDVNATNEVNHQVGQARNSLNLAGQQVDKALAEVDEIIKELDVLPEIDDADLDRLEERLLAAEKEIEEANLEKRIRELTEAKNLQTQWVKNYEDEVSRLRLEVENIDDIRKALPSICYKRLRLEP
>EOG090X00BV
MFSIFTASDVRNLSVLKISTPLSFNILGHPLKGGLYDPALGPLNDRSDPCGTCGEGTIQCMGHFGHIELPVPVVNPLFHKVLTSLLKLSCLKCYTLQIPSYLKLLLNGKLRLMEEGFSNDIPGLEQEVGSAVAGMNRIAEGELEFISDIIEAYIEMTCNQRHHVQSGKSKESTSTRTLNMEWHHYIESVVKTCKASKLCINCRNPIPKMTILKNKILTNHVVNNEDTMMEDRVIHKLETSFMTPDQSKKHLRGLWQKEADILRIIIPCLGSVDLEFPTDVFFFEIIPVLPPITRPVNMLDNQLVEHPQSQVYKSIIQDCLVLRNIIQTIQDGDTTQLPEEGRAVFDEIRGDNAAEKLHHAWTTLQSNVDHLMDREMSKTTESANCHGLKQVIEKKEGIIRMHMMGKRVNYAARSVITPDPNLNIDEIGVPEAFALKLTYPVPVTPWNVTELRKLIINGPEIHPGAVMIEGEDGFVKLLRGDDKTQLEAIAKRLLTSSRKPFSGIKIVHRHLQNGDMLLLNRQPTLHKPSIMAHKARILKGEKTLRLHYANCKAYNADFDGDEMNAHFPQNELARSEGYFIANVSNQYLVPKDGTPLGGLIQDHVISGVRLTLRGNFFNRQDYMQLVYSAIADTTGDLILLPPTILKPVRLWSGKQIISTVIINLTPRGRAPINLKASAKISVKDWQVKKARKWKCGQEFTDQRTMSEAEVVIRGGELLSGVLDKTHYGATPYGLIHCLFELYGGTCSSKVLSAFGKLFQTYLQISGFTLGVEDILVVRKSDQKRREIIEACRQIGDQIQTATVELPPGTSEEQVKSKMEESYAKDPKFRAIVDRKYKSALDVFTNNINKTCLPAGLLKKFPHNNLQLMVQSGAKGSTVNTMQISCLLGQIELEGKRPPLMINGKSLPSFPAYDSSPRSGGFIDGRFMTGIQPQEFFFHCMAGREGLIDTAVKTSRSGYLQRCLIKHLEGLTVNYDSTVRDSDGSLIQMSYGEDGLDIPNSRFLRKEELDFLVENRKAIVDPALVEHLKDETTEKIRKINKKIRKWRTKHGNGSTKWRNSEFAKFSEINRNSGSSKNRQINSNCGRTKAALSLMKKWIRADEEVKKKLKDECVRCPDPVTSIFRQDLQFGVLTEKMEALMEEYLDEKSRRFTTSIGKEEVRDLLCTKIMKSLCPPGEPVGLLAAQSIGEPSTQMTLNTFHFAGRGEMNVTLGIPRLREILMMASKNIKTPSMEIPFRTDLPNVENQATKLQLKLTKCYLSNILKNIKLDRKLEENPNRQLTFTLTVNCLPHKFYKNEYCVKPHNVLNEIERNFFKLFFRAIKKIGKATGTLLHIEEEKSSSREDDAMLDTGEPDETEAKPNRSDLGELHESSDEDEAAEDADATASRSIARHRENQEYEDPEEEEIEDAAPREPEDEENPQNPTNLPPEDEDDLDQPMCVADELITEQRKKDVVNMHPYALDYDYDSEKFLWCKLTFWLPLRMCRLDLPTILRTVAEKVVLWETPAIKRAFTFQNSEGETILKTDGLNIVEMFKYAQILDLHKLYTNDIYGVSRTYGIEAANRVILKEVKDVFKMYGITVDSRHLSLIADYMTFDGTFQPLSRKGMEDSASPLQQMSFEASLNFLKNATLQGKHDDLMSPSSRLMVGQPCKTGTGAFNVLFKMNNTAVSM
Someone could help me?
Thanks you

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:
Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))
My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.
Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

Writing to UTF-16-LE text file with BOM

I've read a few postings regarding Python writing to text files but I could not find a solution to my problem. Here it is in a nutshell.
The requirement: to write values delimited by thorn characters (u00FE; and surronding the text values) and the pilcrow character (u00B6; after each column) to a UTF-16LE text file with BOM (FF FE).
The issue: The written-to text file has whitespace between each column that I did not script for. Also, it's not showing up right in UltraEdit. Only the first value ("mom") shows. I welcome any insight or advice.
The script (simplified to ease troubleshooting; the actual script uses a third-party API to obtain the list of values):
import os
import codecs
import shutil
import sys
import codecs
first = u''
textdel = u'\u00FE'.encode('utf_16_le') #thorn
fielddel = u'\u00B6'.encode('utf_16_le') #pilcrow
list1 = ['mom', 'dad', 'son']
num = len(list1) #pretend this is from the metadata profile
f = codecs.open('c:/myFile.txt', 'w', 'utf_16_le')
f.write(u'\uFEFF')
for item in list1:
mytext2 = u''
i = 0
i = i + 1
mytext2 = mytext2 + item + textdel
if i < (num - 1):
mytext2 = mytext2 + fielddel
f.write(mytext2 + u'\n')
f.close()
You're double-encoding your strings. You've already opened your file as UTF-16-LE, so leave your textdel and fielddel strings unencoded. They will get encoded at write time along with every line written to the file.
Or put another way, textdel = u'\u00FE' sets textdel to the "thorn" character, while textdel = u'\u00FE'.encode('utf-16-le') sets textdel to a particular serialized form of that character, a sequence of bytes according to that codec; it is no longer a sequence of characters:
textdel = u'\u00FE'
len(textdel) # -> 1
type(textdel) # -> unicode
len(textdel.encode('utf-16-le')) # -> 2
type(textdel.encode('utf-16-le')) # -> str

regular expressions in python using quotes

I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!
You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.
Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo

Reading path names from a file in Python under Windows

I have a Python script that read a list of path names from a file and open them using the gzip module. It works well under Linux. But when I used it under Windows, I met an error when calling the gzip.open function. The error message is as follows:
File "C:\dev_tools\Python27\lib\gzip.py", line 34, in open
return GzipFile(filename, mode, compresslevel)
File "C:\dev_tools\Python27\lib\gzip.py", line 89, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
The filename should be something like
'G:\ext_pt1\cfx33_50instr4_testset\cfx33_50instr4_0-99\cfx33_50instr4_cov\cfx33_50instr4_id0_cov\cfx33_50instr4_id0.detail.rpt.gz'
But when I printed the filename, it printed out something like
' ■G : \ e x t _ p t 1 \ c f x 3 3 _ 5 0 i n s t r 4 _ t e s t s e t \
c f x 3 3 _ 5 0 i n s t r 4 _ 0 - 9 9 \ c f x 3 3 _ 5 0 i n s t r 4 _
c o v \ c f x 3 3 _ 5 0 i n s t r 4 _ i d 0 _ c o v \ c f x 3 3 _ 5 0
i n s t r 4 _ i d 0 . d e t a i l . r p t . g z'
And when I printed repr(filename), it printed out something like
'\xff\xfeG\x00:\x00\\x00e\x00x\x00t\x00_\x00p\x00t\x001\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00s\x00t\x00r\x004\x00_\x00t\x00e\x00s\x00t\x00s\x00e\x00t\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00\x00t\x
00r\x004\x00_\x000\x00-\x009\x009\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00\x00t\x00r\x004\x00_\x00c\x00o\x00v\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00s\x00t\x00r\x004\x00_\x00i\x00d\x000\x00_\x00c\x00o\x00v\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00s\x00t\x00r\x004\x00_\x00i\x00d\x000\x00.\x00d\x00e\x00t\x00a\x00i\x00l\x00.\x00r\x00p\x00t\x00.\x00g\x00z\x00'
I don't know why Python added those spaces (possibly the NULL bytes?) when it read the file. Does anyone have any clue?
Python has not added anything; it has merely read what is in the file. You have a little-endian UTF-16 string there, as you can plainly tell by the byte-order mark in the first two bytes. If you are not expecting this, you could convert it to ASCII (assuming it doesn't have any non-ASCII characters).
# convert mystring from little-endian UTF-16 with optional BOM to ASCII
mystring = unicode(mystring, encoding="utf-16le").encode("ascii", "ignore")
Or just convert it to proper Unicode and use it that way, if Windows will tolerate it:
mystring = unicode(mystring, encoding="utf-16le").lstrip(u"\ufeff")
Above, I have manually specified the byte order and then stripped off the BOM, rather than specifying "utf-16" as the encoding and letting Python figure out the byte order. This is because the BOM is going to be found once at the beginning of the file, not at the beginning of each line, so if you are converting the lines to Unicode one at a time, you won't have a BOM most of the time.
However, it might make more sense to go back to the source of that file and figure out why it's being saved in little-endian UTF-16 if you expected ASCII. Is the file generated the same way on Linux and Windows, for instance? Has it been touched by a text editor that defaults to saving as Unicode? Etc.
It seems that the encoding of your file has some problem. The printed file name pasted in your question is not the normal character. Have you saved your path-list file in unicode format?
I had the same problem. I replaced \ with / and it was ok. Just wanted you to remind this possibility before going into more advanced remedies.

Categories

Resources