Iterate text file containing characters with different byte lengths - python

I'm opening a text dump, and trying to parse the contents out. Right now I'm just trying to ID different parts of the file (headers, labels, etc) to work from later. I'm IDing lines based on the first character. Some lines begin with ¯ (macron), some with =.
macron = '\xc2\xaf'
equalSign = '='
nullLines = 0
f = open(sys.argv[1])
for line in f:
tempList = line.rsplit()
if len(tempList) > 0:
switchStr = tempList[0]
else:
print("tempList !> 0")
nullLines = nullLines + 1
if switchStr[0:2] == macron:
print("macron")
elif switchStr[0] == equalSign:
print('equals')
else:
print switchStr
print(nullLines)
f.close()
This code works, but I'm confused. rsplit() splits whitespace. If I have a line such as =================== in the file, tempList is length = 1 and switchStr = '==================='. The same is true with the macron.
OK, so I tried to find the first character in each string with switchStr[0]', but for macron, it didn't work, I needed the first "two" (but obviously just one), egswitchStr[0:2]. It does work for equals. This interpreter output illustrates the thing I don't understand:
>>> line = '¯¯¯¯¯¯¯¯¯¯'
>>> line
'\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf'
>>> print line
¯¯¯¯¯¯¯¯¯¯
>>> line = '=========='
>>> line
'=========='
>>> print line
==========
>>>
So, some "characters" need 2 bytes, and some just one, but how can I programmatically figure out the difference?

Easy.
Stop dealing with bytes.
with io.open(sys.argv[1], encoding='utf-8') as f:
line = f.readline()
print line[0]

Related

Unable to remove line breaks in a text file in python

At the risk of losing reputation I did not know what else to do. My file is not showing any hidden characters and I have tried every .replace and .strip I can think of. My file is UTF-8 encoded and I am using python/3.6.1
I have a file with the format:
>header1
AAAAAAAA
TTTTTTTT
CCCCCCCC
GGGGGGGG
>header2
CCCCCC
TTTTTT
GGGGGG
AAAAAA
I am trying to remove line breaks from the end of the file to make each line a continuous string. (This file is actually thousands of lines long).
My code is redundant in the sense that I typed in everything I could think of to remove line breaks:
fref = open(ref)
for line in fref:
sequence = 0
header = 0
if line.startswith('>'):
header = ''.join(line.splitlines())
print(header)
else:
sequence = line.strip("\n").strip("\r")
sequence = line.replace('\n', ' ').replace('\r', '').replace(' ', '').replace('\t', '')
print(len(sequence))
output is:
>header1
8
8
8
8
>header2
6
6
6
6
But if I manually go in and delete the end of line to make it a continuous string it shows it as a congruent string.
Expected output:
>header1
32
>header2
24
Thanks in advance for any help,
Dennis
There are several approaches to parsing this kind of input. In all cases, I would recommend isolating the open and print side-effects outside of a function that you can unit test to convince yourself of the proper behavior.
You could iterate over each line and handle the case of empty lines and end-of-file separately. Here, I use yield statements to return the values:
def parse(infile):
for line in infile:
if line.startswith(">"):
total = 0
yield line.strip()
elif not line.strip():
yield total
else:
total += len(line.strip())
if line.strip():
yield total
def test_parse(func):
with open("input.txt") as infile:
assert list(parse(infile)) == [
">header1",
32,
">header2",
24,
]
Or, you could handle both empty lines and end-of-file at the same time. Here, I use an output array to which I append headers and totals:
def parse(infile):
output = []
while True:
line = infile.readline()
if line.startswith(">"):
total = 0
header = line.strip()
elif line and line.strip():
total += len(line.strip())
else:
output.append(header)
output.append(total)
if not line:
break
return output
def test_parse(func):
with open("input.txt") as infile:
assert parse(infile) == [
">header1",
32,
">header2",
24,
]
Or, you could also split the whole input file into empty-line-separated blocks and parse them independently. Here, I use an output stream to which I write the output; in production, you could pass the sys.stdout stream for example:
import re
def parse(infile, outfile):
content = infile.read()
for block in re.split(r"\r?\n\r?\n", content):
header, *lines = re.split(r"\s+", block)
total = sum(len(line) for line in lines)
outfile.write("{header}\n{total}\n".format(
header=header,
total=total,
))
from io import StringIO
def test_parse(func):
with open("/tmp/a.txt") as infile:
outfile = StringIO()
parse(infile, outfile)
outfile.seek(0)
assert outfile.readlines() == [
">header1\n",
"32\n",
">header2\n",
"24\n",
]
Note that my tests use open("input.txt") for brevity but I would actually recommend passing a StringIO(...) instance instead to see the input being tested more easily, to avoid hitting the filesystem and to make the tests faster.
From my understanding of your question you would like something like this:
Note how the sequence is build over multiple iteration steps of the loop, as you wish to combine multiple lines.
with open(ref) as f:
sequence = "" # reset sequence
header = None
for line in f:
if line.startswith('>'):
if header:
print(header) # print last header
print(len(sequence)) # print last sequence
sequence = "" # reset sequence
header = line[1:] # store header
else:
sequence += line.rstrip() # append line to sequence

Locate a specific line in a file based on user input then delete a specific number of lines

I'm trying to delete specific lines in a text file the way I need to go about it is by prompting the user to input a string (a phrase that should exist in the file) the file is then searched and if the string is there the data on that line and the number line number are both stored.
After the phrase has been found it and the five following lines are printed out. Now I have to figure out how to delete those six lines without changing any other text in the file which is my issue lol.
Any Ideas as to how I can delete those six lines?
This was my latest attempt to delete the lines
file = open('C:\\test\\example.txt', 'a')
locate = "example string"
for i, line in enumerate(file):
if locate in line:
line[i] = line.strip()
i = i+1
line[i] = line.strip()
i = i+1
line[i] = line.strip()
i = i+1
line[i] = line.strip()
i = i + 1
line[i] = line.strip()
i = i+1
line[i] = line.strip()
break
Usually I would not think it's desirable to overwrite the source file - what if the user does something by mistake? If your project allows, I would write the changes out to a new file.
with open('source.txt', 'r') as ifile:
with open('output.txt', 'w') as ofile:
locate = "example string"
skip_next = 0
for line in ifile:
if locate in line:
skip_next = 6
print(line.rstrip('\n'))
elif skip_next > 0:
print(line.rstrip('\n'))
skip_next -= 1
else:
ofile.write(line)
This is also robust to finding the phrase multiple times - it will just start counting lines to remove again.
You can find the occurrences, copy the list items between the occurrences to a new list and then save the new list into the file.
_newData = []
_linesToSkip = 3
with open('data.txt', 'r') as _file:
data = _file.read().splitlines()
occurrences = [i for i, x in enumerate(data) if "example string" in x]
_lastOcurrence = 0
for ocurrence in occurrences:
_newData.extend(data[_lastOcurrence : ocurrence])
_lastOcurrence = ocurrence + _linesToSkip
_newData.extend(data[_lastOcurrence:])
# Save new data into the file
There are a couple of points that you clearly misunderstand here:
.strip() removes whitespace or given characters:
>>> print(str.strip.__doc__)
S.strip([chars]) -> str
Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.
incrementing i doesn't actually do anything:
>>> for i, _ in enumerate('ignore me'):
... print(i)
... i += 10
...
0
1
2
3
4
5
6
7
8
You're assigning to the ith element of the line, which should raise an exception (that you neglected to tell us about)
>>> line = 'some text'
>>> line[i] = line.strip()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
Ultimately...
You have to write to a file if you want to change its contents. Writing to a file that you're reading from is tricky business. Writing to an alternative file, or just storing the file in memory if it's small enough is a much healthier approach.
search_string = 'example'
lines = []
with open('/tmp/fnord.txt', 'r+') as f: #`r+` so we can read *and* write to the file
for line in f:
line = line.strip()
if search_string in line:
print(line)
for _ in range(5):
print(next(f).strip())
else:
lines.append(line)
f.seek(0) # back to the beginning!
f.truncate() # goodbye, original lines
for line in lines:
print(line, file=f) # python2 requires `from __future__ import print_function`
There is a fatal flaw in this approach, though - if the sought after line is any closer than the 6th line from the end, it's going to have problems. I'll leave that as an exercise for the reader.
You are appending to your file by using open with 'a'. Also, you are not closing your file (bad habit). str.strip() does not delete the line, it removes whitespace by default. Also, this would usually be done in a loop.
This to get started:
locate = "example string"
n=0
with open('example.txt', 'r+') as f:
for i,line in enumerate(f):
if locate in line:
n = 6
if n:
print( line, end='' )
n-=1
print( "done" )
Edit:
Read-modify-write solution:
locate = "example string"
filename='example.txt'
removelines=5
with open(filename) as f:
lines = f.readlines()
with open(filename, 'w') as f:
n=0
for line in lines:
if locate in line:
n = removelines+1
if n:
n-=1
else:
f.write(line)

Finding the last element (digit) on every line in a file (Python)

I'm reading from a text file and I need to find the last element (digit) on each line. I don't understand why this code isn't working as I have tried it on a regular string but it doesn't seem to apply in this case.
f = open("file.txt", "r")
result = 0
for line in f:
string = str(f.read())
if string[-1:].isdigit() == True:
result = int(string[-1:])
else:
result = 40
print(result)
f.close()
The file file.txt only contains the line
81 First line32
so the code should print out 2 as a result, but I only get 40, as the first condition never becomes true. What am I doing wrong?
This line is extraneous:
string = str(f.read())
You don't need to read from your file, and will actually move the file pointer by doing so, causing all sorts of issues. You're already reading with this:
for line in f:
Thus, what you want is:
for line in f:
if line[-1:].isdigit() == True:
result = int(line[-1:])
else:
result = 40
This is explained in the documentation.
You have an f.read() too many. This is all you need:
f = open("file.txt", "r")
result = 0
for line in f:
if line[-1:].isdigit():
result = int(line[-1:])
else:
result = 40
print(result)
f.close()
Also the if string[-1:].isdigit() == True: can be replaced with if line[-1:].isdigit():
You may also want to use line.strip() to get rid of new lines, or else the comparison will fail.
f = open("file.txt", "r")
result = 0
for line in f:
l = line.strip()
if l[-1:].isdigit():
result = int(l[-1:])
else:
result = 40
print(result)
f.close()
The problem is that the last character in the line is the end-of-line character. Use .strip() to remove it (it will also remove extra spaces).
with open("file.txt", "r") as f:
for line in f:
lastchar = line.strip()[-1]
if lastchar.isdigit():
result = int(lastchar)
else:
result = 40
print(result)
This prints 2 as you requested in your question with the one-line file.
81 First line32
It will also work for multiple lines, printing the result for each line.

Printing specific lines txt file python

I have a text file I wish to analyze. I'm trying to find every line that contains certain characters (ex: "#") and then print the line located 3 lines before it (ex: if line 5 contains "#", I would like to print line 2)
This is what I got so far:
file = open('new_file.txt', 'r')
a = list()
x = 0
for line in file:
x = x + 1
if '#' in line:
a.append(x)
continue
x = 0
for index, item in enumerate(a):
for line in file:
x = x + 1
d = a[index]
if x == d - 3:
print line
continue
It won't work (it prints nothing when I feed it a file that has lines containing "#"), any ideas?
First, you are going through the file multiple times without re-opening it for subsequent times. That means all subsequent attempts to iterate the file will terminate immediately without reading anything.
Second, your indexing logic a little convoluted. Assuming your files are not huge relative to your memory size, it is much easier to simply read the whole into memory (as a list) and manipulate it there.
myfile = open('new_file.txt', 'r')
a = myfile.readlines();
for index, item in enumerate(a):
if '#' in item and index - 3 >= 0:
print a[index - 3].strip()
This has been tested on the following input:
PrintMe
PrintMe As Well
Foo
#Foo
Bar#
hello world will print
null
null
##
Ok, the issue is that you have already iterated completely through the file descriptor file in line 4 when you try again in line 11. So line 11 will make an empty loop. Maybe it would be a better idea to iterate the file only once and remember the last few lines...
file = open('new_file.txt', 'r')
a = ["","",""]
for line in file:
if "#" in line:
print(a[0], end="")
a.append(line)
a = a[1:]
For file IO it is usually most efficient for programmer time and runtime to use reg-ex to match patterns. In combination with iteration through the lines in the file. your problem really isn't a problem.
import re
file = open('new_file.txt', 'r')
document = file.read()
lines = document.split("\n")
LinesOfInterest = []
for lineNumber,line in enumerate(lines):
WhereItsAt = re.search( r'#', line)
if(lineNumber>2 and WhereItsAt):
LinesOfInterest.append(lineNumber-3)
print LinesOfInterest
for lineNumber in LinesOfInterest:
print(lines[lineNumber])
Lines of Interest is now a list of line numbers matching your criteria
I used
line1,0
line2,0
line3,0
#
line1,1
line2,1
line3,1
#
line1,2
line2,2
line3,2
#
line1,3
line2,3
line3,3
#
as input yielding
[0, 4, 8, 12]
line1,0
line1,1
line1,2
line1,3

How can I reverse compliment a multiple sequence fasta file with python?

I am new to python and I am trying to figure out how to read a fasta file with multiple sequences and then create a new fasta file containing the reverse compliment of the sequences. The file will look something like:
>homo_sapiens
ACGTCAGTACGTACGTCATGACGTACGTACTGACTGACTGACTGACGTACTGACTGACTGACGTACGTACGTACGTACGTACGTACTG
>Canis_lupus
CAGTCATGCATGCATGCAGTCATGACGTCAGTCAGTACTGCATGCATGCATGCATGCATGACTGCAGTACTGACGTACTGACGTCATGCATGCAGTCATG
>Pan_troglodytus
CATGCATACTGCATGCATGCATCATGCATGCATGCATGCATGCATGCATCATGACTGCAGTCATGCAGTCAGTCATGCATGCATCAT
I am trying to learn how to use for and while loops so if the solution can incorporate one of them it would be preferred.
So far I managed to do it in a very unelegant manner as follows:
file1 = open('/path/to/file', 'r')
for line in file1:
if line[0] == '>':
print line.strip() #to capture the title line
else:
import re
seq = line.strip()
line = re.sub(r'T', r'P', seq)
seq = line
line = re.sub(r'A',r'T', seq)
seq = line
line = re.sub(r'G', r'R', seq)
seq = line
line = re.sub(r'C', r'G', seq)
seq = line
line = re.sub(r'P', r'A', seq)
seq = line
line = re.sub(r'R', r'C', seq)
print line[::-1]
file1.close()
This worked but I know there is a better way to iterate through that end part. Any better solutions?
I know you consider this an exercise for yourself, but in case you are interested in using existing facilities, have a look at the Biopython package. Especially if you are going to do more sequence work.
That would allow you to instantiate a sequence with e.g. seq = Seq('GATTACA'). Then, seq.reverse_complement() will give you the reverse complement.
Note that the reverse complement is more than just string reversal, the nucleotide bases need to be replaced with their complementary letter as well.
Assuming I got you right, would the code below work for you? You could just add the exchanges you want to the dictionary.
d = {'A':'T','C':'G','T':'A','G':'C'}
with open("seqs.fasta", 'r') as in_file:
for line in in_file:
if line != '\n': # skip empty lines
line = line.strip() # Remove new line character (I'm working on windows)
if line.startswith('>'):
head = line
else:
print head
print ''.join(d[nuc] for nuc in line[::-1])
Output:
>homo_sapiens
CAGTACGTACGTACGTACGTACGTACGTCAGTCAGTCAGTACGTCAGTCAGTCAGTCAGTACGTACGTCATGACGTACGT
ACTGACGT
>Canis_lupus
CATGACTGCATGCATGACGTCAGTACGTCAGTACTGCAGTCATGCATGCATGCATGCATGCAGTACTGACTGACGTCATG
ACTGCATGCATGCATGACTG
>Pan_troglodytus
ATGATGCATGCATGACTGACTGCATGACTGCAGTCATGATGCATGCATGCATGCATGCATGCATGATGCATGCATGCAGT
ATGCATG
Here is a simple example of a string reversal.
Python Code
string = raw_input("Enter a string:")
reverse_string = ""
print "our string is %s" % string
print "our range will be %s\n" % range(0,len(string))
for num in range(0,len(string)):
offset = len(string) - 1
reverse_string += string[offset - num]
print "the num is currently: %d" % num
print "the offset is currently: %d" % offset
print "the index is currently: %d" % int(offset - num)
print "the new string is currently: %s" % reverse_string
print "-------------------------------"
offset =- 1
print "\nOur reverse string is: %s" % reverse_string
Added print commands to show you what is happening in the script.
Run it in python and see what happens.
Usually, to iterate over lines in a text file you use a for loop, because "open" returns a file object which is iterable
>>> f = open('workfile', 'w')
>>> print f
<open file 'workfile', mode 'w' at 80a0960>
There is more about this here
You can also use context manager "with" to open a file. This key statement will close the file object for you, so you will never forget it.
I decided not to include a "for line in f:" statement because you have to read several lines to process one sequence (title, sequence and blank line). If you try to use a for loop with "readline()" you will end up with a ValueError (try :)
So I would use string.translate. This script opens a file named "test" with your example in it:
import string
if __name__ == "__main__":
file_name = "test"
translator = string.maketrans("TAGCPR", "PTRGAC")
with open(file_name, "r") as f:
while True:
title = f.readline().strip()
if not title: # end of file
break
rev_seq = f.readline().strip().translate(translator)[::-1]
f.readline() # blank line
print(title)
print(rev_seq)
Output (with your example):
>homo_sapiens
RPGTPRGTPRGTPRGTPRGTPRGTPRGTRPGTRPGTRPGTPRGTRPGTRPGTRPGTRPGTPRGTPRGTRPTGPRGTPRGTPRTGPRGT
>Canis_lupus
RPTGPRTGRPTGRPTGPRGTRPGTPRGTRPGTPRTGRPGTRPTGRPTGRPTGRPTGRPTGRPGTPRTGPRTGPRGTRPTGPRTGRPTGRPTGRPTGPRTG
>Pan_troglodytus
PTGPTGRPTGRPTGPRTGPRTGRPTGPRTGRPGTRPTGPTGRPTGRPTGRPTGRPTGRPTGRPTGPTGRPTGRPTGRPGTPTGRPTG

Categories

Resources