Python: List not correct after appending lines - python

I'm trying to append lines to an empty list reading from a file, and I've already stripped the lines of returns and newlines, but what should be one line is being entered as two separate items into the list.
DNA = open('DNAGCex.txt')
DNAID = []
DNASEQ = []
for line in DNA:
line = line.rstrip()
line = line.lstrip()
if line.startswith('>')==True:
DNAID.append(line)
if line.startswith('>')==False:
DNASEQ.append(line)
print DNAID
print DNASEQ
And here's the output
['>Rosalind_6404', '>Rosalind_5959', '>Rosalind_0808']
['CCTGCGGAAGATCGGCACTAGA', 'TCCCACTAATAATTCTGAGG', 'CCATCGGTAGCGCATCCTTAGTCCA', 'ATATCCATTTGTCAGCAGACACGC', 'CCACCCTCGTGGTATGGCTAGGCATTCAG', 'TGGGAACCTGCGGGCAGTAGGTGGAAT']
I want it to look like this:
['>Rosalind_6404', '>Rosalind_5959', '>Rosalind_0808']
['CCTGCGGAAGATCGGCACTAGATCCCACTAATAATTCTGAGG', 'CCATCGGTAGCGCATCCTTAGTCCAATATCCATTTGTCAGCAGACACGC', 'CCACCCTCGTGGTATGGCTAGGCATTCAGTGGGAACCTGCGGGCAGTAGGTGGAAT']
Here is the source material, just remove the ''s:
['>Rosalind_6404'
CCTGCGGAAGATCGGCACTAGA
TCCCACTAATAATTCTGAGG
'>Rosalind_5959'
CCATCGGTAGCGCATCCTTAGTCCA
ATATCCATTTGTCAGCAGACACGC
'>Rosalind_0808'
CCACCCTCGTGGTATGGCTAGGCATTCAG
TGGGAACCTGCGGGCAGTAGGTGGAAT]

You can combine the .lstrip() and .rstrip() into a single .strip() call.
Then, you were thinking that .append() both added lines to a list and joined lines into a single line. Here, we start DNASEQ with an empty string and use += to join the lines into a long string:
DNA = open('DNAGCex.txt')
DNAID = []
DNASEQ = []
for line in DNA:
line = line.strip()
if line.startswith('>'):
DNAID.append(line)
DNASEQ.append('')
else:
DNASEQ[-1] += line
print DNAID
print DNASEQ

Within each iteration of the loop, you're only looking at a certain line from the file. This means that, although you certainly are appending lines that don't contain a linefeed at the end, you're still appending one of the file's lines at a time. You'll have to let the interpreter know that you want to combine certain lines, by doing something like setting a flag when you first start to read in a DNASEQ and clearing it when the next DNAID starts.
for line in DNA:
line = line.strip() # gets both sides
if line.startswith('>'):
starting = True
DNAID.append(line)
elif starting:
starting = False
DNASEQ.append(line)
else:
DNASEQ[-1] += line

Related

How get only 5mers from a sequence

I have a file that contains millions of sequences. What I want to do is to get 5mers from each sequence in every line of my file.
My file looks like this:
CGATGCATAGGAA
GCAGGAGTGATCC
my code is:
with open('test.txt','r') as file:
for line in file:
for i in range(len(line)):
kmer = str(line[i:i+5])
if len(kmer) == 5:
print(kmer)
else:
pass
with this code, I should not get 4 mers but I do even I have an if statement for the length of 5mers. Could anyone help me with this? Thanks
my out put is:
CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA
GGAA
GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC
ATCC
but the ideal output should be only the one with length equal to 5 (for each line separately):
CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA
GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC
When iterating through a file, every character is represented somewhere. In particular, the last character for each of those lines is a newline \n, which you're printing.
with open('test.txt') as f: data = list(f)
# data[0] == 'CGATGCATAGGAA\n'
# data[1] == 'GCAGGAGTGATCC\n'
So the very last substring you're trying to print from the first line is 'GGAA\n', which has a length of 5, but it's giving you the extra whitespace and the appearance of 4mers. One of the comments proposed a satisfactory solution, but when you know the root of the problem you have lots of options:
with open('test.txt', 'r') as file:
for line_no, line in enumerate(file):
if line_no: print() # for the space between chunks which you seem to want in your final output -- omit if not desired
line = line.strip() # remove surrounding whitespace, including the pesky newlines
for i in range(len(line)):
kmer = str(line[i:i+5])
if len(kmer) == 5:
print(kmer)
else:
pass

Trying to delete lines in a text file that contain a specific character

I'm testing the code below, but it doensn't do what I would like it to do.
delete_if = ['#', ' ']
with open('C:\\my_path\\AllDataFinal.txt') as oldfile, open('C:\\my_path\\AllDataFinalFinal.txt', 'w') as newfile:
for line in oldfile:
if not any(del_it in line for del_it in delete_if):
newfile.write(line)
print('DONE!!')
Basically, I want to delete any line that contains a '#' character (the lines I want to delete start with a '#' character). Also, I want to delete any/all lines that are completely blank. Can I do this in on go, by reading through items in a list, or will it require several passes through the text file to clean up everything? TIA.
It's easy. Check my code below :
filePath = "your old file path"
newFilePath = "your new file path"
# we are going to list down which lines start with "#" or just blank
marker = []
with open(filePath, "r") as file:
content = file.readlines() # read all lines and store them into list
for i in range(len(content)): # loop into the list
if content[i][0] == "#" or content[i] == "\n": # check if the line starts with "#" or just blank
marker.append(i) # store the index into marker list
with open(newFilePath, "a") as file:
for i in range(len(content)): # loop into the list
if not i in marker: # if the index is not in marker list, then continue writing into file
file.writelines(content[i]) # writing lines into file
The point is, we need to read all the lines first. And check line by line whether it starts with # or it's just blank. If yes, then store it into a list variable. After that, we can continue writing into new file by checking if the index of the line is in marker or not.
Let me know if you have problem.
How about using the ternary operator?
#First option: within your for loop
line = "" if "#" in line or not line else line
#Second option: with list comprehension
newFile = ["" if not line or "#" in line else line for line in oldfile]
I'm not sure if the ternary would work because if the string is empty, an Exception should be shown because "#" won't be in an empty string... How about
#Third option: "Staging your conditions" within your for loop
#First, make sure the string is not empty
if line:
#If it has the "#" char in it, delete it
if "#" in line:
line = ""
#If it is, delete it
else:
line = ""

Extracting gene sequences from FASTA File?

I have the following code that reads a FASTA file with 10 gene sequences and return each sequences as a matrix.
However the code seems to be missing on the very last sequence and I wonder why?
file=open('/Users/vivianspro/Downloads/rosalind_cons (5).txt', 'r')
line=file.readline()
strings = []
sequence=''
while line:
#line=line.rstrip('\n')
line = line.strip() #empty () automatically strips the \n
if '>' in line:
if sequence != "":
strings.append(sequence)
sequence = ""
#sequence=line
else:
sequence+=line
line=file.readline()
for s in strings:
print(s)
Motifs = []
for seq in strings:
Motifs.append(list(seq))
#make every symbol into an element in the list separated by ,
for s in Motifs:
print(s) ````
You only append to strings when you see a new > but there isn't one after the last sequence.
Here is a refactoring which will hopefully also be somewhat more idiomatic.
strings = []
sequence=''
with open('/Users/vivianspro/Downloads/rosalind_cons (5).txt', 'r') as file:
for line in file:
line = line.rstrip('\n')
if line.startswith('>'):
if sequence != "":
strings.append(sequence)
sequence = ""
else:
sequence+=line
# After the last iteration, append once more if we have something to append
if sequence:
strings.append(sequence)
Since FASTA files contain the data in such a format:
>ID1
seq_1
>ID2
seq_2
...
According to your code, if your line contains a > only then you try to append the sequence. That means, you are adding the sequence for ID_1 when you are iterating for ID_2.
To resolve this, you can do something like this:
for line in file:
line = line.strip()
if '>' in line: # Line 1
line = file.readline().strip()
# print(line)
strings.append(line)
This above example uses the fact that in a FASTA file, the sequence comes directly after the ID, which contains the > character (you can change Line 1, so that it just checks for the first char, line[0] == ">").

IndexError: string index out of range, line is empty?

I'm trying to make a program that takes a letter the user inputs, reads a text file and then prints the words that start with that letter.
item = "file_name"
letter = raw_input("Words starting with: ")
letter = letter.lower()
found = 0
with open(item) as f:
filelength = sum(1 for line in f)
for i in range (filelength):
word = f.readline()
print word
if word[0] == letter:
print word
found += 1
print "words found:", found
I keep receiving the error
"if word[0] == letter: IndexError: string index out of range"
with no lines being printed. I think this is what happens if there's nothing there, but there are 50 lines of random words in the file, so I'm not sure why it is being read this way.
You have two problems:
You are trying to read the whole file twice (once to determine the filelength, then again to get the lines themselves), which won't work; and
You aren't dealing with empty lines, so if any are introduced (e.g. if the last line is blank) your code will break anyway.
The easiest way to do this is:
found = 0
with open(item) as f:
for line in f: # iterate over lines directly
if line and line[0] == letter: # skip blank lines and any that don't match
found += 1
print "words found:", found
if line skips blanks because empty sequences are false-y, and the "lazy evaluation" of and means that line[0] will only be tried where the line isn't empty. You could alternatively use line.startswith(letter).
When you use sum(1 for line in f) you are already consuming all the lines in your file, so now your handle points to the end of the file. Try using f.seek(0) to return the read cursor to the start of the file.

Python: Copying lines that meet requirements

So, basically, I need a program that opens a .dat file, checks each line to see if it meets certain prerequisites, and if they do, copy them into a new csv file.
The prerequisites are that it must 1) contain "$W" or "$S" and 2) have the last value at the end of the line of the DAT say one of a long list of acceptable terms. (I can simply make-up a list of terms and hardcode them into a list)
For example, if the CSV was a list of purchase information and the last item was what was purchased, I only want to include fruit. In this case, the last item is an ID Tag, and I only want to accept a handful of ID Tags, but there is a list of about 5 acceptable tags. The Tags have very veriable length, however, but they are always the last item in the list (and always the 4th item on the list)
Let me give a better example, again with the fruit.
My original .DAT might be:
DGH$G$H $2.53 London_Port Gyro
DGH.$WFFT$Q5632 $33.54 55n39 Barkdust
UYKJ$S.52UE $23.57 22#3 Apple
WSIAJSM_33$4.FJ4 $223.4 Ha25%ek Banana
Only the line: "UYKJ$S $23.57 22#3 Apple" would be copied because only it has both 1) $W or $S (in this case a $S) and 2) The last item is a fruit. Once the .csv file is made, I am going to need to go back through it and replace all the spaces with commas, but that's not nearly as problematic for me as figuring out how to scan each line for requirements and only copy the ones that are wanted.
I am making a few programs all very similar to this one, that open .dat files, check each line to see if they meet requirements, and then decides to copy them to the new file or not. But sadly, I have no idea what I am doing. They are all similar enough that once I figure out how to make one, the rest will be easy, though.
EDIT: The .DAT files are a few thousand lines long, if that matters at all.
EDIT2: The some of my current code snippets
Right now, my current version is this:
def main():
#NewFile_Loc = C:\Users\J18509\Documents
OldFile_Loc=raw_input("Input File for MCLG:")
OldFile = open(OldFile_Loc,"r")
OldText = OldFile.read()
# for i in range(0, len(OldText)):
# if (OldText[i] != " "):
# print OldText[i]
i = split_line(OldText)
if u'$S' in i:
# $S is in the line
print i
main()
But it's very choppy still. I'm just learning python.
Brief update: the server I am working on is down, and might be for the next few hours, but I have my new code, which has syntax errors in it, but here it is anyways. I'll update again once I get it working. Thanks a bunch everyone!
import os
NewFilePath = "A:\test.txt"
Acceptable_Values = ('Apple','Banana')
#Main
def main():
if os.path.isfile(NewFilePath):
os.remove(NewFilePath)
NewFile = open (NewFilePath, 'w')
NewFile.write('Header 1,','Name Header,','Header 3,','Header 4)
OldFile_Loc=raw_input("Input File for Program:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile:
LineParts = line.split()
if (LineParts[0].find($W)) or (LineParts[0].find($S)):
if LineParts[3] in Acceptable_Values:
print(LineParts[1], ' is accepted')
#This Line is acceptable!
NewFile.write(LineParts[1],',',LineParts[0],',',LineParts[2],',',LineParts[3])
OldFile.close()
NewFile.close()
main()
There are two parts you need to implement: First, read a file line by line and write lines meeting a specific criteria. This is done by
with open('file.dat') as f:
for line in f:
stripped = line.strip() # remove '\n' from the end of the line
if test_line(stripped):
print stripped # Write to stdout
The criteria you want to check for are implemented in the function test_line. To check for the occurrence of "$W" or "$S", you can simply use the in-Operator like
if not '$W' in line and not '$S' in line:
return False
else:
return True
To check, if the last item in the line is contained in a fixed list, first split the line using split(), then take the last item using the index notation [-1] (negative indices count from the end of a sequence) and then use the in operator again against your fixed list. This looks like
items = line.split() # items is an array of strings
last_item = items[-1] # take the last element of the array
if last_item in ['Apple', 'Banana']:
return True
else:
return False
Now, you combine these two parts into the test_line function like
def test_line(line):
if not '$W' in line and not '$S' in line:
return False
items = line.split() # items is an array of strings
last_item = items[-1] # take the last element of the array
if last_item in ['Apple', 'Banana']:
return True
else:
return False
Note that the program writes the result to stdout, which you can easily redirect. If you want to write the output to a file, have a look at Correct way to write line to file in Python
inlineRequirements = ['$W','$S']
endlineRequirements = ['Apple','Banana']
inputFile = open(input_filename,'rb')
outputFile = open(output_filename,'wb')
for line in inputFile.readlines():
line = line.strip()
#trailing and leading whitespace has been removed
if any(req in line for req in inlineRequirements):
#passed inline requirement
lastWord = line.split(' ')[-1]
if lastWord in endlineRequirements:
#passed endline requirement
outputFile.write(line.replace(' ',','))
#replaced spaces with commas and wrote to file
inputFile.close()
outputFile.close()
tags = ['apple', 'banana']
match = ['$W', '$S']
OldFile_Loc=raw_input("Input File for MCLG:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile.readlines(): # Loop through the file
line = line.strip() # Remove the newline and whitespace
if line and not line.isspace(): # If the line isn't empty
lparts = line.split() # Split the line
if any(tag.lower() == lparts[-1].lower() for tag in tags) and any(c in line for c in match):
# $S or $W is in the line AND the last section is in tags(case insensitive)
print line
import re
list_of_fruits = ["Apple","Bannana",...]
with open('some.dat') as f:
for line in f:
if re.findall("\$[SW]",line) and line.split()[-1] in list_of_fruits:
print "Found:%s" % line
import os
NewFilePath = "A:\test.txt"
Acceptable_Values = ('Apple','Banana')
#Main
def main():
if os.path.isfile(NewFilePath):
os.remove(NewFilePath)
NewFile = open (NewFilePath, 'w')
NewFile.write('Header 1,','Name Header,','Header 3,','Header 4)
OldFile_Loc=raw_input("Input File for Program:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile:
LineParts = line.split()
if (LineParts[0].find(\$W)) or (LineParts[0].find(\$S)):
if LineParts[3] in Acceptable_Values:
print(LineParts[1], ' is accepted')
#This Line is acceptable!
NewFile.write(LineParts[1],',',LineParts[0],',',LineParts[2],',',LineParts[3])
OldFile.close()
NewFile.close()
main()
This worked great, and has all the capabilities I needed. The other answers are good, but none of them do 100% of what I needed like this one does.

Categories

Resources