How get only 5mers from a sequence

How get only 5mers from a sequence - python

I have a file that contains millions of sequences. What I want to do is to get 5mers from each sequence in every line of my file.
My file looks like this:
CGATGCATAGGAA
GCAGGAGTGATCC
my code is:
with open('test.txt','r') as file:
for line in file:
for i in range(len(line)):
kmer = str(line[i:i+5])
if len(kmer) == 5:
print(kmer)
else:
pass
with this code, I should not get 4 mers but I do even I have an if statement for the length of 5mers. Could anyone help me with this? Thanks
my out put is:
CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA
GGAA
GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC
ATCC
but the ideal output should be only the one with length equal to 5 (for each line separately):
CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA
GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC

When iterating through a file, every character is represented somewhere. In particular, the last character for each of those lines is a newline \n, which you're printing.
with open('test.txt') as f: data = list(f)
# data[0] == 'CGATGCATAGGAA\n'
# data[1] == 'GCAGGAGTGATCC\n'
So the very last substring you're trying to print from the first line is 'GGAA\n', which has a length of 5, but it's giving you the extra whitespace and the appearance of 4mers. One of the comments proposed a satisfactory solution, but when you know the root of the problem you have lots of options:
with open('test.txt', 'r') as file:
for line_no, line in enumerate(file):
if line_no: print() # for the space between chunks which you seem to want in your final output -- omit if not desired
line = line.strip() # remove surrounding whitespace, including the pesky newlines
for i in range(len(line)):
kmer = str(line[i:i+5])
if len(kmer) == 5:
print(kmer)
else:
pass

Related

How to get the last character in a file from Python?

I'm trying to set a variable to the last character of a file. I am using Python, and I'm fairly new to it. If it is of any importance, my code appends a random number between 2 and 9 to the end of an HTML file. In a separate function, I want to set the last character of the HTML file (the last character being the random number between 2 and 9) to a variable, then delete the last character (as to not affect the function of the HTML). Doe's anyone know how I could do this? I can attach my code below if needed, but I chose not to as it is 50 lines long and all 50 lines are needed for full context.

try this,
"a.txt" file has number 1, 3, 4, 5
Below code will read the file and pulls out last character from the file.
file = open('a.txt','r')
lines = file.read()
print(lines[-1])
=> 5

Using #Jab's answer from the comment above as well as some assumptions, we can produce a more efficient solution to finding the last character and replacing it.
The assumptions that are made are common and most likely will be valid:
You will know whether there is a newline character at the very end of the file, or whether the random number is truly the last character in the file (meaning accounting for whitespace).
You know the encoding of the file. This is valid since almost all HTML is utf-8, (can be utf-16), and since you are the one editing it, you will know. Most times the encoding won't even matter.
So, this is what we can do:
with open("test.txt", "rb+", encoding='utf-8') as f:
f.seek(-2, 2)
# -1 or -2, may change depending on whitespace characters at end of the file
var = f.read(1) # read one byte for a number
f.seek(-1,1)
print("last character:", str(var, 'utf-8'))
f.write(bytes('variable', 'utf-8')) # set whatever info here
f.write(bytes('\n', 'utf-8')) # you may want a newline character at the end of the file
f.truncate()
This is efficient because we actually don't have to iterate through the entire file. We iterate through just the last character, once to read and once to write.

You can do something like that:
# Open the file to read and the file to write
with open('file.txt'), open('new_file.txt', 'w+') as f_in, f_out:
# Read all the lines to memory (you can't find the last line lazily)
lines = f_in.readlines()
# Iterate over every line
for i, line in enumerate(lines):
# If the current index is the last index (i.e. the last line)
if i == len(lines) - 1:
# Get the last character
last_char = line[-1]
# Write to the output file the line without the last character
print(line[:-1], file=f_out, end='')
else:
# Write to the output file the line as it is
print(line, file=f_out, end='')
# Print the removed char
print(last_char)
If you don't want to create a new file, you can load all the file to memory as we're currently doing:
# Read all the lines into memory
with open('file.txt') as f:
lines = f.readlines()
# Replace the lines inside the list using the previous logic
for i, line in enumerate(lines):
if i == len(lines) - 1:
last_char = line[-1]
lines[i] = line[:-1]
else:
lines[i] = line
# Write the changed lines to the same file
with open('file.txt', 'w+') as f:
print(''.join(lines), file=f, end='')
# Print the removed char
print(last_char)

Remove first character in line from text only if it matches defined character

I am receiving TCP data into a file. The data is meant for a POS printer. as such I need to strip control characters and other unwanted info. I have successfully stripped everything except the letter 'a' . However I only need to strip the character if it needed. Not every line will begin with the letter 'a'. Essentially I need to strip the letter 'a' from each line only if it is present as the first character. I don't need to strip every 'a' from the whole file.
Below is what I am doing but it is stripping every 'a' in the file.
unwanted_chars="[a]"
def Rema():
with open('Output.txt') as f:
lines=list(f)
for k, line in enumerate(lines):
for c in unwanted_chars:
line=line.replace(c,'')
lines[k]=line
with open('Output.txt','w') as f:
f.write('\n'.join(lines))
while True:
Rema()

.replace() iterates through an entire string and replaces all instances of the input with the new value given, so in this case, as you stated, all 'a's are being removed.
Strings can be called via indices just like lists in python so you could check if line[0] == 'a' and if so set the new line to be: line = line[1:]
Here is an example:
def Rema():
with open('Output.txt') as f:
lines=list(f)
for k, line in enumerate(lines):
for c in unwanted_chars:
if line[0] == c:
line = line[1:]
lines[k]=line
This is very specific to removing the first letter if it is 'a'. If you want to check for other letters AS the first letter only this will work for a longer list in unwanted_chars. But if you wanted to go back and remove all instances of say "\n" as an example in a string you would again use .replace()

If your printer doesn't like lines starting with an 'a' (for example), I'm guessing it's not going to like a line that started with 'aa' where you only remover the first 'a'.
How about using lstrip() for that:
def Rema():
with open('Output.txt') as f:
lines=(line.lstrip('a') for line in f)
with open('Output.txt','w') as f:
f.write('\n'.join(lines))

Below is the answer. Many thanks to Darren
def Rema():
with open('Output.txt') as f:
lines=list(f)
for k, line in enumerate(lines):
for c in unwanted_chars:
if line[0] == c:
line = line[1:]
lines[k]=line
with open('Output.txt','w') as f:
f.write('\r'.join(lines))

Need to count how many times "AGAT" "AATG" and "TATC" repeats in .txt file that has a DNA sequence

This is my first coding class and I'm having trouble getting the counter to increase every time one of the given appears in the DNA sequence.
My code so far:
agat_Counter = 0
aatg_Counter= 0
tatc_Counter= 0
DNAsample = open('DNA SEQUENCE FILE.txt', 'r');
for lines in DNAsample:
if lines in DNAsample=='AGAT':
agat_Counter+=1
else:
agat_Counter+=0
print(agat_Counter)
for lines in DNAsample:
if lines in DNAsample=='AATG':
aatg_Counter+=1
else:
aatg_Counter+=0
print(aatg_Counter)
for lines in DNAsample:
if lines in DNAsample=='TATC':
tatc_Counter+=0
else:
tatc_Counter+=0
print(tatc_Counter)

You can do this with many ways. One of the more simple is the following:
DNAsample = open('DNA SEQUENCE FILE.txt', 'r').read()
agat_Counter = DNAsample.count('AGAT')
aatg_Counter= DNAsample.count('AATG')
tatc_Counter= DNAsample.count('TATC')

This should work. The issue is with your if statements. as well as once you iterate through the file once, the file pointer is at the end (I think) so it won't go back through. The code below iterates through each line one at a time and compares the string to the 4 character sequence, note that the .strip() removes the trailing \n and or \r characters that are in the line variable as the file is iterated through.
In general, when opening files it is best to use with open(filename, mode) as var: as shown below this handles closing the file once it is done and elminates the risk of un-closed file handles.
Assumption based on original code is that the DNA SEQUENCE FILE.txt file is organized as such:
AGAT
AATG
...
agat_Counter = 0
aatg_Counter= 0
tatc_Counter= 0
with open('DNA SEQUENCE FILE.txt', 'r') as DNAample:
for line in DNAsample:
strippedLine = line.strip()
if strippedLine == 'AGAT':
agat_Counter += 1
elif strippedLine == 'AATG':
aatg_Counter += 1
elif stripepdLine == 'TATC':
tatc_Counter += 1
print(agat_Counter)
print(aatg_Counter)
print(tatc_Counter)

Extracting gene sequences from FASTA File?

I have the following code that reads a FASTA file with 10 gene sequences and return each sequences as a matrix.
However the code seems to be missing on the very last sequence and I wonder why?
file=open('/Users/vivianspro/Downloads/rosalind_cons (5).txt', 'r')
line=file.readline()
strings = []
sequence=''
while line:
#line=line.rstrip('\n')
line = line.strip() #empty () automatically strips the \n
if '>' in line:
if sequence != "":
strings.append(sequence)
sequence = ""
#sequence=line
else:
sequence+=line
line=file.readline()
for s in strings:
print(s)
Motifs = []
for seq in strings:
Motifs.append(list(seq))
#make every symbol into an element in the list separated by ,
for s in Motifs:
print(s) ````

You only append to strings when you see a new > but there isn't one after the last sequence.
Here is a refactoring which will hopefully also be somewhat more idiomatic.
strings = []
sequence=''
with open('/Users/vivianspro/Downloads/rosalind_cons (5).txt', 'r') as file:
for line in file:
line = line.rstrip('\n')
if line.startswith('>'):
if sequence != "":
strings.append(sequence)
sequence = ""
else:
sequence+=line
# After the last iteration, append once more if we have something to append
if sequence:
strings.append(sequence)

Since FASTA files contain the data in such a format:
>ID1
seq_1
>ID2
seq_2
...
According to your code, if your line contains a > only then you try to append the sequence. That means, you are adding the sequence for ID_1 when you are iterating for ID_2.
To resolve this, you can do something like this:
for line in file:
line = line.strip()
if '>' in line: # Line 1
line = file.readline().strip()
# print(line)
strings.append(line)
This above example uses the fact that in a FASTA file, the sequence comes directly after the ID, which contains the > character (you can change Line 1, so that it just checks for the first char, line[0] == ">").

IndexError: string index out of range, line is empty?

I'm trying to make a program that takes a letter the user inputs, reads a text file and then prints the words that start with that letter.
item = "file_name"
letter = raw_input("Words starting with: ")
letter = letter.lower()
found = 0
with open(item) as f:
filelength = sum(1 for line in f)
for i in range (filelength):
word = f.readline()
print word
if word[0] == letter:
print word
found += 1
print "words found:", found
I keep receiving the error
"if word[0] == letter: IndexError: string index out of range"
with no lines being printed. I think this is what happens if there's nothing there, but there are 50 lines of random words in the file, so I'm not sure why it is being read this way.

You have two problems:
You are trying to read the whole file twice (once to determine the filelength, then again to get the lines themselves), which won't work; and
You aren't dealing with empty lines, so if any are introduced (e.g. if the last line is blank) your code will break anyway.
The easiest way to do this is:
found = 0
with open(item) as f:
for line in f: # iterate over lines directly
if line and line[0] == letter: # skip blank lines and any that don't match
found += 1
print "words found:", found
if line skips blanks because empty sequences are false-y, and the "lazy evaluation" of and means that line[0] will only be tried where the line isn't empty. You could alternatively use line.startswith(letter).

When you use sum(1 for line in f) you are already consuming all the lines in your file, so now your handle points to the end of the file. Try using f.seek(0) to return the read cursor to the start of the file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How get only 5mers from a sequence - python

Related

How to get the last character in a file from Python?

Remove first character in line from text only if it matches defined character

Need to count how many times "AGAT" "AATG" and "TATC" repeats in .txt file that has a DNA sequence

Extracting gene sequences from FASTA File?

IndexError: string index out of range, line is empty?

Categories

Resources