Python - Count characters between two specific strings - python

I made a text file containing random sequences of bases (ATCG) and want to find the longest and shortest "reading frame" within those sequences.
I was able to identify the Start- and Stop-Codons (the two "specific strings" mentioned) with "searchfile" and a for-loop and also know the basics of counting (example of code at the end) but I can't find any possibility to set those two as "boundaries" between I can count.
Can anybody perhaps give me a hint or tell me how such a function/operation is called so I can at least find it in a documentary or how it could look like? I found many options how to count various different things but none for counting between "x" and "y".
Example of how I looked up the strings between which I want to count:
searchfile = open('dna.txt', 'r')
for line in searchfile:
if "ATG" in line: print (line)
searchfile.close()
whole code:
import numpy as np
BASES = ('A', 'C', 'T', 'G')
P = (0.25, 0.25, 0.25, 0.25)
def random_dna_sequence(length):
return ''.join(np.random.choice(BASES, p=P) for _ in range(length))
with open('dna.txt', 'w+') as txtout:
for _ in range(10):
dna = random_dna_sequence(50)
txtout.write(dna)
txtout.write("\n")
searchfile = open('dna.txt', 'r')
for line in searchfile:
if "ATG" in line: print (line)
searchfile.close()
searchfile = open('dna.txt', 'r')
for line in searchfile:
if "ATG" in line: print (line)
elif "TAG" in line: print (line)
elif "TAA" in line: print (line)
elif "TGA" in line: print (line)
else: print ("no stop-codon detected")
searchfile.close()
Sidenote: The print instruction is only a temporary placeholder for testing. In the end i would like to set the found strings as mentioned "boundaries" (i can't find a better name for it) at that point.
Some example lines from the dna.txt file:
GAAGACGCAATAGGTTCACGGCGCTCATAGGCTTGCCCTCATAGGGCTTG
TCTGAGGTAGAAGGAGCTACTGCCGTTGCAGGTGACGCCCACAGTCCTGA
GTTATTACTCCCTGACTGTCATCTGTTCGGATACCGTGCAGCGCATCGAG
AGGAGATAACGCGATCCTGAGACAGTTTACCTATATGTTCACTACGCATG
CCGAGCTGATCCGACTACTGAAGGTGAATTCTGAAGCTAATCTGCAGTTC
This is a small example (I use 10 and 50 for testing) but in the end the file shall contain 10000 sequences with 1000 characters each.

What I would do is something like this:
with open("dna.txt", 'r') as searchfile:
all_dna = searchfile.read()
start = all_dna.index("ATG")
rem_dna = all_dna[start + 3:]
end = rem_dna.index("ATG")
needed_dna = all_dna[start:(end + 3)]
print len(needed_dna)
index finds where in a string the substring passed as an argument occurs, and will raise ValueError if the substring is not found. with is a keyword useful as a safety precaution for file I/O that ensures that the file is properly closed even if the code inside that block causes an error. If you don't want to include the starting and ending "ATG" in needed_dna, you can set that to all_dna[(start + 3):end]. The brackets, by the way, mean "take the substring of the specified string beginning at the argument before the colon (inclusive, zero-indexed) and ending at the argument after the colon (non-inclusive, also zero-indexed). This can also be used for lists, and can be used without the colon to get the character at a specific index. Hope this helps!

Related

How get only 5mers from a sequence

I have a file that contains millions of sequences. What I want to do is to get 5mers from each sequence in every line of my file.
My file looks like this:
CGATGCATAGGAA
GCAGGAGTGATCC
my code is:
with open('test.txt','r') as file:
for line in file:
for i in range(len(line)):
kmer = str(line[i:i+5])
if len(kmer) == 5:
print(kmer)
else:
pass
with this code, I should not get 4 mers but I do even I have an if statement for the length of 5mers. Could anyone help me with this? Thanks
my out put is:
CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA
GGAA
GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC
ATCC
but the ideal output should be only the one with length equal to 5 (for each line separately):
CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA
GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC
When iterating through a file, every character is represented somewhere. In particular, the last character for each of those lines is a newline \n, which you're printing.
with open('test.txt') as f: data = list(f)
# data[0] == 'CGATGCATAGGAA\n'
# data[1] == 'GCAGGAGTGATCC\n'
So the very last substring you're trying to print from the first line is 'GGAA\n', which has a length of 5, but it's giving you the extra whitespace and the appearance of 4mers. One of the comments proposed a satisfactory solution, but when you know the root of the problem you have lots of options:
with open('test.txt', 'r') as file:
for line_no, line in enumerate(file):
if line_no: print() # for the space between chunks which you seem to want in your final output -- omit if not desired
line = line.strip() # remove surrounding whitespace, including the pesky newlines
for i in range(len(line)):
kmer = str(line[i:i+5])
if len(kmer) == 5:
print(kmer)
else:
pass

find end of line after another in text file in Python

Issue
Hello all,
in a text file i need to replace an unknown string by another,
first to find it i need to find the line before it 'name Blur2'
as there is many line beginnig by 'xpos':
name Blur2
xpos 12279 # 12279 is the end of line to find and put in a variable
Code to get unknow string:
#string to find:
keyString = ' name Blur2'
f2 = open("output_file.txt", 'w+')
with open("input_file.txt", 'r+') as f1:
lines = f1.readlines()
for i in range(0, len(lines)):
line = lines[i]
if keyString in line:
nextLine = lines[i + 1]
print ' nextLine: ',nextLine #result: nextLine: xpos 12279
number = nextLine.rsplit(' xpos ', 1)[1]
print ' number: ',number #result: number: 12279
#convert string to float:
newString = '{0}\n'.format(int(number)+ 10)
print ' newString: ',newString #result: newString: 12289
f2.write("".join([nextLine.replace(number, str(newString))])) #this line isn't working
f1.close()
f2.close()
so, i had completely change of method but the last line: f2.write... isn't working as expected, did someone know why?
thanks again for your help :)
regex seems like it would help, https://regex101.com/.
Regex searches a string with a language that defines a pattern. I listed the most important ones for learning the pattern itself, but it is sometimes a better alternative than python's native string manipulation.
You first describe the pattern that you will be using, then actually compile the pattern. For the string check, I defined it as a raw string using r''. This means I don't have to escape a \ within a string (example: printing \ would be print('\') instead of print(r'').
There are a couple of parts to this regex.
\s for whitespace(characters like space, ' ')
\n or \r for newline and carriage return, [^] defines which characters not to look for (so [^\n\r] searches for anything not containing a newline or carriage return), the * indicates it can have 0 or more of the characters indicated. $ in the regex string accounts for everything before the line end.
so the pattern searches for 'name Blur2' specifically with any number of whitespaces afterwards and a newline. The parentheses allow this to be group 1 (explained later). The second part '([^\n\r]*$)' captures any number of characters that aren't a newline or carriage return up until the end of that line.
Groups account for the parentheses, so '(name blue\n)' is group 1, and the line you want replaced '([^\n\r]*$)' is group 2. checkre.sub should replace the whole text with group 1
and the new string, so it replaces the first line with the first line, and replaces the second line with your new string
import re
check = r'(name Blur2\s*\n)([^\n\r]*$)'
checkre = re.compile(check, re.MULTILINE)
checkre.sub(\g<1>+newstring, file)
You need to set re.MULTILINE since you're checking multiple lines, if the '\n' isn't matched, you could use [\n\r\z] which gets one of either end of the line, carriage return, or absolute end of the string.
rioV8's comment works, but you could also use '.{5}$', which accounts for any 5 characters before the end of the line. It could be helpful within a re
It should be possible to get the old string with
oldstring = checkre.search(filestring).group(1)
I have not played with span yet, but
stringmatch = checkre.search(filestring)
oldstring = stringmatch.group(2)
newfilestring = filestring[0:stringmatch.span[0]] + stringmatch.group(1) + newstring + filestring[stringmatch.span[1]]:]
should be pretty close to what you're looking for, although the splice may not be exactly correct.
The initial program was pretty close. I edited a little bit of it to tweak a few things that were wrong.
You weren't initially writing the lines that needed to be replaced, I'm not sure why you needed to join things. Just replacing the number directly seemed to work. Python doesn't allow changes to the i in a for loop, and you need to skip one line so it isn't written to the file, so I changed it to a while loop. Anyway ask any questions you have, but the below code seems to work.
#string to find:
keyString = ' name Blur2'
f2 = open("output_file.txt", 'w+')
with open("test.txt", 'r+') as f1:
lines = f1.readlines()
i=0
while i <len(lines):
line = lines[i]
if keyString in line:
f2.write(line)
nextLine = lines[i + 1]
#end of necessary 'i' calls, increment i to avoid reprinting writing the replaced line string
i+=1
print (' nextLine: ',nextLine )#result: nextLine: xpos 12279
number = nextLine.rsplit(' xpos ', 1)[1]
#as was said in a comment, this coula also be number = nextLine[-5:]
print (' number: ',number )#result: number: 12279
#convert string to float:
newString = '{0}\n'.format(int(number)+ 10)
print (' newString: ',newString) #result: newString: 12289
f2.write(nextLine.replace(number, str(newString))) #this line isn't working
else:
f2.write(line)
i+=1
f1.close()
f2.close()

Print full sequence not just first line | Python 3.3 | Print from specific line to end (")

I am attempting to pull out multiple (50-100) sequences from a large .txt file seperated by new lines ('\n'). The sequence is a few lines long but not always the same length so i can't just print lines x-y. The sequences end with " and the next line always starts with the same word so maybe that could be used as a keyword.
I am writing using python 3.3
This is what I have so far:
searchfile = open('filename.txt' , 'r')
cache = []
for line in searchfile:
cache.append(line)
for line in range(len(cache)):
if "keyword1" in cache[line].lower():
print(cache[line+5])
This pulls out the starting line (which is 5 lines below the keyword line always) however it only pulls out this line.
How do I print the whole sequence?
Thankyou for your help.
EDIT 1:
Current output = ABCDCECECCECECE ...
Desired output = ABCBDBEBSOSO ...
ABCBDBDBDBDD ...
continued until " or new line
Edit 2
Text file looks like this:
Name (keyword):
Date
Address1
Address2
Sex
Response"................................"
Y/N
The sequence between the " and " is what I need
TL;DR - How do I print from line + 5 to end when end = keyword
Not sure if I understand your sequence data but if you're searching for each 'keyword' then the next " char then the following should work:
keyword_pos =[]
endseq_pos = []
for line in range(len(cache)):
if 'keyword1' in cache[line].lower():
keyword_pos.append(line)
if '"' in cache[line]:
endseq_pos.append(line)
for key in keyword_pos:
for endseq in endseq_pos:
if endseq > key:
print(cache[key:endseq])
break
This simply compiles a list of all the positions of all the keywords and " characters and then matches the two and prints all the lines in between.
Hope that helps.
I agree with #Michal Frystacky that regex is the way forward. However as I now understand the problem, we need two searches one for the 'keyword' then again 5 lines on, to find the 'sequence'
This should work but may need the regex to be tweaked:
import re
with open('yourfile.txt') as f:
lines = f.readlines()
for i,line in enumerate(lines):
#first search for keyword
key_match = re.search(r'\((keyword)',line)
if key_match:
#if successful search 5 lines on for the string between the quotation marks
seq_match = re.search(r'"([A-Z]*)"',lines[i+5])
if seq_match:
print(key_match.group(1) +' '+ seq_match.group(1))
1This can be done rather simply with regex
import re
lines = 'Name (keyword):','Date','Address1','Address2','Sex','Response"................................" '
for line in lines:
match = re.search('.*?"(:?.*?)"?',line)
if match:
print(match.group(1))
Eventually to use this sample code we would lines = f.readlines() from the dataset. Its important to note that we catch only things between " and another ", if there is no " mark at the end, we will miss this data, but accounting for that isn't too difficult.

Python Programming for .json Loggly files

I want to search particular strings in long .json loggly file, including its line number and also want to print 5 lines above and below the searched line. Can u guzz plzz help me ?
it is always returning "NOT FOUND".
after this now i am only getting some output with the below shown program.
with open('logg.json', 'r') as f:
for ln, line in enumerate(f):
if "error CRASHLOG" in line:
i = ln-25
for i in (ln-25,ln+25):
l = linecache.getline('logg.json', i)
i+=1
print(ln,l)
print(" Next Error")
file.readlines() return a list of lines. Lines does contains newline (\n).
You need to specify newline to match the line:
ln = data.index("error CRASHLOG\n")
If you want to find a line that contians a target string, you need to iterate the lines:
with open('logg.json', 'r') as f:
for ln, line in enumerate(f):
if "error CRASHLOG" in line:
# You now know the line index (`ln`)
# Print above/below 5 lines here.
break
else:
print("Not Found")
BTW, this kind of work is easily done with grep(1):
grep -C 5 'error CRASHLOG' logg.json || echo 'Not Found'
UPDATE
Following is more complete code:
from collections import deque
from itertools import islice, chain
import sys
with open('logg.json', 'r') as f:
last_lines = deque(maxlen=5) # contains the last (up to) 5 lines.
for ln, line in enumerate(f):
if "error CRASHLOG" in line:
sys.stdout.writelines(chain(last_lines, [line], islice(f, 5)))
last_lines.append(line)
else:
print("Not Found")
I'm sure it actually returns "Not found", but I put that down to anxiety-induced shoutiness.
data is a list. The documentation about the list type (http://docs.python.org/2/tutorial/datastructures.html) states that list.index(x) returns "the index in the list of the first item whose value is x. It is an error if there is no such item."
Therefore the only lines that would be reported are those that contain ONLY the string you specify with no other characters. As falsetru points out in her/his answer, if there is no other information on the log lines then you must include for comparison the newline that readlines() ensures is at the end of every line in the list it returns (even on a Windows system as long as you open the file in text mode, which is the default). Without that there is no chance of a match.
If the lines contain other information then a better test might indeed be to use x in string as she/he suggests, but I suspect you might be interested to see how much more processing it takes than a simple equality test. Come to that so would I, but this isn't my problem ...

Using python to search a text file for the occurence of specific characters

My question is similar to this one, except that I want to search for the occurrence of multiple chars, for example g, d and e, and then print the line in which ALL the specified characters exist.
I have tried the following but it didn't work:
searchfile = open("myFile.txt", "r")
for line in searchfile:
if ('g' and 'd') in line: print line,
searchfile.close()
I was getting lines which had EITHER 'g' or 'd' or both in them, all I want is just both occurences, not at least one of them, as is the result of running the above code.
if set('gd').issubset(line)
This has the advantage of not going through twice as each check of c in line iterates through the entire line
This line:
if ('g' and 'd') in line:
is the same as
if 'd' in line:
because
>>> 'g' and 'd'
'd'
You want
if 'g' in line and 'd' in line:
or, better:
if all(char in line for char in 'gde'):
(You could use set intersection too, but that's less generalizable.)
regular expressions will certainly help you when it comes to pattern matching, but it seem s that your search is easier than this. Try the following:
# in_data, an array of all lines to be queried (i.e. reading a file)
in_data = [line1, line2, line3, line4]
# search each line, and return the lines which contain all your search terms
for line in in_data:
if ('g' in line) and ('d' in line) and ('e' in line):
print(line)
Something this simple should work. I am making a few assumptions here:
1. the order of the search terms does not matter
2. upper / lower case is not dealt with
3. the frequency of the search terms is not considered.
Hope it helps.

Categories

Resources