find end of line after another in text file in Python - python

Issue
Hello all,
in a text file i need to replace an unknown string by another,
first to find it i need to find the line before it 'name Blur2'
as there is many line beginnig by 'xpos':
name Blur2
xpos 12279 # 12279 is the end of line to find and put in a variable
Code to get unknow string:
#string to find:
keyString = ' name Blur2'
f2 = open("output_file.txt", 'w+')
with open("input_file.txt", 'r+') as f1:
lines = f1.readlines()
for i in range(0, len(lines)):
line = lines[i]
if keyString in line:
nextLine = lines[i + 1]
print ' nextLine: ',nextLine #result: nextLine: xpos 12279
number = nextLine.rsplit(' xpos ', 1)[1]
print ' number: ',number #result: number: 12279
#convert string to float:
newString = '{0}\n'.format(int(number)+ 10)
print ' newString: ',newString #result: newString: 12289
f2.write("".join([nextLine.replace(number, str(newString))])) #this line isn't working
f1.close()
f2.close()
so, i had completely change of method but the last line: f2.write... isn't working as expected, did someone know why?
thanks again for your help :)

regex seems like it would help, https://regex101.com/.
Regex searches a string with a language that defines a pattern. I listed the most important ones for learning the pattern itself, but it is sometimes a better alternative than python's native string manipulation.
You first describe the pattern that you will be using, then actually compile the pattern. For the string check, I defined it as a raw string using r''. This means I don't have to escape a \ within a string (example: printing \ would be print('\') instead of print(r'').
There are a couple of parts to this regex.
\s for whitespace(characters like space, ' ')
\n or \r for newline and carriage return, [^] defines which characters not to look for (so [^\n\r] searches for anything not containing a newline or carriage return), the * indicates it can have 0 or more of the characters indicated. $ in the regex string accounts for everything before the line end.
so the pattern searches for 'name Blur2' specifically with any number of whitespaces afterwards and a newline. The parentheses allow this to be group 1 (explained later). The second part '([^\n\r]*$)' captures any number of characters that aren't a newline or carriage return up until the end of that line.
Groups account for the parentheses, so '(name blue\n)' is group 1, and the line you want replaced '([^\n\r]*$)' is group 2. checkre.sub should replace the whole text with group 1
and the new string, so it replaces the first line with the first line, and replaces the second line with your new string
import re
check = r'(name Blur2\s*\n)([^\n\r]*$)'
checkre = re.compile(check, re.MULTILINE)
checkre.sub(\g<1>+newstring, file)
You need to set re.MULTILINE since you're checking multiple lines, if the '\n' isn't matched, you could use [\n\r\z] which gets one of either end of the line, carriage return, or absolute end of the string.
rioV8's comment works, but you could also use '.{5}$', which accounts for any 5 characters before the end of the line. It could be helpful within a re
It should be possible to get the old string with
oldstring = checkre.search(filestring).group(1)
I have not played with span yet, but
stringmatch = checkre.search(filestring)
oldstring = stringmatch.group(2)
newfilestring = filestring[0:stringmatch.span[0]] + stringmatch.group(1) + newstring + filestring[stringmatch.span[1]]:]
should be pretty close to what you're looking for, although the splice may not be exactly correct.

The initial program was pretty close. I edited a little bit of it to tweak a few things that were wrong.
You weren't initially writing the lines that needed to be replaced, I'm not sure why you needed to join things. Just replacing the number directly seemed to work. Python doesn't allow changes to the i in a for loop, and you need to skip one line so it isn't written to the file, so I changed it to a while loop. Anyway ask any questions you have, but the below code seems to work.
#string to find:
keyString = ' name Blur2'
f2 = open("output_file.txt", 'w+')
with open("test.txt", 'r+') as f1:
lines = f1.readlines()
i=0
while i <len(lines):
line = lines[i]
if keyString in line:
f2.write(line)
nextLine = lines[i + 1]
#end of necessary 'i' calls, increment i to avoid reprinting writing the replaced line string
i+=1
print (' nextLine: ',nextLine )#result: nextLine: xpos 12279
number = nextLine.rsplit(' xpos ', 1)[1]
#as was said in a comment, this coula also be number = nextLine[-5:]
print (' number: ',number )#result: number: 12279
#convert string to float:
newString = '{0}\n'.format(int(number)+ 10)
print (' newString: ',newString) #result: newString: 12289
f2.write(nextLine.replace(number, str(newString))) #this line isn't working
else:
f2.write(line)
i+=1
f1.close()
f2.close()

Related

Python: Search and Replace but ignore commented lines

I actually want to do a search and replace but ignore all my commented lines, and I also just want to replace only the first found...
input-file.txt
#replace me
#replace me
replace me
replace me
...like with:
text = text.replace("replace me", "replaced!", 1) # with max. 1 rep.
But I'm not sure how to approach(ignore) those comments. So that I get:
#replace me
#replace me
replaced!
replace me
As I see it, the existing solutions have one or more of several problems:
Incomplete (e.g. requiring match on start of line)
Incomplete (e.g. requiring match not containing \n)
Clunky (e.g. looong file-based solutions)
I'm pretty sure a pure-regex solution would require variable-width lookbehinds, which the re module doesn't support (though I think the regex module does). With a small tweak though, regex can still provide a fairly clean answer.
import re
i = re.search(r'^([^#\n]?)+replace me', string_to_replace, re.M).start()
replaced_string = ''.join([
string_to_replace[:i],
re.sub(r'replace me', 'replaced!', string_to_replace[i:], 1, re.M),
])
The idea is that you find the first uncommented line containing the start of your match, and then you replace the first instance of 'replace me' that you find starting on that line. The ^([^#\n]?)+ bit in the regex says
^ -- Find the start of a line.
([^#\n]?)+ -- Find as few ([^#\n]?) as you can before matching the rest of the expression.
([^#\n]?) -- Find 0 or 1 of [^#\n].
[^#\n] -- Find anything that's not # or \n.
Note that we're using raw strings r'' to prevent double escaping things like backslashes when creating our regex expressions, and we're using re.M to search across line breaks.
Note that the behavior is a bit weird if the string you're string to replace contains the pattern \n#. In that case, you'll wind up replacing part or all of one or more commented lines, which may not be what you want. Considering the problems with the alternatives, I'd be inclined to say the alternatives are all wrong approaches.
If that's not what you want, excluding all commented lines gets doubly weird because of some uncertainty in how they'd get merged back together. For example, consider the following input file.
#comment 1
replace
#comment 2
me
replace
me
What happens if you want to replace the string replace\nme? Do you exclude the first match because \n#comment 2 is stuck in between? If you use the first match, where does \n#comment 2 go? Does it go before or after the replacement? Is the replacement multiple lines as well so that it can still get sandwiched in? Do you just delete it?
Have a flag that marks whether you have completed the replacement yet. And then only replace when that flag is true and the lines is not a comment:
not_yet_replaced = True
with open('input-file.txt') as f:
for l in f:
if not_yet_replaced and not l.startswith('#') and 'replace me' in l:
l = l.replace('replace me', 'replaced!')
not_yet_replaced = False
print(l)
You can use a break after the first occurrence like so:
with open('input.txt', 'r') as f:
content = f.read().split('\n')
for i in range(len(content)):
if content[i] == 'replace me':
content[i] = 'replaced'
break
with open('input.txt', 'w') as f:
content = ('\n').join(content)
f.write(content)
Output :
(xenial)vash#localhost:~/python/stack_overflow$ cat input.txt
#replace me
#replace me
replaced
replace me
If the input file is not very big, you can read it into memory as a list of lines. Then iterate over the lines and replace the first matching one. Then write the lines back to the file:
with open('input-file.txt', 'r+') as f:
lines = f.readlines()
substr = 'replace me'
for i in range(len(lines)):
if lines[i].startswith('#'):
continue
if substr in lines[i]:
lines[i] = lines[i].replace(substr, 'replaced!', 1)
break
f.seek(0)
f.truncate()
f.writelines(lines)
I'm not sure whether or not you have managed to get the text out of the file, so you can do that by doing
f = open("input-file.txt", "r")
text = f.read()
f.close()
Then the way I would do this is first split the text into lines like so
lines = text.split("\n")
then do the replacement on each line, checking it does not start with a "#"
for index, line in enumerate(lines):
if len(line) > 0 and line[0] != "#" and "replace me" in line:
lines[index] = line.replace("replace me", "replaced!")
break
then stitch the lines back together.
new_text = "\n".join(lines)
hope this helps :)
Easiest way is to use a multiline regex along with its sub() method and giving it a count of 1:
import re
r = re.compile("^replace me$", re.M)
s = """
#replace me
#replace me
replace me
replace me
"""
r.sub("replaced!", s, 1)
Gives
#replace me
#replace me
replaced!
replace me
Online demo here

Cannot strip character using .strip() in python

I am a biologist and need to make a quick script to process some files.
The file format is fasta:
>line1
ACCGAGCTACTAGXXXXX
>line2
ACGTAX
et cetera.
I want to remove all X characters and quickly put toghether this script:
print """Input file must be named FILE.fasta"""
fasta_file = raw_input('Input file name:') # Input fasta file
char = raw_input('Which sequence should be stripped?:')
OutFileName = fasta_file.strip('.fasta') + '_stripped.fasta'
OutFile = open(OutFileName, 'w')
WriteOutFile = True
data = open(fasta_file, "r")
for line in data:
if line.startswith('>'):
OutPut = line
else:
OutPut = line.strip(char)
print OutPut
OutFile.write(OutPut)
print(char)
OutFile.close()
quit()
It does not work and I can't figure out why. any help?
P.S. sorry for the terrible code.
The other answers specified better alternatives. But in your case, [Python 3.Docs]: Built-in Types - str.strip([chars]) didn't work because each line in a file ends with the EOLN terminator, so X is not actually at the end of the string.
The option that requires minimum of code changes, is to modify the 3rd line from:
char = raw_input('Which sequence should be stripped?:')
to:
char = raw_input('Which sequence should be stripped?:') + "\n"
Beware: the line fasta_file.strip('.fasta') might not do what you think it does. Here, it would be recommended to use:
fasta_file.replace('.fasta', '_stripped.fasta')
EDIT0:
I think that you need to add the EOLN back when writing to the output file, so you also need to replace this line:
OutPut = line.strip(char)
by:
OutPut = line.strip(char) + "\n"
Use line.replace(char,'') instead line.strip(char)
Strip function removes characters only from sides https://docs.python.org/2/library/string.html#string.strip
You could do this using regex:
import re
pattern = re.compile("(\w[^X]+)") # This groups everything but X
stripped = pattern.match(line).group()
For your case you can do something similar in the 'else' section of your code and replace the 'X' in "(\w[^X]+)" by your 'char' variable:
pattern = re.compile("(\w[^" + char + "]+)")

Print full sequence not just first line | Python 3.3 | Print from specific line to end (")

I am attempting to pull out multiple (50-100) sequences from a large .txt file seperated by new lines ('\n'). The sequence is a few lines long but not always the same length so i can't just print lines x-y. The sequences end with " and the next line always starts with the same word so maybe that could be used as a keyword.
I am writing using python 3.3
This is what I have so far:
searchfile = open('filename.txt' , 'r')
cache = []
for line in searchfile:
cache.append(line)
for line in range(len(cache)):
if "keyword1" in cache[line].lower():
print(cache[line+5])
This pulls out the starting line (which is 5 lines below the keyword line always) however it only pulls out this line.
How do I print the whole sequence?
Thankyou for your help.
EDIT 1:
Current output = ABCDCECECCECECE ...
Desired output = ABCBDBEBSOSO ...
ABCBDBDBDBDD ...
continued until " or new line
Edit 2
Text file looks like this:
Name (keyword):
Date
Address1
Address2
Sex
Response"................................"
Y/N
The sequence between the " and " is what I need
TL;DR - How do I print from line + 5 to end when end = keyword
Not sure if I understand your sequence data but if you're searching for each 'keyword' then the next " char then the following should work:
keyword_pos =[]
endseq_pos = []
for line in range(len(cache)):
if 'keyword1' in cache[line].lower():
keyword_pos.append(line)
if '"' in cache[line]:
endseq_pos.append(line)
for key in keyword_pos:
for endseq in endseq_pos:
if endseq > key:
print(cache[key:endseq])
break
This simply compiles a list of all the positions of all the keywords and " characters and then matches the two and prints all the lines in between.
Hope that helps.
I agree with #Michal Frystacky that regex is the way forward. However as I now understand the problem, we need two searches one for the 'keyword' then again 5 lines on, to find the 'sequence'
This should work but may need the regex to be tweaked:
import re
with open('yourfile.txt') as f:
lines = f.readlines()
for i,line in enumerate(lines):
#first search for keyword
key_match = re.search(r'\((keyword)',line)
if key_match:
#if successful search 5 lines on for the string between the quotation marks
seq_match = re.search(r'"([A-Z]*)"',lines[i+5])
if seq_match:
print(key_match.group(1) +' '+ seq_match.group(1))
1This can be done rather simply with regex
import re
lines = 'Name (keyword):','Date','Address1','Address2','Sex','Response"................................" '
for line in lines:
match = re.search('.*?"(:?.*?)"?',line)
if match:
print(match.group(1))
Eventually to use this sample code we would lines = f.readlines() from the dataset. Its important to note that we catch only things between " and another ", if there is no " mark at the end, we will miss this data, but accounting for that isn't too difficult.

Python - Count characters between two specific strings

I made a text file containing random sequences of bases (ATCG) and want to find the longest and shortest "reading frame" within those sequences.
I was able to identify the Start- and Stop-Codons (the two "specific strings" mentioned) with "searchfile" and a for-loop and also know the basics of counting (example of code at the end) but I can't find any possibility to set those two as "boundaries" between I can count.
Can anybody perhaps give me a hint or tell me how such a function/operation is called so I can at least find it in a documentary or how it could look like? I found many options how to count various different things but none for counting between "x" and "y".
Example of how I looked up the strings between which I want to count:
searchfile = open('dna.txt', 'r')
for line in searchfile:
if "ATG" in line: print (line)
searchfile.close()
whole code:
import numpy as np
BASES = ('A', 'C', 'T', 'G')
P = (0.25, 0.25, 0.25, 0.25)
def random_dna_sequence(length):
return ''.join(np.random.choice(BASES, p=P) for _ in range(length))
with open('dna.txt', 'w+') as txtout:
for _ in range(10):
dna = random_dna_sequence(50)
txtout.write(dna)
txtout.write("\n")
searchfile = open('dna.txt', 'r')
for line in searchfile:
if "ATG" in line: print (line)
searchfile.close()
searchfile = open('dna.txt', 'r')
for line in searchfile:
if "ATG" in line: print (line)
elif "TAG" in line: print (line)
elif "TAA" in line: print (line)
elif "TGA" in line: print (line)
else: print ("no stop-codon detected")
searchfile.close()
Sidenote: The print instruction is only a temporary placeholder for testing. In the end i would like to set the found strings as mentioned "boundaries" (i can't find a better name for it) at that point.
Some example lines from the dna.txt file:
GAAGACGCAATAGGTTCACGGCGCTCATAGGCTTGCCCTCATAGGGCTTG
TCTGAGGTAGAAGGAGCTACTGCCGTTGCAGGTGACGCCCACAGTCCTGA
GTTATTACTCCCTGACTGTCATCTGTTCGGATACCGTGCAGCGCATCGAG
AGGAGATAACGCGATCCTGAGACAGTTTACCTATATGTTCACTACGCATG
CCGAGCTGATCCGACTACTGAAGGTGAATTCTGAAGCTAATCTGCAGTTC
This is a small example (I use 10 and 50 for testing) but in the end the file shall contain 10000 sequences with 1000 characters each.
What I would do is something like this:
with open("dna.txt", 'r') as searchfile:
all_dna = searchfile.read()
start = all_dna.index("ATG")
rem_dna = all_dna[start + 3:]
end = rem_dna.index("ATG")
needed_dna = all_dna[start:(end + 3)]
print len(needed_dna)
index finds where in a string the substring passed as an argument occurs, and will raise ValueError if the substring is not found. with is a keyword useful as a safety precaution for file I/O that ensures that the file is properly closed even if the code inside that block causes an error. If you don't want to include the starting and ending "ATG" in needed_dna, you can set that to all_dna[(start + 3):end]. The brackets, by the way, mean "take the substring of the specified string beginning at the argument before the colon (inclusive, zero-indexed) and ending at the argument after the colon (non-inclusive, also zero-indexed). This can also be used for lists, and can be used without the colon to get the character at a specific index. Hope this helps!

Adding to the end of a line in Python

I want to add some letters to the beginning and end of each line using python.
I found various methods of doing this, however, whichever method I use the letters I want to add to then end are always added to the beginning.
input = open("input_file",'r')
output = open("output_file",'w')
for line in input:
newline = "A" + line + "B"
output.write(newline)
input.close()
output.close()
I have used varios methods I found here. With each one of them both letters are added to the front.
inserting characters at the start and end of a string
''.join(('L','yourstring','LL'))
or
yourstring = "L%sLL" % yourstring
or
yourstring = "L{0}LL".format(yourstring)
I'm clearly missing something here. What can I do?
When reading lines from a file, python leaves the \n on the end. You could .rstrip it off however.
yourstring = 'L{0}LL\n'.format(yourstring.rstrip('\n'))

Categories

Resources