Cannot strip character using .strip() in python

Cannot strip character using .strip() in python - python

I am a biologist and need to make a quick script to process some files.
The file format is fasta:
>line1
ACCGAGCTACTAGXXXXX
>line2
ACGTAX
et cetera.
I want to remove all X characters and quickly put toghether this script:
print """Input file must be named FILE.fasta"""
fasta_file = raw_input('Input file name:') # Input fasta file
char = raw_input('Which sequence should be stripped?:')
OutFileName = fasta_file.strip('.fasta') + '_stripped.fasta'
OutFile = open(OutFileName, 'w')
WriteOutFile = True
data = open(fasta_file, "r")
for line in data:
if line.startswith('>'):
OutPut = line
else:
OutPut = line.strip(char)
print OutPut
OutFile.write(OutPut)
print(char)
OutFile.close()
quit()
It does not work and I can't figure out why. any help?
P.S. sorry for the terrible code.

The other answers specified better alternatives. But in your case, [Python 3.Docs]: Built-in Types - str.strip([chars]) didn't work because each line in a file ends with the EOLN terminator, so X is not actually at the end of the string.
The option that requires minimum of code changes, is to modify the 3rd line from:
char = raw_input('Which sequence should be stripped?:')
to:
char = raw_input('Which sequence should be stripped?:') + "\n"
Beware: the line fasta_file.strip('.fasta') might not do what you think it does. Here, it would be recommended to use:
fasta_file.replace('.fasta', '_stripped.fasta')
EDIT0:
I think that you need to add the EOLN back when writing to the output file, so you also need to replace this line:
OutPut = line.strip(char)
by:
OutPut = line.strip(char) + "\n"

Use line.replace(char,'') instead line.strip(char)
Strip function removes characters only from sides https://docs.python.org/2/library/string.html#string.strip

You could do this using regex:
import re
pattern = re.compile("(\w[^X]+)") # This groups everything but X
stripped = pattern.match(line).group()
For your case you can do something similar in the 'else' section of your code and replace the 'X' in "(\w[^X]+)" by your 'char' variable:
pattern = re.compile("(\w[^" + char + "]+)")

Related

find end of line after another in text file in Python

Issue
Hello all,
in a text file i need to replace an unknown string by another,
first to find it i need to find the line before it 'name Blur2'
as there is many line beginnig by 'xpos':
name Blur2
xpos 12279 # 12279 is the end of line to find and put in a variable
Code to get unknow string:
#string to find:
keyString = ' name Blur2'
f2 = open("output_file.txt", 'w+')
with open("input_file.txt", 'r+') as f1:
lines = f1.readlines()
for i in range(0, len(lines)):
line = lines[i]
if keyString in line:
nextLine = lines[i + 1]
print ' nextLine: ',nextLine #result: nextLine: xpos 12279
number = nextLine.rsplit(' xpos ', 1)[1]
print ' number: ',number #result: number: 12279
#convert string to float:
newString = '{0}\n'.format(int(number)+ 10)
print ' newString: ',newString #result: newString: 12289
f2.write("".join([nextLine.replace(number, str(newString))])) #this line isn't working
f1.close()
f2.close()
so, i had completely change of method but the last line: f2.write... isn't working as expected, did someone know why?
thanks again for your help :)

regex seems like it would help, https://regex101.com/.
Regex searches a string with a language that defines a pattern. I listed the most important ones for learning the pattern itself, but it is sometimes a better alternative than python's native string manipulation.
You first describe the pattern that you will be using, then actually compile the pattern. For the string check, I defined it as a raw string using r''. This means I don't have to escape a \ within a string (example: printing \ would be print('\') instead of print(r'').
There are a couple of parts to this regex.
\s for whitespace(characters like space, ' ')
\n or \r for newline and carriage return, [^] defines which characters not to look for (so [^\n\r] searches for anything not containing a newline or carriage return), the * indicates it can have 0 or more of the characters indicated. $ in the regex string accounts for everything before the line end.
so the pattern searches for 'name Blur2' specifically with any number of whitespaces afterwards and a newline. The parentheses allow this to be group 1 (explained later). The second part '([^\n\r]*$)' captures any number of characters that aren't a newline or carriage return up until the end of that line.
Groups account for the parentheses, so '(name blue\n)' is group 1, and the line you want replaced '([^\n\r]*$)' is group 2. checkre.sub should replace the whole text with group 1
and the new string, so it replaces the first line with the first line, and replaces the second line with your new string
import re
check = r'(name Blur2\s*\n)([^\n\r]*$)'
checkre = re.compile(check, re.MULTILINE)
checkre.sub(\g<1>+newstring, file)
You need to set re.MULTILINE since you're checking multiple lines, if the '\n' isn't matched, you could use [\n\r\z] which gets one of either end of the line, carriage return, or absolute end of the string.
rioV8's comment works, but you could also use '.{5}$', which accounts for any 5 characters before the end of the line. It could be helpful within a re
It should be possible to get the old string with
oldstring = checkre.search(filestring).group(1)
I have not played with span yet, but
stringmatch = checkre.search(filestring)
oldstring = stringmatch.group(2)
newfilestring = filestring[0:stringmatch.span[0]] + stringmatch.group(1) + newstring + filestring[stringmatch.span[1]]:]
should be pretty close to what you're looking for, although the splice may not be exactly correct.

The initial program was pretty close. I edited a little bit of it to tweak a few things that were wrong.
You weren't initially writing the lines that needed to be replaced, I'm not sure why you needed to join things. Just replacing the number directly seemed to work. Python doesn't allow changes to the i in a for loop, and you need to skip one line so it isn't written to the file, so I changed it to a while loop. Anyway ask any questions you have, but the below code seems to work.
#string to find:
keyString = ' name Blur2'
f2 = open("output_file.txt", 'w+')
with open("test.txt", 'r+') as f1:
lines = f1.readlines()
i=0
while i <len(lines):
line = lines[i]
if keyString in line:
f2.write(line)
nextLine = lines[i + 1]
#end of necessary 'i' calls, increment i to avoid reprinting writing the replaced line string
i+=1
print (' nextLine: ',nextLine )#result: nextLine: xpos 12279
number = nextLine.rsplit(' xpos ', 1)[1]
#as was said in a comment, this coula also be number = nextLine[-5:]
print (' number: ',number )#result: number: 12279
#convert string to float:
newString = '{0}\n'.format(int(number)+ 10)
print (' newString: ',newString) #result: newString: 12289
f2.write(nextLine.replace(number, str(newString))) #this line isn't working
else:
f2.write(line)
i+=1
f1.close()
f2.close()

Print full sequence not just first line | Python 3.3 | Print from specific line to end (")

I am attempting to pull out multiple (50-100) sequences from a large .txt file seperated by new lines ('\n'). The sequence is a few lines long but not always the same length so i can't just print lines x-y. The sequences end with " and the next line always starts with the same word so maybe that could be used as a keyword.
I am writing using python 3.3
This is what I have so far:
searchfile = open('filename.txt' , 'r')
cache = []
for line in searchfile:
cache.append(line)
for line in range(len(cache)):
if "keyword1" in cache[line].lower():
print(cache[line+5])
This pulls out the starting line (which is 5 lines below the keyword line always) however it only pulls out this line.
How do I print the whole sequence?
Thankyou for your help.
EDIT 1:
Current output = ABCDCECECCECECE ...
Desired output = ABCBDBEBSOSO ...
ABCBDBDBDBDD ...
continued until " or new line
Edit 2
Text file looks like this:
Name (keyword):
Date
Address1
Address2
Sex
Response"................................"
Y/N
The sequence between the " and " is what I need
TL;DR - How do I print from line + 5 to end when end = keyword

Not sure if I understand your sequence data but if you're searching for each 'keyword' then the next " char then the following should work:
keyword_pos =[]
endseq_pos = []
for line in range(len(cache)):
if 'keyword1' in cache[line].lower():
keyword_pos.append(line)
if '"' in cache[line]:
endseq_pos.append(line)
for key in keyword_pos:
for endseq in endseq_pos:
if endseq > key:
print(cache[key:endseq])
break
This simply compiles a list of all the positions of all the keywords and " characters and then matches the two and prints all the lines in between.
Hope that helps.

I agree with #Michal Frystacky that regex is the way forward. However as I now understand the problem, we need two searches one for the 'keyword' then again 5 lines on, to find the 'sequence'
This should work but may need the regex to be tweaked:
import re
with open('yourfile.txt') as f:
lines = f.readlines()
for i,line in enumerate(lines):
#first search for keyword
key_match = re.search(r'\((keyword)',line)
if key_match:
#if successful search 5 lines on for the string between the quotation marks
seq_match = re.search(r'"([A-Z]*)"',lines[i+5])
if seq_match:
print(key_match.group(1) +' '+ seq_match.group(1))

1This can be done rather simply with regex
import re
lines = 'Name (keyword):','Date','Address1','Address2','Sex','Response"................................" '
for line in lines:
match = re.search('.*?"(:?.*?)"?',line)
if match:
print(match.group(1))
Eventually to use this sample code we would lines = f.readlines() from the dataset. Its important to note that we catch only things between " and another ", if there is no " mark at the end, we will miss this data, but accounting for that isn't too difficult.

In Python, how do I efficiently check if a string has been found in a file yet?

in a Python function I'm writing I'm going through a text file, line by line, to replace each occurence of a certain string by a (numerical) value. Once I'm at the end of the file I would like to know if this string appeared in the file at all.
The function string.replace() does not tell you if anything has been replaced or not so I find myself having to go over each line twice, to look for the string and again to replace the string.
So far, I've come up with 2 ways to do this.
For each line:
use line.find(...) to look for the string, if it hasn't been found before
if the string is found, mark it as found
newLine = line.replace(...)
(do sth. with newLine ...)
For each line:
do newLine = line.replace(...) first
if newLine != line mark the string as found
(do sth. with newLine ...)
Here's my question:
Is there a better, i.e., more efficient or more pythonic way to do this?
If not, which of the above ways is faster?

I'd do something roughly like
found = False
newlines = []
for line in f:
if oldstring in line:
found = True
newlines.append(line.replace(oldstring, newstring))
else:
newlines.append(line)
Because that's the most easily understandable to me, I think.
There may be faster ways, but the best way depends on how often the string will occur in lines. Almost every line or almost no lines, that makes a big difference.

This example will work with multiple replacements:
replacements = {'string': [1,0], 'string2': [2,0]}
with open('somefile.txt') as f:
for line in f:
for key, value in replacements.iteritems():
if key in line:
new_line = line.replace(key, value[0])
replacements[key][1] += 1
# At the end
for key, value in replacements.iteritems():
print('Replaced {} with {} {} times'.format(key, *value))

Since we anyway have to go through the string twice, I'd make it as follows:
import re
with open('yourfile.txt', 'r', encoding='utf-8') as f: # check encoding
s = f.read()
oldstr, newstr = 'XXX', 'YYY'
count = len(list(re.finditer(oldstr, s)))
s_new = s.replace(oldstr, newstr)
print(oldstr, 'has been found and replaced by', newstr, count, 'times')

Adding to the end of a line in Python

I want to add some letters to the beginning and end of each line using python.
I found various methods of doing this, however, whichever method I use the letters I want to add to then end are always added to the beginning.
input = open("input_file",'r')
output = open("output_file",'w')
for line in input:
newline = "A" + line + "B"
output.write(newline)
input.close()
output.close()
I have used varios methods I found here. With each one of them both letters are added to the front.
inserting characters at the start and end of a string
''.join(('L','yourstring','LL'))
or
yourstring = "L%sLL" % yourstring
or
yourstring = "L{0}LL".format(yourstring)
I'm clearly missing something here. What can I do?

When reading lines from a file, python leaves the \n on the end. You could .rstrip it off however.
yourstring = 'L{0}LL\n'.format(yourstring.rstrip('\n'))

How to read a text file into a string variable and strip newlines?

I have a text file that looks like:
ABC
DEF
How can I read the file into a single-line string without newlines, in this case creating a string 'ABCDEF'?
For reading the file into a list of lines, but removing the trailing newline character from each line, see How to read a file without newlines?.

You could use:
with open('data.txt', 'r') as file:
data = file.read().replace('\n', '')
Or if the file content is guaranteed to be one-line
with open('data.txt', 'r') as file:
data = file.read().rstrip()

In Python 3.5 or later, using pathlib you can copy text file contents into a variable and close the file in one line:
from pathlib import Path
txt = Path('data.txt').read_text()
and then you can use str.replace to remove the newlines:
txt = txt.replace('\n', '')

You can read from a file in one line:
str = open('very_Important.txt', 'r').read()
Please note that this does not close the file explicitly.
CPython will close the file when it exits as part of the garbage collection.
But other python implementations won't. To write portable code, it is better to use with or close the file explicitly. Short is not always better. See https://stackoverflow.com/a/7396043/362951

To join all lines into a string and remove new lines, I normally use :
with open('t.txt') as f:
s = " ".join([l.rstrip("\n") for l in f])

with open("data.txt") as myfile:
data="".join(line.rstrip() for line in myfile)
join() will join a list of strings, and rstrip() with no arguments will trim whitespace, including newlines, from the end of strings.

This can be done using the read() method :
text_as_string = open('Your_Text_File.txt', 'r').read()
Or as the default mode itself is 'r' (read) so simply use,
text_as_string = open('Your_Text_File.txt').read()

I'm surprised nobody mentioned splitlines() yet.
with open ("data.txt", "r") as myfile:
data = myfile.read().splitlines()
Variable data is now a list that looks like this when printed:
['LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN', 'GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE']
Note there are no newlines (\n).
At that point, it sounds like you want to print back the lines to console, which you can achieve with a for loop:
for line in data:
print(line)

It's hard to tell exactly what you're after, but something like this should get you started:
with open ("data.txt", "r") as myfile:
data = ' '.join([line.replace('\n', '') for line in myfile.readlines()])

I have fiddled around with this for a while and have prefer to use use read in combination with rstrip. Without rstrip("\n"), Python adds a newline to the end of the string, which in most cases is not very useful.
with open("myfile.txt") as f:
file_content = f.read().rstrip("\n")
print(file_content)

Here are four codes for you to choose one:
with open("my_text_file.txt", "r") as file:
data = file.read().replace("\n", "")
or
with open("my_text_file.txt", "r") as file:
data = "".join(file.read().split("\n"))
or
with open("my_text_file.txt", "r") as file:
data = "".join(file.read().splitlines())
or
with open("my_text_file.txt", "r") as file:
data = "".join([line for line in file])

you can compress this into one into two lines of code!!!
content = open('filepath','r').read().replace('\n',' ')
print(content)
if your file reads:
hello how are you?
who are you?
blank blank
python output
hello how are you? who are you? blank blank

You can also strip each line and concatenate into a final string.
myfile = open("data.txt","r")
data = ""
lines = myfile.readlines()
for line in lines:
data = data + line.strip();
This would also work out just fine.

This is a one line, copy-pasteable solution that also closes the file object:
_ = open('data.txt', 'r'); data = _.read(); _.close()

f = open('data.txt','r')
string = ""
while 1:
line = f.readline()
if not line:break
string += line
f.close()
print(string)

python3: Google "list comprehension" if the square bracket syntax is new to you.
with open('data.txt') as f:
lines = [ line.strip('\n') for line in list(f) ]

Oneliner:
List: "".join([line.rstrip('\n') for line in open('file.txt')])
Generator: "".join((line.rstrip('\n') for line in open('file.txt')))
List is faster than generator but heavier on memory. Generators are slower than lists and is lighter for memory like iterating over lines. In case of "".join(), I think both should work well. .join() function should be removed to get list or generator respectively.
Note: close() / closing of file descriptor probably not needed

Have you tried this?
x = "yourfilename.txt"
y = open(x, 'r').read()
print(y)

To remove line breaks using Python you can use replace function of a string.
This example removes all 3 types of line breaks:
my_string = open('lala.json').read()
print(my_string)
my_string = my_string.replace("\r","").replace("\n","")
print(my_string)
Example file is:
{
"lala": "lulu",
"foo": "bar"
}
You can try it using this replay scenario:
https://repl.it/repls/AnnualJointHardware

I don't feel that anyone addressed the [ ] part of your question. When you read each line into your variable, because there were multiple lines before you replaced the \n with '' you ended up creating a list. If you have a variable of x and print it out just by
x
or print(x)
or str(x)
You will see the entire list with the brackets. If you call each element of the (array of sorts)
x[0]
then it omits the brackets. If you use the str() function you will see just the data and not the '' either.
str(x[0])

Maybe you could try this? I use this in my programs.
Data= open ('data.txt', 'r')
data = Data.readlines()
for i in range(len(data)):
data[i] = data[i].strip()+ ' '
data = ''.join(data).strip()

Regular expression works too:
import re
with open("depression.txt") as f:
l = re.split(' ', re.sub('\n',' ', f.read()))[:-1]
print (l)
['I', 'feel', 'empty', 'and', 'dead', 'inside']

with open('data.txt', 'r') as file:
data = [line.strip('\n') for line in file.readlines()]
data = ''.join(data)

from pathlib import Path
line_lst = Path("to/the/file.txt").read_text().splitlines()
Is the best way to get all the lines of a file, the '\n' are already stripped by the splitlines() (which smartly recognize win/mac/unix lines types).
But if nonetheless you want to strip each lines:
line_lst = [line.strip() for line in txt = Path("to/the/file.txt").read_text().splitlines()]
strip() was just a useful exemple, but you can process your line as you please.
At the end, you just want concatenated text ?
txt = ''.join(Path("to/the/file.txt").read_text().splitlines())

This works:
Change your file to:
LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE
Then:
file = open("file.txt")
line = file.read()
words = line.split()
This creates a list named words that equals:
['LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN', 'GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE']
That got rid of the "\n". To answer the part about the brackets getting in your way, just do this:
for word in words: # Assuming words is the list above
print word # Prints each word in file on a different line
Or:
print words[0] + ",", words[1] # Note that the "+" symbol indicates no spaces
#The comma not in parentheses indicates a space
This returns:
LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN, GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE

with open(player_name, 'r') as myfile:
data=myfile.readline()
list=data.split(" ")
word=list[0]
This code will help you to read the first line and then using the list and split option you can convert the first line word separated by space to be stored in a list.
Than you can easily access any word, or even store it in a string.
You can also do the same thing with using a for loop.

file = open("myfile.txt", "r")
lines = file.readlines()
str = '' #string declaration
for i in range(len(lines)):
str += lines[i].rstrip('\n') + ' '
print str

Try the following:
with open('data.txt', 'r') as myfile:
data = myfile.read()
sentences = data.split('\\n')
for sentence in sentences:
print(sentence)
Caution: It does not remove the \n. It is just for viewing the text as if there were no \n

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot strip character using .strip() in python - python

Use line.replace(char,'') instead line.strip(char) Strip function removes characters only from sides https://docs.python.org/2/library/string.html#string.strip

Related

find end of line after another in text file in Python

Print full sequence not just first line | Python 3.3 | Print from specific line to end (")

In Python, how do I efficiently check if a string has been found in a file yet?

Adding to the end of a line in Python

How to read a text file into a string variable and strip newlines?

Categories

Resources