How does one remove a header from a long string of text?
I have a program that displays a FASTA file as
...TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG...
The string is large and contains multiple headers like this
So the headers that need to be trimmed start with a > and end with a $
There's multiple headers, ranging from IonTorrenttrimmedcontig1 to IonTorrenttrimmedcontig25
How can I cut on the > and the $, remove everything inbetween, and seperate the code before and after into seperate list elements?
The file is read from a standard FASTA file, so I´d be very happy to hear possible solutions on the input step as well.
As it is part of fasta file, so you are going to slice it like this:
>>> import re
>>> a = "TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG"
>>> re.split(">[^$]*\$", a)
['TCGATCATCGATCG', 'CCGTAGGTGAACCTGCGGAAG']
Also, some people are answering with slicing with '>ion1'. That's totally wrong!
I believe your problem is solved! I am also editing a tag with bioinformatics for this question!
I would use the re module for that:
>>> s = "blablabla>ion1$foobar>ion2$etc>ion3$..."
>>> import re
>>> re.split(">[^$]*\$",s)
['blablabla', 'foobar', 'etc', '...']
And if you have 1 string on each line:
>>> with open("foo.txt", "r") as f:
... for line in f:
... re.split(">[^$]*\$",line[:-1])
...
['blablabla', 'foobar', 'etc', '...']
['fofofofofo', 'barbarbar', 'blablabla']
If you are reading over every line there a few ways to do this. You could use partition (partition returns a list containing 3 elements: [the text before the specified string, the specified string, and the text after]):
for line in file:
stripped_header = line.partition(">")[2].partition("$")[0]
You could use split:
for line in file:
stripped_header = line.spilt(">")[1].split("$")[0]
You could loop over all the elements in the string and only append after you pass ">" but before "$" (however this will not be nearly as efficient):
for line in file:
bool = False
stripped_header = ""
for char in line:
if char == ">":
bool = True
elif bool:
if char != "$":
stripped_header += char
else:
bool = False
Or alternatively use a regular expression, but it seems like my peers have already beat me to it!
Related
I have the following line of code reading in a specific part of a text file. The problem is these are numbers not strings so I want to convert them to ints and read them into a list of some sort.
A sample of the data from the text file is as follows:
However this is not wholly representative I have uploaded the full set of data here: http://s000.tinyupload.com/?file_id=08754130146692169643 as a text file.
*NSET, NSET=Nodes_Pushed_Back_IB
99915527, 99915529, 99915530, 99915532, 99915533, 99915548, 99915549, 99915550,
99915551, 99915552, 99915553, 99915554, 99915555, 99915556, 99915557, 99915558,
99915562, 99915563, 99915564, 99915656, 99915657, 99915658, 99915659, 99915660,
99915661, 99915662, 99915663, 99915664, 99915665, 99915666, 99915667, 99915668,
99915669, 99915670, 99915885, 99915886, 99915887, 99915888, 99915889, 99915890,
99915891, 99915892, 99915893, 99915894, 99915895, 99915896, 99915897, 99915898,
99915899, 99915900, 99916042, 99916043, 99916044, 99916045, 99916046, 99916047,
99916048, 99916049, 99916050
*NSET, NSET=Nodes_Pushed_Back_OB
Any help would be much appreciated.
Hi I am still stuck with this issue any more suggestions? Latest code and error message is as below Thanks!
import tkinter as tk
from tkinter import filedialog
file_path = filedialog.askopenfilename()
print(file_path)
data = []
data2 = []
data3 = []
flag= False
with open(file_path,'r') as f:
for line in f:
if line.strip().startswith('*NSET, NSET=Nodes_Pushed_Back_IB'):
flag= True
elif line.strip().endswith('*NSET, NSET=Nodes_Pushed_Back_OB'):
flag= False #loop stops when condition is false i.e if false do nothing
elif flag: # as long as flag is true append
data.append([int(x) for x in line.strip().split(',')])
result is the following error:
ValueError: invalid literal for int() with base 10: ''
Instead of reading these as strings I would like each to be a number in a list, i.e [98932850 98932852 98932853 98932855 98932856 98932871 98932872 98932873]
In such cases I use regular expressions together with string methods. I would solve this problem like so:
import re
with open(filepath) as f:
txt = f.read()
g = re.search(r'NSET=Nodes_Pushed_Back_IB(.*)', txt, re.S)
snums = g.group(1).replace(',', ' ').split()
numbers = [int(num) for num in snums]
I read the entire text into txt.
Next I use a regular expression and use the last portion of your header in the text as an anchor, and capture with capturing parenthesis all the rest (the re.S flag means that a dot should capture also newlines). I access all the nubers as one unit of text via g.group(1).
Next. I remove all the commas (actually replace them with spaces) because on the resulting text I use split() which is an excellent function to use on text items that are separated with spaces - it doesn't matter the amount of spaces, it just splits it as you would intent.
The rest is just converting the text to numbers using a list comprehension.
Your line contains more than one number, and some separating characters. You could parse that format by judicious application of split and perhaps strip, or you could minimize string handling by having re extract specifically the fields you care about:
ints = list(map(int, re.findall(r'-?\d+', line)))
This regular expression will find each group of digits, optionally prefixed by a minus sign, and then map will apply int to each such group found.
Using a sample of your string:
strings = ' 98932850, 98932852, 98932853, 98932855, 98932856, 98932871, 98932872, 98932873,\n'
I'd just split the string, strip the commas, and return a list of numbers:
numbers = [ int(s.strip(',')) for s in strings.split() ]
Based on your comment and regarding the larger context of your code. I'd suggest a few things:
from itertools import groupby
number_groups = []
with open('data.txt', 'r') as f:
for k, g in groupby(f, key=lambda x: x.startswith('*NSET')):
if k:
pass
else:
number_groups += list(filter('\n'.__ne__, list(g))) #remove newlines in list
data = []
for group in number_groups:
for str_num in group.strip('\n').split(','):
data.append(int(str_num))
I'm still learning Python, and I have a question I haven't been able to solve. I have a very long string (millions of lines long) which I would like to be split into a smaller string length based on a specified number of occurrences of a delimeter.
For instance:
ABCDEF
//
GHIJKLMN
//
OPQ
//
RSTLN
//
OPQR
//
STUVW
//
XYZ
//
In this case I would want to split based on "//" and return a string of all lines before the nth occurrence of the delimeter.
So an input of splitting the string by // by 1 would return:
ABCDEF
an input of splitting the string by // by 2 would return:
ABCDEF
//
GHIJKLMN
an input of splitting the string by // by 3 would return:
ABCDEF
//
GHIJKLMN
//
OPQ
And so on... However, The length of the original 2 million line string appeared to be a problem when I simply tried to split the entire string and by "//" and just work with the individual indexes. (I was getting a memory error) Perhaps Python can't handle so many lines in one split? So I can't do that.
I'm looking for a way that I don't need to split the entire string into a hundred-thousand indexes when I may only need 100, but instead just start from the beginning until a certain point, stop and return everything before it, which I assume may also be faster? I hope my question is as clear as possible.
Is there a simple or elegant way to achieve this? Thanks!
If you want to work with files instead of strings in memory, here is another answer.
This version is written as a function that reads lines and immediately prints them out until the specified number of delimiters have been found (no extra memory needed to store the entire string).
def file_split(file_name, delimiter, n=1):
with open(file_name) as fh:
for line in fh:
line = line.rstrip() # use .rstrip("\n") to only strip newlines
if line == delimiter:
n -= 1
if n <= 0:
return
print line
file_split('data.txt', '//', 3)
You can use this to write the output to a new file like this:
python split.py > newfile.txt
With a little extra work, you can use argparse to pass parameters to the program.
As a more efficient way you can read the firs N lines separated by your delimiter so if you are sure that all of your lines are splitted by delimiter you can use itertools.islice to do the job:
from itertools import islice
with open('filename') as f :
lines = islice(f,0,2*N-1)
The method that comes to my mind when I read your question uses a for loop
where you cut up the string into several (for example the 100 you called) and iterate through the substring.
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
log = 0
substring = thestring[:log+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
log = log+steps
# and go again from the start only with this offset
now you can go through all the elements go through the whole 2 million(!) line string.
best thing to do here is actually make a recursive function from this(if that is what you want):
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
def iterateThroughHugeString(beginning):
substring = thestring[:beginning+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
iterateThroughHugeString(beginning+steps)
# and go again from the start only with this offset
For instance:
i = 0
s = ""
fd = open("...")
for l in fd:
if l[:-1] == delimiter: # skip last '\n'
i += 1
if i >= max_split:
break
s += l
fd.close()
Since you are learning Python it would be a challenge to model a complete dynamic solution. Here's a notion of how you can model one.
Note: The following code snippet only works for file(s) which is/are in the given format (see the 'For Instance' in the question). Hence, it is a static solution.
num = (int(input("Enter delimiter: ")) * 2)
with open("./data.txt") as myfile:
print ([next(myfile) for x in range(num-1)])
Now that have the idea, you can use pattern matching and so on.
I have a txt file, from which I need to search a specific line, which is working, but in that line I need to strip the first 14 characters, and the part of the list element I am interested is dynamically generated during run time. So, scenario is I ran a script and the output is saved in output.txt, now I am parsing it, here is what I have tried
load_profile = open('output.txt', "r"
read_it = load_profile.read()
myLines = [ ]
for line in read_it.splitlines():
if line.find("./testSuites/") > -1
myLines.append(line)
print myLines
which gives output:
['*** Passed :) at ./testSuites/TS1/2013/06/17/15.58.12.744_14']
I need to parse ./testSuites/TS1/2013/06/17/15.58.12.744_14' part only and 2013 and est of the string is dynamically generated.
Could you please guide me what would be best way to achieve it?
Thanks in advance
Urmi
Use slicing:
>>> strs = 'Passed :) at ./testSuites/TS1/2013/06/17/15.58.12.744_14'
>>> strs[13:]
'./testSuites/TS1/2013/06/17/15.58.12.744_14'
Update : use lis[0] to access the string inside that list.
>>> lis = ['*** Passed :) at ./testSuites/TS1/2013/06/17/15.58.12.744_14']
>>> strs = lis[0]
>>> strs[17:] # I think you need 17 here
'./testSuites/TS1/2013/06/17/15.58.12.744_14'
You are asking how to strip the first 14 characters, but what if your strings don't always have that format in the future? Try splitting the string into substrings (removing whitespace) and then just get the substring with "./testSuites/" in it.
load_profile = open('output.txt', "r")
read_it = load_profile.read()
myLines = [ ]
for line in read_it.splitlines():
for splt in line.split():
if "./testSuites/" in splt:
myLines.append(splt)
print myLines
Here's how it works:
>>> pg = "Hello world, how you doing?\nFoo bar!"
>>> print pg
Hello world, how you doing?
Foo bar!
>>> lines = pg.splitlines()
>>> lines
["Hello world, how you doing?", 'Foo bar!']
>>> for line in lines:
... for splt in line.split():
... if "Foo" in splt:
... print splt
...
Foo
>>>
Of course, if you do in fact have strict requirements on the formats of these lines, you could just use string slicing (strs[13:] as Ashwini says) or you could split the line and do splt[-1] (which means get the last element of the split line list).
I have a particular block of stuff in a general file of many contents which is arbitrarily long, can contain any character, begins each line with a blank space and has the form in some text file:
1\1\GINC-NODE9999\Scan\...
... ... ... ... ... ... ...
... ... ... ... ...\HF=-568
.8880019,-568.2343213, -568
.2343432, ... , -586.328492
1\RMSD=...
I'm interested in the particular sequence which lies between \HF= and \RMSD=. I want to put these numbers into a Python list. This sequence is simply a series of numbers that are comma separated, however, these numbers can roll over onto a second line. ALSO, \HF= and \RMSD may be broken by rolling over onto a newline.
Current Efforts
I currently have the following:
with open(infile) as data:
d1 = []
start = '\\HF'
end = 'RMSD'
should_append = False
for line in data:
if start in line:
data = line[len(start):]
d1.append(data)
should_append=True
elif end in line:
should_append = False
break
elif should_append:
d1.append(line)
which spits out the following list
['.6184082129,7.5129238742\\\\Version=EM64L-G09RevC.01\\
State=1-A\\HF=-568\n', ' .8880019,-568.8879907,-568.8879686,
-568.887937,-\n']
The problem is not only do I have newlines throughout, I'm also keeping more data than I should. Furthermore, numbers that roll over onto other lines are given their own placement in the list. I need it to look like
['-568.8880019', '-568.8879907', ... ]
A multline non-greedy regular expression can be used to extract text that lies between \HF= and \RMSD=. Once the text is extracted it should be trivially easy to tokenize into constituent numbers
import re
import os
pattern = r'''\HF=(.*?)\RMSD='''
pat = re.compile(pattern, re.DOTALL)
for number in pat.finditer(open('file.txt').read()):
print number.group(1).replace(os.linesep, '').replace(' ', '').strip(r'''\\''')
...
-568 .8880019,-568.2343213, -568 .2343432, ... , -586.328492 1\
for a fast solution, you can implement a naive string concatenation based on regular expressions.
I implemented a short solution for your data format.
import re
def naiveDecimalExtractor(data):
p = re.compile("(-?\d+)[\n\s]*(\d+\.\d+)[\n\s]*(\d+)")
brokenNumbers = p.findall(data)
return ["".join(n) for n in brokenNumbers]
data = """
1\1\GINC-NODE9999\Scan\...
... ... ... ... ... ... ...
... ... ... ... ...\HF=-568
.8880019,-568.2343213, -568
.2343432, ... , -586.328492
1\RMSD=...
"""
print naiveDecimalExtractor(data)
Regards,
And Past
Use something like this to join everything in one line:
with open(infile) as data:
joined = ''.join(data.read().splitlines())
And then parse that without worrying about newlines.
If your file is really large you may want to consider another approach to avoid having it all in memory.
How about something like this:
# open the file to read
f = open("test.txt")
# read the whole file, then concatenate the list as one big string (str)
str = " ".join(f.readlines())
# get the substring between \HF= and \RMDS, then remove any '\', 'n', or ' '
values = str[str.find("\HF=")+5:str.find("\RMSD")].translate(None, "\n ")
# the string is now just numbers separated by commas, so split it to a list
# using the ',' deliminator
list = values.split(',')
Now list has:
['568.8880019', '-568.2343213', '-568.2343432', '...', '-586.3284921']
I had something like this open and forgot to post - a "slightly" different answer that uses mmap'd files and re.finditer:
This has the advantage of dealing with larger files relatively efficiently as it allows the regex engine to see the file as one long string without it being in memory at once.
import mmap
import re
with open('/home/jon/blah.txt') as fin:
mfin = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for match in re.finditer(r'\\HF=(.*?)\\RMSD=', mfin, re.DOTALL):
print match.group(1).translate(None, '\n ').split(',')
# ['-568.8880019', '-568.2343213', '-568.2343432', '...', '-586.3284921']
I am reading a file, line-by-line and doing some text processing in order to get output in a certain format
My string processing code goes as follows:
file1=open('/myfolder/testfile.txt')
scanlines=file1.readlines()
string = ''
for line in scanlines:
if line.startswith('>from'):
continue
if line.startswith('*'):
continue
string.join(line.rstrip('\n'))
The output of this code is as follows:
abc
def
ghi
Is there a way to join these physical lines into one logical line, e.g:
abcdefghi
Basically, how can I concatenate multiple strings into one large string?
If I was reading from a file with very long strings is there the risk of an overflow by concatenating multiple physical lines into one logical line?
there are several ways to do this. for example just using + should do the trick.
"abc" + "def" # produces "abcdef"
If you try to concatenate multiple strings you can do this with the join method:
', '.join(('abc', 'def', 'ghi')) # produces 'abc, def, ghi'
If you want no delimiter, use the empty string ''.join() method.
Cleaning things up a bit, it would be easiest to append to array and then return the result
def joinfile(filename) :
sarray = []
with open(filename) as fd :
for line in fd :
if line.startswith('>from') or line.startswith('*'):
continue
sarray.append(line.rstrip('\n'))
return ''.join(sarray)
If you wanted to get really cute you could also do the following:
fd = open(filename)
str = ''.join([line.rstrip('\n') for line in fd if not (line.startswith('>from') or line.startswith('*'))])
Yes of course you could read a file big enough to overflow memory.
Use string addition
>>> s = 'a'
>>> s += 'b'
>>> s
'ab'
I would prefer:
oneLine = reduce(lambda x,y: x+y, \
[line[:-1] for line in open('/myfolder/testfile.txt')
if not line.startswith('>from') and \
not line.startswith('*')])
line[:-1] in order to remove all the \n
the second argument of reduce is a list comprehension which extracts all the lines you are interested in and removes the \n from the lines.
the reduce (just if you actually need that) to make one string from the list of strings.