Counting no. of characters between headers in python - python

I have the following dataset, my code below will identify each line with the word 'Query_' search for an '*' and print the letters under it until the next line with 'Query_'
Query_10 206 IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK 385
010718494 193 LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT 255
001291831 173 LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT 235
012359817 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
009246541 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
Query_13 31 MEEQKEKGLSNPEVV*KYRQCSEIVNQVLSTVVSSCVPGADVASICTNGDFLIEDGLRNI 210
002947167 7 IQGEQEPNLSVPEVVTKYKAAADICNRALQAVIDGCKDGSKIVDLCRTGDNFITKECGNI 66
004993505 1 MELDRQSKVVDADALSKYRAAAAIANDCVQQLVANCIAGADVYTLAVEADTYIEQKLKEL 60
006961234 1 MSETKEYSLNNPDTLTKYKTAAQISEKVLAAVSDLCVPGAKIVDICQQGDKLIEEELAKV 62
008089018 1 MSEETDYTLNNPDTLTKYKTAAQISEKVLAAVAELVVPGEKIVTICEKGDKLIEEELAKV 60
Query_13 211 EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI 390
I am looking to print only if there are at least 50 or more letters under the '*' between the Query_ lines. Any help would be great!
lines = [line.rstrip() for line in open('infile.txt')]
for line in lines:
data = line.split()
sequence = data[2]
if data[0].startswith("Query_"):
star_indicies = [i for i,c in enumerate(sequence) if c == '*']
else:
print(list(sequence[star_index] for star_index in star_indicies))

Break it down into steps
First find all the lines with headers, and mark whether they contain asterisks:
headers = [[i,"*" in l.split()[2]] for i,l in enumerate(lines)
if l.startswith("Query_")]
So now you have a list of lists, each containing two values
Index into lines of the header
Whether that header contains an asterisk
Now you can iterate over it
for i, header in enumerate(headers[:-1]): # All but last
if not header[1]:
continue // No asterisk
this_header = header[0]
next_header = headers[i+1][0]
if (next_header - this_header -1) < 50:
continue // Not enough rows
...
The ... above is where you put the code to figure out which columns of lines[this_header] contain asterisks and then extract those columns from lines[this_header+1] through lines[next_header-1].
I've left that bit for you as your question is underspecified
Does the file end with a "Query_" header line?
If not, how do you deal with the case where the final header line has asterisks and is followed by 100 more lines?
What do you mean by "print the letters under it"?
But this should get you started

Related

Check if file line starts with a character

I want to read some files with Python that contain certain data I need.
The structure of the files is like this:
NAME : a280
COMMENT : drilling problem (Ludwig)
TYPE : TSP
DIMENSION: 280
EDGE_WEIGHT_TYPE : EUC_2D
NODE_COORD_SECTION
1 288 149
2 288 129
3 270 133
4 256 141
5 256 157
6 246 157
7 236 169
8 228 169
9 228 161
So, the file starts with a few lines that contain data I need, then there are some random lines I do NOT need, and then there are lines with numerical data that I do need. I read everything that I need to read just fine.
However, my problem is that I cannot find a way to bypass the random number of lines that are sandwiched between the data I need. The lines from file to file can be 1, 2 or more. It would be silly to hardcode some f.readline() commands in there to bypass this.
I have thought of some regular expression to check if the line is starting with a string, in order to bypass it, but I'm failing.
In other words, there can be more lines like "NODE_COORD_SECTION" that I don't need in my data.
Any help is highly appreciated.
Well you can simply check if every line is valid (stuff you need) and if it is not, you just skip it. For example:
line_list = line.split()
if line_list[0] not in ['NAME', 'COMMENT', 'TYPE', ...]:
break
if len(line_list) != 3:
break
if len(line_list) == 3 and (type(line_list[0]) != int or type(line_list[1]) != int or type(line_list[2]) != int):
break
It would be nice if you add some format to the "lines of your file" and if you showed some code, but her's what I would try.
I would first define a list of strings containing an indication of a valid line, then I would split the current line into a list of strings and check if the first element corresponds to any of the elements in a list of valid strings.
In case the first string doesn't corresponds to any of the strings in the list of valid strings, I would check if the first element is an integer, and so on...
current_line = 'LINE OF TEXT FROM FILE'
VALID_WORDS = ['VALID_STR1','VALID_STR2','VALID_STR3']
elems = current_line.split(' ')
valid_line = False
if elems[0] in VALID_WORDS:
# If the first str is in the list of valid words,
# continue...
valid_line = True
else if len(elems)==3:
# If it's not in the list of valid words BUT has 3
# elements, check if it's and int
try:
valid_line = isinstance(int(elems[0]),int)
except Exception as e:
valid_line = False
if valid_line:
# Your thing
pass
else:
# Not a valid line
continue

Deleting specific columns of each row

i'm new to python and right now i'm out of ideas.
What i'm trying to: i got a file
example:
254 578 name1 *--21->28--* secname1
854 548 name2 *--21->28--* secname2
944 785 name3 *--21->28--* secname3
1025 654 name4 *--21->28--* secname4
between those files are a lot of spaces and i wan't to remove specific spaces between "name*" and "secname*" for each row. I don't know what to do to as seen in the example remove the character/spaces 21 -> 28
What i got so far:
fobj_in = open("85488_66325_R85V54.txt")
fobj_out = open("85488_66325_R85V54.txt","w")
for line in fobj_in:
fobj_in.close()
fobj_out.close()
At the end it should look like:
254 578 name1 secname1
854 548 name2 secname2
944 785 name3 secname3
1025 654 name4 secname4
To remove characters by specific index positions you have to use slicing
for line in open('85488_66325_R85V54.txt'):
newline = line[:21] + line[29:]
print(newline)
removes the characters in column 21:28 (which are all whitespaces in your example)
Just split the line and pop the element you don't need.
fobj_in = open('85488_66325_R85V54','r')
fobj_out = open('85488_66325_R85V54.txt', 'a')
for line in fobj_in:
items = line.split()
items.pop(3)
fobj_out.write(' '.join(items)+'\n')
fobj_in.close()
fobj_out.close()
You can just use the string object's split method, like so:
f = open('my_file.txt', 'r')
data = f.readlines()
final_data = []
for line in data:
bits = line.split()
final_data.append([bits[0], bits[1], bits[2], bits[4]])
Basically I'm just illustrating how to use that split method to break each line into individual chunks, at which point you can do whatever you wish, like print all of those bits and selectively discard one of the columns.
I can suggest a robust method to correct the input line.
#!/usr/bin/env ipython
# -----------------------------------
line='254 578 name1 *--21->28--* secname1';
# -----------------------------------
def correctline(line,marker='*'):
status=0;
lineout='';
for val in line:
if val=='*':
status=abs(status-1);continue
if status==0:
lineout=lineout+val;
elif status == 1:
lineout=lineout
# -----------------------------------
while lineout.__contains__(' '):
lineout=lineout.replace(' ',' ');
return lineout
# ------------------------------------
print correctline(line)
Basically, it loops through the elements of the input file. When it finds some marker from which onward to skip the text, it skips it and finally just replaces too many spaces with one space.
If the names are of varying lengths and you dont want to just remove a set number of spaces between them you can search for blank characters to find where sname begins and name ends:
# open file in "read" mode
fobj_in = open("85488_66325_R85V54.txt", "r")
# use readlines to create a list, each member containing a line of 85488_66325_R85V54.txt
lines = fobj_in.readlines()
# For each line search from the end backwards for the first " " char
# when this char is found create first_name which is a list containing the
# elements of line from here onwards and a second list which is the elements up to
# this point. Then search for a non " " char and remove the blank spaces.
# remaining_line and first_name can then be concatenated back together using
# + with the desired number of spaces between then (in this case 12).
for line_number, line in enumerate(lines):
first_name_found = False
new_line_created = False
for i in range(len(line)):
if(line[-i] is " " and first_name_found is False):
first_name = line[-i+1:]
remaining_line = line[:-i+1]
first_name_found = True
for j in range(len(remaining_line)):
if(remaining_line[-j-1] is not " " and new_line_created == False):
new_line = remaining_line[0:-j]+ " "*12 + first_name
new_line_created = True
lines[line_number] = new_line
then just write lines to 85488_66325_R85V54.txt.
You could try to do it as follows:
for line in fobj_in:
setstring = line
print(setstring.replace(" ", "")

How do you split a list by space in python?

How do you split a list by space? With the code below, it reads a file with 4 lines of 7 numbers separated by spaces. When it takes the file and then splits it, it splits it by number so if i print item[0], 5 will print instead of 50. here is the code
def main():
filename = input("Enter the name of the file: ")
infile = open(filename, "r")
for i in range(4):
data = infile.readline()
print(data)
item = data.split()
print(data[0])
main()
the file looks like this
50 60 15 100 60 15 40 /n
100 145 20 150 145 20 45 /n
50 245 25 120 245 25 50 /n
100 360 30 180 360 30 55 /n
Split takes as argument the character you want to split your string with.
I invite you to read the documentation of methods you are using. :)
EDIT : By the way, readline returns a string, not a **list **.
However, split does return a list.
import nltk
tokens = nltk.word_tokenize(TextInTheFile)
Try this once you have opened that file.
TextInTheFile is a variable
There's not a lot wrong with what you are doing, except that you are printing the wrong thing.
Instead of
print(data[0])
use
print(item[0])
data[0] is the first character of the string you read from file. You split this string into a variable called item so that's what you should print.

Python while loop error - condition ignored by python?

Here is the code I am running:
#Import torsions.dat
f = open("torsions.txt")
next = f.readline().strip()
length = next #Store the first line as length
#Store the values for phi and psi in lists
phi = []
psi = []
while next != "":
next = f.readline().strip().split(" ")
phi.append(float(next[0]))
psi.append(float(next[1]))
But I get this error:
enter image description here
The file torsions.txt contains this:
20
60 61
62 63
64 65
There is no space after 65. There are 4 succeeding lines (i.e. there's no blank line in between). THe underscores are just for clarity, they're not in the txt.
The loop stops the script due to the error, adn the part after the loop doesn't run.
phi and psi get populated as required, but then the loop should stop, but it looks like it doesn't.
Could you help?
next = f.readline().strip().split(" ")
readline will return an empty string '' when done reading the file.
split will give [''], and you can't convert '' to a float.
Also [''] != ''.

Correct use of split()

I'm trying to split lines of text and store key information in a dictionary.
For example I have lines that look like:
Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
For the first line, my key will be "Lasal_00010", and the value I'm storing is "H293".
My current code works fine for this case, but when I encounter a line like:
Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
my code will not store the string "SSCG".
Here is my current code:
dataHash = {}
with open(fasta,'r') as f:
for ln in f:
query = ln.split('\t')[0]
query.strip()
tempValue = ln.split('\t')[1]
value = tempValue.split('|')[0]
value.strip()
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
for x in dataHash:
print x + " " + str(dataHash[x])
I believe I am splitting the line incorrectly in the case with two vertical bars. But I'm confused as to where my problem is. Shouldn't "SSCG" be the value I get when I write value = tempValue.split('|')[0]? Can someone explain to me how split works or what I'm missing?
Split on the first pipe, then on the space:
with open(fasta,'r') as f:
for ln in f:
query, value = ln.partition('|')[0].split()
I used str.partition() here as you only need to split once.
Your code makes assumptions on where tabs are being used; by splitting on the first pipe first we get to ignore the rest of the line altogether, making it a lot simpler to split the first from the second column.
Demo:
>>> lines = '''\
... Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
... Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
... Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
... '''
>>> for ln in lines.splitlines(True):
... query, value = ln.partition('|')[0].split()
... print query, value
...
Lasal_00010 H293
Lasal_00010 SSEG
Lasal_00030 SSCG
However, your code works too, up to a point, albeit less efficiently. Your real problem is with:
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
This really means: First time I see query, store an empty string, otherwise store value. I am not sure why you do this; if there are no other lines starting with Lasal_00030, all you have is an empty value in the dictionary. If that wasn't the intention, just store the value:
dataHash[query] = value
No if statement.
Note that dict.has_key() has been deprecated; it is better to use in to test for a key:
if query not in dataHash:

Categories

Resources