I'm trying to split lines of text and store key information in a dictionary.
For example I have lines that look like:
Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
For the first line, my key will be "Lasal_00010", and the value I'm storing is "H293".
My current code works fine for this case, but when I encounter a line like:
Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
my code will not store the string "SSCG".
Here is my current code:
dataHash = {}
with open(fasta,'r') as f:
for ln in f:
query = ln.split('\t')[0]
query.strip()
tempValue = ln.split('\t')[1]
value = tempValue.split('|')[0]
value.strip()
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
for x in dataHash:
print x + " " + str(dataHash[x])
I believe I am splitting the line incorrectly in the case with two vertical bars. But I'm confused as to where my problem is. Shouldn't "SSCG" be the value I get when I write value = tempValue.split('|')[0]? Can someone explain to me how split works or what I'm missing?
Split on the first pipe, then on the space:
with open(fasta,'r') as f:
for ln in f:
query, value = ln.partition('|')[0].split()
I used str.partition() here as you only need to split once.
Your code makes assumptions on where tabs are being used; by splitting on the first pipe first we get to ignore the rest of the line altogether, making it a lot simpler to split the first from the second column.
Demo:
>>> lines = '''\
... Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
... Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
... Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
... '''
>>> for ln in lines.splitlines(True):
... query, value = ln.partition('|')[0].split()
... print query, value
...
Lasal_00010 H293
Lasal_00010 SSEG
Lasal_00030 SSCG
However, your code works too, up to a point, albeit less efficiently. Your real problem is with:
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
This really means: First time I see query, store an empty string, otherwise store value. I am not sure why you do this; if there are no other lines starting with Lasal_00030, all you have is an empty value in the dictionary. If that wasn't the intention, just store the value:
dataHash[query] = value
No if statement.
Note that dict.has_key() has been deprecated; it is better to use in to test for a key:
if query not in dataHash:
Related
I want to read some files with Python that contain certain data I need.
The structure of the files is like this:
NAME : a280
COMMENT : drilling problem (Ludwig)
TYPE : TSP
DIMENSION: 280
EDGE_WEIGHT_TYPE : EUC_2D
NODE_COORD_SECTION
1 288 149
2 288 129
3 270 133
4 256 141
5 256 157
6 246 157
7 236 169
8 228 169
9 228 161
So, the file starts with a few lines that contain data I need, then there are some random lines I do NOT need, and then there are lines with numerical data that I do need. I read everything that I need to read just fine.
However, my problem is that I cannot find a way to bypass the random number of lines that are sandwiched between the data I need. The lines from file to file can be 1, 2 or more. It would be silly to hardcode some f.readline() commands in there to bypass this.
I have thought of some regular expression to check if the line is starting with a string, in order to bypass it, but I'm failing.
In other words, there can be more lines like "NODE_COORD_SECTION" that I don't need in my data.
Any help is highly appreciated.
Well you can simply check if every line is valid (stuff you need) and if it is not, you just skip it. For example:
line_list = line.split()
if line_list[0] not in ['NAME', 'COMMENT', 'TYPE', ...]:
break
if len(line_list) != 3:
break
if len(line_list) == 3 and (type(line_list[0]) != int or type(line_list[1]) != int or type(line_list[2]) != int):
break
It would be nice if you add some format to the "lines of your file" and if you showed some code, but her's what I would try.
I would first define a list of strings containing an indication of a valid line, then I would split the current line into a list of strings and check if the first element corresponds to any of the elements in a list of valid strings.
In case the first string doesn't corresponds to any of the strings in the list of valid strings, I would check if the first element is an integer, and so on...
current_line = 'LINE OF TEXT FROM FILE'
VALID_WORDS = ['VALID_STR1','VALID_STR2','VALID_STR3']
elems = current_line.split(' ')
valid_line = False
if elems[0] in VALID_WORDS:
# If the first str is in the list of valid words,
# continue...
valid_line = True
else if len(elems)==3:
# If it's not in the list of valid words BUT has 3
# elements, check if it's and int
try:
valid_line = isinstance(int(elems[0]),int)
except Exception as e:
valid_line = False
if valid_line:
# Your thing
pass
else:
# Not a valid line
continue
I am using spaCy to tag named entities in a list of lines. My current code is:
from spacy.en import English
parser = English()
for item in datalist:
#parse the item (sentence)
parsed = parser(unicode(item))
# tag the lines
ents = list(parsed.ents)
# write to outfile
for entity in ents:
outfile.write(str(itemnumber) + '\t' + ' '.join(t.orth_ for t in entity) + '\n')
The spaCy stuff works fine, but somehow, an extra blank line is added to the outfile in some cases, as below:
...
165 it
165 it
165 it greater andre
166 6 12 14
166 solidarity paristoferguson
167 77
167 shooting deaths
167 cops
167 circumstances
...
Such blank lines are added sometimes, but not always, when the thing before the line break is a number.
I tried writing the tagged line to a string, then doing re.sub('\n\n', '\n', string) but the problem persists.
Edit: Adding .strip() to the parser solved it:
parsed = parser(unicode(item))
I have the following dataset, my code below will identify each line with the word 'Query_' search for an '*' and print the letters under it until the next line with 'Query_'
Query_10 206 IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK 385
010718494 193 LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT 255
001291831 173 LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT 235
012359817 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
009246541 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
Query_13 31 MEEQKEKGLSNPEVV*KYRQCSEIVNQVLSTVVSSCVPGADVASICTNGDFLIEDGLRNI 210
002947167 7 IQGEQEPNLSVPEVVTKYKAAADICNRALQAVIDGCKDGSKIVDLCRTGDNFITKECGNI 66
004993505 1 MELDRQSKVVDADALSKYRAAAAIANDCVQQLVANCIAGADVYTLAVEADTYIEQKLKEL 60
006961234 1 MSETKEYSLNNPDTLTKYKTAAQISEKVLAAVSDLCVPGAKIVDICQQGDKLIEEELAKV 62
008089018 1 MSEETDYTLNNPDTLTKYKTAAQISEKVLAAVAELVVPGEKIVTICEKGDKLIEEELAKV 60
Query_13 211 EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI 390
I am looking to print only if there are at least 50 or more letters under the '*' between the Query_ lines. Any help would be great!
lines = [line.rstrip() for line in open('infile.txt')]
for line in lines:
data = line.split()
sequence = data[2]
if data[0].startswith("Query_"):
star_indicies = [i for i,c in enumerate(sequence) if c == '*']
else:
print(list(sequence[star_index] for star_index in star_indicies))
Break it down into steps
First find all the lines with headers, and mark whether they contain asterisks:
headers = [[i,"*" in l.split()[2]] for i,l in enumerate(lines)
if l.startswith("Query_")]
So now you have a list of lists, each containing two values
Index into lines of the header
Whether that header contains an asterisk
Now you can iterate over it
for i, header in enumerate(headers[:-1]): # All but last
if not header[1]:
continue // No asterisk
this_header = header[0]
next_header = headers[i+1][0]
if (next_header - this_header -1) < 50:
continue // Not enough rows
...
The ... above is where you put the code to figure out which columns of lines[this_header] contain asterisks and then extract those columns from lines[this_header+1] through lines[next_header-1].
I've left that bit for you as your question is underspecified
Does the file end with a "Query_" header line?
If not, how do you deal with the case where the final header line has asterisks and is followed by 100 more lines?
What do you mean by "print the letters under it"?
But this should get you started
I have a .txt file looking like:
rawdata/d-0197.bmp 1 329 210 50 51
rawdata/c-0044.bmp 1 215 287 59 48
rawdata/e-0114.bmp 1 298 244 46 45
rawdata/102.bmp 1 243 126 163 143
I need to transform it in the following way:
-Before "rawdata", add the whole path, which is "/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/".
-Add a comma after ".bmp"
-Remove the first number (so the 1).
-Put the other four numbers into square brackets [].
It would look like:
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/d-0197.bmp, [329 210 50 51]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/c-0044.bmp, [215 287 59 48]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/e-0114.bmp, [298 244 46 45]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/102.bmp, [243 126 163 143]
I have done it, first by replacing "rawdata/" with nothing in a simple text editor, and then with python:
file=open('data.txt')
fout=open('data2.txt','w')
for line in file:
line=line.rstrip()
pieces=line.split('.bmp')
pieces2=pieces[1].split()
fout.write('/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/'+pieces[0]+'.bmp, '+'['+pieces2[1]+' '+pieces2[2]+' '+pieces2[3]+' '+pieces2[4]+']'+'\n')
fout.close()
But this file is going to be used in Matlab, so it would be much better to have an automatic process. How can I do the same in Matlab?
Thank you
Here you go:
infid = fopen('data.txt', 'r');
outfid = fopen('data2.txt', 'w');
dirStr = '/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/';
while ~feof(infid)
inline = fgetl(infid);
outline = [dirStr, regexprep(inline,' 1 (\d* \d* \d* \d*)',', [$1]')];
fprintf(outfid, '%s\n', outline);
end
fclose(infid);
fclose(outfid);
What we've done there is to read in the code from the input file line by line, then use a regular expression to make the changes to the line, then write it out to the output file. There are probably better ways of applying the regular expression, but that was pretty quick.
I have a huge input file that looks like this,
c651 OS05T0-00 492 749 29.07
c651 OS01T0-00 1141 1311 55.00
c1638 MLOC_8.3 27 101 72.00
c1638 MLOC_8.3 25 117 70.97
c2135 TRIUR3_3-P1 124 210 89.66
c2135 EMT17965 25 117 70.97
c1914 OS02T0-00 2 109 80.56
c1914 OS02T0-00 111 155 93.33
c1914 OS08T0-00 528 617 50.00
I would like to iterate inside each c, see if it has same elements in line[1] and print in 2 separate files
c that contain same elements and
that do not have same elements.
In case of c1914, since it has 2 same elements and 1 is not, it goes to file 2. So desired 2 output files will look like this, file1.txt
c1638 MLOC_8.3 27 101 72.00
c1638 MLOC_8.3 25 117 70.97
file2.txt
c651 OS05T0-00 492 749 29.07
c651 OS01T0-00 1141 1311 55.00
c2135 TRIUR3_3-P1 124 210 89.66
c1914 OS02T0-00 2 109 80.56
c1914 OS02T0-00 111 155 93.33
c1914 OS08T0-00 528 617 50.00
This is what I tried,
oh1=open('result.txt','w')
oh2=open('result2.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
new_list=line.split()
protein=new_list[1]
for i in range(1,len(protein)):
(p, c) = protein[i-1], protein[i]
if c == p:
new_list.append(protein)
oh1.write(line)
else:
oh2.write(line)
If I understand you correctly, you want to send all lines for your input file that have a first element txt1 to your first output file if the second element txt2 of all those lines is the same; otherwise all those lines go to the second output file. Here is a program that does that.
from collections import defaultdict
# Read in file line-by-line for the first time
# Build up dictionary of txt1 to set of txt2 s
txt1totxt2 = defaultdict(set)
f=open('file.txt','r')
for line in f:
lst = line.split()
txt1=lst[0]
txt2=lst[1]
txt1totxt2[txt1].add(txt2);
# The dictionary tells us whether the second text
# is unique or not. If it's unique the set has
# just one element; otherwise the set has > 1 elts.
# Read in file for second time, sending each line
# to the appropriate output file
f.seek(0)
oh1=open('result1.txt','w')
oh2=open('result2.txt','w')
for line in f:
lst = line.split()
txt1=lst[0]
if len(txt1totxt2[txt1]) == 1:
oh1.write(line)
else:
oh2.write(line)
The program logic is very simple. For each txt it builds up a set of txt2s that it sees. When you're done reading the file, if the set has just one element, then you know that the txt2s are unique; if the set has more than one element, then there are at least two txt2s. Note that this means that if you only have one line in the input file with a particular txt1, it will always be sent to the first output file. There are ways round this if this is not the behaviour you want.
Note also that because the file is large, I've read it in line-by-line: lines=f.readlines() in your original program reads the whole file into memory at a time. I've stepped through it twice: the second time does the output. If this increases the run time then you can restore the lines=f.readlines() instead of reading it a second time. However the program as is should be much more robust to very large files. Conversely if your files are very large indeed, it would be worth looking at the program to reduce the memory usage further (the dictionary txt1totxt2 could be replaced with something more optimal, albeit more complicated, if necessary).
Edit: there was a good point in comments (now deleted) about the memory cost of this algorithm. To elaborate, the memory usage could be high, but on the other hand it isn't as severe as storing the whole file: rather txt1totxt2 is a dictionary from the first text in each line to a set of the second text, which is of the order of (size of unique first text) * (average size of unique second text for each unique first text). This is likely to be a lot smaller than the file size, but the approach may require further optimization. The approach here is to get something simple going first -- this can then be iterated to optimize further if necessary.
Try this...
import collections
parsed_data = collections.OrderedDict()
with open("input.txt", "r") as fd:
for line in fd.readlines():
line_data = line.split()
key = line_data[0]
key2 = line_data[1]
if not parsed_data.has_key(key):
parsed_data[key] = collections.OrderedDict()
if not parsed_data[key].has_key(key2):
parsed_data[key][key2] = [line]
else:
parsed_data[key][key2].append(line)
# now process the parsed data and write result files
fsimilar = open("similar.txt", "w")
fdifferent = open("different.txt", "w")
for key in parsed_data:
if len(parsed_data[key]) == 1:
f = fsimilar
else:
f = fdifferent
for key2 in parsed_data[key]:
for line in parsed_data[key][key2]:
f.write(line)
fsimilar.close()
fdifferent.close()
Hope this helps