I'm having trouble grasping how to get right the right order of output, when doing nested for loops.
I have a list of integers:
[7, 9, 12]
And a .txt with lines of DNA sequence data.
>Ind1 AACTCAGCTCACG
>Ind2 GTCATCGCTACGA
>Ind3 CTTCAAACTGACT
I am trying to make a nested for loop, that takes the first integer (7), goes through the lines of text and prints the charachter at position 7 for each line. Then takes the next integer, and prints each character at position 9 for each line.
with (Input) as getletter:
for line in getletter:
if line [0] == ">":
for pos in position:
snp = line[pos]
print line[pos], str(pos)
When I run the above code, I get the data I want, but in the wrong order, like so:
A 7
T 9
G 12
T 7
A 9
G 12
T 7
C 9
A 12
What I want is this:
A 7
T 7
T 7
T 9
A 9
C 9
G 12
G 12
A 12
I suspect the problem can be solved by changing the indentation of the code, but I cannot get it right.
------EDIT--------
I've tried to swap the two loops around, but I am obviously not getting the bigger picture of this gives me the same (wrong) result as above.
with (Input) as getsnps:
for line in getsnps:
if line[0] == ">":
hit = line
for pos in position:
print hit[pos], pos
Trying an answer:
with (Input) as getletter:
lines=[x.strip() for x in getLetter.readlines() if x.startswith('>') ]
for pos in position:
for line in lines:
snp = line[pos]
print ("%s\t%s" % (pos,snp))
The file is read and cached into an array (lines, discarding file not starting with >)
we then iterate over the position then the lines and print the expected result.
Please note that you should check that your offset is not bigger than your line.
Alternative without list comprehension (will use more memory, especially if you have a lot of useless lines (i.e. not starting with '>')
with (Input) as getletter:
lines=getLetter.readlines()
for pos in position:
for line in lines:
if line.startswith('>'):
snp = line[pos]
print ("%s\t%s" % (pos,snp))
Alternative with another storage (assuming position is small and Input is big)
with (Input) as getletter:
storage=dict()
for p in positions:
storage[p]=[]
for line in getLetter:
for p in positions:
storage[p]+=[line[pos]]
for (k,v) in storage.iteritems():
print ("%s -> %s" % (k, ",".join(v))
if positions contains a value bigger than size of line, using line[p] will trigger an exception (IndexError). You can either catch it or test for it
try:
a=line[pos]
except IndexError:
a='X'
if pos>len(line):
a='X'
else:
a=line[pos]
Related
I am trying to extract some data of a file. For that purpose have made a script which reads the file and if some keyword is detected, it starts copying and then, when finds a blank line, it stops copying. I think it is not too bad, but is not working.
The python script i wrote is:
def out_to_mop (namefilein, namefileout):
print namefilein
filein=open(namefilein, "r")
fileout=open(namefileout, "w")
lines = filein.readlines()
filein.close()
#look for keyword "CURRENT.." to start copying
try:
indexmaxcycle = lines.index(" CURRENT BEST VALUE OF HEAT OF FORMATION")
indexmaxcycle += 5
except:
indexmaxcycle = 0
if indexmaxcycle != 0:
while lines[indexmaxcycle]!=" \n":
linediv = lines[indexmaxcycle].split()
symbol = linediv[0]
x = float(linediv[1])
indexmaxcycle += 1
fileout.write("%s \t %3.8f 1 \n" %(symbol, x))
else:
print "structure not found"
exit()
fileout.close()
This function is supposed to extract info from this file called file1.out:
CURRENT BEST VALUE OF HEAT OF FORMATION = -1161.249249
cycles=200 pm6 opt singlet eps=80 charge=-1
C -3.87724655 +1 1.30585983 +1 4.53273224 +1
H -7.60628859 +1 0.53968618 +1 3.72680573 +1
O -4.76978297 +1 4.45409715 +1 1.42608903 +1
H -4.66890488 +1 4.47267425 +1 2.41952335 +1
H -5.59468165 +1 3.93399792 +1 1.27757138 +1
**********************
* *
* JOB ENDED NORMALLY *
* *
**********************
but it prints "structure not found"
Would you help me a bit?
You try to find the beginning of the structure with the code line
indexmaxcycle = lines.index(" CURRENT BEST VALUE OF HEAT OF FORMATION")
The documentation for the index method says, "Return zero-based index in the list of the first item whose value is x. Raises a ValueError if there is no such item." However, that line you are searching for is not one of the file lines. The actual file line is
CURRENT BEST VALUE OF HEAT OF FORMATION = -1161.249249
Note the number at the end, which is not in your search string. Therefore, the index method raises an exception and you get an indexmaxcycle value of zero.
Since you apparently do not know the full contents of the file line in advance, you should loop through the input lines yourself and use the in operator to find a line that contains your search string. You could also use the startswith string method in this way:
for j, line in enumerate(lines):
if line.startswith(" CURRENT BEST VALUE OF HEAT OF FORMATION"):
indexmaxcycle = j + 5
break
else:
indexmaxcycle = 0
I dropped the try..except structure here, since I see no way an exception could be raised for this code. I could be wrong, of course.
You are looking for an exact match, but the line in the textfile is longer than the pattern you are looking for. Try to search for the beginning of the line instead:
pattern = " CURRENT BEST VALUE OF HEAT OF FORMATION"
try:
indexmaxcycle = [i for (i,s) in enumerate(lines) if s.startswith(pattern)][0]
indexmaxcycle += 5
etc.
[i for (i,s) in enumerate(lines) if s.startswith(pattern)] gives you all indices of elements that start with your pattern. If you add the [0] you get the first one.
I just noticed you can speed this up if you use generator expressions instead of list comprehensions:
pattern = " CURRENT BEST VALUE OF HEAT OF FORMATION"
try:
indexmaxcycle = next((i for (i,s) in enumerate(lines) if s.startswith('foo'))) + 5
except:
etc.
This will only search the list until it finds the first match.
I need to create a script that reads four lines and, if a condition is met, reads the next four lines in the file, and so on. If the condition isn't met, the script must restart the test from the second line of the previously read block. Therefore, the first line of what was the would be next block becomes the new fourth line. For instance I want to retrieve all the blocks that sum 4 from the following file.
printf "1\n1\n1\n1\n2\n1\n1\n1\n1" > file1.txt #In BASH
Lines from 1 to 4 sum 4, so they produce a positive results. Lines from 5 to 8 sum 5,so they produce a negative results and the sum must be redone starting in the 6th line and ending in the 9th, which sum 4 and therefore throw a positive results. I'm aware that I could do something like this,
with open("file1.txt") as infile:
while not EOF:
lines = []
for i in range(next N lines):
lines.append(infile.readline())
make_the_sum(lines)
but this will move the reader four lines and will make impossible to go backwards if the sum is larger than 4. How can I achieve this effect? Consider that my files are large and I can't load them whole in memory.
I am simplifying by ignoring the end of file issue. You could use tell and seek to handle recovering an earlier position (you could save as many positions as you required in a list, say:
>>> with open('testmedium.txt') as infile:
... times = 0
... EOF = 0
... while not EOF:
... pos = infile.tell()
... print(f"\nPosition is {pos}")
... lines = []
... for i in range(4):
... lines.append(infile.readline())
... [print(l[:20]) for l in lines]
... if times==0 and '902' in lines[0]:
... times = 1
... infile.seek(pos)
... elif '902' in lines[0]:
... break
Position is 0
271,848,690,44,511,5
132,427,793,452,85,6
62,617,183,843,456,3
668,694,659,691,242,
Position is 125
902,550,177,290,828,
326,603,623,79,803,5
803,949,551,947,71,8
661,881,124,382,126,
Position is 125
902,550,177,290,828,
326,603,623,79,803,5
803,949,551,947,71,8
661,881,124,382,126,
>>>
The following code will read lines into a "cache" (just a list) and do some work on the cached lines when the cache has four lines. If the test passes, the cache gets cleared. If the test fails, the cache is updated to contain only the last three lines of the cache. You can do additional work in the if-else blocks as necessary.
def passes_test(lines, target_value=4):
return sum([int(line) for line in lines]) == target_value
with open('file1.txt') as f:
cached = []
for line in f:
cached.append(line)
if len(cached) == 4:
if passes_test(cached):
cached = []
else:
cached = cached[1:]
As Martijn has suggested,
with open("file1.txt") as f:
rd = lambda: int(next(f))
try:
a, b, c, d = rd(), rd(), rd(), rd()
if a + b + c + d == 4:
# found a block
a, b, c, d = rd(), rd(), rd(), rd()
else:
# nope
a, b, c, d = b, c, d, rd()
except StopIteration:
# found end of file
I am fairly new to Python so please be patient, this is probably simple. I am trying to build an adjacency list representation of a graph. In this particular representation I decided to use list of lists where the first value of each sublist represents the tail node and all other values represent head nodes. For example, the graph with edges 1->2, 2->3, 3->1, 1->3 will be represented as [[1,2,3],[2,3],[3,1]].
Running the following code on this edge list, gives a problem I do not understand.
The edge list (Example.txt):
1 2
2 3
3 1
3 4
5 4
6 4
8 6
6 7
7 8
The Code:
def adjacency_list(graph):
graph_copy = graph[:]
g_tmp = []
nodes = []
for arc in graph_copy:
choice_flag_1 = arc[0] not in nodes
choice_flag_2 = arc[1] not in nodes
if choice_flag_1:
g_tmp.append(arc)
nodes.append(arc[0])
else:
idx = [item[0] for item in g_tmp].index(arc[0])
g_tmp[idx].append(arc[1])
if choice_flag_2:
g_tmp.append([arc[1]])
nodes.append(arc[1])
return g_tmp
# Read input from file
g = []
with open('Example.txt') as f:
for line in f:
line_split = line.split()
new_line = []
for element in line_split:
new_line.append(int(element))
g.append(new_line)
print('File Read. There are: %i items.' % len(g))
graph = adjacency_list(g)
During runtime, when the code processes arc 6 7 (second to last line in file), the following lines (found in the else statement) append 7 not only to g_tmp but also to graph_copy and graph.
idx = [item[0] for item in g_tmp].index(arc[0])
g_tmp[idx].append(arc[1])
What is happening?
Thank you!
J
P.S. I'm running Python 3.5
P.P.S. I also tried replacing graph_copy = graph[:] with graph_copy = list(graph). Same behavior.
The problem is in the lines
if choice_flag_1:
g_tmp.append(arc)
When you append arc, you are appending a shallow copy of the inner list. Replace with a new list like so
if choice_flag_1:
g_tmp.append([arc[0],arc[1]])
I have slight confusion regarding the start parameter in enumerate function,as i recently started working on python i don't have much idea how it is supposed to work.
Suppose i have an example file below:
Test 1
Test 2
Test 3
This is the first line [SB WOM]|[INTERNAL REQUEST]|[START] which is the start of message
Name : Vaibhav
Designation : Technical Lead
ID : 123456
Company : Nokia
This is the sixth line [SB WOM]|[INTERNAL REQUEST]|[END] which is the end of message
Now when i run the below code :
path =("C:/Users/vgupt021/Desktop")
in_file = os.path.join(path,"KSClogs_Test.txt")
fd = open(in_file,'r')
for linenum,line in enumerate(fd) :
if "[SB WOM]|[INTERNAL REQUEST]|[START]" in line:
x1 = linenum
print x1
break
for linenum,line in enumerate(fd,x1):
if "[SB WOM]|[INTERNAL REQUEST]|[END]" in line:
print linenum
break
I get the linenum returned as 3 and 7, I am not clear why it is not returned as 3,8.Since the index number of line "[SB WOM]|[INTERNAL REQUEST]|[END]" is 8 and not 7, how the start parameter makes the difference in second part of the loop.
Since the file iterator object has read the first four lines, when running the second for loop, it starts from where it stopped. The previous iteration stopped at line 3 (assuming we start counting from 0), the next for loop starts at line 4.
Therefore, the enumerate of the second loop should start from x1 + 1 not x1 as the line with index x1 was already covered in the previous loop; last line of first loop:
for linenum, line in enumerate(fd, x1+1):
...
Try this code
x = range(10)
for i, e in enumerate(x):
if i == 4:
print i
st = i
break
for i, e in enumerate(x, st):
print i
And you will see this output:
4
4 5 6 7 8 9 10 11 12 13
So, what does the second parameter of enumerate? Well, it's the starting value of the index of enumerate. The iterable variable x is enumerated again from the beginning but the values of i at different iteration is shifted by the value of st.
Instead of having the values of i as 0, 1, 2, etc., we have 4, 5, 6, etc.
I think that explains why you have the incorrect line number in your code.
I am new to python and trying to write my dictionary values to a file using Python 2.7. The values in my Dictionary D is a list with at least 2 items.
Dictionary has key as TERM_ID and
value has format [[DOC42, POS10, POS22], [DOC32, POS45]].
It means the TERM_ID (key) lies in DOC42 at POS10, POS22 positions and it also lies in DOC32 at POS45
So I have to write to a new file in the format: a new line for each TERM_ID
TERM_ID (tab) DOC42:POS10 (tab) 0:POS22 (tab) DOC32:POS45
Following code will help you understand what exactly am trying to do.
for key,valuelist in D.items():
#first value in each list is an ID
docID = valuelist[0][0]
for lst in valuelist:
file.write('\t' + lst[0] + ':' + lst[1])
lst.pop(0)
lst.pop(0)
for n in range(len(lst)):
file,write('\t0:' + lst[0])
lst.pop(0)
The output I get is :
TERM_ID (tab) DOC42:POS10 (tab) 0:POS22
DOC32:POS45
I tried using the new line tag as well as commas to continue file writing on the same line at no of places, but it did not work. I fail to understand how the file write really works.
Any kind of inputs will be helpful. Thanks!
#Falko I could not find a way to attach the text file hence here is my sample data-
879\t3\t1
162\t3\t1
405\t4\t1455
409\t5\t1
13\t6\t15
417\t6\t13
422\t57\t1
436\t4\t1
141\t8\t1
142\t4\t145
170\t8\t1
11\t4\t1
184\t4\t1
186\t8\t14
My sample running code is -
with open('sampledata.txt','r') as sample,open('result.txt','w') as file:
d = {}
#term= ''
#docIndexLines = docIndex.readlines()
#form a d with format [[doc a, pos 1, pos 2], [doc b, poa 3, pos 8]]
for l in sample:
tID = -1
someLst = l.split('\\t')
#if len(someLst) >= 2:
tID = someLst[1]
someLst.pop(1)
#if term not in d:
if not d.has_key(tID):
d[tID] = [someLst]
else:
d[tID].append(someLst)
#read the dionary to generate result file
docID = 0
for key,valuelist in d.items():
file.write(str(key))
for lst in valuelist:
file.write('\t' + lst[0] + ':' + lst[1])
lst.pop(0)
lst.pop(0)
for n in range(len(lst)):
file.write('\t0:' + lst[0])
lst.pop(0)
My Output:
57 422:1
3 879:1
162:1
5 409:1
4 405:1455
436:1
142:145
11:1
184:1
6 13:15
417:13
8 141:1
170:1
186:14
Expected output:
57 422:1
3 879:1 162:1
5 409:1
4 405:1455 436:1 142:145 11:1 184:1
6 13:15 417:13
8 141:1 170:1 186:14
You probably don't get the result you're expecting because you didn't strip the newline characters \n while reading the input data. Try replacing
someLst = l.split('\\t')
with
someLst = l.strip().split('\\t')
To enforce the mentioned line breaks in your output file, add a
file.write('\n')
at the very end of your second outer for loop:
for key,valuelist in d.items():
// ...
file.write('\n')
Bottom line: write never adds a line break. If you do see one in your output file, it's in your data.