Add new line after finding last string in a region - python

I have this input test.txt file with the output interleaved as #Expected in it (after finding the last line containing 1 1 1 1 within a *Title region
and this code in Python 3.6
index = 0
insert = False
currentTitle = ""
testfile = open("test.txt","r")
content = testfile.readlines()
finalContent = content
testfile.close()
# Should change the below line of code I guess to adapt
#titles = ["TitleX","TitleY","TitleZ"]
for line in content:
index = index + 1
for title in titles:
if line in title+"\n":
currentTitle = line
print (line)
if line == "1 1 1 1\n":
insert = True
if (insert == True) and (line != "1 1 1 1\n"):
finalContent.insert(index-1, currentTitle[:6] + "2" + currentTitle[6:])
insert = False
f = open("test.txt", "w")
finalContent = "".join(finalContent)
f.write(finalContent)
f.close()
Update:
Actual output with the answer provided
*Title Test
12125
124125
asdas 1 1 1 1
rthtr 1 1 1 1
asdasf 1 1 1 1
asfasf 1 1 1 1
blabla 1 1 1 1
#Expected "*Title Test2" here <-- it didn't add it
124124124
*Title Dunno
12125
124125
12763125 1 1 1 1
whatever 1 1 1 1
*Title Dunno2
#Expected "*Title Dunno2" here <-- This worked great
214142122
#and so on for thousands of them..
Also is there a way to overwrite this in the test.txt file?

Because you are already reading the entire file into memory anyway, it's easy to scan through the lines twice; once to find the last transition out of a region after each title, and once to write the modified data back to the same filename, overwriting the previous contents.
I'm introducing a dictionary variable transitions where the keys are the indices of the lines which have a transition, and the value for each is the text to add at that point.
transitions = dict()
in_region = False
reg_end = -1
current_title = None
with open("test.txt","r") as testfile:
content = testfile.readlines()
for idx, line in enumerate(content):
if line.startswith('*Title '):
# Commit last transition before this to dict, if any
if current_title:
transitions[reg_end] = current_title
# add suffix for printing
current_title = line.rstrip('\n') + '2\n'
elif line.strip().endswith(' 1 1 1 1'):
in_region = True
# This will be overwritten while we remain in the region
reg_end = idx
elif in_region:
in_region = False
if current_title:
transitions[reg_end] = current_title
with open("test.txt", "w") as output:
for idx, line in enumerate(content):
output.write(line)
if idx in transitions:
output.write(transitions[idx])
This kind of "remember the last time we saw something" loop is very common, but takes some time getting used to. Inside the loop, keep in mind that we are looping over all the lines, and remembering some things we saw during a previous iteration of this loop. (Forgetting the last thing you were supposed to remember when you are finally out of the loop is also a very common bug!)
The strip() before we look for 1 1 1 1 normalizes the input by removing any surrounding whitespace. You could do other kinds of normalizations, too; normalizing your data is another very common technique for simplifying your logic.
Demo: https://ideone.com/GzNUA5

try this, using itertools.zip_longest
from itertools import zip_longest
with open("test.txt","r") as f:
content = f.readlines()
results, title = [], ""
for i, j in zip_longest(content, content[1:]):
# extract title.
if i.startswith("*"):
title = i
results.append(i)
# compare value in i'th index with i+1'th (if mismatch add title)
if "1 1 1 1" in i and "1 1 1 1" not in j:
results.append(f'{title.strip()}2\n')
print("".join(results))

Related

Fill down/increment by 1 if line is the same as line above it

I am having a hard time trying to fill down a series of numbers based on the content of a line and the line immediately above it. I have a text file containing several lines of text, and I want to check if a line is equal to the line above it. If it is equal, then add 1, and if not then use 1. The input text is:
DOLORES
DOLORES
GENERAL LUNA
GENERAL LUNA
GENERAL LUNA
GENERAL NAKAR
GENERAL NAKAR
and the output I want is:
1
2
1
2
3
1
2
I tried this but it gives a different result:
fhand = open("input.txt")
fout = open("output.txt","w")
t = list()
ind = 0
for line in fhand:
t.append(line)
for elem in t:
if elem[0] or (not(elem[0]) and (elem[ind] == elem[ind-1])):
ind += 1
elif not(elem[0]) and (elem[ind] != elem[ind-1]):
ind = 1
fout.write(str(ind)+'\n')
fout.close()
How can I have the result that I want?
The primary problem is that you say that you want to check for identical lines, but your code deals with individual characters. Trivial tracing of your program shows this.
See this lovely reference for debugging help.
You need to deal with lines, not characters. I've removed the file handling, as they are immaterial to your question.
t = [
"DOLORES",
"DOLORES",
"GENERAL LUNA",
"GENERAL LUNA",
"GENERAL LUNA",
"GENERAL NAKAR",
"GENERAL NAKAR",
]
prev = None # There is no previous line on the first iteration
count = 1
for line in t:
if line == prev:
count += 1
else:
count = 1
prev = line
print(count)
Output:
1
2
1
2
3
1
2
Edit you conditions like this :
fout.write(str(1)+'\n')
for elem in range(1,len(t)):
if t[elem]==t[elem-1]:
ind += 1
else:
ind = 1
fout.write(str(ind)+'\n')

Python programming changing a specific value in file

I have a problem with concerning my code:
with open('Premier_League.txt', 'r+') as f:
data = [int(line.strip()) for line in f.readlines()] # [1, 2, 3]
f.seek(0)
i = int(input("Add your result! \n"))
data[i] += 1 # e.g. if i = 1, data now [1, 3, 3]
for line in data:
f.write(str(line) + "\n")
f.truncate()
print(data)
The code works that the file "Premier_League.txt" that contains for example:
1
2
3
where i=1
gets converted to and saved to already existing file (the previous info gets deleted)
1
3
3
My problem is that I want to chose a specific value in a matris (not only a vertical line) for example:
0 0 0 0
0 0 0 0
0 0 0 0
where i would like to change it to for example:
0 1 0 0
0 0 0 0
0 0 0 0
When I run this trough my program this appears:
ValueError: invalid literal for int() with base 10: '1 1 1 1'
So my question is: how do I change a specific value in a file that contains more than a vertical line of values?
The problem is you are not handling the increased number of dimensions properly. Try something like this;
with open('Premier_League.txt', 'r+') as f:
# Note this is now a 2D matrix (nested list)
data = [[int(value) for value in line.strip().split()] for line in f ]
f.seek(0)
# We must specify both a column and row
i = int(input("Add your result to column! \n"))
j = int(input("Add your result to row! \n"))
data[i][j] += 1 # Assign to the index of the column and row
# Parse out the data and write back to file
for line in data:
f.write(' '.join(map(str, line)) + "\n")
f.truncate()
print(data)
You could also use a generator expression to write to the file, for example;
# Parse out the data and write back to file
f.write('\n'.join((' '.join(map(str, line)) for line in data)))
instead of;
# Parse out the data and write back to file
for line in data:
f.write(' '.join(map(str, line)) + "\n")
First up, you are trying to parse the string '0 0 0 0' as an int, that's the error you are getting. To fix this, do:
data = [[int(ch) for ch in line.strip().split()] for line in f.readlines()]
This will create a 2D array, where the first index corresponds to the row, and the second index corresponds to the column. Then, you would probably want the user to give you two values, instead of a singular i since you are trying to edit in a 2D array.
Edit:
So your following code will look like this:
i = int(input("Add your result row: \n"))
j = int(input("Add your result column: \n"))
data[i][j] += 1
# For data = [[1,2,1], [2,3,2]], and user enters i = 1
# and j = 0, the new data will be [[1,2,1], [3,3,2]]

How to find all instances of list values(ex: [1,2,3]) in a file at a specific index

I want to find out a list of elements in a file at a specific index.
For ex, below are the contents of the file "temp.txt"
line_0 1
line_1 2
line_2 3
line_3 4
line_4 1
line_5 1
line_6 2
line_7 1
line_8 2
line_9 3
line_10 4
Now, I need to find out the list of values [1,2,3] occurring in sequence at column 2 of each line in above file.
Output should look like below:
line_2 3
line_9 3
I have tried the below logic, but it some how not working ;(
inf = open("temp.txt", "rt")
count = 0
pos = 0
ListSeq = ["1","2","3"]
for line_no, line in enumerate(inf):
arr = line.split()
if len(arr) > 1:
if count == 1 :
pos = line_no
if ListSeq[count] == arr[1] :
count += 1
elif count > 0 :
inf.seek(pos)
line_no = pos
count = 0
else :
count = 0
if count >= 3 :
print(line)
count = 0
Can somebody help me in finding the issue with above code? or even a different logic which will give a correct output is also fine.
Your code is flawed. Most prominent bug: trying to seek in a text file using line number is never going to work: you have to use byte offset for that. Even if you did that, it would be wrong because you're iterating on the lines, so you shouldn't attempt to change file pointer while doing that.
My approach:
The idea is to "transpose" your file to work with vertical vectors, find the sequence in the 2nd vertical vector, and use the found index to extract data on the first vertical vector.
split lines to get text & number, zip the results to get 2 vectors: 1 of numbers 1 of text.
At this point, one list contains ["line_0","line_1",...] and the other one contains ["1","2","3","4",...]
Find the indexes of the sequence in the number list, and print the couple txt/number when found.
code:
with open("text.txt") as f:
sequence = ('1','2','3')
txt,nums = list(zip(*(l.split()[:2] for l in f))) # [:2] in case there are more columns
for i in range(len(nums)-len(sequence)+1):
if nums[i:i+len(sequence)]==sequence:
print("{} {}".format(txt[i+2],nums[i+2]))
result:
line_2 3
line_9 3
last for loop can be replaced by a list comprehension to generate the tuples:
result = [(txt[i+2],nums[i+2]) for i in range(len(nums)-len(sequence)) if nums[i:i+len(sequence)]==sequence ]
result:
[('line_2', '3'), ('line_9', '3')]
Generalizing for any sequence and any column.
sequence = ['1','2','3']
col = 1
with open(filename, 'r') as infile:
idx = 0
for _i, line in enumerate(infile):
if line.strip().split()[col] == sequence[idx]:
if idx == len(sequence)-1:
print(line)
idx = 0
else:
idx += 1
else:
idx = 0

Splicing through a line of a textfile using python

I am trying to create genetic signatures. I have a textfile full of DNA sequences. I want to read in each line from the text file. Then add 4mers which are 4 bases into a dictionary.
For example: Sample sequence
ATGATATATCTATCAT
What I want to add is ATGA, TGAT, GATA, etc.. into a dictionary with ID's that just increment by 1 while adding the 4mers.
So the dictionary will hold...
Genetic signatures, ID
ATGA,1
TGAT, 2
GATA,3
Here is what I have so far...
import sys
def main ():
readingFile = open("signatures.txt", "r")
my_DNA=""
DNAseq = {} #creates dictionary
for char in readingFile:
my_DNA = my_DNA+char
for char in my_DNA:
index = 0
DnaID=1
seq = my_DNA[index:index+4]
if (DNAseq.has_key(seq)): #checks if the key is in the dictionary
index= index +1
else :
DNAseq[seq] = DnaID
index = index+1
DnaID= DnaID+1
readingFile.close()
if __name__ == '__main__':
main()
Here is my output:
ACTC
ACTC
ACTC
ACTC
ACTC
ACTC
This output suggests that it is not iterating through each character in string... please help!
You need to move your index and DnaID declarations before the loop, otherwise they will be reset every loop iteration:
index = 0
DnaID=1
for char in my_DNA:
#... rest of loop here
Once you make that change you will have this output:
ATGA 1
TGAT 2
GATA 3
ATAT 4
TATA 5
ATAT 6
TATC 6
ATCT 7
TCTA 8
CTAT 9
TATC 10
ATCA 10
TCAT 11
CAT 12
AT 13
T 14
In order to avoid the last 3 items which are not the correct length you can modify your loop:
for i in range(len(my_DNA)-3):
#... rest of loop here
This doesn't loop through the last 3 characters, making the output:
ATGA 1
TGAT 2
GATA 3
ATAT 4
TATA 5
ATAT 6
TATC 6
ATCT 7
TCTA 8
CTAT 9
TATC 10
ATCA 10
TCAT 11
This should give you the desired effect.
from collections import defaultdict
readingFile = open("signatures.txt", "r").read()
DNAseq = defaultdict(int)
window = 4
for i in xrange(len(readingFile)):
current_4mer = readingFile[i:i+window]
if len(current_4mer) == window:
DNAseq[current_4mer] += 1
print DNAseq
index is being reset to 0 each time through the loop that starts with for char in my_DNA:.
Also, I think the loop condition should be something like while index < len(my_DNA)-4: to be consistent with the loop body.
Your index counters reset themselves since they are in the for loop.
May I make some further suggestions? My solution would look like that:
readingFile = open("signatures.txt", "r")
my_DNA=""
DNAseq = {} #creates dictionary
for line in readingFile:
line = line.strip()
my_DNA = my_DNA + line
ID = 1
index = 0
while True:
try:
seq = my_DNA[index:index+4]
if not seq in my_DNA:
DNAseq[ID] = my_DNA[index:index+4]
index += 4
ID += 1
except IndexError:
break
readingFile.close()
But what do you want to do with duplicates? E.g., if a sequence like ATGC appears twice? Should both be added under a different ID, for example {...1:'ATGC', ... 200:'ATGC',...} or shall those be omitted?
If I'm understanding correctly, you are counting how often each sequential string of 4 bases occurs? Try this:
def split_to_4mers(filename):
dna_dict = {}
with open(filename, 'r') as f:
# assuming the first line of the file, only, contains the dna string
dna_string = f.readline();
for idx in range(len(dna_string)-3):
seq = dna_string[idx:idx+4]
count = dna_dict.get(seq, 0)
dna_dict[seq] = count+1
return dna_dict
output on a file that contains only "ATGATATATCTATCAT":
{'TGAT': 1, 'ATCT': 1, 'ATGA': 1, 'TCAT': 1, 'TATA': 1, 'TATC': 2, 'CTAT': 1, 'ATCA': 1, 'ATAT': 2, 'GATA': 1, 'TCTA': 1}

Read in file contents to Arrays

I am a bit familiar with Python. I have a file with information that I need to read in a very specific way. Below is an example...
1
6
0.714285714286
0 0 1.00000000000
0 1 0.61356352337
...
-1 -1 0.00000000000
0 0 5.13787636499
0 1 0.97147643932
...
-1 -1 0.00000000000
0 0 5.13787636499
0 1 0.97147643932
...
-1 -1 0.00000000000
0 0 0 0 5.13787636499
0 0 0 1 0.97147643932
....
So every file will have this structure (tab delimited).
The first line must be read in as a variable as well as the second and third lines.
Next we have four blocks of code separated by a -1 -1 0.0000000000. Each block of code is 'n' lines long. The first two numbers represent the position/location that the 3rd number in the line is to be inserted in an array. Only the unique positions are listed (so, position 0 1 would be the same as 1 0 but that information would not be shown).
Note: The 4th block of code has a 4-index number.
What I need
The first 3 lines read in as unique variables
Each block of data read into an array using the first 2 (or 4 ) column of numbers as the array index and the 3rd column as the value being inserted into an array.
Only unique array elements shown. I need the mirrored position to be filled with the proper value as well (a 0 1 value should also appear in 1 0).
The last block would need to be inserted into a 4-dimensional array.
I rewrote the code. Now it's almost what you need. You only need fine tuning.
I decided to leave the old answer - perhaps it would be helpful too.
Because the new is feature-rich enough, and sometimes may not be clear to understand.
def the_function(filename):
"""
returns tuple of list of independent values and list of sparsed arrays as dicts
e.g. ( [1,2,0.5], [{(0.0):1,(0,1):2},...] )
on fail prints the reason and returns None:
e.g. 'failed on text.txt: invalid literal for int() with base 10: '0.0', line: 5'
"""
# open file and read content
try:
with open(filename, "r") as f:
data_txt = [line.split() for line in f]
# no such file
except IOError, e:
print 'fail on open ' + str(e)
# try to get the first 3 variables
try:
vars =[int(data_txt[0][0]), int(data_txt[1][0]), float(data_txt[2][0])]
except ValueError,e:
print 'failed on '+filename+': '+str(e)+', somewhere on lines 1-3'
return
# now get arrays
arrays =[dict()]
for lineidx, item in enumerate(data_txt[3:]):
try:
# for 2d array data
if len(item) == 3:
i, j = map(int, item[:2])
val = float(item[-1])
# check for 'block separator'
if (i,j,val) == (-1,-1,0.0):
# make new array
arrays.append(dict())
else:
# update last, existing
arrays[-1][(i,j)] = val
# almost the same for 4d array data
if len(item) == 5:
i, j, k, m = map(int, item[:4])
val = float(item[-1])
arrays[-1][(i,j,k,m)] = val
# if value is unparsable like '0.00' for int or 'text'
except ValueError,e:
print 'failed on '+filename+': '+str(e)+', line: '+str(lineidx+3)
return
return vars, arrays
As i anderstand what did you ask for..
# read data from file into list
parsed=[]
with open(filename, "r") as f:
for line in f:
# # you can exclude separator here with such code (uncomment) (1)
# # be careful one zero more, one zero less and it wouldn work
# if line == '-1 -1 0.00000000000':
# continue
parsed.append(line.split())
# a simpler version
with open(filename, "r") as f:
# # you can exclude separator here with such code (uncomment, replace) (2)
# parsed = [line.split() for line in f if line != '-1 -1 0.00000000000']
parsed = [line.split() for line in f]
# at this point 'parsed' is a list of lists of strings.
# [['1'],['6'],['0.714285714286'],['0', '0', '1.00000000000'],['0', '1', '0.61356352337'] .. ]
# ALT 1 -------------------------------
# we do know the len of each data block
# get the first 3 lines:
head = parsed[:3]
# get the body:
body = parsed[3:-2]
# get the last 2 lines:
tail = parsed[-2:]
# now you can do anything you want with your data
# but remember to convert str to int or float
# first3 as unique:
unique0 = int(head[0][0])
unique1 = int(head[1][0])
unique2 = float(head[2][0])
# cast body:
# check each item of body has 3 inner items
is_correct = all(map(lambda item: len(item)==3, body))
# parse str and cast
if is_correct:
for i, j, v in body:
# # you can exclude separator here (uncomment) (3)
# # * 1. is the same as float(1)
# if (i,j,v) == (0,0,1.):
# # here we skip iteration for line w/ '-1 -1 0.0...'
# # but you can place another code that will be executed
# # at the point where block-termination lines appear
# continue
some_body_cast_function(int(i), int(j), float(v))
else:
raise Exception('incorrect body')
# cast tail
# check each item of body has 5 inner items
is_correct = all(map(lambda item: len(item)==5, tail))
# parse str and cast
if is_correct:
for i, j, k, m, v in body: # 'l' is bad index, because similar to 1.
some_tail_cast_function(int(i), int(j), int(k), int(m), float(v))
else:
raise Exception('incorrect tail')
# ALT 2 -----------------------------------
# we do NOT know the len of each data block
# maybe we have some array?
array = dict() # your array may be other type
v1,v2,v2 = parsed[:3]
unique0 = int(v1[0])
unique1 = int(v2[0])
unique2 = float(v3[0])
for item in parsed[3:]:
if len(item) == 3:
i,j,v = item
i = int(i)
j = int(j)
v = float(v)
# # yo can exclude separator here (uncomment) (4)
# # * 1. is the same as float(1)
# # logic is the same as in 3rd variant
# if (i,j,v) == (0,0,1.):
# continue
# do your stuff
# for example,
array[(i,j)]=v
array[(j,i)]=v
elif len(item) ==5:
i, j, k, m, v = item
i = int(i)
j = int(j)
k = int(k)
m = int(m)
v = float(v)
# do your stuff
else:
raise Exception('unsupported') # or, maybe just 'pass'
To read lines from a file iteratively, you can use something like:
with open(filename, "r") as f:
var1 = int(f.next())
var2 = int(f.next())
var3 = float(f.next())
for line in f:
do some stuff particular to the line we are on...
Just create some data structures outside the loop, and fill them in the loop above. To split strings into elements, you can use:
>>> "spam ham".split()
['spam', 'ham']
I also think you want to take a look at the numpy library for array datastructures, and possible the SciPy library for analysis.

Categories

Resources