Splicing through a line of a textfile using python

Splicing through a line of a textfile using python - python

I am trying to create genetic signatures. I have a textfile full of DNA sequences. I want to read in each line from the text file. Then add 4mers which are 4 bases into a dictionary.
For example: Sample sequence
ATGATATATCTATCAT
What I want to add is ATGA, TGAT, GATA, etc.. into a dictionary with ID's that just increment by 1 while adding the 4mers.
So the dictionary will hold...
Genetic signatures, ID
ATGA,1
TGAT, 2
GATA,3
Here is what I have so far...
import sys
def main ():
readingFile = open("signatures.txt", "r")
my_DNA=""
DNAseq = {} #creates dictionary
for char in readingFile:
my_DNA = my_DNA+char
for char in my_DNA:
index = 0
DnaID=1
seq = my_DNA[index:index+4]
if (DNAseq.has_key(seq)): #checks if the key is in the dictionary
index= index +1
else :
DNAseq[seq] = DnaID
index = index+1
DnaID= DnaID+1
readingFile.close()
if __name__ == '__main__':
main()
Here is my output:
ACTC
ACTC
ACTC
ACTC
ACTC
ACTC
This output suggests that it is not iterating through each character in string... please help!

You need to move your index and DnaID declarations before the loop, otherwise they will be reset every loop iteration:
index = 0
DnaID=1
for char in my_DNA:
#... rest of loop here
Once you make that change you will have this output:
ATGA 1
TGAT 2
GATA 3
ATAT 4
TATA 5
ATAT 6
TATC 6
ATCT 7
TCTA 8
CTAT 9
TATC 10
ATCA 10
TCAT 11
CAT 12
AT 13
T 14
In order to avoid the last 3 items which are not the correct length you can modify your loop:
for i in range(len(my_DNA)-3):
#... rest of loop here
This doesn't loop through the last 3 characters, making the output:
ATGA 1
TGAT 2
GATA 3
ATAT 4
TATA 5
ATAT 6
TATC 6
ATCT 7
TCTA 8
CTAT 9
TATC 10
ATCA 10
TCAT 11

This should give you the desired effect.
from collections import defaultdict
readingFile = open("signatures.txt", "r").read()
DNAseq = defaultdict(int)
window = 4
for i in xrange(len(readingFile)):
current_4mer = readingFile[i:i+window]
if len(current_4mer) == window:
DNAseq[current_4mer] += 1
print DNAseq

index is being reset to 0 each time through the loop that starts with for char in my_DNA:.
Also, I think the loop condition should be something like while index < len(my_DNA)-4: to be consistent with the loop body.

Your index counters reset themselves since they are in the for loop.
May I make some further suggestions? My solution would look like that:
readingFile = open("signatures.txt", "r")
my_DNA=""
DNAseq = {} #creates dictionary
for line in readingFile:
line = line.strip()
my_DNA = my_DNA + line
ID = 1
index = 0
while True:
try:
seq = my_DNA[index:index+4]
if not seq in my_DNA:
DNAseq[ID] = my_DNA[index:index+4]
index += 4
ID += 1
except IndexError:
break
readingFile.close()
But what do you want to do with duplicates? E.g., if a sequence like ATGC appears twice? Should both be added under a different ID, for example {...1:'ATGC', ... 200:'ATGC',...} or shall those be omitted?

If I'm understanding correctly, you are counting how often each sequential string of 4 bases occurs? Try this:
def split_to_4mers(filename):
dna_dict = {}
with open(filename, 'r') as f:
# assuming the first line of the file, only, contains the dna string
dna_string = f.readline();
for idx in range(len(dna_string)-3):
seq = dna_string[idx:idx+4]
count = dna_dict.get(seq, 0)
dna_dict[seq] = count+1
return dna_dict
output on a file that contains only "ATGATATATCTATCAT":
{'TGAT': 1, 'ATCT': 1, 'ATGA': 1, 'TCAT': 1, 'TATA': 1, 'TATC': 2, 'CTAT': 1, 'ATCA': 1, 'ATAT': 2, 'GATA': 1, 'TCTA': 1}

Related

I am struggling a little on getting this code to work in python

Now honestly, I think this could be entirely wrong as I don't really know what I am doing and just kinda through some stuff together, so help would be appreciated.
This is the code I got, including starting code that cannot be changed.
# DO NOT CHANGE ANY CODE IN THE MAIN FUNCTION
def main():
input_file = open('strings.txt', 'r') # Open a file for reading
for line in input_file: # Use a for loop to read each line in the file
manipulate_text(line)
print()
def manipulate_text(line):
# Delete the following line, then implement the function as indicated
line = line.upper()
line = line.strip()
letters = []
for char in line:
if char.isalpha():
if char not in letters.count(line):
letters[char] = 1
else:
letters[char] += 1
for everyLetter in letters:
print("{0} {1}".format(everyLetter, letters[everyLetter]))
The .txt file it uses just contain:
Csc.565
Magee, Mississippi
A stitch in time saves nine.
And these are the instructions I have been given, also in this .count is what needs to be used, as shown in my code.
The manipulate_text() function accepts one string as input. The function should do the following with the string parameter:
⦁ Convert all the letters of the string to uppercase, strip the leading and trailing whitespace, and output the string.
⦁ Count and display the frequency of each letter in the string. Ignore all non-alpha characters.
For example, if this is the contents of strings.txt:
Csc.565
Magee, Mississippi
A stitch in time saves nine.
This would be the output of your program:
CSC.565
C 2
S 1
MAGEE, MISSISSIPPI
M 2
A 1
G 1
E 2
I 4
S 4
P 2
A STITCH IN TIME SAVES NINE.
A 2
S 3
T 3
I 4
C 1
H 1
N 3
M 1
E 3
V 1

Here's the code you wanted:
# DO NOT CHANGE ANY CODE IN THE MAIN FUNCTION
def main():
input_file = open('strings.txt', 'r') # Open a file for reading
for line in input_file: # Use a for loop to read each line in the file
manipulate_text(line)
print()
def manipulate_text(line):
line = line.upper()
line = line.strip()
letters = {} # Dict[Char: No. of occurrences]
print(line)
for char in line:
if char.isalpha():
if char not in list(letters.keys()): # If char not in our dict
letters[char] = 1 # One occurrence
else:
letters[char] += 1 # Add one occurrence
for i in letters:
print("{0} {1}".format(i, letters[i]))
main() # Call main
Output:
CSC.565
C 2
S 1
MAGEE, MISSISSIPPI
M 2
A 1
G 1
E 2
I 4
S 4
P 2
A STITCH IN TIME SAVES NINE.
A 2
S 3
T 3
I 4
C 1
H 1
N 3
M 1
E 3
V 1
In response to your comment:
# DO NOT CHANGE ANY CODE IN THE MAIN FUNCTION
def main():
input_file = open('strings.txt', 'r') # Open a file for reading
for line in input_file: # Use a for loop to read each line in the file
manipulate_text(line)
print()
def manipulate_text(line):
line = line.upper()
line = line.strip()
letters = {} # Dict[Char: No. of occurrences]
print(line)
for char in line:
if char.isalpha():
if list(letters.keys()).count(char) == 0: # If char not in our dict
letters[char] = 1 # One occurrence
else:
letters[char] += 1 # Add one occurrence
for i in letters:
print("{0} {1}".format(i, letters[i]))
main() # Call main
In reponse to your second comment:
Use these instead of manipulate_text():
If you don't care about ordering:
def manipulate_text(line):
line = [i for i in line.upper() if i.isalpha()] # List comprehension!
for i in set(line): # set() changes it to all unique keys, loses order
print(i, line.count(i)) # .count()
If you care about ordering:
def manipulate_text(line):
line = [i for i in line.upper() if i.isalpha()] # List comprehension!
uniques = []
for i in line:
if i not in uniques:
print(i, line.count(i)) # .count()
uniques += [i]

How to use Enumerate with Variable data properly?

I am trying to use enumerate with data in a variable but the variable data is getting enumerated
as a single string how can i use in the below format
Excepted output comes when i use with statement :
with open("sample.txt") as file:
for num, line in enumerate(file):
print(num, line)
output
0 sdasd
1 adad
2 adadf
but when
data = "adklkahdjsa saljdahsd \nsjdksd"
for num, line in enumerate(data):
print(num, line)
output
0 a
1 d
2 k
3 l
4 k
5 a
6 h
7 d
8 j
9 s
10 a
11
12 s ... so on

enumerate expects an iterable. In your example it takes the string as iterable an iterates over each character.
It seems what you want is to iterate over each word in the text. Then you first need to split the string into words.
Example:
data.split(' ') # split by whitespace
Full Example:
data = "adklkahdjsa saljdahsd \nsjdksd"
for num, line in enumerate(data.split(' ')):
print(num, line)

Add new line after finding last string in a region

I have this input test.txt file with the output interleaved as #Expected in it (after finding the last line containing 1 1 1 1 within a *Title region
and this code in Python 3.6
index = 0
insert = False
currentTitle = ""
testfile = open("test.txt","r")
content = testfile.readlines()
finalContent = content
testfile.close()
# Should change the below line of code I guess to adapt
#titles = ["TitleX","TitleY","TitleZ"]
for line in content:
index = index + 1
for title in titles:
if line in title+"\n":
currentTitle = line
print (line)
if line == "1 1 1 1\n":
insert = True
if (insert == True) and (line != "1 1 1 1\n"):
finalContent.insert(index-1, currentTitle[:6] + "2" + currentTitle[6:])
insert = False
f = open("test.txt", "w")
finalContent = "".join(finalContent)
f.write(finalContent)
f.close()
Update:
Actual output with the answer provided
*Title Test
12125
124125
asdas 1 1 1 1
rthtr 1 1 1 1
asdasf 1 1 1 1
asfasf 1 1 1 1
blabla 1 1 1 1
#Expected "*Title Test2" here <-- it didn't add it
124124124
*Title Dunno
12125
124125
12763125 1 1 1 1
whatever 1 1 1 1
*Title Dunno2
#Expected "*Title Dunno2" here <-- This worked great
214142122
#and so on for thousands of them..
Also is there a way to overwrite this in the test.txt file?

Because you are already reading the entire file into memory anyway, it's easy to scan through the lines twice; once to find the last transition out of a region after each title, and once to write the modified data back to the same filename, overwriting the previous contents.
I'm introducing a dictionary variable transitions where the keys are the indices of the lines which have a transition, and the value for each is the text to add at that point.
transitions = dict()
in_region = False
reg_end = -1
current_title = None
with open("test.txt","r") as testfile:
content = testfile.readlines()
for idx, line in enumerate(content):
if line.startswith('*Title '):
# Commit last transition before this to dict, if any
if current_title:
transitions[reg_end] = current_title
# add suffix for printing
current_title = line.rstrip('\n') + '2\n'
elif line.strip().endswith(' 1 1 1 1'):
in_region = True
# This will be overwritten while we remain in the region
reg_end = idx
elif in_region:
in_region = False
if current_title:
transitions[reg_end] = current_title
with open("test.txt", "w") as output:
for idx, line in enumerate(content):
output.write(line)
if idx in transitions:
output.write(transitions[idx])
This kind of "remember the last time we saw something" loop is very common, but takes some time getting used to. Inside the loop, keep in mind that we are looping over all the lines, and remembering some things we saw during a previous iteration of this loop. (Forgetting the last thing you were supposed to remember when you are finally out of the loop is also a very common bug!)
The strip() before we look for 1 1 1 1 normalizes the input by removing any surrounding whitespace. You could do other kinds of normalizations, too; normalizing your data is another very common technique for simplifying your logic.
Demo: https://ideone.com/GzNUA5

try this, using itertools.zip_longest
from itertools import zip_longest
with open("test.txt","r") as f:
content = f.readlines()
results, title = [], ""
for i, j in zip_longest(content, content[1:]):
# extract title.
if i.startswith("*"):
title = i
results.append(i)
# compare value in i'th index with i+1'th (if mismatch add title)
if "1 1 1 1" in i and "1 1 1 1" not in j:
results.append(f'{title.strip()}2\n')
print("".join(results))

Finding the substring with the most repeats in a dictionary with dna sequences

The substring has to be with 6 characters. The number I'm gettig is smaller than it should be.
first I've written code to get the sequences from a file, then put them in a dictionary, then written 3 nested for loops: the first iterates over the dictionary and gets a sequence in each iteration. The second takes each sequence and gets a substring with 6 characters from it. In each iteration, the second loop increases the index of the start of the string (the long sequence) by 1. The third loop takes each substring from the second loop, and counts how many times it appears in each string (long sequence).
I tried rewriting the code many times. I think I got very close. I checked if the loops actually do their iterations, and they do. I even checked manually to see if the counts for a substring in random sequences are the same as the program gives, and they are. Any idea? maybe a different approach? what debugger do you use for Python?
I added a file with 3 shortened sequences for testing. Maybe try smaller substring: say with 3 characters instead of 6: rep_len = 3
The code
matches = []
count = 0
final_count = 0
rep_len = 6
repeat = ''
pos = 0
seq_count = 0
seqs = {}
f = open(r"file.fasta")
# inserting each sequences from the file into a dictionary
for line in f:
line = line.rstrip()
if line[0] == '>':
seq_count += 1
name = seq_count
seqs[name] = ''
else:
seqs[name] += line
for key, seq in seqs.items(): # getting one sequence in each iteration
for pos in range(len(seq)): # setting an index and increasing it by 1 in each iteration
if pos <= len(seq) - rep_len: # checking no substring from the end of the sequence are selected
repeat = seq[pos:pos + rep_len] # setting a substring
if repeat not in matches: # checking if the substring was already scanned
matches.append(repeat) # adding the substring to previously checked substrings' list
for key1, seq2 in seqs.items(): # iterating over each sequence
count += seq2.count(repeat) # counting the substring's repetitions
if count > final_count: # if the count is greater than the previously saved greatest number
final_count = count # the new value is saved
count = 0
print('repetitions: ', final_count) # printing
sequences.fasta

The code is not very clear, so it is a bit difficult to debug. I suggest rewriting.
Anyway, I (currently) just noted one small mistake:
if pos < len(seq) - rep_len:
Should be
if pos <= len(seq) - rep_len:
Currently, the last character in each sequence is ignored.
EDIT:
Here some rewriting of your code that is clearer and might help you investigate the errors:
rep_len = 6
seq_count = 0
seqs = {}
filename = "dna2.txt"
# Extract the data into a dictionary
with open(filename, "r") as f:
for line in f:
line = line.rstrip()
if line[0] == '>':
seq_count += 1
name = seq_count
seqs[name] = ''
else:
seqs[name] += line
# Store all the information, so that you can reuse it later
counter = {}
for key, seq in seqs.items():
for pos in range(len(seq)-rep_len):
repeat = seq[pos:pos + rep_len]
if repeat in counter:
counter[repeat] += 1
else:
counter[repeat] = 1
# Sort the counter to have max occurrences first
sorted_counter = sorted(counter.items(), key = lambda item:item[1], reverse=True )
# Display the 5 max occurrences
for i in range(5):
key, rep = sorted_counter[i]
print("{} -> {}".format(key, rep))
# GCGCGC -> 11
# CCGCCG -> 11
# CGCCGA -> 10
# CGCGCG -> 9
# CGTCGA -> 9

It might be easier to use Counter from the collections module in Python. Also check out the NLTK library.
An example:
from collections import Counter
from nltk.util import ngrams
sequence = "cggttgcaatgagcgtcttgcacggaccgtcatgtaagaccgctacgcttcgatcaacgctattacgcaagccaccgaatgcccggctcgtcccaacctg"
def reps(substr):
"Counts repeats in a substring"
return sum([i for i in Counter(substr).values() if i>1])
def make_grams(sent, n=6):
"splits a sentence into n-grams"
return ["".join(seq) for seq in (ngrams(sent,n))]
grams = make_grams(sequence) # splits string into substrings
max_length = max(list(map(reps, grams))) # gets maximum repeat count
result = [dna for dna in grams if reps(dna) == max_length]
print(result)
Output: ['gcgtct', 'cacgga', 'acggac', 'tgtaag', 'agaccg', 'gcttcg', 'cgcaag', 'gcaagc', 'gcccgg', 'cccggc', 'gctcgt', 'cccaac', 'ccaacc']
And if the question is look for the string with the most repeated character:
repeat_count = [max(Counter(a).values()) for a in result] # highest character repeat count
result_dict = {dna:ct for (dna,ct) in zip(result, repeat_count)}
another_result = [dna for dna in result_dict.keys() if result_dict[dna] == max(repeat_count)]
print(another_result)
Output: ['cccggc', 'cccaac', 'ccaacc']

How to find all instances of list values(ex: [1,2,3]) in a file at a specific index

I want to find out a list of elements in a file at a specific index.
For ex, below are the contents of the file "temp.txt"
line_0 1
line_1 2
line_2 3
line_3 4
line_4 1
line_5 1
line_6 2
line_7 1
line_8 2
line_9 3
line_10 4
Now, I need to find out the list of values [1,2,3] occurring in sequence at column 2 of each line in above file.
Output should look like below:
line_2 3
line_9 3
I have tried the below logic, but it some how not working ;(
inf = open("temp.txt", "rt")
count = 0
pos = 0
ListSeq = ["1","2","3"]
for line_no, line in enumerate(inf):
arr = line.split()
if len(arr) > 1:
if count == 1 :
pos = line_no
if ListSeq[count] == arr[1] :
count += 1
elif count > 0 :
inf.seek(pos)
line_no = pos
count = 0
else :
count = 0
if count >= 3 :
print(line)
count = 0
Can somebody help me in finding the issue with above code? or even a different logic which will give a correct output is also fine.

Your code is flawed. Most prominent bug: trying to seek in a text file using line number is never going to work: you have to use byte offset for that. Even if you did that, it would be wrong because you're iterating on the lines, so you shouldn't attempt to change file pointer while doing that.
My approach:
The idea is to "transpose" your file to work with vertical vectors, find the sequence in the 2nd vertical vector, and use the found index to extract data on the first vertical vector.
split lines to get text & number, zip the results to get 2 vectors: 1 of numbers 1 of text.
At this point, one list contains ["line_0","line_1",...] and the other one contains ["1","2","3","4",...]
Find the indexes of the sequence in the number list, and print the couple txt/number when found.
code:
with open("text.txt") as f:
sequence = ('1','2','3')
txt,nums = list(zip(*(l.split()[:2] for l in f))) # [:2] in case there are more columns
for i in range(len(nums)-len(sequence)+1):
if nums[i:i+len(sequence)]==sequence:
print("{} {}".format(txt[i+2],nums[i+2]))
result:
line_2 3
line_9 3
last for loop can be replaced by a list comprehension to generate the tuples:
result = [(txt[i+2],nums[i+2]) for i in range(len(nums)-len(sequence)) if nums[i:i+len(sequence)]==sequence ]
result:
[('line_2', '3'), ('line_9', '3')]

Generalizing for any sequence and any column.
sequence = ['1','2','3']
col = 1
with open(filename, 'r') as infile:
idx = 0
for _i, line in enumerate(infile):
if line.strip().split()[col] == sequence[idx]:
if idx == len(sequence)-1:
print(line)
idx = 0
else:
idx += 1
else:
idx = 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splicing through a line of a textfile using python - python

index is being reset to 0 each time through the loop that starts with for char in my_DNA:. Also, I think the loop condition should be something like while index < len(my_DNA)-4: to be consistent with the loop body.

Related

I am struggling a little on getting this code to work in python

How to use Enumerate with Variable data properly?

Add new line after finding last string in a region

Finding the substring with the most repeats in a dictionary with dna sequences

How to find all instances of list values(ex: [1,2,3]) in a file at a specific index

Categories

Resources