I am trying to create a code to compare gene file with gene panels.
The gene panel file is in csv format and has Chromosome, gene, start location and end locations.
patients file has chromosome, mutations and the location.
so i made a loop to pass gene panel information to a function where the comparison is done to return me a list of similar items.
the function works great when i call it with manual data. but doenst not do the comparison inside the loop.
import vcf
import os, sys
records = open('exampleGenePanel.csv')
read = vcf.Reader(open('examplePatientFile.vcf','r'))
#functions to find mutations in patients sequence
def findMutations(gn,chromo,start,end):
start = int(start)
end = int(end)
for each in read:
CHROM = each.CHROM
if CHROM != chromo:
continue
POS = each.POS
if POS < start:
continue
if POS > end:
continue
REF = each.REF
ALT = each.ALT
print (gn,CHROM,POS,REF,ALT)
list.append([gn,CHROM,POS,REF,ALT])
return list
gene = records.readlines()
list=[]
y = len (gene)
x=1
while x < 3:
field = gene[x].split(',')
gname = field[0]
chromo = field[1]
gstart = field[2]
gend = field[3]
findMutations(gname,chromo,gstart,gend)
x = x+1
if not list:
print ('Mutation not found')
else:
print (len(list),' Mutations found')
print (list)
i want to get the details of matching mutations in the list.
This works as expected when i pass the data manually to the function.
Eg.findMutations('TESTGene','chr8','146171437','146229161')
But doesnt compare when passed through the loop
The problem is that findMutations attempts to read from read each time it is called, but after the first call, read has already been read and there's nothing left. I suggest reading the contents of read once, before calling the function, then save the results in a list. Then findMutations can read the list each time it is called.
It would also be a good idea to use a name other than list for your result list, since that name conflicts with the Python built-in function. It would also be better to have findMutations return its result list rather than append it to a global.
Related
I am trying to build a script that copies a specified number of lines from one document to multiple other documents. The copied lines are supposed to be appended to the end of the docs. In case I want to delete lines from the end of the docs, the script also has to be able to delete a specified number of lines.
I want to be able to run the script from the command line and want to pass two args:
"add" or "del"
number of lines (counting from the end of the document)
A command could look like this:
py doccopy.py add 2 which would copy the last 2 lines to the other docs, or:
py doccopy.py del 4 which would delete the last 4 lines from all docs.
So far, I have written a function that copies the number of lines I want from the original document,
def copy_last_lines(number_of_lines):
line_offset = [0]
offset = 0
for line in file_to_copy_from:
line_offset.append(offset)
offset += len(line)
file_to_copy_from.seek(line_offset[number_of_lines])
changedlines = file_to_copy_from.read()
a function that pastes said lines to a document
def add_to_file():
doc = open(files_to_write[file_number], "a")
doc.write("\n")
doc.write(changedlines.strip())
doc.close()
and a main function:
def main(action, number_of_lines):
if action == "add":
for files in files_to_write:
add_to_file()
elif action == "del":
for files in files_to_write:
del_from_file()
else:
print("Not a valid action.")
The main function isn't done yet, of course and I have yet to figure out how to realize the del_from_file function.
I also have problems with looping through all the documents.
My idea was to make a list including all the paths to the documents i want to write in and then loop through this list and to make a single variable for the "original" document, but I don't know if that's even possible the way I want to do it.
If possible, maybe someone has an idea for how to realize all this with a single list, have the "original" document be the first entry and loop through the list starting with "1" when writing to the other docs.
I realize that the code I've done so far is a total clusterfuck and I ask a lot of questions, so I'd be grateful for every bit of help. I'm totally new to programming, I just did a Python crash course in the last 3 days and my first own project is shaping out to be way more complicated than I thought it would be.
This should do what you ask, I think.
# ./doccopy.py add src N dst...
# Appends the last N lines of src to all of the dst files.
# ./doccopy.py del N dst...
# Removes the last N lines from all of the dst files.
import sys
def process_add(args):
# Fetch the last N lines of src.
src = argv[0]
count = int(args[1])
lines = open(src).readlines()[-count:]
# Copy to dst list.
for dst in args[2:}
open(dst,'a').write(''.join(lines))
def process_del(args):
# Delete the last N lines of each dst file.
count = int(args[0])
for dst in args[1:]:
lines = open(dst).readlines()[:-count]
open(dst,'w').write(''.join(lines))
def main():
if sys.argv[1] == 'add':
process_add( sys.argv[2:] )
elif sys.argv[1] == 'del':
process delete( sys.argv[2:] )
else:
print( "What?" )
if __name__ == "__main__":
main()
So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x
Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The code I am running so far is as follows
import os
import math
import statistics
def main ():
infile = open('USPopulation.txt', 'r')
values = infile.read()
infile.close()
index = 0
while index < len(values):
values(index) = int(values(index))
index += 1
print(values)
main()
The text file contains 41 rows of numbers each entered on a single line like so:
151868
153982
156393
158956
161884
165069
168088
etc.
My tasks is to create a program which shows average change in population during the time period. The year with the greatest increase in population during the time period. The year with the smallest increase in population (from the previous year) during the time period.
The code will print each of the text files entries on a single line, but upon trying to convert to int for use with the statistics package I am getting the following error:
values(index) = int(values(index))
SyntaxError: can't assign to function call
The values(index) = int(values(index)) line was taken from reading as well as resources on stack overflow.
You can change values = infile.read() to values = list(infile.read())
and it will have it ouput as a list instead of a string.
One of the things that tends to happen whenever reading a file like this is, at the end of every line there is an invisible '\n' that declares a new line within the text file, so an easy way to split it by lines and turn them into integers would be, instead of using values = list(infile.read()) you could use values = values.split('\n') which splits the based off of lines, as long as values was previously declared.
and the while loop that you have can be easily replace with a for loop, where you would use len(values) as the end.
the values(index) = int(values(index)) part is a decent way to do it in a while loop, but whenever in a for loop, you can use values[i] = int(values[i]) to turn them into integers, and then values becomes a list of integers.
How I would personally set it up would be :
import os
import math
import statistics
def main ():
infile = open('USPopulation.txt', 'r')
values = infile.read()
infile.close()
values = values.split('\n') # Splits based off of lines
for i in range(0, len(values)) : # loops the length of values and turns each part of values into integers
values[i] = int(values[i])
changes = []
# Use a for loop to get the changes between each number.
for i in range(0, len(values)-1) : # you put the -1 because there would be an indexing error if you tried to count i+1 while at len(values)
changes.append(values[i+1] - values[i]) # This will get the difference between the current and the next.
print('The max change :', max(changes), 'The minimal change :', min(changes))
#And since there is a 'change' for each element of values, meaning if you print both changes and values, you would get the same number of items.
print('A change of :', max(changes), 'Happened at', values[changes.index(max(changes))]) # changes.index(max(changes)) gets the position of the highest number in changes, and finds what population has the same index (position) as it.
print('A change of :', min(changes), 'Happened at', values[changes.index(min(changes))]) #pretty much the same as above just with minimum
# If you wanted to print the second number, you would do values[changes.index(min(changes)) + 1]
main()
If you need any clarification on anything I did in the code, just ask.
I personally would use numpy for reading a text file.
in your case I would do it like this:
import numpy as np
def main ():
infile = np.loadtxt('USPopulation.txt')
maxpop = np.argmax(infile)
minpop = np.argmin(infile)
print(f'maximum population = {maxpop} and minimum population = {minpop}')
main()
This might be a tough question and I will do my best to explain the best I can!
Im trying to create a script where I run different json files through a forloop (Each of these json files has their own "data") - What I want to do is that I want to be able to add those values into a list where I later match the first object in the json and matches if it is found in the script. If it is found with the same Name then we gonna check if its last object is higher or lower than it was previous in that object name. If it is higher then we print that it has been higher and we change the value object to the new one (Which we change inside a list).
Also another thing that I want it to do is that It should only append to the list once and not n times depending on how many json files I use.
I start by showing of the json that I use which we can see it contains same Name but different number.
{
"name": "Albert",
"image": "https://pbs.twimg.com/profile_images/....jpg",
"number": "5"
}
-----------------------------------
{
"name": "Albert",
"image": "https://pbs.twimg.com/profile_images/....jpg",
"number": "6"
}
Ok before I continue to explain. Here is the code I wrote so I can explain it better...
webhook_list = [
'https://discordapp.com/api/webhooks/5133124/slack',
'https://discordapp.com/api/webhooks/5124124/slack',
'https://discordapp.com/api/webhooks/5112412/slack'] #Discord different API_key/Webhook
def get_identifier(thread):
thread_id = thread['name'] #Grab name from Json
try:
thread_image = thread['image'] #Grab Image from Json
except KeyError:
thread_image = None
try:
thread_number = thread['numbers'] #Grab Image from Json
except KeyError:
thread_number = None
identifier = ('{}%{}%{}').format(thread_id, thread_image, thread_number) #Make them all into one "String"
return identifier
def script():
old_list = [] #Old_list where we gonna append if new items arrives.
while True:
for thread in [line.rstrip('\n') for line in open('names.txt')]: #We check all names in txt file (This can be changed to however you want to print, could be a list aswell. Doesn't matter
get_value_identifier = get_identifier(thread) #We send the value thread to get_identifier to grab the values which return of the identifier
if get_identifier(thread) not in old_list: #if this value is not in the old list then we go in here
#Slack/Discord function
directory = os.fsencode('./slack')
for counters, file in enumerate(os.listdir(directory)):
filename = os.fsdecode(file)
if filename.endswith(".json"):
with open('./slack/' + filename) as slackAttachment:
data = json.loads(slackAttachment.read())
data_list = []
# *****************---Picture---*****************
try:
data["attachments"][0]["thumb_url"] = information ['image_url'] #We add everything to data so we can later on be able to print it out to discord/slack
except Exception:
data["attachments"][0][
"thumb_url"] = 'https://cdn.browshot.com/static/images/not-found.png'
# *****************---Footer---*****************
data["attachments"][0]["footer"] = str(
data["attachments"][0]["footer"] + ' | ' + datetime.now().strftime(
'%Y-%m-%d [%H:%M:%S.%f')[:-3] + "]")
# -------------------------------------------------------------------------
a = get_value_identifier.split("%") #We split the identifier meaning it will be name image number
for i, items in zip(range(len(old_list)), old_list): #We split the length of old_list length and the old_list (I didn't think out of anything other way than this. This can be changed)
old_list_value = old_list[i].split("%") #We also split the old_list values the same way as we did with *a =...*
if a[0] in old_list_value[0]: #if the first value of both *a* in inside somewhere in a old_list first value.....
if old_list_value[2] < a[2]: #We check if the number is higher than the old_list. If it is then we do the things below
data["attachments"][0]["title"] = information['name'].upper()
data_list.append((webhook_list[counters], data))
for hook, data in data_list:
threading.Thread(target=sendData, args=(hook, data)).start()
old_list[i] = get_value_identifier
break
elif len(old_list_value[2]) >= len(a[2]): #We check if the number is lower than the old_list. If it is then we do the things below
old_list[i] = get_value_identifier
break
else: #If nothing is found then we just do things below here and add the value to old_list.
data["attachments"][0]["title"] = information['name'].upper()
data_list.append((webhook_list[counters], data))
for hook, data in data_list:
threading.Thread(target=sendData, args=(hook, data)).start()
old_list.append(get_value_identifier)
else:
randomtime = random.randint(3, 7)
logger.warn('No new item found! - retrying in %d secs' % (randomtime))
time.sleep(randomtime)
As you can see, this is my code that I do for opening each json format and we use data = json.loads(slackAttachment.read()) meaning it will add up into "data" which is a json.
directory = os.fsencode('./slack')
for counters, file in enumerate(os.listdir(directory)):
filename = os.fsdecode(file)
if filename.endswith(".json"):
with open('./slack/' + filename) as slackAttachment:
data = json.loads(slackAttachment.read())
Whenever one of the loop is done - It adds up everything into data and later on at the end we can print out the data or send it through a requests to Discord/Slack.
But in the end before I want to print it out to discord/slack - I am checking first of all if the name is already in the list. We do that by splitting each % which will contains "Name image Number"
In the if statement we check if a[0] (Which is current thread found) - if it is somewhere in a old list.
IF it is in the old list then we check the last number on the object
if it is higher or lower
IF it is higher then we print it out since its a increase of value
and then we change the previous old_list value to this new one.
IF it is lower then we just change the previous old_list value to the
new one.
IF there is none in the old_list that matches then we just append it
to the list.
.
a = get_value_identifier.split("%")
for i, items in zip(range(len(old_list)), old_list):
old_list_value = old_list[i].split("%")
if a[0] in old_list_value[0]:
if old_list_value[2] < a[2]:
data["attachments"][0]["title"] = information['name'].upper()
data_list.append((webhook_list[counters], data))
for hook, data in data_list:
threading.Thread(target=sendData, args=(hook, data)).start()
old_list[i] = get_value_identifier
break
elif len(old_list_value[2]) > len(a[2]):
old_list[i] = get_value_identifier
break
else:
data["attachments"][0]["title"] = information['name'].upper()
data_list.append((webhook_list[counters], data))
for hook, data in data_list:
threading.Thread(target=sendData, args=(hook, data)).start()
old_list.append(get_value_identifier)
and here is the mechanic issue.
The issue I am having is that at the beginning when we run
directory = os.fsencode('./slack')
for counters, file in enumerate(os.listdir(directory)):
filename = os.fsdecode(file)
if filename.endswith(".json"):
with open('./slack/' + filename) as slackAttachment:
data = json.loads(slackAttachment.read())
is that it will be looping through the function above
a = get_value_identifier.split("%")
for i, items in zip(range(len(old_list)), old_list):
old_list_value = old_list[i].split("%")......
x times depending on how many files there is in the slack. This will be a issue after second loop becuase if the first loop find a new item. It will add it into the old_list using append but that means when the second loop with slack/discord it means that there will be a value in the old_list that contains it. And here is where it start continuing... Meaning the first loop will always be correct but after that it will give you not the correct answers.
My question in that case is two question
How can I make so whenever the first loop hits etc if statement or elif or else. It should print the same for all discord/slack at once .
If it hits if or elif statement. Inside those I have that it should take the value from old_list[i] and change that value to the "newer" one which is get_value_identifier. How can I make it so it does it only once. Because I believe if I run through 3 slacks in my case. It will be 3 same values in the old_list due to the for loop.
I think that is it for me and I hope I explained so good I could! Please. If there is any more question to be added or anything. I will be pretty active now and I will be able to edit to either upgrade the question due to your questions below!
I'm new to BioPython and I'm trying to import a fasta/fastq file and iterate through each sequence, while performing some operation on each sequence. I know this seems basic, but my code below for some reason is not printing correctly.
from Bio import SeqIO
newfile = open("new.txt", "w")
records = list(SeqIO.parse("rosalind_gc.txt", "fasta"))
i = 0
dna = records[i]
while i <= len(records):
print (dna.name)
i = i + 1
I'm trying to basically iterate through records and print the name, however my code ends up only printing "records[0]", where I want it to print "records[1-10]". Can someone explain why it ends up only print "records[0]"?
The reason for your problem is here:
i = 0
dna = records[i]
Your object 'dna' is fixed to the index 0 of records, i.e., records[0]. Since you are not calling it again, dna will always be fixed on that declaration. On your print statement within your while loop, use something like this:
while i <= len(records):
print (records[i].name)
i = i + 1
If you would like to have an object dna as a copy of records entries, you would need to reassign dna to every single index, making this within your while loop, like this:
while i <= len(records):
dna = records[i]
print (dna.name)
i = i + 1
However, that's not the most efficient way. Finally, for you to learn, a much nicer way than with your while loop with i = i + 1 is to use a for loop, like this:
for i in range(0,len(records)):
print (records[i].name)
For loops do the iteration automatically, one by one. range() will give a set of integers from 0 to the length of records. There are also other ways, but I'm keeping it simple.