Extracting gene location from FASTA file

Extracting gene location from FASTA file - python

I am trying to extract the gene location from a fasta file using BioPython but the function .location is not working. I would like to avoid regex because this function will have to work with different files and the all have slightly different headers.
The header looks like:
>X dna:chromosome chromosome:GRCh38:X:111410060:111411807:-1
I would like the output to be:
start = 111410060
end = 111411807

If the headers of your different fasta files always end with '...chr:start:end:strand' and the different parts are separated by ':' you could try to split .description by .split(":") and select the penultimate and antepenultimate position of the resulting list.
The following works for me with your example header:
from Bio import SeqIO
path = 'fasta_test.fasta'
records = SeqIO.parse(open(path), 'fasta')
record = next(records)
parts = record.description.split(":")
print('start =', parts[-3], 'end =', parts[-2])

Related

I'm trying to find words from a text file in another text file

I built a simple graphical user interface (GUI) with basketball info to make finding information about players easier. The GUI utilizes data that has been scraped from various sources using the 'requests' library. It works well but there is a problem; within my code lies a list of players which must be compared against this scraped data in order for everything to work properly. This means that if I want to add or remove any names from this list, I have to go into my IDE or directly into my code - I need to change this. Having an external text file where all these player names can be stored would provide much needed flexibility when managing them.
#This is how the players list looks in the code.
basketball = ['Adebayo, Bam', 'Allen, Jarrett', 'Antetokounmpo, Giannis' ... #and many others]
#This is how the info in the scrapped file looks like:
Charlotte Hornets,"Ball, LaMelo",Out,"Injury/Illness - Bilateral Ankle, Wrist; Soreness (L Ankle, R Wrist)"
"Hayward, Gordon",Available,Injury/Illness - Left Hamstring; Soreness
"Martin, Cody",Out,Injury/Illness - Left Knee; Soreness
"Forbes, Bryn",Questionable,Injury/Illness - N/A; Illness,
"Okogie, Josh",Questionable,Injury/Illness - Nasal; Fracture,
#The rest of the code is working well, this is the final part where it uses the list to write the players that were found it both files.
with open("freeze.csv",'r') as freeze:
for word in basketball:
if word in freeze:
freeze.write(word)
# Up to this point I get the correct output, but now I need the list 'basketball' in a text file so can can iterate the same way
# I tried differents solutions but none of them work for me
with open('final_G_league.csv') as text, open('freeze1.csv') as filter_words:
st = set(map(str.rstrip,filter_words))
txt = next(text).split()
out = [word for word in txt if word not in st]
# This one gives me the first line of the scrapped text
import csv
file1 = open("final_G_league.csv",'r')
file2 = open("freeze1.csv",'r')
data_read1= csv.reader(file1)
data_read2 = csv.reader(file2)
# convert the data to a list
data1 = [data for data in data_read1]
data2 = [data for data in data_read2]
for i in range(len(data1)):
if data1[i] != data2[i]:
print("Line " + str(i) + " is a mismatch.")
print(f"{data1[i]} doesn't match {data2[i]}")
file1.close()
file2.close()
#This one returns a list with a bunch of names and a list index error.
file1 = open('final_G_league.csv','r')
file2 = open('freeze_list.txt','r')
list1 = file1.readlines()
list2 = file2.readlines()
for i in list1:
for j in list2:
if j in i:
# I also tried the answers in this post:
#https://stackoverflow.com/questions/31343457/filter-words-from-one-text-file-in-another-text-file

Let's assume we have following input files:
freeze_list.txt - comma separated list of filter words (players) enclosed in quotes:
'Adebayo, Bam', 'Allen, Jarrett', 'Antetokounmpo, Giannis', 'Anthony, Cole', 'Anunoby, O.G.', 'Ayton, Deandre',
'Banchero, Paolo', 'Bane, Desmond', 'Barnes, Scottie', 'Barrett, RJ', 'Beal, Bradley', 'Booker, Devin', 'Bridges, Mikal',
'Brown, Jaylen', 'Brunson, Jalen', 'Butler, Jimmy', 'Forbes, Bryn'
final_G_league.csv - scrapped lines that we want to filter, using words from the freeze_list.txt file:
Charlotte Hornets,"Ball, LaMelo",Out,"Injury/Illness - Bilateral Ankle, Wrist; Soreness (L Ankle, R Wrist)"
"Hayward, Gordon",Available,Injury/Illness - Left Hamstring; Soreness
"Martin, Cody",Out,Injury/Illness - Left Knee; Soreness
"Forbes, Bryn",Questionable,Injury/Illness - N/A; Illness,
"Okogie, Josh",Questionable,Injury/Illness - Nasal; Fracture,
I would split the responsibilities of the script in code segments to make it more readable and manageable:
Define constants (later you could make them parameters)
Read filter words from a file
Filter scrapped lines
Dump output to a file
The constants:
FILTER_WORDS_FILE_NAME = "freeze_list.txt"
SCRAPPED_FILE_NAME = "final_G_league.csv"
FILTERED_FILE_NAME = "freeze.csv"
Read filter words from a file:
with open(FILTER_WORDS_FILE_NAME) as filter_words_file:
filter_words = eval('(' + filter_words_file.read() + ')')
Filter lines from the scrapped file:
matched_lines = []
with open(SCRAPPED_FILE_NAME) as scrapped_file:
for line in scrapped_file:
# Check if any of the keywords is found in the line
for filter_word in filter_words:
if filter_word in line:
matched_lines.append(line)
# stop checking other words for performance and
# to avoid sending same line multipe times to the output
break
Dump filtered lines into a file:
with open(FILTERED_FILE_NAME, "w") as filtered_file:
for line in matched_lines:
filtered_file.write(line)
The output freeze.csv after running above segments in a sequence is:
"Forbes, Bryn",Questionable,Injury/Illness - N/A; Illness,
Suggestion
Not sure why you have chosen to store the filter words in a comma separated list. I would prefer using a plain list of words - one word per line.
freeze_list.txt:
Adebayo, Bam
Allen, Jarrett
Antetokounmpo, Giannis
Butler, Jimmy
Forbes, Bryn
The reading becomes straightforward:
with open(FILTER_WORDS_FILE_NAME) as filter_words_file:
filter_words = [word.strip() for word in filter_words_file]
The output freeze.csv is the same:
"Forbes, Bryn",Questionable,Injury/Illness - N/A; Illness,

If file2 is just a list of names and want to extract those rows in first file where the name column matches a name in the list.
Suggest you make the "freeze" file a text file with one-name per line and remove the single quotes from the names then can more easily parse it.
Can then do something like this to match the names from one file against the other.
import csv
# convert the names data to a list
with open("freeze1.txt",'r') as file2:
names = [s.strip() for s in file2]
print("names:", names)
# next open league data and extract rows with matching names
with open("final_G_league.csv",'r') as file1:
reader = csv.reader(file1)
next(reader) # skip header
for row in reader:
if row[0] in names:
# print matching name that matches
print(row[0])
If names don't match exactly as appears in the final_G_league file then may need to adjust accordingly such as doing a case-insensitive match or normalizing names (last, first vs first last), etc.

How can I remove files that have an unknown number in them?

I have some code that writes out files with names like this:
body00123.txt
body00124.txt
body00125.txt
body-1-2126.txt
body-1-2127.txt
body-1-2128.txt
body-3-3129.txt
body-3-3130.txt
body-3-3131.txt
Such that the first two numbers in the file can be 'negative', but the last 3 numbers are not.
I have a list such as this:
123
127
129
And I want to remove all the files that don't end with one of these numbers. An example of the desired leftover files would be like this:
body00123.txt
body-1-2127.txt
body-3-3129.txt
My code is running in python, so I have tried:
if i not in myList:
os.system('rm body*' + str(i) + '.txt')
And this resulted in every file being deleted.
The solution should be such that any .txt file that ends with a number contained by myList should be kept. All other .txt files should be deleted. This is why I'm trying to use a wildcard in my attempt.

Because all the files end in .txt, you can cut that part out and use the str.endswith() function. str.endswith() accepts a tuple of strings, and sees if your string ends in any of them. As a result, you can do something like this:
all_file_list = [...]
keep_list = [...]
files_to_remove = []
file_to_remove_tup = tuple(files_to_remove)
for name in all_file_list:
if name[:-4].endswith(file_to_remove_tup)
files_to_remove.append(name)
# or os.remove(name)

How to get an unknown substring between two known substrings, within a giant string/file

I'm trying to get all the substrings under a "customLabel" tag, for example "Month" inside of ...,"customLabel":"Month"},"schema":"metric...
Unusual issue: this is a 1071552 characters long ndjson file, of a single line ("for line in file:" is pointless since there's only one).
The best I found was that:
How to find a substring of text with a known starting point but unknown ending point in python
but if I use this, the result obviously doesn't stop (at Month) and keeps writing the whole remaining of the file, same as if using partition()[2].
Just know that Month is only an example, customLabel has about 300 variants and they are not listed (I'm actually doing this to list them...)
To give some details here's my script so far:
with open("file.ndjson","rt", encoding='utf-8') as ndjson:
filedata = ndjson.read()
x="customLabel"
count=filedata.count(x)
for i in range (count):
if filedata.find(x)>0:
print("Found "+str(i+1))
So right now it properly tells me how many occurences of customLabel there are, I'd like to get the substring that comes after customLabel":" instead (Month in the example) to put them all in a list, to locate them way more easily and enable the use of replace() for traductions later on.
I'd guess regex are the solution but I'm pretty new to that, so I'll post that question by the time I learn about them...

If you want to search for all (even nested) customLabel values like this:
{"customLabel":"Month" , "otherJson" : {"customLabel" : 23525235}}
you can use RegEx patterns with the re module
import re
label_values = []
regex_pattern = r"\"customLabel\"[ ]?:[ ]?([1-9a-zA-z\"]+)"
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
values = re.findall(regex_pattern, line)
label_values.extend(values)
print(label_values) # ['"Month"', '23525235']
# If you don't want the items to have quotations
label_values = [i.replace('"', "") for i in label_values]
print(label_values) # ['Month', '23525235']
Note: If you're only talking about ndjson files and not nested searching, then it'd be better to use the json module to parse the lines and then easily get the value of your specific key which is customLabel.
import json
label = "customLabel"
label_values = []
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
line_json = json.loads(line)
if line_json.get(label) is not None:
label_values.append(line_json.get(label))
print(label_values) # ['Month']

Remove embedded line-feeds from fields of a CSV file

I have a CSV file in which a single row is getting split into multiple rows.
The source file contents are:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAA
XYZ"
"4",
"ABCD"
As we can see, IDs 3 and 4 are getting split into multiple rows. So, is there any way in Python to join those rows with the previous line?
Desired output:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAAXYZ"
"4","ABCD"
This the code I have:
data = open(r"C:\Users\suksengupta\Downloads\BotReport_V01.csv","r")

It looks like your CSV file has control characters embedded in the fields contents. If that is the case, you need to strip them out in order to have each field contents printed joined together.
With that in mind, something like this will fix the problem:
import re
src = r'C:\Users\suksengupta\Downloads\BotReport_V01.csv'
with open(src) as f:
data = re.sub(r'([\w|,])\s+', r'\1', f.read())
print(data)
The above code will result in the output below printed to console:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAAXYZ"
"4","ABCD"

About dict.fromkeys, key from filename, values inside file, using Regex

Well, I'm learning Python, so I'm working on a project that consists in passing some numbers of PDF files to xlsx and placing them in their corresponding columns, rows determined according to row heading.
The idea that came to me to carry it out is to convert the PDF files to txt and make a dictionary with the txt files, whose key is a part of the file name (because it contains a part of the row header) and the values be the numbers I need.
I have already managed to convert the txt files, now i'm dealing with the script to carry the dictionary. at the moment look like this:
import os
import re
p = re.compile(r'\w+\f+')
'''
I'm not entirely sure at the moment how the .compile of regular expressions works, but I know I'm missing something to indicate that what I want is immediately to the right, I'm also not sure if the keywords will be ignored, I just want take out the numbers
'''
m = p.match('Theese are the keywords' or 'That are immediately to the left' or 'The numbers I want')
def IsinDict(txtDir):
ToData = ()
if txtDir == "": txtDir = os.getcwd() + "\\"
for txt in os.listdir(txtDir):
ToKey = txt[9:21]
if ToKey == (r"\w+"):
Data = open(txt, "r")
for string in Data:
ToData += m.group()
Diccionary = dict.fromkeys(ToKey, ToData)
return Diccionary
txtDir = "Absolute/Path/OfTheText/Files"
IsinDict(txtDir)
Any contribution is welcome, thanks for your attention.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting gene location from FASTA file - python

Related

I'm trying to find words from a text file in another text file

How can I remove files that have an unknown number in them?

How to get an unknown substring between two known substrings, within a giant string/file

Remove embedded line-feeds from fields of a CSV file

About dict.fromkeys, key from filename, values inside file, using Regex

Categories

Resources