Write parsed fasta file back to fasta format from a dictionary

Write parsed fasta file back to fasta format from a dictionary - python

I have created a function that parses a Fasta file because I needed to remove some odd characters. Now I have a dictionary and want to turn it back to a fasta format. I am new to Fasta files so I don't know how to proceed.
The dictionary has this format:
{'NavAb:/1126': 'TNIVESSFFTKFIIYLIVLNGITMGLETSKTFMQSFGVYTTLFNQIVITIFTIEIILRIYVHRISFFKDPWSLFDFFVVAISLVPTSSGFEILRVLRVLRLFRLVTAVPQMRKI', 'Shaker:/1656': 'SSQAARVVAIISVFVILLSIVIFCLETLEDEVPDITDPFFLIETLCIIWFTFELTVRFLACPLNFCRDVMNVIDIIAIIPYFITTLNLLRVIRLVRVFRIFKLSRHSKGLQIL', .....
The function:
def parse_file(input_file):
parsed_seqs = {}
curr_seq_id = None
curr_seq = []
for line in newfile:
line = line.strip()
line = line.replace('-', '')
if line.startswith(">"):
if curr_seq_id is not None:
parsed_seqs[curr_seq_id] = ''.join(curr_seq)
curr_seq_id = line[1:]
curr_seq = []
continue
curr_seq.append(line)
parsed_seqs[curr_seq_id] = ''.join(curr_seq)
return parsed_seqs
newfile = open("file")
parsed_seqs = parse_file(newfile)
print(parsed_seqs)

If you can use an existing library for this task, you may use Biotite:
import biotite.sequence.io.fasta as fasta
seq_dict = {
'NavAb:/1126': 'TNIVESSFFTKFIIYLIVLNGITMGLETSKTFMQSFGVYTTLFNQIVITIFTIEIILRIYVHRISFFKDPWSLFDFFVVAISLVPTSSGFEILRVLRVLRLFRLVTAVPQMRKI',
'Shaker:/1656': 'SSQAARVVAIISVFVILLSIVIFCLETLEDEVPDITDPFFLIETLCIIWFTFELTVRFLACPLNFCRDVMNVIDIIAIIPYFITTLNLLRVIRLVRVFRIFKLSRHSKGLQIL'
}
fasta_file = fasta.FastaFile()
for header, seq_str in seq_dict.items():
fasta_file[header] = seq_str
fasta_file.write("path/to/file.fasta")
path/to/file.fasta:
>NavAb:/1126
TNIVESSFFTKFIIYLIVLNGITMGLETSKTFMQSFGVYTTLFNQIVITIFTIEIILRIYVHRISFFKDPWSLFDFFVVA
ISLVPTSSGFEILRVLRVLRLFRLVTAVPQMRKI
>Shaker:/1656
SSQAARVVAIISVFVILLSIVIFCLETLEDEVPDITDPFFLIETLCIIWFTFELTVRFLACPLNFCRDVMNVIDIIAIIP
YFITTLNLLRVIRLVRVFRIFKLSRHSKGLQIL
Note that I belong to the developers of this package. There are also solutions in a multitude of other packages, such as Biopython.

Related

Python File String Replace Dict and Tuple

have a Dict with multiple values in a tuple.
newhost = {'newhost.com': ('1.oldhost.com',
'2.oldhost.com',
'3.oldhost.com',
'4.oldhost.com')
}
I wanna open a existing file and search for lines in this file that contains a value of the oldhosts. A file can have multiple Account Lines. In example
Account: 1.oldhost.com username
Account: someotherhost username
When the line with 1.oldhost.com or 2.oldhost.com or 3.oldhost.com and so on is found i wanna replace it with the key form the dict newhost.com.
Can anyone help me? Searched alot, but didnt find the right thing.
Regards

Maybe something like this could get you started
infile_name = 'some_file.txt'
# Open and read the incoming file
with open(infile_name, 'r') as infile:
text = infile.read()
# Cycle through the dictionary
for newhost, oldhost_list in host_dict.items():
# Cycle through each possible old host
for oldhost in oldhost_list:
text.replace(oldhost, newhost)
outfile_name = 'some_other_file.txt'
# Write to file
with open(outfile_name, 'w') as outfile:
outfile.write(text)
Not claiming this to be the best solution, but it should be a good start for you.

To easily find the new host for a given old host, you should convert your data structure:
# your current structure
new_hosts = {
'newhost-E.com': (
'1.oldhost-E.com',
'2.oldhost-E.com',
),
'newhost-A.com': (
'1.oldhost-A.com',
'2.oldhost-A.com',
'3.oldhost-A.com',
),
}
# my proposal
new_hosts_2 = {
v: k
for k, v_list in new_hosts.items()
for v in v_list}
print(new_hosts_2)
# {
# '1.oldhost-E.com': 'newhost-E.com',
# '2.oldhost-E.com': 'newhost-E.com',
# '1.oldhost-A.com': 'newhost-A.com',
# '2.oldhost-A.com': 'newhost-A.com',
# '3.oldhost-A.com': 'newhost-A.com',
# }
This does repeat the new host names (the values in new_hosts_2), but it will allow you to quickly look up given an old host name:
some_old_host = 'x.oldhost.com'
the corresponding_new_host = new_hosts_2[some_old_host]
Now you just need to:
read the lines of the file
find the old hostname in that line
lookup the corresponding new host in new_hosts_2
replace that value in the line
write the line to a new file
Maybe like this:
with open(file_name_1, 'r') as fr:
with open(file_name_2, 'w') as fw:
for line in fr:
line = line.strip()
if len(line) > 0:
# logic to find the start and end position of the old host
start_i = ?
end_i = ?
# get and replace, but only if its found in 'new_hosts_2'
old_host = line[start_i:end_i]
if old_host in new_hosts_2:
line = line[:start_i] + new_hosts_2[old_host] + line[end_i:]
fw.write(line + '\n')

Thank you for your tips. I came up with this now and it is working fine.
import fileinput
textfile = 'somefile.txt'
curhost = 'newhost.com'
hostlist = {curhost: ('1.oldhost.com',
'2.oldhost.com',
'3.oldhost.com')
}
new_hosts_2 = {
v: k
for k, v_list in hostlist.items()
for v in v_list}
for line in fileinput.input(textfile, inplace=True):
line = line.rstrip()
if not line:
continue
for f_key, f_value in new_hosts_2.items():
if f_key in line:
line = line.replace(f_key, f_value)
print line

how to join incorporate splitted lines with replacing data from a file into the same string

So as most of us are thinking it's a duplicate which is not, so what I'm trying to achieve is let's say there is a Master string like the below and couple of files mentioned in it then we need to open the files and check if there are any other files included in it, if so we need to copy that into the line where we fetched that particular text.
Master String:
Welcome
How are you
file.txt
everything alright
signature.txt
Thanks
file.txt
ABCDEFGHtele.txt
tele.txt
IJKL
signature.txt
SAK
Output:
Welcome
How are you
ABCD
EFGH
IJKL
everything alright
SAK
Thanks
for msplitin [stext.split('\n')]:
for num, items in enumerate(stext,1):
if items.strip().startswith("here is") and items.strip().endswith(".txt"):
gmsf = open(os.path.join(os.getcwd()+"\txt", items[8:]), "r")
gmsfstr = gmsf.read()
newline = items.replace(items, gmsfstr)
How to join these replace items in the same string format.
Also, any idea on how to re-iterate the same function until there are no ".txt". So, once the join is done there might be other ".txt" inside a ".txt.
Thanks for your help in advance.

A recursive approach that works with any level of file name nesting:
from os import linesep
def get_text_from_file(file_path):
with open(file_path) as f:
text = f.read()
return SAK_replace(text)
def SAK_replace(s):
lines = s.splitlines()
for index, l in enumerate(lines):
if l.endswith('.txt'):
lines[index] = get_text_from_file(l)
return linesep.join(lines)

You can try:
s = """Welcome
How are you
here is file.txt
everything alright
here is signature.txt
Thanks"""
data = s.split("\n")
match = ['.txt']
all_matches = [s for s in data if any(xs in s for xs in match)]
for index, item in enumerate(data):
if item in all_matches:
data[index] ="XYZ"
data = "\n".join(data)
print data
Output:
Welcome
How are you
XYZ
everything alright
XYZ
Thanks
Added new requirement:
def file_obj(filename):
fo = open(filename,"r")
s = fo.readlines()
data = s.split("\n")
match = ['.txt']
all_matches = [s for s in data if any(xs in s for xs in match)]
for index, item in enumerate(data):
if item in all_matches:
file_obj(item)
data[index] ="XYZ"
data = "\n".join(data)
print data
file_obj("first_filename")

We can create temporary file object and keep the replaced line in that temporary file object and once everything line is processed then we can replace with the new content to original file. This temporary file will be deleted automatically once its come out from the 'with' statement.
import tempfile
import re
file_pattern = re.compile(ur'(((\w+)\.txt))')
original_content_file_name = 'sample.txt'
"""
sample.txt should have this content.
Welcome
How are you
here is file.txt
everything alright
here is signature.txt
Thanks
"""
replaced_file_str = None
def replace_file_content():
"""
replace the file content using temporary file object.
"""
def read_content(file_name):
# matched file name is read and returned back for replacing.
content = ""
with open(file_name) as fileObj:
content = fileObj.read()
return content
# read the file and keep the replaced text in temporary file object(tempfile object will be deleted automatically).
with open(original_content_file_name, 'r') as file_obj, tempfile.NamedTemporaryFile() as tmp_file:
for line in file_obj.readlines():
if line.strip().startswith("here is") and line.strip().endswith(".txt"):
file_path = re.search(file_pattern, line).group()
line = read_content(file_path) + '\n'
tmp_file.write(line)
tmp_file.seek(0)
# assign the replaced value to this variable
replaced_file_str = tmp_file.read()
# replace with new content to the original file
with open(original_content_file_name, 'w+') as file_obj:
file_obj.write(replaced_file_str)
replace_file_content()

Read CSV file and filter results

Im writing a script where one of its functions is to read a CSV file that contain URLs on one of its rows. Unfortunately the system that create those CSVs doesn't put double-quotes on values inside the URL column so when the URL contain commas it breaks all my csv parsing.
This is the code I'm using:
with open(accesslog, 'r') as csvfile, open ('results.csv', 'w') as enhancedcsv:
reader = csv.DictReader(csvfile)
for row in reader:
self.uri = (row['URL'])
self.OriCat = (row['Category'])
self.query(self.uri)
print self.URL+","+self.ServerIP+","+self.OriCat+","+self.NewCat"
This is a sample URL that is breaking up the parsing - this URL comes on the row named "URL". (note the commas at the end)
ams1-ib.adnxs.com/ww=1238&wh=705&ft=2&sv=43&tv=view5-1&ua=chrome&pl=mac&x=1468251839064740641,439999,v,mac,webkit_chrome,view5-1,0,,2,
The following row after the URL always come with a numeric value between parenthesis. Ex: (9999) so this could be used to define when the URL with commas end.
How can i deal with a situation like this using the csv module?

You will have to do it a little more manually. Try this
def process(lines, delimiter=','):
header = None
url_index_from_start = None
url_index_from_end = None
for line in lines:
if not header:
header = [l.strip() for l in line.split(delimiter)]
url_index_from_start = header.index('URL')
url_index_from_end = len(header)-url_index_from_start
else:
data = [l.strip() for l in line.split(delimiter)]
url_from_start = url_index_from_start
url_from_end = len(data)-url_index_from_end
values = data[:url_from_start] + data[url_from_end+1:] + [delimiter.join(data[url_from_start:url_from_end+1])]
keys = header[:url_index_from_start] + header[url_index_from_end+1:] + [header[url_index_from_start]]
yield dict(zip(keys, values))
Usage:
lines = ['Header1, Header2, URL, Header3',
'Content1, "Content2", abc,abc,,abc, Content3']
result = list(process(lines))
assert result[0]['Header1'] == 'Content1'
assert result[0]['Header2'] == '"Content2"'
assert result[0]['Header3'] == 'Content3'
assert result[0]['URL'] == 'abc,abc,,abc'
print(result)
Result:
>>> [{'URL': 'abc,abc,,abc', 'Header2': '"Content2"', 'Header3': 'Content3', 'Header1': 'Content1'}]

Have you considered using Pandas to read your data in?
Another possible solution would be to use regular expressions to pre-process the data...
#make a list of everything you want to change
old = re.findall(regex, f.read())
#append quotes and create a new list
new = []
for url in old:
url2 = "\""+url+"\""
new.append(url2)
#combine the lists
old_new = list(zip(old,new))
#Then use the list to update the file:
f = open(filein,'r')
filedata = f.read()
f.close()
for old,new in old_new:
newdata = filedata.replace(old,new)
f = open(filein,'w')
f.write(newdata)
f.close()

Edit CSV file in python which reads values from another json file in python

I wanted to edit a csv file which reads the value from one of my another json file in python 2.7
my csv is : a.csv
a,b,c,d
,10,12,14
,11,14,15
my json file is a.json
{"a":20}
i want my where the column 'a' will try to match in json file. if their is a match. it should copy that value from json and paste it to my csv file and the final output of my csv file should be looks like this.
a,b,c,d
20,10,12,14
20,11,14,15
Till now I what I have tried is
fileCSV = open('a.csv', 'a')
fileJSON = open('a.json', 'r')
jsonData = fileJSON.json()
for k in range(jsonData):
for i in csvRow:
for j in jsonData.keys():
if i == j:
if self.count == 0:
self.data = jsonData[j]
self.count = 1
else:
self.data = self.data + "," + jsonData[j]
self.count = 0
fileCSV.write(self.data)
fileCSV.write("\n")
k += 1
fileCSV.close()
print("File created successfully")
I will be really thankful if anyone can help me for this.
please ignore any syntactical and indentation error.
Thank You.

Some basic string parsing will get you here.. I wrote a script which works for the simple scenario which you refer to.
check if this solves your problem:
import json
from collections import OrderedDict
def list_to_csv(listdat):
csv = ""
for val in listdat:
csv = csv+","+str(val)
return csv[1:]
lines = []
csvfile = "csvfile.csv"
outcsvfile = "outcsvfile.csv"
jsonfile = "jsonfile.json"
with open(csvfile, encoding='UTF-8') as a_file:
for line in a_file:
lines.append(line.strip())
columns = lines[0].split(",")
data = lines[1:]
whole_data = []
for row in data:
fields = row.split(",")
i = 0
rowData = OrderedDict()
for column in columns:
rowData[columns[i]] = fields[i]
i += 1
whole_data.append(rowData)
with open(jsonfile) as json_file:
jsondata = json.load(json_file)
keys = list(jsondata.keys())
for key in keys:
value = jsondata[key]
for each_row in whole_data:
each_row[key] = value
with open(outcsvfile, mode='w', encoding='UTF-8') as b_file:
b_file.write(list_to_csv(columns)+'\n')
for row_data in whole_data:
row_list = []
for ecolumn in columns:
row_list.append(row_data.get(ecolumn))
b_file.write(list_to_csv(row_list)+'\n')
CSV output is not written to the source file but to a different file.
The output file is also always truncated and written, hence the 'w' mode.

I would recommend using csv.DictReader and csv.DictWriter classes which will read into and out of python dicts. This would make it easier to modify the dict values that you read in from the JSON file.

Parsing specific contents in a file

I have a file that looks like this
!--------------------------------------------------------------------------DISK
[DISK]
DIRECTION = 'OK'
TYPE = 'normal'
!------------------------------------------------------------------------CAPACITY
[CAPACITY]
code = 0
ID = 110
I want to read sections [DISK] and [CAPACITY].. there will be more sections like these. I want to read the parameters defined under those sections.
I wrote a following code:
file_open = open(myFile,"r")
all_lines = file_open.readlines()
count = len(all_lines)
file_open.close()
my_data = {}
section = None
data = ""
for line in all_lines:
line = line.strip() #remove whitespace
line = line.replace(" ", "")
if len(line) != 0: # remove white spaces between data
if line[0] == "[":
section = line.strip()[1:]
data = ""
if line[0] !="[":
data += line + ","
my_data[section] = [bit for bit in data.split(",") if bit != ""]
print my_data
key = my_data.keys()
print key
Unfortunately I am unable to get those sections and the data under that. Any ideas on this would be helpful.

As others already pointed out, you should be able to use the ConfigParser module.
Nonetheless, if you want to implement the reading/parsing yourself, you should split it up into two parts.
Part 1 would be the parsing at file level: splitting the file up into blocks (in your example you have two blocks: DISK and CAPACITY).
Part 2 would be parsing the blocks itself to get the values.
You know you can ignore the lines starting with !, so let's skip those:
with open('myfile.txt', 'r') as f:
content = [l for l in f.readlines() if not l.startswith('!')]
Next, read the lines into blocks:
def partition_by(l, f):
t = []
for e in l:
if f(e):
if t: yield t
t = []
t.append(e)
yield t
blocks = partition_by(content, lambda l: l.startswith('['))
and finally read in the values for each block:
def parse_block(block):
gen = iter(block)
block_name = next(gen).strip()[1:-1]
splitted = [e.split('=') for e in gen]
values = {t[0].strip(): t[1].strip() for t in splitted if len(t) == 2}
return block_name, values
result = [parse_block(b) for b in blocks]
That's it. Let's have a look at the result:
for section, values in result:
print section, ':'
for k, v in values.items():
print '\t', k, '=', v
output:
DISK :
DIRECTION = 'OK'
TYPE = 'normal'
CAPACITY :
code = 0
ID = 110

Are you able to make a small change to the text file? If you can make it look like this (only changed the comment character):
#--------------------------------------------------------------------------DISK
[DISK]
DIRECTION = 'OK'
TYPE = 'normal'
#------------------------------------------------------------------------CAPACITY
[CAPACITY]
code = 0
ID = 110
Then parsing it is trivial:
from ConfigParser import SafeConfigParser
parser = SafeConfigParser()
parser.read('filename')
And getting data looks like this:
(Pdb) parser
<ConfigParser.SafeConfigParser instance at 0x100468dd0>
(Pdb) parser.get('DISK', 'DIRECTION')
"'OK'"
Edit based on comments:
If you're using <= 2.7, then you're a little SOL.. The only way really would be to subclass ConfigParser and implement a custom _read method. Really, you'd just have to copy/paste everything in Lib/ConfigParser.py and edit the values in line 477 (2.7.3):
if line.strip() == '' or line[0] in '#;': # add new comment characters in the string
However, if you're running 3'ish (not sure what version it was introduced in offhand, I'm running 3.4(dev)), you may be in luck: ConfigParser added the comment_prefixes __init__ param to allow you to customize your prefix:
parser = ConfigParser(comment_prefixes=('#', ';', '!'))

If the file is not big, you can load it and use Regexes to find parts that are of interest to you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Write parsed fasta file back to fasta format from a dictionary - python

Related

Python File String Replace Dict and Tuple

how to join incorporate splitted lines with replacing data from a file into the same string

Read CSV file and filter results

Edit CSV file in python which reads values from another json file in python

Parsing specific contents in a file

Categories

Resources