Python programming - python

My assignment ask to make a function call readFasta that 
accepts 
one 
argument:
the
 name 
of 
a
 fasta
 format 
file
(fn) 
containing 
one 
or 
more 
sequences.
The 
function 
should 
read
 the 
file 
and
 return 
a
 dictionary 
where 
the 
keys 
are 
the 
fasta 
headers 
and 
the 
values
 are 
the 
corresponding 
sequences 
from 
file 
fn 
converted 
to 
strings.
 Make 
sure 
that
 you 
don’t 
include 
any 
new 
lines 
or 
other 
white space 
characters 
in 
the
 sequences 
in
 the 
dictionary.
For ex, if afile.fa looks like:
>one
atctac
>two
gggaccttgg
>three
gacattac
then the a.readFasta(f) returns:
[‘one’ : ‘atctac’,
‘two’ : ‘gggaccttgg’,
‘three’: ‘gacattac’]
If have tried to write some codes but as I am totally newbie in programming, it didnt work out very much for me. Can everyone please help me. Thank you so much. Here are my codes:
import gzip
def readFasta(fn):
if fn.endswith('.gz'):
fh = gzip.gzipfile(fn)
else:
fh = open(fn,'r')
d = {}
while 1:
line = fh.readline()
if not line:
fh.close()
break
vals = line.rstrip().split('\t')
number = vals[0]
sequence = vals[1]
if d.has_key(number):
lst = d[number]
if gene not in lst:
# this test may not be necessary
lst.append(sequence)
else:
d[number] = [sequence]
return d
Here is what I got in my afile.txt
one atctac
two gggaccttgg
three gacattac

your post is slightly confusing. I assume that you want it to return a dict. in that case, you would write it as {'one': 'actg', 'two': 'aaccttgg' }. if you correctly presented the file format, then this function should do the trick.
import gzip
def read_fasta(filename):
with gzip.open(filename) as f:
return dict(line.split() for line in f)

Related

Saving file format after editing it with ConfigParser

i am using ConfigParser to write some modification in a configuration file, basically what i am doing is :
retrieve my urls from an api
write them in my config file
but after the edit, i noticed that the file format has changed :
Before the edit :
[global_tags]
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 10000
[[inputs.cpu]]
percpu = true
totalcpu = true
[[inputs.prometheus]]
urls= []
interval = "140s"
[inputs.prometheus.tags]
exp = "exp"
After the edit :
[global_tags]
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 10000
[[inputs.cpu]
percpu = true
totalcpu = true
[[inputs.prometheus]
interval = "140s"
response_timeout = "120s"
[inputs.prometheus.tags]
exp = "snmp"
the offset changed and all the comments that were in the file has been deleted, my code :
edit = configparser.ConfigParser(strict=False, allow_no_value=True, empty_lines_in_values=False)
edit.read("file.conf")
edit.set("[section]", "urls", str(urls))
print(edit)
# Write changes back to file
with open('file.conf', 'w') as configfile:
edit.write(configfile)
I have already tried : SafeConfigParser, RawConfigParser but it doesn't work.
when i do a print(edit.section()), here is what i get : ['global_tags', 'agent', '[inputs.cpu', , '[inputs.prometheus', 'inputs.prometheus.tags']
Is there any help please ?
Here's an example of a "filter" parser that retains all other formatting but changes the urls line in the agent section if it comes across it:
import io
def filter_config(stream, item_filter):
"""
Filter a "config" file stream.
:param stream: Text stream to read from.
:param item_filter: Filter function; takes a section and a line and returns a filtered line.
:return: Yields (possibly) filtered lines.
"""
current_section = None
for line in stream:
stripped_line = line.strip()
if stripped_line.startswith('['):
current_section = stripped_line.strip('[]')
elif not stripped_line.startswith("#") and " = " in stripped_line:
line = item_filter(current_section, line)
yield line
def urls_filter(section, line):
if section == "agent" and line.strip().startswith("urls = "):
start, sep, end = line.partition(" = ")
return start + sep + "hi there..."
return line
# Could be a disk file, just using `io.StringIO()` for self-containedness here
config_file = io.StringIO("""
[global_tags]
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 10000
# HELLO! THIS IS A COMMENT!
metric_buffer_limit = 100000
urls = ""
[other]
urls = can't touch this!!!
""")
for line in filter_config(config_file, urls_filter):
print(line, end="")
The output is
[global_tags]
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 10000
# HELLO! THIS IS A COMMENT!
metric_buffer_limit = 100000
urls = hi there...
[other]
urls = can't touch this!!!
so you can see all comments and (mis-)indentation was preserved.
The problem is that you're passing brackets with the section name, which is unnecessary:
edit.set("[section]", "urls", str(urls))
See this example from the documentation:
import configparser
config = configparser.RawConfigParser()
# Please note that using RawConfigParser's set functions, you can assign
# non-string values to keys internally, but will receive an error when
# attempting to write to a file or when you get it in non-raw mode. Setting
# values using the mapping protocol or ConfigParser's set() does not allow
# such assignments to take place.
config.add_section('Section1')
config.set('Section1', 'an_int', '15')
config.set('Section1', 'a_bool', 'true')
config.set('Section1', 'a_float', '3.1415')
config.set('Section1', 'baz', 'fun')
config.set('Section1', 'bar', 'Python')
config.set('Section1', 'foo', '%(bar)s is %(baz)s!')
# Writing our configuration file to 'example.cfg'
with open('example.cfg', 'w') as configfile:
config.write(configfile)
But, anyway, it won't preserve the identation, nor will it support nested sections; you could try the YAML format, which does allow to use indentation to separate nested sections, but it won't keep the exact same indentation when saving, but, do you really need it to be the exact same? Anyway, there are various configuration formats out there, you should study them to see what fits your case better.

Parsing Json with multiple "levels" with Python

I'm trying to parse a json file from an api call.
I have found this code that fits my need and trying to adapt it to what I want:
import math, urllib2, json, re
def download():
graph = {}
page = urllib2.urlopen("http://fx.priceonomics.com/v1/rates/?q=1")
jsrates = json.loads(page.read())
pattern = re.compile("([A-Z]{3})_([A-Z]{3})")
for key in jsrates:
matches = pattern.match(key)
conversion_rate = -math.log(float(jsrates[key]))
from_rate = matches.group(1).encode('ascii','ignore')
to_rate = matches.group(2).encode('ascii','ignore')
if from_rate != to_rate:
if from_rate not in graph:
graph[from_rate] = {}
graph[from_rate][to_rate] = float(conversion_rate)
return graph
And I've turned it into:
import math, urllib2, json, re
def download():
graph = {}
page = urllib2.urlopen("https://bittrex.com/api/v1.1/public/getmarketsummaries")
jsrates = json.loads(page.read())
for pattern in jsrates['result'][0]['MarketName']:
for key in jsrates['result'][0]['Ask']:
matches = pattern.match(key)
conversion_rate = -math.log(float(jsrates[key]))
from_rate = matches.group(1).encode('ascii','ignore')
to_rate = matches.group(2).encode('ascii','ignore')
if from_rate != to_rate:
if from_rate not in graph:
graph[from_rate] = {}
graph[from_rate][to_rate] = float(conversion_rate)
return graph
Now the problem is that there is multiple level in the json "Result > 0, 1,2 etc"
json screenshot
for key in jsrates['result'][0]['Ask']:
I want the zero to be able to be any number, I don't know if thats clear.
So I could get all the ask price to match their marketname.
I have shortened the code so it doesnt make too long of a post.
Thanks
PS: sorry for the english, its not my native language.
You could loop through all of the result values that are returned, ignoring the meaningless numeric index:
for result in jsrates['result'].values():
ask = result.get('Ask')
if ask is not None:
# Do things with your ask...

Design a module to parse text file

I really don't believe in generic text file parser anymore - especially those files are meant for human readers. Files like HTML and web log can be well handled by Beautiful Soap or Regular Expression. But the human readable text file is still a tough nut to crack.
Just that I am willing to hand-coded a text file parser, tailoring every different format I would encounter. I still want to see if it is possible to have a better program structure in the way that I will still able to understand the program logic 3 months down the road. Also to make it readable.
Today I was given a problem to extract the time-stamps from a file:
"As of 12:30:45, ..."
"Between 1:12:00 and 3:10:45, ..."
"During this time from 3:44:50 to 4:20:55 we have ..."
The parsing is straightforward. I have the time-stamps in different locations on each line. But I am think how should I design the module/function in the way that: (1) each line format will be handle separately, (2) how to branch to the relevant function. For example, I can code each line parser like this:
def parse_as(s):
return s.split(' ')[2], s.split(' ')[2] # returning the second same as the first for the case that only one time stamp is found
def parse_between(s):
return s.split(' ')[2], s.split(' ')[4]
def parse_during(s):
return s.split(' ')[4], s.split(' ')[6]
This can help me to have a quick idea about the formats already handled by the program. I can always add a new function in case I encounter another new format.
However, I still don't have an elegant way to branch to the relevant function.
# open file
for l in f.readline():
s = l.split(' ')
if s == 'As':
ts1, ts2 = parse_as(l)
else:
if s == 'Between':
ts1, ts2 = parse_between(l)
else:
if s == 'During':
ts1, ts2 = parse_during(l)
else:
print 'error!'
# process ts1 and ts2
That's not something I want to maintain.
Any suggestion? There was once I thought decorator might help but I couldn't sort it out myself. Appreciate if anyone can point me to the correct direction.
Consider of using dictionary mapping:
dmap = {
'As': parse_as,
'Between': parse_between,
'During': parse_during
}
Then you only need to use it like this:
dmap = {
'As': parse_as,
'Between': parse_between,
'During': parse_during
}
for l in f.readline():
s = l.split(' ')
p = dmap.get(s, None)
if p is None:
print('error')
else:
ts1, ts2 = p(l)
#continue to process
A lot easier to maintain. If you have new function, you just need to add it into the dmap together with its keyword:
dmap = {
'As': parse_as,
'Between': parse_between,
'During': parse_during,
'After': parse_after,
'Before': parse_before
#and so on
}
What about
start_with = ["As", "Between", "During"]
parsers = [parse_as, parse_between, parse_during]
for l in f.readlines():
match_found = False
for start, f in zip(start_with, parsers):
if l.startswith(start):
ts1, ts2 = f(l.split(' '))
match_found = True
break
if not match_found:
raise NotImplementedError('Not found!')
or with a dict as Ian mentioned:
rules = {
"As": parse_as,
"Between": parse_between,
"During": parse_during
}
for l in f.readlines():
match_found = False
for start, f in rules.items():
if l.startswith(start):
ts1, ts2 = f(l.split(' '))
match_found = True
break
if not match_found:
raise NotImplementedError('Not found!')
Why not use a regular expression?
import re
# open file
with open('datafile.txt') as f:
for line in f:
ts_vals = re.findall(r'(\d+:\d\d:\d\d)', line)
# process ts1 and ts2
Thus ts_vals will be a list with either one or two elements for the examples provided.

Creating loop for __main__

I am new to Python, and I want your advice on something.
I have a script that runs one input value at a time, and I want it to be able to run a whole list of such values without me typing the values one at a time. I have a hunch that a "for loop" is needed for the main method listed below. The value is "gene_name", so effectively, i want to feed in a list of "gene_names" that the script can run through nicely.
Hope I phrased the question correctly, thanks! The chunk in question seems to be
def get_probes_from_genes(gene_names)
import json
import urllib2
import os
import pandas as pd
api_url = "http://api.brain-map.org/api/v2/data/query.json"
def get_probes_from_genes(gene_names):
if not isinstance(gene_names,list):
gene_names = [gene_names]
#in case there are white spaces in gene names
gene_names = ["'%s'"%gene_name for gene_name in gene_names]**
api_query = "?criteria=model::Probe"
api_query= ",rma::criteria,[probe_type$eq'DNA']"
api_query= ",products[abbreviation$eq'HumanMA']"
api_query= ",gene[acronym$eq%s]"%(','.join(gene_names))
api_query= ",rma::options[only$eq'probes.id','name']"
data = json.load(urllib2.urlopen(api_url api_query))
d = {probe['id']: probe['name'] for probe in data['msg']}
if not d:
raise Exception("Could not find any probes for %s gene. Check " \
"http://help.brain- map.org/download/attachments/2818165/HBA_ISH_GeneList.pdf? version=1&modificationDate=1348783035873 " \
"for list of available genes."%gene_name)
return d
def get_expression_values_from_probe_ids(probe_ids):
if not isinstance(probe_ids,list):
probe_ids = [probe_ids]
#in case there are white spaces in gene names
probe_ids = ["'%s'"%probe_id for probe_id in probe_ids]
api_query = "? criteria=service::human_microarray_expression[probes$in%s]"% (','.join(probe_ids))
data = json.load(urllib2.urlopen(api_url api_query))
expression_values = [[float(expression_value) for expression_value in data["msg"]["probes"][i]["expression_level"]] for i in range(len(probe_ids))]
well_ids = [sample["sample"]["well"] for sample in data["msg"] ["samples"]]
donor_names = [sample["donor"]["name"] for sample in data["msg"] ["samples"]]
well_coordinates = [sample["sample"]["mri"] for sample in data["msg"] ["samples"]]
return expression_values, well_ids, well_coordinates, donor_names
def get_mni_coordinates_from_wells(well_ids):
package_directory = os.path.dirname(os.path.abspath(__file__))
frame = pd.read_csv(os.path.join(package_directory, "data", "corrected_mni_coordinates.csv"), header=0, index_col=0)
return list(frame.ix[well_ids].itertuples(index=False))
if __name__ == '__main__':
probes_dict = get_probes_from_genes("SLC6A2")
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)
whoa, first things first. Python ain't Java, so do yourself a favor and use a nice """xxx\nyyy""" string, with triple quotes to multiline.
api_query = """?criteria=model::Probe"
,rma::criteria,[probe_type$eq'DNA']
...
"""
or something like that. you will get white spaces as typed, so you may need to adjust.
If, like suggested, you opt to loop on the call to your function through a file, you will need to either try/except your data-not-found exception or you will need to handle missing data without throwing an exception. I would opt for returning an empty result myself and letting the caller worry about what to do with it.
If you do opt for raise-ing an Exception, create your own, rather than using a generic exception. That way your code can catch your expected Exception first.
class MyNoDataFoundException(Exception):
pass
#replace your current raise code with...
if not d:
raise MyNoDataFoundException(your message here)
clarification about catching exceptions, using the accepted answer as a starting point:
if __name__ == '__main__':
with open(r"/tmp/genes.txt","r") as f:
for line in f.readlines():
#keep track of your input data
search_data = line.strip()
try:
probes_dict = get_probes_from_genes(search_data)
except MyNoDataFoundException, e:
#and do whatever you feel you need to do here...
print "bummer about search_data:%s:\nexception:%s" % (search_data, e)
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)
You may want to create a file with Gene names, then read content of the file and call your function in the loop. Here is an example below
if __name__ == '__main__':
with open(r"/tmp/genes.txt","r") as f:
for line in f.readlines():
probes_dict = get_probes_from_genes(line.strip())
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)

Parsing chat messages as config

I'm trying write a function that would be able to parse out a file with defined messages for a set of replies but am at loss on how to do so.
For example the config file would look:
[Message 1]
1: Hey
How are you?
2: Good, today is a good day.
3: What do you have planned?
Anything special?
4: I am busy working, so nothing in particular.
My calendar is full.
Each new line without a number preceding it is considered part of the reply, just another message in the conversation without waiting for a response.
Thanks
Edit: The config file will contain multiple messages and I would like to have the ability to randomly select from them all. Maybe store each reply from a conversation as a list, then the replies with extra messages can carry the newline then just split them by the newline. I'm not really sure what would be the best operation.
Update:
I've got for the most part this coded up so far:
def parseMessages(filename):
messages = {}
begin_message = lambda x: re.match(r'^(\d)\: (.+)', x)
with open(filename) as f:
for line in f:
m = re.match(r'^\[(.+)\]$', line)
if m:
index = m.group(1)
elif begin_message(line):
begin = begin_message(line).group(2)
else:
cont = line.strip()
else:
# ??
return messages
But now I am stuck on being able to store them into the dict the way I'd like..
How would I get this to store a dict like:
{'Message 1':
{'1': 'How are you?\nHow are you?',
'2': 'Good, today is a good day.',
'3': 'What do you have planned?\nAnything special?',
'4': 'I am busy working, so nothing in particular.\nMy calendar is full'
}
}
Or if anyone has a better idea, I'm open for suggestions.
Once again, thanks.
Update Two
Here is my final code:
import re
def parseMessages(filename):
all_messages = {}
num = None
begin_message = lambda x: re.match(r'^(\d)\: (.+)', x)
with open(filename) as f:
messages = {}
message = []
for line in f:
m = re.match(r'^\[(.+)\]$', line)
if m:
index = m.group(1)
elif begin_message(line):
if num:
messages.update({num: '\n'.join(message)})
all_messages.update({index: messages})
del message[:]
num = int(begin_message(line).group(1))
begin = begin_message(line).group(2)
message.append(begin)
else:
cont = line.strip()
if cont:
message.append(cont)
return all_messages
Doesn't sound too difficult. Almost-Python pseudocode:
for line in configFile:
strip comments from line
if line looks like a section separator:
section = matched section
elsif line looks like the beginning of a reply:
append line to replies[section]
else:
append line to last reply in replies[section][-1]
You may want to use the re module for the "looks like" operation. :)
If you have a relatively small number of strings, why not just supply them as string literals in a dict?
{'How are you?' : 'Good, today is a good day.'}

Categories

Resources