Parsing chat messages as config - python

I'm trying write a function that would be able to parse out a file with defined messages for a set of replies but am at loss on how to do so.
For example the config file would look:
[Message 1]
1: Hey
How are you?
2: Good, today is a good day.
3: What do you have planned?
Anything special?
4: I am busy working, so nothing in particular.
My calendar is full.
Each new line without a number preceding it is considered part of the reply, just another message in the conversation without waiting for a response.
Thanks
Edit: The config file will contain multiple messages and I would like to have the ability to randomly select from them all. Maybe store each reply from a conversation as a list, then the replies with extra messages can carry the newline then just split them by the newline. I'm not really sure what would be the best operation.
Update:
I've got for the most part this coded up so far:
def parseMessages(filename):
messages = {}
begin_message = lambda x: re.match(r'^(\d)\: (.+)', x)
with open(filename) as f:
for line in f:
m = re.match(r'^\[(.+)\]$', line)
if m:
index = m.group(1)
elif begin_message(line):
begin = begin_message(line).group(2)
else:
cont = line.strip()
else:
# ??
return messages
But now I am stuck on being able to store them into the dict the way I'd like..
How would I get this to store a dict like:
{'Message 1':
{'1': 'How are you?\nHow are you?',
'2': 'Good, today is a good day.',
'3': 'What do you have planned?\nAnything special?',
'4': 'I am busy working, so nothing in particular.\nMy calendar is full'
}
}
Or if anyone has a better idea, I'm open for suggestions.
Once again, thanks.
Update Two
Here is my final code:
import re
def parseMessages(filename):
all_messages = {}
num = None
begin_message = lambda x: re.match(r'^(\d)\: (.+)', x)
with open(filename) as f:
messages = {}
message = []
for line in f:
m = re.match(r'^\[(.+)\]$', line)
if m:
index = m.group(1)
elif begin_message(line):
if num:
messages.update({num: '\n'.join(message)})
all_messages.update({index: messages})
del message[:]
num = int(begin_message(line).group(1))
begin = begin_message(line).group(2)
message.append(begin)
else:
cont = line.strip()
if cont:
message.append(cont)
return all_messages

Doesn't sound too difficult. Almost-Python pseudocode:
for line in configFile:
strip comments from line
if line looks like a section separator:
section = matched section
elsif line looks like the beginning of a reply:
append line to replies[section]
else:
append line to last reply in replies[section][-1]
You may want to use the re module for the "looks like" operation. :)

If you have a relatively small number of strings, why not just supply them as string literals in a dict?
{'How are you?' : 'Good, today is a good day.'}

Related

Design a module to parse text file

I really don't believe in generic text file parser anymore - especially those files are meant for human readers. Files like HTML and web log can be well handled by Beautiful Soap or Regular Expression. But the human readable text file is still a tough nut to crack.
Just that I am willing to hand-coded a text file parser, tailoring every different format I would encounter. I still want to see if it is possible to have a better program structure in the way that I will still able to understand the program logic 3 months down the road. Also to make it readable.
Today I was given a problem to extract the time-stamps from a file:
"As of 12:30:45, ..."
"Between 1:12:00 and 3:10:45, ..."
"During this time from 3:44:50 to 4:20:55 we have ..."
The parsing is straightforward. I have the time-stamps in different locations on each line. But I am think how should I design the module/function in the way that: (1) each line format will be handle separately, (2) how to branch to the relevant function. For example, I can code each line parser like this:
def parse_as(s):
return s.split(' ')[2], s.split(' ')[2] # returning the second same as the first for the case that only one time stamp is found
def parse_between(s):
return s.split(' ')[2], s.split(' ')[4]
def parse_during(s):
return s.split(' ')[4], s.split(' ')[6]
This can help me to have a quick idea about the formats already handled by the program. I can always add a new function in case I encounter another new format.
However, I still don't have an elegant way to branch to the relevant function.
# open file
for l in f.readline():
s = l.split(' ')
if s == 'As':
ts1, ts2 = parse_as(l)
else:
if s == 'Between':
ts1, ts2 = parse_between(l)
else:
if s == 'During':
ts1, ts2 = parse_during(l)
else:
print 'error!'
# process ts1 and ts2
That's not something I want to maintain.
Any suggestion? There was once I thought decorator might help but I couldn't sort it out myself. Appreciate if anyone can point me to the correct direction.
Consider of using dictionary mapping:
dmap = {
'As': parse_as,
'Between': parse_between,
'During': parse_during
}
Then you only need to use it like this:
dmap = {
'As': parse_as,
'Between': parse_between,
'During': parse_during
}
for l in f.readline():
s = l.split(' ')
p = dmap.get(s, None)
if p is None:
print('error')
else:
ts1, ts2 = p(l)
#continue to process
A lot easier to maintain. If you have new function, you just need to add it into the dmap together with its keyword:
dmap = {
'As': parse_as,
'Between': parse_between,
'During': parse_during,
'After': parse_after,
'Before': parse_before
#and so on
}
What about
start_with = ["As", "Between", "During"]
parsers = [parse_as, parse_between, parse_during]
for l in f.readlines():
match_found = False
for start, f in zip(start_with, parsers):
if l.startswith(start):
ts1, ts2 = f(l.split(' '))
match_found = True
break
if not match_found:
raise NotImplementedError('Not found!')
or with a dict as Ian mentioned:
rules = {
"As": parse_as,
"Between": parse_between,
"During": parse_during
}
for l in f.readlines():
match_found = False
for start, f in rules.items():
if l.startswith(start):
ts1, ts2 = f(l.split(' '))
match_found = True
break
if not match_found:
raise NotImplementedError('Not found!')
Why not use a regular expression?
import re
# open file
with open('datafile.txt') as f:
for line in f:
ts_vals = re.findall(r'(\d+:\d\d:\d\d)', line)
# process ts1 and ts2
Thus ts_vals will be a list with either one or two elements for the examples provided.

Creating loop for __main__

I am new to Python, and I want your advice on something.
I have a script that runs one input value at a time, and I want it to be able to run a whole list of such values without me typing the values one at a time. I have a hunch that a "for loop" is needed for the main method listed below. The value is "gene_name", so effectively, i want to feed in a list of "gene_names" that the script can run through nicely.
Hope I phrased the question correctly, thanks! The chunk in question seems to be
def get_probes_from_genes(gene_names)
import json
import urllib2
import os
import pandas as pd
api_url = "http://api.brain-map.org/api/v2/data/query.json"
def get_probes_from_genes(gene_names):
if not isinstance(gene_names,list):
gene_names = [gene_names]
#in case there are white spaces in gene names
gene_names = ["'%s'"%gene_name for gene_name in gene_names]**
api_query = "?criteria=model::Probe"
api_query= ",rma::criteria,[probe_type$eq'DNA']"
api_query= ",products[abbreviation$eq'HumanMA']"
api_query= ",gene[acronym$eq%s]"%(','.join(gene_names))
api_query= ",rma::options[only$eq'probes.id','name']"
data = json.load(urllib2.urlopen(api_url api_query))
d = {probe['id']: probe['name'] for probe in data['msg']}
if not d:
raise Exception("Could not find any probes for %s gene. Check " \
"http://help.brain- map.org/download/attachments/2818165/HBA_ISH_GeneList.pdf? version=1&modificationDate=1348783035873 " \
"for list of available genes."%gene_name)
return d
def get_expression_values_from_probe_ids(probe_ids):
if not isinstance(probe_ids,list):
probe_ids = [probe_ids]
#in case there are white spaces in gene names
probe_ids = ["'%s'"%probe_id for probe_id in probe_ids]
api_query = "? criteria=service::human_microarray_expression[probes$in%s]"% (','.join(probe_ids))
data = json.load(urllib2.urlopen(api_url api_query))
expression_values = [[float(expression_value) for expression_value in data["msg"]["probes"][i]["expression_level"]] for i in range(len(probe_ids))]
well_ids = [sample["sample"]["well"] for sample in data["msg"] ["samples"]]
donor_names = [sample["donor"]["name"] for sample in data["msg"] ["samples"]]
well_coordinates = [sample["sample"]["mri"] for sample in data["msg"] ["samples"]]
return expression_values, well_ids, well_coordinates, donor_names
def get_mni_coordinates_from_wells(well_ids):
package_directory = os.path.dirname(os.path.abspath(__file__))
frame = pd.read_csv(os.path.join(package_directory, "data", "corrected_mni_coordinates.csv"), header=0, index_col=0)
return list(frame.ix[well_ids].itertuples(index=False))
if __name__ == '__main__':
probes_dict = get_probes_from_genes("SLC6A2")
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)
whoa, first things first. Python ain't Java, so do yourself a favor and use a nice """xxx\nyyy""" string, with triple quotes to multiline.
api_query = """?criteria=model::Probe"
,rma::criteria,[probe_type$eq'DNA']
...
"""
or something like that. you will get white spaces as typed, so you may need to adjust.
If, like suggested, you opt to loop on the call to your function through a file, you will need to either try/except your data-not-found exception or you will need to handle missing data without throwing an exception. I would opt for returning an empty result myself and letting the caller worry about what to do with it.
If you do opt for raise-ing an Exception, create your own, rather than using a generic exception. That way your code can catch your expected Exception first.
class MyNoDataFoundException(Exception):
pass
#replace your current raise code with...
if not d:
raise MyNoDataFoundException(your message here)
clarification about catching exceptions, using the accepted answer as a starting point:
if __name__ == '__main__':
with open(r"/tmp/genes.txt","r") as f:
for line in f.readlines():
#keep track of your input data
search_data = line.strip()
try:
probes_dict = get_probes_from_genes(search_data)
except MyNoDataFoundException, e:
#and do whatever you feel you need to do here...
print "bummer about search_data:%s:\nexception:%s" % (search_data, e)
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)
You may want to create a file with Gene names, then read content of the file and call your function in the loop. Here is an example below
if __name__ == '__main__':
with open(r"/tmp/genes.txt","r") as f:
for line in f.readlines():
probes_dict = get_probes_from_genes(line.strip())
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)

Python refresh file from disk

I have a python script that calls a system program and reads the output from a file out.txt, acts on that output, and loops. However, it doesn't work, and a close investigation showed that the python script just opens out.txt once and then keeps on reading from that old copy. How can I make the python script reread the file on each iteration? I saw a similar question here on SO but it was about a python script running alongside a program, not calling it, and the solution doesn't work. I tried closing the file before looping back but it didn't do anything.
EDIT:
I already tried closing and opening, it didn't work. Here's the code:
import subprocess, os, sys
filename = sys.argv[1]
file = open(filename,'r')
foo = open('foo','w')
foo.write(file.read().rstrip())
foo = open('foo','a')
crap = open(os.devnull,'wb')
numSolutions = 0
while True:
subprocess.call(["minisat", "foo", "out"], stdout=crap,stderr=crap)
out = open('out','r')
if out.readline().rstrip() == "SAT":
numSolutions += 1
clause = out.readline().rstrip()
clause = clause.split(" ")
print clause
clause = map(int,clause)
clause = map(lambda x: -x,clause)
output = ' '.join(map(lambda x: str(x),clause))
print output
foo.write('\n'+output)
out.close()
else:
break
print "There are ", numSolutions, " solutions."
You need to flush foo so that the external program can see its latest changes. When you write to a file, the data is buffered in the local process and sent to the system in larger blocks. This is done because updating the system file is relatively expensive. In your case, you need to force a flush of the data so that minisat can see it.
foo.write('\n'+output)
foo.flush()
I rewrote it to hopefully be a bit easier to understand:
import os
from shutil import copyfile
import subprocess
import sys
TEMP_CNF = "tmp.in"
TEMP_SOL = "tmp.out"
NULL = open(os.devnull, "wb")
def all_solutions(cnf_fname):
"""
Given a file containing a set of constraints,
generate all possible solutions.
"""
# make a copy of original input file
copyfile(cnf_fname, TEMP_CNF)
while True:
# run minisat to solve the constraint problem
subprocess.call(["minisat", TEMP_CNF, TEMP_SOL], stdout=NULL,stderr=NULL)
# look at the result
with open(TEMP_SOL) as result:
line = next(result)
if line.startswith("SAT"):
# Success - return solution
line = next(result)
solution = [int(i) for i in line.split()]
yield solution
else:
# Failure - no more solutions possible
break
# disqualify found solution
with open(TEMP_CNF, "a") as constraints:
new_constraint = " ".join(str(-i) for i in sol)
constraints.write("\n")
constraints.write(new_constraint)
def main(cnf_fname):
"""
Given a file containing a set of constraints,
count the possible solutions.
"""
count = sum(1 for i in all_solutions(cnf_fname))
print("There are {} solutions.".format(count))
if __name__=="__main__":
if len(sys.argv) == 2:
main(sys.argv[1])
else:
print("Usage: {} cnf.in".format(sys.argv[0]))
You take your file_var and end the loop with file_var.close().
for ... :
ga_file = open(out.txt, 'r')
... do stuff
ga_file.close()
Demo of an implementation below (as simple as possible, this is all of the Jython code needed)...
__author__ = ''
import time
var = 'false'
while var == 'false':
out = open('out.txt', 'r')
content = out.read()
time.sleep(3)
print content
out.close()
generates this output:
2015-01-09, 'stuff added'
2015-01-09, 'stuff added' # <-- this is when i just saved my update
2015-01-10, 'stuff added again :)' # <-- my new output from file reads
I strongly recommend reading the error messages. They hold quite a lot of information.
I think the full file name should be written for debug purposes.

Is there a good regular expression for multiline matching of received SIP invites?

I really need python regexp which would give me this information:
Data:
Received from 1.1.1.1 18:41:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com To:
sdfasdfasdfas From: "test"
Via:
sdafsdfasdfasd
Sent from 1.1.1.1 18:42:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com
From: "test"
To:
sdfasdfasdfas Via:
sdafsdfasdfasd
Received from 1.1.1.1 18:50:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com
Via: sdafsdfasdfasd
From: "test"
To:
sdfasdfasdfas
What I need to achieve, is to find the newest INVITE that was "Received" in order to get From: header value. So searching the data backwards.
Is it possible with unique regexp ? :)
Thanks.
One-line answer, assuming you suck the entire header into a string with embedded newlines (or cr/nl's):
sorted(re.findall("Received [^\r\n]+ (\d{2}:\d{2}:\d{2}:\d{3})[^\"]+From: \"([^\r\n]+)\"", data))[-1][1]
The trick to doing it with one RE is using [^\r\n] instead of . when you want to scan over stuff. This works assuming from string always has the double quotes. The double quotes are used to keep the scanner from swallowing the entire string at the first Received... ;)
I do not think a single regular expression is the answer. I think a stateful line-by-line matcher is what you're looking for here.
import re
import collections
_msg_start_re = re.compile('^(Received|Sent)\s+from\s+(\S.*):\s*$')
_msg_field_re = re.compile('^([A-Za-z](?:(?:\w|-)+)):\s+(\S(?:.*\S)?)\s*$')
def message_parser():
hdr = None
fields = collections.defaultdict(list)
msg = None
while True:
if msg is not None:
line = (yield msg)
msg = None
hdr = None
fields = collections.defaultdict(list)
else:
line = (yield None)
if hdr is None:
hdr_match = _msg_start_re.match(line)
hdr = None if hdr_match is None else hdr_match.groups()
elif len(fields) <= 0:
field_match = _msg_field_re.match(line)
if field_match is not None:
fields[field_match.group(1)].append(field_match.group(2))
else: # Waiting for the end of the message
if line.strip() == '':
msg = (hdr, dict(fields))
else:
field_match = _msg_field_re.match(line)
fields[field_match.group(1)].append(field_match.group(2))
Example of use:
parser = msg_parser()
parser.next()
recvd_invites = [msg for msg in (parser.send(line) for line in linelst) \
if (msg is not None) and \
(msg[0][0] == 'Received') and \
('INVITE' in msg[1])]
You might be able to do this with a multiple line regex, but if you do it this way you get the message nicely parsed into its various fields. Presumably you want to do something interesting with the messages, and this will let you do a whole bunch more with them without having to use more regexps.
This also allows you to parse something other than an already existing file or a giant string with all the messages in it. For example, if you want to parse the output of a pipe that's printing out these requests as they happen you can simply do msg = parser.send(line) every time you receive a line and get a new message out as soon as its all been printed (if the line isn't the end of a message then msg will be None).

Python programming

My assignment ask to make a function call readFasta that 
accepts 
one 
argument:
the
 name 
of 
a
 fasta
 format 
file
(fn) 
containing 
one 
or 
more 
sequences.
The 
function 
should 
read
 the 
file 
and
 return 
a
 dictionary 
where 
the 
keys 
are 
the 
fasta 
headers 
and 
the 
values
 are 
the 
corresponding 
sequences 
from 
file 
fn 
converted 
to 
strings.
 Make 
sure 
that
 you 
don’t 
include 
any 
new 
lines 
or 
other 
white space 
characters 
in 
the
 sequences 
in
 the 
dictionary.
For ex, if afile.fa looks like:
>one
atctac
>two
gggaccttgg
>three
gacattac
then the a.readFasta(f) returns:
[‘one’ : ‘atctac’,
‘two’ : ‘gggaccttgg’,
‘three’: ‘gacattac’]
If have tried to write some codes but as I am totally newbie in programming, it didnt work out very much for me. Can everyone please help me. Thank you so much. Here are my codes:
import gzip
def readFasta(fn):
if fn.endswith('.gz'):
fh = gzip.gzipfile(fn)
else:
fh = open(fn,'r')
d = {}
while 1:
line = fh.readline()
if not line:
fh.close()
break
vals = line.rstrip().split('\t')
number = vals[0]
sequence = vals[1]
if d.has_key(number):
lst = d[number]
if gene not in lst:
# this test may not be necessary
lst.append(sequence)
else:
d[number] = [sequence]
return d
Here is what I got in my afile.txt
one atctac
two gggaccttgg
three gacattac
your post is slightly confusing. I assume that you want it to return a dict. in that case, you would write it as {'one': 'actg', 'two': 'aaccttgg' }. if you correctly presented the file format, then this function should do the trick.
import gzip
def read_fasta(filename):
with gzip.open(filename) as f:
return dict(line.split() for line in f)

Categories

Resources