Determining a pattern of lines in Python - python

I'm new to Python and having trouble thinking about this problem Pythonically. I have a text file of SMS messages. There are multi-line statements I'd like to capture.
import fileinput
parsed = {}
for linenum, line in enumerate(fileinput.input()):
### Process the input data ###
try:
parsed[linenum] = line
except (KeyError, TypeError, ValueError):
value = None
###############################################
### Now have dict with value: "data" pairing ##
### for every text message in the archive #####
###############################################
for item in parsed:
sent_or_rcvd = parsed[item][:4]
if sent_or_rcvd != "rcvd" and sent_or_rcvd != "sent" and sent_or_rcvd != '--\n':
###########################################
### Know we have a second or third line ###
###########################################
But here's where I hit a wall. I'm not sure what's the best way to contain the strings I get here. I'd love some expert input. Using Python 2.7.3 but glad to move to 3.
Goal: have a human-readable file full of three-line quotes from these SMS.
Example text:
12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump
--
(Yes, before you ask, that's a haiku about poo. I'm trying to capture them from the last 5 years of texting my best friend.)
Ideally resulting in something like:
Haipu 3
2011-03-19
More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump

import time
data = """12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump """.splitlines()
def get_haikus(lines):
haiku = None
for line in lines:
try:
ID, timestamp, txt = line.split('|')
t = time.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
ID = int(ID)
if haiku and len(haiku[1]) ==3:
yield haiku
haiku = (timestamp, [txt])
except ValueError: # happens on error with split(), time or int conversion
haiku[1].append(line)
else:
yield haiku
# now get_haikus() returns tuple (timestamp, [lines])
for haiku in get_haikus(data):
timestamp, text = haiku
date = timestamp.split()[0]
text = '\n'.join(text)
print """{d}\n{txt}""".format(d=date, txt=text)

A good start might be something like the following. I'm reading data from a file named data2 but the read_messages generator will consume lines from any iterable.
#!/usr/bin/env python
def read_messages(file_input):
message = []
for line in file_input:
line = line.strip()
if line[:4].lower() in ('rcvd', 'sent', '--'):
if message:
yield message
message = []
else:
message.append(line)
if message:
yield message
with open('data2') as file_input:
for msg in read_messages(file_input):
print msg
This expects input to look something like the following:
sent
message sent away
it has multiple lines
--
rcvd
message received
rcvd
message sent away
it has multiple lines

Related

How to resolve ValueError in following code?

The votes are in… and it's up to you to make sure the correct winner is announced!
You've been given a CSV file called nominees.csv, which contains the names of various movies nominated for a prize, and the people who should be announced as the recipient. The file will look like this:
title,director(s)
Schindler's List,Steven Spielberg
"O Brother, Where Art Thou?","Joel Coen, Ethan Coen"
2001: A Space Odyssey,Stanley Kubrick
"Sherlock, Jr.","Buster Keaton, Roscoe Arbuckle"
You should write a program that reads in nominees.csv, asks for the name of the winning title, and prints out specific congratulations. For example, with the above file, your program should work like this:
Winning title: O Brother, Where Art Thou?
Congratulations: Joel Coen, Ethan Coen
Here is another example, using the same file:
Winning title: Schindler's List
Congratulations: Steven Spielberg
Already tried submitting and altering values but line number 10 always gives value error and so does line number 15. When a list of new nominees is applied, it gives the error and fails my code.
def main():
film_director=[]
with open('nominees.csv','r') as read_file:
lines=read_file.readlines()
lines=lines[1:]
for line in lines:
if '"' in line:
if line[0]=='"':
index_second_quotes=line.index('"',1)
index_third_quotes=line.index('"',index_second_quotes+1)
title = line[:index_second_quotes].strip('\"')
directors=line[index_third_quotes:-1].strip('\"').strip()
else:
index_first_quotes = line.index('"')
index_second_quotes = line.index('"', index_first_quotes+1)
title = line[:index_first_quotes-1].strip('\"')
directors = line[index_first_quotes+1:-1].strip('\"').strip()
film_director.append([title,directors])
else:
tokens = line.split(',')
film_director.append([tokens[0].strip(),tokens[1].strip()])
title = input('Winning title: ')
for row in film_director:
if title.strip()==row[0]:
print('Congratulations:',row[1])
break
main()
The error message given is:
Testing a new nominees file. Your submission raised an exception of type ValueError. This occurred on line 10 of program.py.
The above number of condition checks, splitting, concatenation can be omitted with regular expression. You can make use of the below code with a single regular expression and a split
import re
with open("nominees.csv") as cf:
lines = cf.readlines()
for line in lines[1:]:
reg_match = re.match(r'"([^""]*)","([^""]*)"$', line)
if reg_match:
win_title, director = reg_match.group(1), reg_match.group(2)
else:
win_title, director = line.split(",")
print("Winning title: %s" % win_title)
print("Congratulations: %s" % director.strip())

Checking a text segment within brackets with python

I have a text file, which is strucutred as following:
segmentA {
content Aa
content Ab
content Ac
....
}
segmentB {
content Ba
content Bb
content Bc
......
}
segmentC {
content Ca
content Cb
content Cc
......
}
I know how to search certrain strings through the whole text file, but how can i define to search for a certain string whithin, like example, "segmentC". I need something like reg expression to tell the script??:
If text beginn with "segmentC {" perform a search of a certain string until the first "}" appears.
Someone an idea?
Thanks in advance!
Not a RegEx solution ...but would do the work!
def SearchStuff(lines,sstr):
i=0
while(lines[i]!='}'):
#Do stuffff .....for e.g.
if 'Ca' in lines[i]:
return lines[i]
i+=1
def main(search_str):
f=open('file.txt','r')
lines = f.readlines()
f.close()
for line in lines:
if search_str in line:
index = lines.index(line)
break
lines = lines[index+1:]
print SearchStuff(lines,search_str)
search_str = 'segmentC' #set this string accordingly
main(search_str)
Depending on the complexity you are looking for, you can range from a simple state machine with line based pattern searching to a full lexer.
Line based search
The below example makes the assumption that you are only looking for one segment and that segmentC { and the closing } are on one single line.
def parsesegment(fh):
# Yields all lines inside "segmentC"
state = "out"
for line in fh:
line = line.strip() # in case there are whitespaces around
if state == "out":
if line.startswith("segmentC {"):
state = "in"
break
elif state == "in":
if line.startswith("}"):
state = "out"
break
# Work on the specific lines here
yield line
with open(...) as fh:
for line in parsesegment(fh):
# do something
Simple Lexer
If you need more flexibility, you can design a simple lexer/parser couple. For example, the following code makes no assumption to the organisation of the syntax between lines. It also ignores unknown pattern, which a typical lexer do not (normally it should raise a syntax error):
import re
class ParseSegment:
# Dictionary of patterns per state
# Tuples are (token name, pattern, state change command)
_regexes = {
"out": [
("open", re.compile(r"segment(?P<segment>\w+)\s+\{"), "in")
],
"in": [
("close", re.compile(r"\}"), "out"),
# Here an example of what you could want to match
("content", re.compile(r"content\s+(?P<content>\w+)"), None)
]
}
def lex(self, source, initpos = 0):
pos = initpos
end = len(source)
state = "out"
while pos < end:
for token_name, reg, state_chng in self._regexes[state]:
# Try to get a match
match = reg.match(source, pos)
if match:
# Advance according to how much was matched
pos = match.end()
# yield a token if it has a name
if token_name is not None:
# Yield token name, the full matched part of source
# and the match grouped according to (?P<tag>) tags
yield (token_name, match.group(), match.groupdict())
# Switch state if requested
if state_chng is not None:
state = state_chng
break
else:
# No match, advance by one character
# This is particular to that lexer, usually no match means
# the input file has an error in the syntax and lexer should
# yield an exception
pos += 1
def parse(self, source, initpos = 0):
# This is an example of use of the lexer with a parser
# This converts the input file into a dictionary. Keys are segment
# names, and values are list of contents.
segments = {}
cur_segment = None
# Use lexer to get tokens from source
for token, fullmatch, groups in self.lex(source, initpos):
# On open, create the list of content in segments
if token == "open":
cur_segment = groups["segment"]
segments[cur_segment] = []
# On content, ensure we know the segment and add content to the
# list
elif token == "content":
if cur_segment is None:
raise RuntimeError("Content found outside a segment")
segments[cur_segment].append(groups["content"])
# On close, set the current segment to unknown
elif token == "close":
cur_segment = None
# ignore unknown tokens, we could raise an error instead
return segments
def main():
with open("...", "r") as fh:
data = fh.read()
lexer = ParseSegment()
segments = lexer.parse(data)
print(segments)
return 0
if __name__ == '__main__':
main()
Full Lexer
Then if you need even more flexibility and reuseability, you will have to create a full parser. No need to reinvent the wheel, have a look at this list of language parsing modules, you will probably find the one that suits you.

How can I print organized ngrams from my email?

I need to do two things at this point but I need your help:
A best practice to clean up data - programmatically deleting superfluous tags & the '>>>>>>>', plus other non meaningful communication flotsam and jetsum
Once it's cleaned - how do I pack it up to work nice in django & sqlite.
Do I make it into a csv based on date, person, subject, words then input them into my data classes within my database?
Well, before I get into the database, I'd like to be able to sort a sort and display the data cleanly - I have very little experience putting things into databases, the closest I do is work from XML, csv and JSON.
I need to have the ngrams by rankings, for example how many times a certain word shows up in a series of emails by a person. I'm trying to get closer to knowing the streams of how people are talking to me about subjects, etc. a very elementary version of Jon Kleinberg's work analyzing his own emails.
be gentle, be rough but please be helpful :)!
> My output currently looks like this: : 1, 'each': 1, 'Me': 1, 'IN!\r\n\r\n2012/1/31': 1, 'calculator.\r\n>>>>>>\r\n>>>>>>': 1, 'people': 1, '=97MB\r\n>\r\n>': 1, 'we': 2, 'wrote:\r\n>>>>>>\r\n>>>>>>': 1, '=\r\nwrote:\r\n>>>>>\r\n>>>>>>': 1, '2012/1/31': 2, 'are': 1, '31,': 5, '=97MB\r\n>>>>\r\n>>>>': 1, '1:45': 1, 'be\r\n>>>>>': 1, 'Sent':
import getpass, imaplib, email
# NGramCounter builds a dictionary relating ngrams (as tuples) to the number
# of times that ngram occurs in a text (as integers)
class NGramCounter(object):
# parameter n is the 'order' (length) of the desired n-gram
def __init__(self, text):
self.text = text
self.ngrams = dict()
# feed method calls tokenize to break the given string up into units
def tokenize(self):
return self.text.split(" ")
# feed method takes text, tokenizes it, and visits every group of n tokens
# in turn, adding the group to self.ngrams or incrementing count in same
def parse(self):
tokens = self.tokenize()
#Moves through every individual word in the text, increments counter if already found
#else sets count to 1
for word in tokens:
if word in self.ngrams:
self.ngrams[word] += 1
else:
self.ngrams[word] = 1
def get_ngrams(self):
return self.ngrams
#loading profile for login
M = imaplib.IMAP4_SSL('imap.gmail.com')
M.login("EMAIL", "PASS")
M.select()
new = open('liamartinez.txt', 'w')
typ, data = M.search(None, 'FROM', 'SEARCHGOES_HERE') #Gets ALL messages
def get_first_text_part(msg): #where should this be nested?
maintype = msg.get_content_maintype()
if maintype == 'multipart':
for part in msg.get_payload():
if part.get_content_maintype() == 'text':
return part.get_payload()
elif maintype == 'text':
return msg.get_payload()
for num in data[0].split(): #Loops through all messages
typ, data = M.fetch(num, '(RFC822)') #Pulls Message
msg = email.message_from_string(data[0][2]) #Puts message into easy to use python objects
_from = msg['from'] #pull from
_to = msg['to'] #pull to
_subject = msg['subject'] #pull subject
_body = get_first_text_part(msg) #pull body
if _body:
ngrams = NGramCounter(_body)
ngrams.parse()
_feed = ngrams.get_ngrams()
# print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
print _feed
# print 'Content-Type:',msg.get_content_type()
# print _from
# print _to
# print _subject
# print _body
#
new.write(_from)
print '---------------------------------'
M.close()
M.logout()
There is nothing wrong in your main loop. The process though is somewhat slow as you need to retrieve all your emails from an external server. What I'd suggest is to download all the messages on the client once. Then save them into a database (sqlite, zodb, mongodb.. the one you prefer) and then perform all the analysis that you want on the db objects afterwards. The two processes (downloading and analyzing) are better kept a part one from each other otherwise tuning them up would result complicated and code complexity would increase.
replace
if _body:
ngrams = NGramCounter(_body)
ngrams.parse()
_feed = ngrams.get_ngrams()
# print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
print _feed
with
if _body:
ngrams = NGramCounter(" ".join(_body.strip(">").split()))
ngrams.parse()
_feed = ngrams.get_ngrams()
print _feed

Parsing chat messages as config

I'm trying write a function that would be able to parse out a file with defined messages for a set of replies but am at loss on how to do so.
For example the config file would look:
[Message 1]
1: Hey
How are you?
2: Good, today is a good day.
3: What do you have planned?
Anything special?
4: I am busy working, so nothing in particular.
My calendar is full.
Each new line without a number preceding it is considered part of the reply, just another message in the conversation without waiting for a response.
Thanks
Edit: The config file will contain multiple messages and I would like to have the ability to randomly select from them all. Maybe store each reply from a conversation as a list, then the replies with extra messages can carry the newline then just split them by the newline. I'm not really sure what would be the best operation.
Update:
I've got for the most part this coded up so far:
def parseMessages(filename):
messages = {}
begin_message = lambda x: re.match(r'^(\d)\: (.+)', x)
with open(filename) as f:
for line in f:
m = re.match(r'^\[(.+)\]$', line)
if m:
index = m.group(1)
elif begin_message(line):
begin = begin_message(line).group(2)
else:
cont = line.strip()
else:
# ??
return messages
But now I am stuck on being able to store them into the dict the way I'd like..
How would I get this to store a dict like:
{'Message 1':
{'1': 'How are you?\nHow are you?',
'2': 'Good, today is a good day.',
'3': 'What do you have planned?\nAnything special?',
'4': 'I am busy working, so nothing in particular.\nMy calendar is full'
}
}
Or if anyone has a better idea, I'm open for suggestions.
Once again, thanks.
Update Two
Here is my final code:
import re
def parseMessages(filename):
all_messages = {}
num = None
begin_message = lambda x: re.match(r'^(\d)\: (.+)', x)
with open(filename) as f:
messages = {}
message = []
for line in f:
m = re.match(r'^\[(.+)\]$', line)
if m:
index = m.group(1)
elif begin_message(line):
if num:
messages.update({num: '\n'.join(message)})
all_messages.update({index: messages})
del message[:]
num = int(begin_message(line).group(1))
begin = begin_message(line).group(2)
message.append(begin)
else:
cont = line.strip()
if cont:
message.append(cont)
return all_messages
Doesn't sound too difficult. Almost-Python pseudocode:
for line in configFile:
strip comments from line
if line looks like a section separator:
section = matched section
elsif line looks like the beginning of a reply:
append line to replies[section]
else:
append line to last reply in replies[section][-1]
You may want to use the re module for the "looks like" operation. :)
If you have a relatively small number of strings, why not just supply them as string literals in a dict?
{'How are you?' : 'Good, today is a good day.'}

How can I change in real-time filtered words in tweetstream (Python)?

I need to save in real-time to a database all tweets from the Twitter Streaming API, filtering them by out a certain list of words, of course. I've achieved it by using tweetstream, defining the list words like this before calling FilterStream():
words = ["word1","two words","anotherWord"]
What I'd like to do, is to be able to add/change/remove any of those values, without stoping the script. To do so, I created a plain text file containing the words I want to be filtered out separated by a line break. Using this code I get the list words just perfectly:
file = open('words.txt','r')
words = file.read().split("\n")
I made those lines work when it starts, but I need it to do it every time it's going to check the stream. Any ideas?
You could read an updated word list in one thread and process tweets in another one using Queue for communication.
Example:
Thread that reads tweets:
def read_tweets(q):
words = q.get()
while True:
with tweetstream.FilterStream(..track=words,..) as stream:
for tweet in stream: #NOTE:it requires special handling if it blocks
process(tweet)
try: words = q.get_nowait() # try to read a new word list
except Empty: pass
else: break # start new connection
Thread that reads words:
def read_words(q):
words = None
while True:
with open('words.txt') as file:
newwords = file.read().splitlines()
if words != newwords:
q.put(newwords)
words = newwords
time.sleep(1)
The main script could look like:
q = Queue(1)
t = Thread(target=read_tweets, args=(q,))
t.daemon = True
t.start()
read_words(q)
Instead of polling you could use inotify or similar to monitor changes to the 'words.txt' file.
Perhaps something like this will work:
def rebuild_wordlist(s):
with open('words.txt','r') as f:
return set(f.read().split('\n'))
def match(tweet):
return any(w in tweet for w in words)
words, timestamp = rebuild_wordlist(), time.time()
stream = tweetstream.SampleStream("username", "password")
fstream = ifilter(match, stream)
for tweet in fstream:
do_some_with_tweet(tweet)
if time.time() > timestamp + 5.0:
# refresh the wordlist every 5 seconds
words, timestamp = rebuild_wordlist(), time.time()
The words set is a global that gets refreshed every few seconds while the filter is running.

Categories

Resources