How can I print organized ngrams from my email?

How can I print organized ngrams from my email? - python

I need to do two things at this point but I need your help:
A best practice to clean up data - programmatically deleting superfluous tags & the '>>>>>>>', plus other non meaningful communication flotsam and jetsum
Once it's cleaned - how do I pack it up to work nice in django & sqlite.
Do I make it into a csv based on date, person, subject, words then input them into my data classes within my database?
Well, before I get into the database, I'd like to be able to sort a sort and display the data cleanly - I have very little experience putting things into databases, the closest I do is work from XML, csv and JSON.
I need to have the ngrams by rankings, for example how many times a certain word shows up in a series of emails by a person. I'm trying to get closer to knowing the streams of how people are talking to me about subjects, etc. a very elementary version of Jon Kleinberg's work analyzing his own emails.
be gentle, be rough but please be helpful :)!
> My output currently looks like this: : 1, 'each': 1, 'Me': 1, 'IN!\r\n\r\n2012/1/31': 1, 'calculator.\r\n>>>>>>\r\n>>>>>>': 1, 'people': 1, '=97MB\r\n>\r\n>': 1, 'we': 2, 'wrote:\r\n>>>>>>\r\n>>>>>>': 1, '=\r\nwrote:\r\n>>>>>\r\n>>>>>>': 1, '2012/1/31': 2, 'are': 1, '31,': 5, '=97MB\r\n>>>>\r\n>>>>': 1, '1:45': 1, 'be\r\n>>>>>': 1, 'Sent':
import getpass, imaplib, email
# NGramCounter builds a dictionary relating ngrams (as tuples) to the number
# of times that ngram occurs in a text (as integers)
class NGramCounter(object):
# parameter n is the 'order' (length) of the desired n-gram
def __init__(self, text):
self.text = text
self.ngrams = dict()
# feed method calls tokenize to break the given string up into units
def tokenize(self):
return self.text.split(" ")
# feed method takes text, tokenizes it, and visits every group of n tokens
# in turn, adding the group to self.ngrams or incrementing count in same
def parse(self):
tokens = self.tokenize()
#Moves through every individual word in the text, increments counter if already found
#else sets count to 1
for word in tokens:
if word in self.ngrams:
self.ngrams[word] += 1
else:
self.ngrams[word] = 1
def get_ngrams(self):
return self.ngrams
#loading profile for login
M = imaplib.IMAP4_SSL('imap.gmail.com')
M.login("EMAIL", "PASS")
M.select()
new = open('liamartinez.txt', 'w')
typ, data = M.search(None, 'FROM', 'SEARCHGOES_HERE') #Gets ALL messages
def get_first_text_part(msg): #where should this be nested?
maintype = msg.get_content_maintype()
if maintype == 'multipart':
for part in msg.get_payload():
if part.get_content_maintype() == 'text':
return part.get_payload()
elif maintype == 'text':
return msg.get_payload()
for num in data[0].split(): #Loops through all messages
typ, data = M.fetch(num, '(RFC822)') #Pulls Message
msg = email.message_from_string(data[0][2]) #Puts message into easy to use python objects
_from = msg['from'] #pull from
_to = msg['to'] #pull to
_subject = msg['subject'] #pull subject
_body = get_first_text_part(msg) #pull body
if _body:
ngrams = NGramCounter(_body)
ngrams.parse()
_feed = ngrams.get_ngrams()
# print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
print _feed
# print 'Content-Type:',msg.get_content_type()
# print _from
# print _to
# print _subject
# print _body
#
new.write(_from)
print '---------------------------------'
M.close()
M.logout()

There is nothing wrong in your main loop. The process though is somewhat slow as you need to retrieve all your emails from an external server. What I'd suggest is to download all the messages on the client once. Then save them into a database (sqlite, zodb, mongodb.. the one you prefer) and then perform all the analysis that you want on the db objects afterwards. The two processes (downloading and analyzing) are better kept a part one from each other otherwise tuning them up would result complicated and code complexity would increase.

replace
if _body:
ngrams = NGramCounter(_body)
ngrams.parse()
_feed = ngrams.get_ngrams()
# print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
print _feed
with
if _body:
ngrams = NGramCounter(" ".join(_body.strip(">").split()))
ngrams.parse()
_feed = ngrams.get_ngrams()
print _feed

Related

Loop through a spreadsheet

I made a Python program using tkinter and pandas to select rows and send them by email.
The program let the user decides on which excel file wants to operate;
then asks on which sheet of that file you want to operate;
then it asks how many rows you want to select (using .tail function);
then the program is supposed to iterate through rows and read from a cell (within selected rows) the email address;
then it sends the correct row to the correct address.
I'm stuck at the iteration process.
Here's the code:
import pandas as pd
import smtplib
def invio_mail(my_tailed__df): #the function imports the sliced (.tail) dataframe
gmail_user = '###'
gmail_password = '###'
sent_from = gmail_user
server = smtplib.SMTP_SSL('smtp.gmail.com', 465)
server.ehlo()
server.login(gmail_user, gmail_password)
list = my_tailed_df
customers = list['CUSTOMER']
wrs = list['WR']
phones = list['PHONE']
streets = list['STREET']
cities = list["CITY"]
mobiles = list['MOBILE']
mobiles2 = list['MOBILE2']
mails = list['INST']
for i in range(len(mails)):
customer = customers[i]
wr = wrs[i]
phone = phones[i]
street = streets[i]
city = cities[i]
mobile = mobiles[i]
mobile2 = mobiles2[i]
"""
for i in range(len(mails)):
if mails[i] == "T138":
mail = "email_1#gmail.com"
elif mails[i] == "T139":
mail = "email_2#gmail.com"
"""
subject = 'My subject'
body = f"There you go" \n Name: {customer} \n WR: {wr} \n Phone: {phone} \n Street: {street} \n City: {city} \n Mobile: {mobile} \n Mobile2: {mobile2}"
email_text = """\
From: %s
To: %s
Subject: %s
%s
""" % (sent_from, ", ".join(mail), subject, body)
try:
server.sendmail(sent_from, [mail], email_text)
server.close()
print('Your email was sent!')
except:
print("Some error")
The program raises a KeyError: 0 after it enters the for loop, on the first line inside the loop: customer = customers[i]
I know that the commented part (the nested for loop) will raise the same error.
I'm banging my head on the wall, I think i've read and tried everything.
Where's my error?

Things start to go wrong here: list = my_tailed_df. In Python list() is a Built-in Type.
However, with list = my_tailed_df, you are overwriting the type. You can check this:
# before the line:
print(type(list))
<class 'type'>
list = my_tailed_df
# after the line:
print(type(list))
<class 'pandas.core.frame.DataFrame'> # assuming that your df is an actual df!
This is bad practice and adds no functional gain at the expense of confusion. E.g. customers = list['CUSTOMER'] is doing the exact same thing as would customers = my_tailed_df['CUSTOMER'], namely creating a pd.Series with the index from my_tailed_df. So, first thing to do, is to get rid of list = my_tailed_df and to change all those list[...] snippets into my_tailed_df[...].
Next, let's look at your error. for i in range(len(mails)): generates i = 0, 1, ..., len(mails)-1. Hence, what you are trying to do, is access the pd.Series at the index 0, 1 etc. If you get the error KeyError: 0, this must simply mean that the index of your original df does not contain this key in the index (e.g. it's a list of IDs or something).
If you don't need the original index (as seems to be the case), you could remedy the situation by resetting the index:
my_tailed_df.reset_index(drop=True, inplace=True)
print(my_tailed_df.index)
# will get you: RangeIndex(start=0, stop=x, step=1)
# where x = len(my_tailed_df)-1 (== len(mails)-1)
Implement the reset before the line customers = my_tailed_df['CUSTOMER'] (so, instead of list = my_tailed_df), and you should be good to go.
Alternatively, you could keep the original index and change for i in range(len(mails)): into for i in mails.index:.
Finally, you could also do for idx, element in enumerate(mails.index): if you want to keep track both of the position of the index element (idx) and its value (element).

How to scrape a link from a multipart email in python

I have a program which logs on to a specified gmail account and gets all the emails in a selected inbox that were sent from an email that you input at runtime.
I would like to be able to grab all the links from each email and append them to a list so that i can then filter out the ones i don't need before outputting them to another file. I was using a regex to do this which requires me to convert the payload to a string. The problem is that the regex i am using doesn't work for findall(), it only works when i use search() (I am not too familiar with regexes). I was wondering if there was a better way to extract all links from an email that doesn't involve me messing around with regexes?
My code currently looks like this:
print(f'[{Mail.timestamp}] Scanning inbox')
sys.stdout.write(Style.RESET)
self.search_mail_status, self.amount_matching_criteria = self.login_session.search(Mail.CHARSET,search_criteria)
if self.amount_matching_criteria == 0 or self.amount_matching_criteria == '0':
print(f'[{Mail.timestamp}] No mails from that email address could be found...')
Mail.enter_to_continue()
import main
main.main_wrapper()
else:
pattern = '(?P<url>https?://[^\s]+)'
prog = re.compile(pattern)
self.amount_matching_criteria = self.amount_matching_criteria[0]
self.amount_matching_criteria_str = str(self.amount_matching_criteria)
num_mails = re.search(r"\d.+",self.amount_matching_criteria_str)
num_mails = ((num_mails.group())[:-1]).split(' ')
sys.stdout.write(Style.GREEN)
print(f'[{Mail.timestamp}] Status code of {self.search_mail_status}')
sys.stdout.write(Style.RESET)
sys.stdout.write(Style.YELLOW)
print(f'[{Mail.timestamp}] Found {len(num_mails)} emails')
sys.stdout.write(Style.RESET)
num_mails = self.amount_matching_criteria.split()
for message_num in num_mails:
individual_response_code, individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
message = email.message_from_bytes(individual_response_data[0][1])
if message.is_multipart():
print('multipart')
multipart_payload = message.get_payload()
for sub_message in multipart_payload:
string_payload = str(sub_message.get_payload())
print(prog.search(string_payload).group("url"))

Ended up using this for loop with a recursive function and a regex to get the links, i then removed all links without a the substring that you can input earlier on in the program before appending to a set
for message_num in self.amount_matching_criteria.split():
counter += 1
_, self.individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
self.raw = email.message_from_bytes(self.individual_response_data[0][1])
raw = self.raw
self.scraped_email_value = email.message_from_bytes(Mail.scrape_email(raw))
self.scraped_email_value = str(self.scraped_email_value)
self.returned_links = prog.findall(self.scraped_email_value)
for i in self.returned_links:
if self.substring_filter in i:
self.link_set.add(i)
self.timestamp = time.strftime('%H:%M:%S')
print(f'[{self.timestamp}] Links scraped: [{counter}/{len(num_mails)}]')
The function used:
def scrape_email(raw):
if raw.is_multipart():
return Mail.scrape_email(raw.get_payload(0))
else:
return raw.get_payload(None,True)

urllib.request: Data Not Writing to Outfile

I've got a script here which (ideally) iterates through multiple pages X of JSON data for each entity Y (in this case, multiple loans X for each team Y). The way that the api is constructed, I believe I must physically change a subdirectory within the URL in order to iterate through multiple entities. Here is the explicit documentation and URL:
GET /teams/:id/loans
Returns loans belonging to a particular team.
Example http://api.kivaws.org/v1/teams/2/loans.json
Parameters id(number) Required. The team ID for which to return loans.
page(number) The page position of results to return. Default: 1
sort_by(string) The order by which to sort results. One of: oldest,
newest Default: newest app_id(string) The application id in reverse
DNS notation. ids_only(string) Return IDs only to make the return
object smaller. One of: true, false Default: false Response
loan_listing – HTML , JSON , XML , RSS
Status Production
And here is my script, which does run and appear to extract the correct data, but doesn't seem to write any data to the outfile:
# -*- coding: utf-8 -*-
import urllib.request as urllib
import json
import time
# storing team loans dict. The key is the team id, en value is the list of lenders
team_loans = {}
url = "http://api.kivaws.org/v1/teams/"
#teams_id range 1 - 11885
for i in range(1, 100):
params = dict(
id = i
)
#i =1
try:
handle = urllib.urlopen(str(url+str(i)+"/loans.json"))
print(handle)
except:
print("Could not handle url")
continue
# reading response
item_html = handle.read().decode('utf-8')
# converting bytes to str
data = str(item_html)
# converting to json
data = json.loads(data)
# getting number of pages to crawl
numPages = data['paging']['pages']
# deleting paging data
data.pop('paging')
# calling additional pages
if numPages >1:
for pa in range(2,numPages+1,1):
#pa = 2
handle = urllib.urlopen(str(url+str(i)+"/loans.json?page="+str(pa)))
print("Pulling loan data from team " + str(i) + "...")
# reading response
item_html = handle.read().decode('utf-8')
# converting bytes to str
datatemp = str(item_html)
# converting to json
datatemp = json.loads(datatemp)
#Pagings are redundant headers
datatemp.pop('paging')
# adding data to initial list
for loan in datatemp['loans']:
data['loans'].append(loan)
time.sleep(2)
# recording loans by team in dict
team_loans[i] = data['loans']
if (data['loans']):
print("===Data added to the team_loan dictionary===")
else:
print("!!!FAILURE to add data to team_loan dictionary!!!")
# recorging data to file when 10 teams are read
print("===Finished pulling from page " + str(i) + "===")
if (int(i) % 10 == 0):
outfile = open("team_loan.json", "w")
print("===Now writing data to outfile===")
json.dump(team_loans, outfile, sort_keys = True, indent = 2, ensure_ascii=True)
outfile.close()
else:
print("!!!FAILURE to write data to outfile!!!")
# compliance with API # of requests
time.sleep(2)
print ('Done! Check your outfile (team_loan.json)')
I know that may be a heady amount of code to throw in your faces, but it's a pretty sequential process.
Again, this program is pulling the correct data, but it is not writing this data to the outfile. Can anyone understand why?

For others who may read this post, the script does in face write data to an outfile. It was simply test code logic that was wrong. Ignore the print statements I have put into place.

Determining a pattern of lines in Python

I'm new to Python and having trouble thinking about this problem Pythonically. I have a text file of SMS messages. There are multi-line statements I'd like to capture.
import fileinput
parsed = {}
for linenum, line in enumerate(fileinput.input()):
### Process the input data ###
try:
parsed[linenum] = line
except (KeyError, TypeError, ValueError):
value = None
###############################################
### Now have dict with value: "data" pairing ##
### for every text message in the archive #####
###############################################
for item in parsed:
sent_or_rcvd = parsed[item][:4]
if sent_or_rcvd != "rcvd" and sent_or_rcvd != "sent" and sent_or_rcvd != '--\n':
###########################################
### Know we have a second or third line ###
###########################################
But here's where I hit a wall. I'm not sure what's the best way to contain the strings I get here. I'd love some expert input. Using Python 2.7.3 but glad to move to 3.
Goal: have a human-readable file full of three-line quotes from these SMS.
Example text:
12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump
--
(Yes, before you ask, that's a haiku about poo. I'm trying to capture them from the last 5 years of texting my best friend.)
Ideally resulting in something like:
Haipu 3
2011-03-19
More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump

import time
data = """12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump """.splitlines()
def get_haikus(lines):
haiku = None
for line in lines:
try:
ID, timestamp, txt = line.split('|')
t = time.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
ID = int(ID)
if haiku and len(haiku[1]) ==3:
yield haiku
haiku = (timestamp, [txt])
except ValueError: # happens on error with split(), time or int conversion
haiku[1].append(line)
else:
yield haiku
# now get_haikus() returns tuple (timestamp, [lines])
for haiku in get_haikus(data):
timestamp, text = haiku
date = timestamp.split()[0]
text = '\n'.join(text)
print """{d}\n{txt}""".format(d=date, txt=text)

A good start might be something like the following. I'm reading data from a file named data2 but the read_messages generator will consume lines from any iterable.
#!/usr/bin/env python
def read_messages(file_input):
message = []
for line in file_input:
line = line.strip()
if line[:4].lower() in ('rcvd', 'sent', '--'):
if message:
yield message
message = []
else:
message.append(line)
if message:
yield message
with open('data2') as file_input:
for msg in read_messages(file_input):
print msg
This expects input to look something like the following:
sent
message sent away
it has multiple lines
--
rcvd
message received
rcvd
message sent away
it has multiple lines

Parsing chat messages as config

I'm trying write a function that would be able to parse out a file with defined messages for a set of replies but am at loss on how to do so.
For example the config file would look:
[Message 1]
1: Hey
How are you?
2: Good, today is a good day.
3: What do you have planned?
Anything special?
4: I am busy working, so nothing in particular.
My calendar is full.
Each new line without a number preceding it is considered part of the reply, just another message in the conversation without waiting for a response.
Thanks
Edit: The config file will contain multiple messages and I would like to have the ability to randomly select from them all. Maybe store each reply from a conversation as a list, then the replies with extra messages can carry the newline then just split them by the newline. I'm not really sure what would be the best operation.
Update:
I've got for the most part this coded up so far:
def parseMessages(filename):
messages = {}
begin_message = lambda x: re.match(r'^(\d)\: (.+)', x)
with open(filename) as f:
for line in f:
m = re.match(r'^\[(.+)\]$', line)
if m:
index = m.group(1)
elif begin_message(line):
begin = begin_message(line).group(2)
else:
cont = line.strip()
else:
# ??
return messages
But now I am stuck on being able to store them into the dict the way I'd like..
How would I get this to store a dict like:
{'Message 1':
{'1': 'How are you?\nHow are you?',
'2': 'Good, today is a good day.',
'3': 'What do you have planned?\nAnything special?',
'4': 'I am busy working, so nothing in particular.\nMy calendar is full'
}
}
Or if anyone has a better idea, I'm open for suggestions.
Once again, thanks.
Update Two
Here is my final code:
import re
def parseMessages(filename):
all_messages = {}
num = None
begin_message = lambda x: re.match(r'^(\d)\: (.+)', x)
with open(filename) as f:
messages = {}
message = []
for line in f:
m = re.match(r'^\[(.+)\]$', line)
if m:
index = m.group(1)
elif begin_message(line):
if num:
messages.update({num: '\n'.join(message)})
all_messages.update({index: messages})
del message[:]
num = int(begin_message(line).group(1))
begin = begin_message(line).group(2)
message.append(begin)
else:
cont = line.strip()
if cont:
message.append(cont)
return all_messages

Doesn't sound too difficult. Almost-Python pseudocode:
for line in configFile:
strip comments from line
if line looks like a section separator:
section = matched section
elsif line looks like the beginning of a reply:
append line to replies[section]
else:
append line to last reply in replies[section][-1]
You may want to use the re module for the "looks like" operation. :)

If you have a relatively small number of strings, why not just supply them as string literals in a dict?
{'How are you?' : 'Good, today is a good day.'}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I print organized ngrams from my email? - python

replace if _body: ngrams = NGramCounter(_body) ngrams.parse() _feed = ngrams.get_ngrams() # print "\n".join("\t".join(str(_feed) for col in row) for row in tab) print _feed with if _body: ngrams = NGramCounter(" ".join(_body.strip(">").split())) ngrams.parse() _feed = ngrams.get_ngrams() print _feed

Related

Loop through a spreadsheet

How to scrape a link from a multipart email in python

urllib.request: Data Not Writing to Outfile

Determining a pattern of lines in Python

Parsing chat messages as config

Categories

Resources