imap_tools Taking Long Time to Scrape Links from Emails

imap_tools Taking Long Time to Scrape Links from Emails - python

I am using imap_tools to get links from emails. The emails are very small with very little text, graphics, etc. There are also not many, around 20-40 spread through the day.
When a new email arrives it takes between 10 and 25 seconds to scrape the link. This seems very long. I would have expected it to be less than 2 seconds and speed is important.
Nb. it is a shared mailbox and I cannot simply fetch unseeen emails because often other users will have opened emails before the scraper gets to them.
Can anyone see what the issue is?
import pandas as pd
from imap_tools import MailBox, AND
import re, time, datetime, os
from config import email, password
uids = []
yahooSmtpServer = "imap.mail.yahoo.com"
data = {
'today': str(datetime.datetime.today()).split(' ')[0],
'uids': []
}
while True:
while True:
try:
client = MailBox(yahooSmtpServer).login(email, password, 'INBOX')
try:
if not data['today'] == str(datetime.datetime.today()).split(' ')[0]:
data['today'] = str(datetime.datetime.today()).split(' ')[0]
data['uids'] = []
ds = str(datetime.datetime.today()).split(' ')[0].split('-')
msgs = client.fetch(AND(date_gte=datetime.date.today()))
for msg in msgs:
links = []
if str(datetime.datetime.today()).split(' ')[0] == str(msg.date).split(' ')[0] and not msg.uid in data['uids']:
mail = msg.html
if 'order' in mail and not 'cancel' in mail:
for i in re.findall(r'(https?://[^\s]+)', mail):
if 'pick' in i:
link = i.replace('"', "")
link = link.replace('<', '>').split('>')[0]
print(link)
links.append(link)
break
data['uids'].append(msg.uid)
scr_links = pd.DataFrame({'Links': links})
scr_links.to_csv('Links.csv', mode='a', header=False, index=False)
time.sleep(0.5)
except Exception as e:
print(e)
pass
client.logout()
time.sleep(5)
except Exception as e:
print(e)
print('sleeping for 5 sec')
time.sleep(1)

I think this is email server throttle timeout.
Try to see IMAP IDLE.
since 0.51.0 imap_tools has IDLE support:
https://github.com/ikvk/imap_tools/releases/tag/v0.51.0

Related

Azure communication services send email

I am using azure communication services in my react app to send email.
But I want to send bulk messages via a text file or an excel file
import time
from azure.communication.email import EmailClient, EmailContent, EmailAddress, EmailMessage, EmailRecipients
def main():
try:
connection_string = "<ACS_CONNECTION_STRING>"
client = EmailClient.from_connection_string(connection_string)
sender = "<SENDER_EMAIL>"
content = EmailContent(
subject="Test email from Python",
plain_text="This is plaintext body of test email.",
html= "<html><h1>This is the html body of test email.</h1></html>",
)
recipient = EmailAddress(email="<RECIPIENT_EMAIL>", display_name="<RECIPIENT_DISPLAY_NAME>")
message = EmailMessage(
sender=sender,
content=content,
recipients=EmailRecipients(to=[recipient])
)
response = client.send(message)
if (not response or response.message_id=='undefined' or response.message_id==''):
print("Message Id not found.")
else:
print("Send email succeeded for message_id :"+ response.message_id)
message_id = response.message_id
counter = 0
while True:
counter+=1
send_status = client.get_send_status(message_id)
if (send_status):
print(f"Email status for message_id {message_id} is {send_status.status}.")
if (send_status.status.lower() == "queued" and counter < 12):
time.sleep(10) # wait for 10 seconds before checking next time.
counter +=1
else:
if(send_status.status.lower() == "outfordelivery"):
print(f"Email delivered for message_id {message_id}.")
break
else:
print("Looks like we timed out for checking email send status.")
break
except Exception as ex:
print(ex)
main()
How to solve this issue?
I tried to get emails from a text file, but it failed for me

Create a Generator to read email and process messages

I am trying to write some code to read my inbox and process some attachments if present. I decided this would be a good time to learn how generators work as I want to process all messages that have a particular subject. I have gotten to the point where I can get all the attachments and relevant subjects but I sort of had to fake it as the iterator in the for i in range . . . was not advancing so I am advancing the latest_email_id in the loop
def read_email_from_gmail():
try:
print 'got here'
mail = imaplib.IMAP4_SSL(SMTP_SERVER)
mail.login(FROM_EMAIL,FROM_PWD)
mail.select('inbox')
type, data = mail.search(None, 'ALL')
mail_ids = data[0]
id_list = mail_ids.split()
first_email_id = int(id_list[0])
latest_email_id = int(id_list[-1])
print latest_email_id
while True:
for i in range(latest_email_id,first_email_id - 1, -1):
latest_email_id -= 1
#do stuff to get attachment and subject
yield attachment_data, subject
except Exception, e:
print str(e)
for attachment, subject in read_email_from_gmail():
x = process_attachment(attachment)
y = process_subject(subject)
Is there a more pythonic way to advance through my in-box using a generator to hold state in the in-box?

I have learned a bit more about generators and played around with the code I started with so I have a function that uses a generator to send each relevant email message subject to the main function. This is what I have so far, and it works great for my needs
import imaplib
import email
FROM_EMAIL = 'myemail#gmail.com'
FROM_PWD = "mygmail_password"
SMTP_SERVER = "imap.gmail.com"
SMTP_PORT = 993
STOP_MESSAGES = set(['Could not connect to mailbox',
'No Messages or None Retrieved Successfully',
'Could not retrieve some message',
'Finished processing'])
def read_emails():
mail = imaplib.IMAP4_SSL(SMTP_SERVER)
mail.login(FROM_EMAIL,FROM_PWD)
mail.select('inbox')
con_status, data = mail.uid('search', None, "ALL")
if con_status != 'OK':
yield 'Could not connect to mailbox'
try:
mail_ids = data[0].split()
except Exception:
yield 'No Messages or None Retrieved Successfully'
print mail_ids
processed = []
while True:
for mail_id in mail_ids:
status, mdata = mail.uid('fetch', mail_id, '(RFC822)')
if status != 'OK':
yield 'Could not retrieve some message'
if mail_id in processed:
yield 'Finished processing'
raw_msg = mdata[0][1]
structured_msg = email.message_from_string(raw_msg)
msg_subject = structured_msg['subject']
processed.append(mail_id)
yield msg_subject
To access my messages one by one, I then use the following block to get my messages
for msg_subj in read_emails():
if msg_subj not in STOP_MESSAGES:
do some stuff here with msg_subj
else:
print msg_subj
break
I am accessing these messages by their uid as I will be deleting them later and would like to use the uid as the key to manage deletion. For me the trick was to collect the uid in the list named processed and then check to see if I was going to circle through them again because I was working with a uid that had already been processed.

Python Multithreaded HTTP crawler - Closing connection and hanging of the program

Wrote this crawler in Python, it dumps several parameters to JSON output file based on the input list of domains.
Have this question:
Do I need to close the HTTP connection in each thread? Input data is ca. 5 Million items. It process at the beginning at a rate ca. 50 iterations per second, but later after some time it drops to 1-2 per second and/or hangs (no kernel messages and no errors on stdout)? Can this be code or is network limiting related? I suspect software since when I restart it, it starts again with high rate (ca. 50 iteration per second)
Any tips how to improve the code below are also welcome, especially improve on speed and crawling throughput.
Code in questions:
import urllib2
import pprint
from tqdm import tqdm
import lxml.html
from Queue import Queue
from geoip import geolite2
import pycountry
from tld import get_tld
resfile = open("out.txt",'a')
concurrent = 200
def doWork():
while True:
url = q.get()
status = getStatus(url)
doSomethingWithResult(status)
q.task_done()
def getStatus(ourl):
try:
response = urllib2.urlopen("http://"+ourl)
peer = response.fp._sock.fp._sock.getpeername()
ip = peer[0]
header = response.info()
html = response.read()
html_element = lxml.html.fromstring(html)
generator = html_element.xpath("//meta[#name='generator']/#content")
try:
match = geolite2.lookup(ip)
if match is not None:
country= match.country
try:
c=pycountry.countries.lookup(country)
country=c.name
except:
country=""
except:
country=""
try:
res=get_tld("http://www"+ourl, as_object=True)
tld=res.suffix
except:
tld=""
try:
match = re.search(r'[\w\.-]+#[\w\.-]+', html)
email=match.group(0)
except:
email=""
try:
item= generator[0]
val = "{ \"Domain\":\"http://"+ourl.rstrip()+"\",\"IP:\""+ip+"\"," + "\"Server\":"+ "\""+str(header.getheader("Server")).replace("None","")+"\",\"PoweredBy\":" + "\""+str(header.getheader("X-Powered-By")).replace("None","")+"\""+",\"MetaGenerator\":\""+item+"\",\"Email\":\""+email+"\",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
except:
val = "{ \"Domain\":\"http://"+ourl.rstrip()+"\",\"IP:\""+ip+"\"," + "\"Server\":"+ "\""+str(header.getheader("Server")).replace("None","")+"\",\"PoweredBy\":" + "\""+str(header.getheader("X-Powered-By")).replace("None","")+"\""+",\"MetaGenerator\":\"\",\"Email\":\""+email+"\",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
return val
except Exception as e:
#print "error"+str(e)
pass
def doSomethingWithResult(status):
if status:
resfile.write(str(status)+"\n")
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=doWork)
t.daemon = True
t.start()
try:
for url in tqdm(open('list.txt')):
q.put(url.strip())
status = open("status.txt",'w')
status.write(str(url.strip()))
q.join()
except KeyboardInterrupt:
sys.exit(1)
Update 1:
Closing the Socket and FileDescriptor makes it work better, does not seem to hang anymore after some time. Performance is 50 reqs/sec on home laptop and ca 100 req/sec on a VPS
from threading import Thread
import httplib, sys
import urllib2
import pprint
from tqdm import tqdm
import lxml.html
from Queue import Queue
from geoip import geolite2
import pycountry
from tld import get_tld
import json
resfile = open("out.txt",'a')
concurrent = 200
def doWork():
while True:
url = q.get()
status = getStatus(url)
doSomethingWithResult(status)
q.task_done()
def getStatus(ourl):
try:
response = urllib2.urlopen("http://"+ourl)
realsock = response.fp._sock.fp._sock
peer = response.fp._sock.fp._sock.getpeername()
ip = peer[0]
header = response.info()
html = response.read()
realsock.close()
response.close()
html_element = lxml.html.fromstring(html)
generator = html_element.xpath("//meta[#name='generator']/#content")
try:
match = geolite2.lookup(ip)
if match is not None:
country= match.country
try:
c=pycountry.countries.lookup(country)
country=c.name
except:
country=""
except:
country=""
try:
res=get_tld("http://www"+ourl, as_object=True)
tld=res.suffix
except:
tld=""
try:
match = re.search(r'[\w\.-]+#[\w\.-]+', html)
email=match.group(0)
except:
email=""
try:
item= generator[0]
val = "{ \"Domain\":"+json.dumps("http://"+ourl.rstrip())+",\"IP\":\""+ip+"\",\"Server\":"+json.dumps(str(header.getheader("Server")).replace("None",""))+",\"PoweredBy\":" +json.dumps(str(header.getheader("X-Powered-By")).replace("None",""))+",\"MetaGenerator\":"+json.dumps(item)+",\"Email\":"+json.dumps(email)+",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
except:
val = "{ \"Domain\":"+json.dumps("http://"+ourl.rstrip())+",\"IP\":\""+ip+"\"," + "\"Server\":"+json.dumps(str(header.getheader("Server")).replace("None",""))+",\"PoweredBy\":" +json.dumps(str(header.getheader("X-Powered-By")).replace("None",""))+",\"MetaGenerator\":\"\",\"Email\":"+json.dumps(email)+",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
return val
except Exception as e:
print "error"+str(e)
pass
def doSomethingWithResult(status):
if status:
resfile.write(str(status)+"\n")
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=doWork)
t.daemon = True
t.start()
try:
for url in tqdm(open('list.txt')):
q.put(url.strip())
status = open("status.txt",'w')
status.write(str(url.strip()))
q.join()
except KeyboardInterrupt:
sys.exit(1)

The handles will be automatically garbage collected, but, you will be better off closing the handles yourself, especially as you are doing this in a tight loop.
You also asked for suggestions for improvement. A big one would be to stop using urllib2 and start using requests instead.

There are many possible options, why your crawling rate drops.
1.) Take care not to crawl to much data from the same domain. Some web servers are configured just to allow one connection per IP address in parallel.
2.) Try to send randomized browser-like http headers (user-agent, referrer, ...) to prevent web server scraping protection, if set.
3.) Use a mature http (parallel) library, like pycurl (has MultiCurl) or requests (grequests). They perform faster for sure.

Why does python.exe crash when I run this Selenium/WebDriver code using IE as the driver?

I've got IE 9 on Windows 7 and Python 2.7. This code runs perfectly when I use Firefox for the driver, but with IE it crashes after a few seconds. I wonder if it's hitting a high CPU state during the while loop? I am a noob when it comes to writing good while/wait loops.
def text_from_gmail(username, password, to_user, pattern):
gmail = PyGmail('imap.gmail.com')
gmail.login(username, password)
#find the id of the confirmation message
sec_waiting = 0
msg_uid = ['']
sleep_period = 10 # seconds to wait between Gmail searches
while msg_uid == ['']: # wait until message list from server isn't empty
gmail.m.select('[Gmail]/All Mail')
resp, msg_uid = gmail.m.search(None, 'To', to_user)
if(msg_uid != ['']):
break # don't sleep if the message was found quickly
time.sleep(sleep_period) # pause between checks
sec_waiting += sleep_period
if sec_waiting >= 300: # 5 minute timeout
assert False # fail the test if timeout elapses
# extract the confirmation link from the message
# it's a multi-part MIME, so we have to grab the attachment and decode it
typ, msg_data = gmail.m.fetch(msg_uid[0], '(RFC822)')
gmail.logout()
email_body = msg_data[0][1]
mail = email.message_from_string(email_body)
body = ""
for part in mail.walk():
if(part.get_content_type() == 'text/plain'):
body = part.get_payload(decode=True)
link = re.findall(pattern, body)
return link

It was trying to go to a malformed URL I think. I put a str() around it driver.web.get(str(confirm_link)) and it's not crashing now.

Checking email with Python

I am interested to trigger a certain action upon receiving an email from specific
address with specific subject. In order to be able to do so I need to implement
monitoring of my mailbox, checking every incoming mail (in particular, i use gmail).
what is the easiest way to do that?

Gmail provides the ability to connect over POP, which you can turn on in the gmail settings panel. Python can make connections over POP pretty easily:
import poplib
from email import parser
pop_conn = poplib.POP3_SSL('pop.gmail.com')
pop_conn.user('username')
pop_conn.pass_('password')
#Get messages from server:
messages = [pop_conn.retr(i) for i in range(1, len(pop_conn.list()[1]) + 1)]
# Concat message pieces:
messages = ["\n".join(mssg[1]) for mssg in messages]
#Parse message intom an email object:
messages = [parser.Parser().parsestr(mssg) for mssg in messages]
for message in messages:
print message['subject']
pop_conn.quit()
You would just need to run this script as a cron job. Not sure what platform you're on so YMMV as to how that's done.

Gmail provides an atom feed for new email messages. You should be able to monitor this by authenticating with py cURL (or some other net library) and pulling down the feed. Making a GET request for each new message should mark it as read, so you won't have to keep track of which emails you've read.

While not Python-specific, I've always loved procmail wherever I could install it...!
Just use as some of your action lines for conditions of your choice | pathtoyourscript (vertical bar AKA pipe followed by the script you want to execute in those cases) and your mail gets piped, under the conditions of your choice, to the script of your choice, for it to do whatever it wants -- hard to think of a more general approach to "trigger actions of your choice upon receipt of mails that meet your specific conditions!! Of course there are no limits to how many conditions you can check, how many action lines a single condition can trigger (just enclose all the action lines you want in { } braces), etc, etc.

People seem to be pumped up about Lamson:
https://github.com/zedshaw/lamson
It's an SMTP server written entirely in Python. I'm sure you could leverage that to do everything you need - just forward the gmail messages to that SMTP server and then do what you will.
However, I think it's probably easiest to do the ATOM feed recommendation above.
EDIT: Lamson has been abandoned

I found a pretty good snippet when I wanted to do this same thing (and the example uses gmail). Also check out the google search results on this.

I recently solved this problem by using procmail and python
Read the documentation for procmail. You can tell it to send all incoming email to a python script like this in a special procmail config file
:0:
| ./scripts/ppm_processor.py
Python has an "email" package available that can do anything you could possibly want to do with email. Read up on the following ones....
from email.generator import Generator
from email import Message
from email.MIMEBase import MIMEBase
from email.MIMEText import MIMEText
from email.mime.multipart import MIMEMultipart

https://developers.google.com/gmail/gmail_inbox_feed
Says you have to have a corporate Gmail, but I have come to find that you can read Gmail free versions without issues. I use this code to get my blood pressure results I email or text to a gmail address.
from email.header import decode_header
from datetime import datetime
import os
import pandas as pd
import plotly.graph_objs as go
import plotly
now = datetime.now()
dt_string = now.strftime("%Y.%m.%d %H:%M:%S")
print("date_time:", dt_string)
email_account = '13123#gmail.com'
email_password = '131231231231231231312313F'
email_server = 'imap.gmail.com'
email_port = 993
accept_emails_from = {'j1231312#gmail.com', '1312312#chase.com', '13131231313131#msg.fi.google.com'}
verbose = True
def get_emails():
email_number = 0
local_csv_data = ''
t_date = None
t_date = None
t_systolic = None
t_diastolic = None
t_pulse = None
t_weight = None
try:
mail = imaplib.IMAP4_SSL(email_server)
email_code, email_auth_status = mail.login(email_account, email_password)
if verbose:
print('[DEBUG] email_code: ', email_code)
print('[DEBUG] email_auth_status: ', email_auth_status)
mail.list()
mail.select('inbox')
# (email_code, messages) = mail.search(None, 'ALL')
(email_code, messages) = mail.search(None, '(UNSEEN)') # only get unread emails to process.
subject = None
email_from = None
for email_id in messages[0].split():
email_number += 1
email_code, email_data = mail.fetch(email_id, '(RFC822)')
for response in email_data:
if isinstance(response, tuple): # we only want the tuple ,the bytes is just b .
msg = email.message_from_bytes(response[1])
content_type = msg.get_content_type()
subject, encoding = decode_header(msg["Subject"])[0]
subject = str(subject.replace("\r\n", ""))
if isinstance(subject, bytes):
subject = subject.decode(encoding)
email_from, encoding = decode_header(msg.get("From"))[0]
if isinstance(email_from, bytes):
email_from = email_from.decode(encoding)
if content_type == "text/plain":
body = msg.get_payload(decode=True).decode()
parse_data = body
else:
parse_data = subject
if '>' in email_from:
email_from = email_from.lower().split('<')[1].split('>')[0]
if email_from in accept_emails_from:
parse_data = parse_data.replace(',', ' ')
key = 0
for value in parse_data.split(' '):
if key == 0:
t_date = value
t_date = t_date.replace('-', '.')
if key == 1:
t_time = value
if ':' not in t_time:
numbers = list(t_time)
t_time = numbers[0] + numbers[1] + ':' + numbers[2] + numbers[3]
if key == 2:
t_systolic = value
if key == 3:
t_diastolic = value
if key == 4:
t_pulse = value
if key == 5:
t_weight = value
key += 1
t_eval = t_date + ' ' + t_time
if verbose:
print()
print('--------------------------------------------------------------------------------')
print('[DEBUG] t_eval:'.ljust(30), t_eval)
date_stamp = datetime.strptime(t_eval, '%Y.%m.%d %H:%M')
if verbose:
print('[DEBUG] date_stamp:'.ljust(30), date_stamp)
print('[DEBUG] t_systolic:'.ljust(30), t_systolic)
print('[DEBUG] t_diastolic:'.ljust(30), t_diastolic)
print('[DEBUG] t_pulse:'.ljust(30), t_pulse)
print('[DEBUG] t_weight:'.ljust(30), t_weight)
new_data = str(date_stamp) + ',' + \
t_systolic + ',' + \
t_diastolic + ',' + \
t_pulse + ',' + \
t_weight + '\n'
local_csv_data += new_data
except Exception as e:
traceback.print_exc()
print(str(e))
return False, email_number, local_csv_data
return True, email_number, local_csv_data
def update_csv(local_data):
""" updates csv and sorts it if there is changes made. """
uniq_rows = 0
if os.name == 'posix':
file_path = '/home/blood_pressure_results.txt'
elif os.name == 'nt':
file_path = '\\\\uncpath\\blood_pressure_results.txt'
else:
print('[ERROR] os not supported:'.ljust(30), os.name)
exit(911)
if verbose:
print('[DEBUG] file_path:'.ljust(30), file_path)
column_names = ['00DateTime', 'Systolic', 'Diastolic', 'Pulse', 'Weight']
if not os.path.exists(file_path):
with open(file_path, 'w') as file:
for col in column_names:
file.write(col + ',')
file.write('\n')
# append the new data to file.
with open(file_path, 'a+') as file:
file.write(local_data)
# sort the file.
df = pd.read_csv(file_path, usecols=column_names)
df_sorted = df.sort_values(by=["00DateTime"], ascending=True)
df_sorted.to_csv(file_path, index=False)
# remove duplicates.
file_contents = ''
with open(file_path, 'r') as file:
for row in file:
if row not in file_contents:
uniq_rows += 1
print('Adding: '.ljust(30), row, end='')
file_contents += row
else:
print('Duplicate:'.ljust(30), row, end='')
with open(file_path, 'w') as file:
file.write(file_contents)
return uniq_rows
# run the main code to get emails.
status, emails, my_data = get_emails()
print('status:'.ljust(30), status)
print('emails:'.ljust(30), emails)
# if the new emails received then sort the files.
csv_rows = update_csv(my_data)
print('csv_rows:'.ljust(30), csv_rows)
exit(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

imap_tools Taking Long Time to Scrape Links from Emails - python

I think this is email server throttle timeout. Try to see IMAP IDLE. since 0.51.0 imap_tools has IDLE support: https://github.com/ikvk/imap_tools/releases/tag/v0.51.0

Related

Azure communication services send email

Create a Generator to read email and process messages

Python Multithreaded HTTP crawler - Closing connection and hanging of the program

Why does python.exe crash when I run this Selenium/WebDriver code using IE as the driver?

Checking email with Python

Categories

Resources