Email harvester in Python: How to improve performance? - python

I have found a program called "Best Email Extractor" http://www.emailextractor.net/. The website says it is written in Python. I tried to write a similar program. The above program extracts about 300 - 1000 emails per minute. My program extracts about 30-100 emails per hour. Could someone give me tips on how to improve the performance of my program? I wrote the following:
import sqlite3 as sql
import urllib2
import re
import lxml.html as lxml
import time
import threading
def getUrls(start):
urls = []
try:
dom = lxml.parse(start).getroot()
dom.make_links_absolute()
for url in dom.iterlinks():
if not '.jpg' in url[2]:
if not '.JPG' in url[2]:
if not '.ico' in url[2]:
if not '.png' in url[2]:
if not '.jpeg' in url[2]:
if not '.gif' in url[2]:
if not 'youtube.com' in url[2]:
urls.append(url[2])
except:
pass
return urls
def getURLContent(urlAdresse):
try:
url = urllib2.urlopen(urlAdresse)
text = url.read()
url.close()
return text
except:
return '<html></html>'
def harvestEmail(url):
text = getURLContent(url)
emails = re.findall('[\w\-][\w\-\.]+#[\w\-][\w\-\.]+[a-zA-Z]{1,4}', text)
if emails:
if saveEmail(emails[0]) == 1:
print emails[0]
def saveUrl(url):
connection = sql.connect('url.db')
url = (url, )
with connection:
cursor = connection.cursor()
cursor.execute('SELECT COUNT(*) FROM urladressen WHERE adresse = ?', url)
data = cursor.fetchone()
if(data[0] == 0):
cursor.execute('INSERT INTO urladressen VALUES(NULL, ?)', url)
return 1
return 0
def saveEmail(email):
connection = sql.connect('emails.db')
email = (email, )
with connection:
cursor = connection.cursor()
cursor.execute('SELECT COUNT(*) FROM addresse WHERE email = ?', email)
data = cursor.fetchone()
if(data[0] == 0):
cursor.execute('INSERT INTO addresse VALUES(NULL, ?)', email)
return 1
return 0
def searchrun(urls):
for url in urls:
if saveUrl(url) == 1:
#time.sleep(0.6)
harvestEmail(url)
print url
urls.remove(url)
urls = urls + getUrls(url)
urls1 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=DVD')
urls2 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Jolie')
urls3 = getUrls('http://www.finanzen.net')
urls4 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Party')
urls5 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Games')
urls6 = getUrls('http://www.spiegel.de')
urls7 = getUrls('http://www.kicker.de/')
urls8 = getUrls('http://www.chessbase.com')
urls9 = getUrls('http://www.nba.com')
urls10 = getUrls('http://www.nfl.com')
try:
threads = []
urls = (urls1, urls2, urls3, urls4, urls5, urls6, urls7, urls8, urls9, urls10)
for urlList in urls:
thread = threading.Thread(target=searchrun, args=(urlList, )).start()
threads.append(thread)
print threading.activeCount()
for thread in threads:
thread.join()
except RuntimeError:
print RuntimeError

I don't think many people are going to help you harvest emails. It's a generally detested activity.
Regarding the performance bottlenecks in your code, you need to find out where the time is going by profiling. At the lowest level, replace each of your functions with a dummy that does no processing but returns valid output; so the email collector could return a list of the same address 100 times (or however many are in these URL results). That will show you which function is costing you time.
Things that stick out:
Get the files behind the URLs from the server beforehand; if you spam Google every time you run the script, they could well block you. Reading from disk is faster than requesting the files from the internet and can be done separately and concurrently.
The database code is creating a new connection for each call to saveEmail etc, which will spend most of its time doing handshaking and authentication. Better to have an object that keeps the connection alive between calls, or better yet to insert multiple records at once.
Once the network and database issues are done, the regex could probably do with \b around it so that the matching does less backtracking.
A series of if not 'foo' in str: then if not 'blah' in str ... is poor coding. Extract the final segment once and check it against multiple values by creating a set or even frozenset of all the non-permitted values like ignoredExtensions = set([jpg,png,gif]) and comparing with that like if not extension in ignoredExtensions. Note also that converting extension to lower case first will mean less checking and work whether it is jpg or JPG.
Finally, consider running the same script without threading on multiple commandlines. There is no real need to have the threading inside the script except for coordinating the different url lists. Frankly it would be far simpler to just have a set of url lists in files, and start a separate script to work on each. Let the OS do the multithreading, it is better at it.

Related

Downloading Multiple torrent files with Libtorrent in Python

I'm trying to write a torrent application that can take in a list of magnet links and then download them all together. I've been trying to read and understand the documentation at Libtorrent but I haven't been able to tell if what I try works or not. I've managed to be able to apply a SOCKS5 proxy to a Libtorrent session and download a single magnet link using this code:
import libtorrent as lt
import time
import os
ses = lt.session()
r = lt.proxy_settings()
r.hostname = "proxy_info"
r.username = "proxy_info"
r.password = "proxy_info"
r.port = 1080
r.type = lt.proxy_type_t.socks5_pw
ses.set_peer_proxy(r)
ses.set_web_seed_proxy(r)
ses.set_proxy(r)
t = ses.settings()
t.force_proxy = True
t.proxy_peer_connections = True
t.anonymous_mode = True
ses.set_settings(t)
print(ses.get_settings())
ses.peer_proxy()
ses.web_seed_proxy()
ses.set_settings(t)
magnet_link = "magnet"
params = {
"save_path": os.getcwd() + r"\torrents",
"storage_mode": lt.storage_mode_t.storage_mode_sparse,
"url": magnet_link
}
handle = lt.add_magnet_uri(ses, magnet_link, params)
ses.start_dht()
print('downloading metadata...')
while not handle.has_metadata():
time.sleep(1)
print('got metadata, starting torrent download...')
while handle.status().state != lt.torrent_status.seeding:
s = handle.status()
state_str = ['queued', 'checking', 'downloading metadata', 'downloading', 'finished', 'seeding', 'allocating']
print('%.2f%% complete (down: %.1f kb/s up: %.1f kB/s peers: %d) %s' % (s.progress * 100, s.download_rate / 1000, s.upload_rate / 1000, s.num_peers, state_str[s.state]))
time.sleep(5)
This is great and all for runing on its own with a single link. What I want to do is something like this:
def torrent_download(magnetic_link_list):
for mag in range(len(magnetic_link_list)):
handle = lt.add_magnet_uri(ses, magnetic_link_list[mag], params)
#Then download all the files
#Once all files complete, stop the torrents so they dont seed.
return torrent_name_list
I'm not sure if this is even on the right track or not, but some pointers would be helpful.
UPDATE: This is what I now have and it works fine in my case
def magnet2torrent(magnet_link):
global LIBTORRENT_SESSION, TORRENT_HANDLES
if LIBTORRENT_SESSION is None and TORRENT_HANDLES is None:
TORRENT_HANDLES = []
settings = lt.default_settings()
settings['proxy_hostname'] = CONFIG_DATA["PROXY"]["HOST"]
settings['proxy_username'] = CONFIG_DATA["PROXY"]["USERNAME"]
settings['proxy_password'] = CONFIG_DATA["PROXY"]["PASSWORD"]
settings['proxy_port'] = CONFIG_DATA["PROXY"]["PORT"]
settings['proxy_type'] = CONFIG_DATA["PROXY"]["TYPE"]
settings['force_proxy'] = True
settings['anonymous_mode'] = True
LIBTORRENT_SESSION = lt.session(settings)
params = {
"save_path": os.getcwd() + r"/torrents",
"storage_mode": lt.storage_mode_t.storage_mode_sparse,
"url": magnet_link
}
TORRENT_HANDLES.append(LIBTORRENT_SESSION.add_torrent(params))
def check_torrents():
global TORRENT_HANDLES
for torrent in range(len(TORRENT_HANDLES)):
print(TORRENT_HANDLES[torrent].status().is_seeding)
It's called "magnet links" (not magnetic).
In new versions of libtorrent, the way you add a magnet link is:
params = lt.parse_magnet_link(uri)
handle = ses.add_torrent(params)
That also gives you an opportunity to tweak the add_torrent_params object, to set the save directory for instance.
If you're adding a lot of magnet links (or regular torrent files for that matter) and want to do it quickly, a faster way is to use:
ses.add_torrent_async(params)
That function will return immediately and the torrent_handle object can be picked up later in the add_torrent_alert.
As for downloading multiple magnet links in parallel, your pseudo code for adding them is correct. You just want to make sure you either save off all the torrent_handle objects you get back or query all torrent handles once you're done adding them (using ses.get_torrents()). In your pseudo code you seem to overwrite the last torrent handle every time you add a new one.
The condition you expressed for exiting was that all torrents were complete. The simplest way of doing that is simply to poll them all with handle.status().is_seeding. i.e. loop over your list of torrent handles and ask that. Keep in mind that the call to status() requires a round-trip to the libtorrent network thread, which isn't super fast.
The faster way of doing this is to keep track of all torrents that aren't seeding yet, and "strike them off your list" as you get torrent_finished_alerts for torrents. (you get alerts by calling ses.pop_alerts()).
Another suggestion I would make is to set up your settings_pack object first, then create the session. It's more efficient and a bit cleaner. Especially with regards to opening listen sockets and then immediately closing and re-opening them when you change settings.
i.e.
p = lt.settings_pack()
p['proxy_hostname'] = '...'
p['proxy_username'] = '...'
p['proxy_password'] = '...'
p['proxy_port'] = 1080
p['proxy_type'] = lt.proxy_type_t.socks5_pw
p['proxy_peer_connections'] = True
ses = lt.session(p)

pyzmail: use a loop variable to get a single raw message

First, I should say, I don't really know much about computer programming, but I find Python fairly easy to use for automating simple tasks, thanks to Al Sweigart's book, "Automate the boring stuff."
I want to collect email text bodies. I'm trying to move homework to email to save paper. I thought I could do that by getting the numbers of the unseen mails and just looping through that. If I try that, the IDLE3 shell just becomes unresponsive, ctrl c does nothing, I have to restart the shell.
Question: Why can't I just use a loop variable in server.fetch()??
for msgNum in unseenMessages:
rawMessage = server.fetch([msgNum], ['BODY[]', 'FLAGS'])
It seems you need an actual number like 57, not msgNum, in there, or it won't work.
After looking at various questions and answers here on SO, the following works for me. I suppose it collects all the email bodies in one swoop.
import pyzmail
import pprint
from imapclient import IMAPClient
server = IMAPClient('imap.qq.com', use_uid=True, ssl=True)
server.login('myEmail#foxmail.com', 'myIMAPpassword')
select_info = server.select_folder('Inbox')
unseenMessages = server.search(['UNSEEN'])
rawMessage = server.fetch(unseenMessages, ['BODY[]', 'FLAGS'])
for msgNum in unseenMessages:
message = pyzmail.PyzMessage.factory(rawMessage[msgNum][b'BODY[]'])
text = message.text_part.get_payload().decode(message.text_part.charset)
print('Text' + str(msgNum) + ' = ')
print(text)
I've found this gist with nice and clean code and a page with many helping examples
The main difference between API of imaplib and pyzmail is pyzmail is all-in-one package with parsing and all client-server communication. But these packages are splitted into different packages in standard library. Basically, they both provided almost the same functionality and with the same methods.
As an additional important note here, pyzmail looks quite abandoned.
To save a useful code from that gist, I copy it here as it is with very small modifications like extracting main function (note, it's for Python 3):
#!/usr/bin/env python
#
# Very basic example of using Python 3 and IMAP to iterate over emails in a
# gmail folder/label. This code is released into the public domain.
#
# This script is example code from this blog post:
# http://www.voidynullness.net/blog/2013/07/25/gmail-email-with-python-via-imap/
#
# This is an updated version of the original -- modified to work with Python 3.4.
#
import sys
import imaplib
import getpass
import email
import email.header
import datetime
EMAIL_ACCOUNT = "notatallawhistleblowerIswear#gmail.com"
# Use 'INBOX' to read inbox. Note that whatever folder is specified,
# after successfully running this script all emails in that folder
# will be marked as read.
EMAIL_FOLDER = "Top Secret/PRISM Documents"
def process_mailbox(M):
"""
Do something with emails messages in the folder.
For the sake of this example, print some headers.
"""
rv, data = M.search(None, "ALL")
if rv != 'OK':
print("No messages found!")
return
for num in data[0].split():
rv, data = M.fetch(num, '(RFC822)')
if rv != 'OK':
print("ERROR getting message", num)
return
msg = email.message_from_bytes(data[0][1])
hdr = email.header.make_header(email.header.decode_header(msg['Subject']))
subject = str(hdr)
print('Message %s: %s' % (num, subject))
print('Raw Date:', msg['Date'])
# Now convert to local date-time
date_tuple = email.utils.parsedate_tz(msg['Date'])
if date_tuple:
local_date = datetime.datetime.fromtimestamp(
email.utils.mktime_tz(date_tuple))
print ("Local Date:", \
local_date.strftime("%a, %d %b %Y %H:%M:%S"))
def main(host, login, folder):
with imaplib.IMAP4_SSL(host) as M:
rv, data = M.login(login, getpass.getpass())
print(rv, data)
rv, mailboxes = M.list()
if rv == 'OK':
print("Mailboxes:")
print(mailboxes)
rv, data = M.select(folder)
if rv == 'OK':
print("Processing mailbox...\n")
process_mailbox(M)
else:
print("ERROR: Unable to open mailbox ", rv)
if __name__ == '__main__':
try:
main('imap.gmail.com', EMAIL_ACCOUNT, EMAIL_FOLDER)
except imaplib.IMAP4.error as e:
print('Error while processing mailbox:', e)
sys.exit(1)

Looping SQL query in python

I am writing a python script which queries the database for a URL string. Below is my snippet.
db.execute('select sitevideobaseurl,videositestring '
'from site, video '
'where siteID =1 and site.SiteID=video.VideoSiteID limit 1')
result = db.fetchall()
filename = '/home/Site_info'
output = open(filename, "w")
for row in result:
videosite= row[0:2]
link = videosite[0].format(videosite[1])
full_link = link.replace("http://","https://")
print full_link
output.write("%s\n"%str(full_link))
output.close()
The query basically gives a URL link.It gives me baseURL from a table and the video site string from another table.
output: https://www.youtube.com/watch?v=uqcSJR_7fOc
SiteID is the primary key which is int and not in sequence.
I wish to loop this sql query to pick a new siteId for every execution so that i have unique site URL everytime and write all the results to a file.
desired output: https://www.youtube.com/watch?v=uqcSJR_7fOc
https://www.dailymotion.com/video/hdfchsldf0f
There are about 1178 records.
Thanks for your time and help in advance.
I'm not sure if I completely understand what you're trying to do. I think your goal is to get a list of all links to videos. You get a link to a video by joining the sitevideobaseurl from site and videositestring from video.
From my experience it's much easier to let the database do the heavy lifting, it's build for that. It should be more efficient to join the tables, return all the results and then looping trough them instead of making subsequent queries to the database for each row.
The code should look something like this: (Be careful, I didn't test this)
query = """
select s.sitevideobaseurl,
v.videositestring
from video as v
join site as s
on s.siteID = v.VideoSiteID
"""
db.execute(query)
result = db.fetchall()
filename = '/home/Site_info'
output = open(filename, "w")
for row in result:
link = "%s%s" % (row[0],row[1])
full_link = link.replace("http://","https://")
print full_link
output.write("%s\n" % str(full_link))
output.close()
If you have other reasons for wanting to fetch these ony by one an idea might be to fetch a list of all SiteIDs and store them in a list. Afterwards you start a loop for each item in that list and insert the id into the query via a parameterized query.

Dealing with a huge amount of threads and db connections (Python) What can I do to conserve resources?

I'm playing around with a radio streaming project. Currently I'm creating a python backend. There are over 150,000 online radio station streams in the database. One feature I'm trying to add is to search the radio stations by their currently playing song. I'm using Dirble's streamscrobbler to grab the currently playing song from each radio station using a request and looking through the metadata.
Obviously this script will need to be multi-threaded in order to grab the currently playing songs in a feasible amount of time. It can take no more than 2 minutes. Preferably 1 minute to 1 minute 30 seconds if this is possible.
I've never messed around with a project of this scale before. Creating too many threads takes up resources, so it seems it's best to create a ThreadPoolExecutor. I'm also using SQLAlchemy to work with inserting these songs into a database. Apparently SQLAlchemy uses a connection pool which is implemented on default?
I'm scheduling this task using the lightweight scheduler python module by Daniel Bader. It seems to be working well.
Now, the problem I seem to be having is I get this error:
error: can't start new thread
I'm guessing this is because I'm using up too much resources. What can I do? I could reduce the amount of threads but the task doesn't seem to complete in the amount of time I need, so that would increase the time it makes to go through every stream url.
from streamscrobbler import streamscrobbler
from concurrent.futures import ThreadPoolExecutor
import re
from sqlalchemy import *
#get song name from station
def manageStation(station_id, station_link):
current_song = getCurrentSong(station_link)
current_song = current_song.replace("'", "")
current_song = current_song.replace("\"", "")
current_song = current_song.replace("/", "")
current_song = current_song.replace("\\", "")
current_song = current_song.replace("%", "")
if current_song:
with db.connect() as con:
rs = con.execute("INSERT INTO station_songs VALUES('" + str(station_id) + "', '" + current_song + "', '') ON DUPLICATE KEY UPDATE song_name = '" + current_song + "';")
return ""
def getCurrentSong(stream_url):
streamscrobblerobj = streamscrobbler()
stationinfo = streamscrobblerobj.getServerInfo(stream_url)
metadata = stationinfo.get("metadata")
regex = re.search('\'song\': \'(.*?)\'' , str(metadata))
if regex:
return regex.group(1)
return ""
def update() :
print 'update starting'
global db
db = create_engine('mysql://root:pass#localhost:3306/radio')
global threadExecutor
threadExecutor = ThreadPoolExecutor(max_workers=20000)
with db.connect() as con:
rs = con.execute("SELECT id, link FROM station_table")
for row in rs.fetchall():
threadExecutor.submit(manageStation, row[0], row[1])
You don't need a real thread for each task since most of the time, the thread will be waiting on IO from a socket (the web-request).
What you could try is green threads using something like gevent using something like the following architechure:
from gevent import monkey; monkey.patch_socket()
NUM_GLETS = 20
STATION_URLS = (
'http://station1.com',
...
)
pool = gevent.Pool(NUM_GLETS)
tasks = [pool.spawn(analyze_station, url) for url in STATION_URLS]
pool.join(tasks)
where analyze_station is your code for fetching and analyzing the particular station.
The result should be a single threaded program, but instead of blocking on every single web-request, another green thread is run while the socket is waiting on data. This is much more efficient than spawning real threads for mostly idle work.

Why is utorrents Magnet to Torrent file fetching is faster than my python script?

i am trying to convert torrent magnet urls in .torrent files using python script.
python script connects to dht and waits for meta data then creates torrent file from it.
e.g.
#!/usr/bin/env python
'''
Created on Apr 19, 2012
#author: dan, Faless
GNU GENERAL PUBLIC LICENSE - Version 3
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
http://www.gnu.org/licenses/gpl-3.0.txt
'''
import shutil
import tempfile
import os.path as pt
import sys
import libtorrent as lt
from time import sleep
def magnet2torrent(magnet, output_name=None):
if output_name and \
not pt.isdir(output_name) and \
not pt.isdir(pt.dirname(pt.abspath(output_name))):
print("Invalid output folder: " + pt.dirname(pt.abspath(output_name)))
print("")
sys.exit(0)
tempdir = tempfile.mkdtemp()
ses = lt.session()
params = {
'save_path': tempdir,
'duplicate_is_error': True,
'storage_mode': lt.storage_mode_t(2),
'paused': False,
'auto_managed': True,
'duplicate_is_error': True
}
handle = lt.add_magnet_uri(ses, magnet, params)
print("Downloading Metadata (this may take a while)")
while (not handle.has_metadata()):
try:
sleep(1)
except KeyboardInterrupt:
print("Aborting...")
ses.pause()
print("Cleanup dir " + tempdir)
shutil.rmtree(tempdir)
sys.exit(0)
ses.pause()
print("Done")
torinfo = handle.get_torrent_info()
torfile = lt.create_torrent(torinfo)
output = pt.abspath(torinfo.name() + ".torrent")
if output_name:
if pt.isdir(output_name):
output = pt.abspath(pt.join(
output_name, torinfo.name() + ".torrent"))
elif pt.isdir(pt.dirname(pt.abspath(output_name))):
output = pt.abspath(output_name)
print("Saving torrent file here : " + output + " ...")
torcontent = lt.bencode(torfile.generate())
f = open(output, "wb")
f.write(lt.bencode(torfile.generate()))
f.close()
print("Saved! Cleaning up dir: " + tempdir)
ses.remove_torrent(handle)
shutil.rmtree(tempdir)
return output
def showHelp():
print("")
print("USAGE: " + pt.basename(sys.argv[0]) + " MAGNET [OUTPUT]")
print(" MAGNET\t- the magnet url")
print(" OUTPUT\t- the output torrent file name")
print("")
def main():
if len(sys.argv) < 2:
showHelp()
sys.exit(0)
magnet = sys.argv[1]
output_name = None
if len(sys.argv) >= 3:
output_name = sys.argv[2]
magnet2torrent(magnet, output_name)
if __name__ == "__main__":
main()
above script takes around 1+ minutes to fetch metadata and create .torrent file while utorrent client only takes few seconds , why is that ?
How can i make my script faster ?
i would like to fetch metadata for around 1k+ torrents.
e.g. Magnet Link
magnet:?xt=urn:btih:BFEFB51F4670D682E98382ADF81014638A25105A&dn=openSUSE+13.2+DVD+x86_64.iso&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.publicbt.com%3A80&tr=udp%3A%2F%2Ftracker.ccc.de%3A80
update :
i have specified known dht router urls like this in my script.
session = lt.session()
session.listen_on(6881, 6891)
session.add_dht_router("router.utorrent.com", 6881)
session.add_dht_router("router.bittorrent.com", 6881)
session.add_dht_router("dht.transmissionbt.com", 6881)
session.add_dht_router("router.bitcomet.com", 6881)
session.add_dht_router("dht.aelitis.com", 6881)
session.start_dht()
but it still slow and sometimes i get errors like
DHT error [hostname lookup] (1) Host not found (authoritative)
could not map port using UPnP: no router found
update :
i have wrote this scmall script which fetches hex info hash from DB and tries to fetch meta data from dht and then inserts the torrent file in DB.
i have made it run indefinitely as i did not know how to save state , so keeping it running will get more peers and fetching meta data will quicker .
#!/usr/bin/env python
# this file will run as client or daemon and fetch torrent meta data i.e. torrent files from magnet uri
import libtorrent as lt # libtorrent library
import tempfile # for settings parameters while fetching metadata as temp dir
import sys #getting arguiments from shell or exit script
from time import sleep #sleep
import shutil # removing directory tree from temp directory
import os.path # for getting pwd and other things
from pprint import pprint # for debugging, showing object data
import MySQLdb # DB connectivity
import os
from datetime import date, timedelta
#create lock file to make sure only single instance is running
lock_file_name = "/daemon.lock"
if(os.path.isfile(lock_file_name)):
sys.exit('another instance running')
#else:
#f = open(lock_file_name, "w")
#f.close()
session = lt.session()
session.listen_on(6881, 6891)
session.add_dht_router("router.utorrent.com", 6881)
session.add_dht_router("router.bittorrent.com", 6881)
session.add_dht_router("dht.transmissionbt.com", 6881)
session.add_dht_router("router.bitcomet.com", 6881)
session.add_dht_router("dht.aelitis.com", 6881)
session.start_dht()
alive = True
while alive:
db_conn = MySQLdb.connect( host = 'localhost', user = '', passwd = '', db = 'basesite', unix_socket='') # Open database connection
#print('reconnecting')
#get all records where enabled = 0 and uploaded within yesterday
subset_count = 5 ;
yesterday = date.today() - timedelta(1)
yesterday = yesterday.strftime('%Y-%m-%d %H:%M:%S')
#print(yesterday)
total_count_query = ("SELECT COUNT(*) as total_count FROM content WHERE upload_date > '"+ yesterday +"' AND enabled = '0' ")
#print(total_count_query)
try:
total_count_cursor = db_conn.cursor()# prepare a cursor object using cursor() method
total_count_cursor.execute(total_count_query) # Execute the SQL command
total_count_results = total_count_cursor.fetchone() # Fetch all the rows in a list of lists.
total_count = total_count_results[0]
print(total_count)
except:
print "Error: unable to select data"
total_pages = total_count/subset_count
#print(total_pages)
current_page = 1
while(current_page <= total_pages):
from_count = (current_page * subset_count) - subset_count
#print(current_page)
#print(from_count)
hashes = []
get_mysql_data_query = ("SELECT hash FROM content WHERE upload_date > '" + yesterday +"' AND enabled = '0' ORDER BY record_num ASC LIMIT "+ str(from_count) +" , " + str(subset_count) +" ")
#print(get_mysql_data_query)
try:
get_mysql_data_cursor = db_conn.cursor()# prepare a cursor object using cursor() method
get_mysql_data_cursor.execute(get_mysql_data_query) # Execute the SQL command
get_mysql_data_results = get_mysql_data_cursor.fetchall() # Fetch all the rows in a list of lists.
for row in get_mysql_data_results:
hashes.append(row[0].upper())
except:
print "Error: unable to select data"
print(hashes)
handles = []
for hash in hashes:
tempdir = tempfile.mkdtemp()
add_magnet_uri_params = {
'save_path': tempdir,
'duplicate_is_error': True,
'storage_mode': lt.storage_mode_t(2),
'paused': False,
'auto_managed': True,
'duplicate_is_error': True
}
magnet_uri = "magnet:?xt=urn:btih:" + hash.upper() + "&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.publicbt.com%3A80&tr=udp%3A%2F%2Ftracker.ccc.de%3A80"
#print(magnet_uri)
handle = lt.add_magnet_uri(session, magnet_uri, add_magnet_uri_params)
handles.append(handle) #push handle in handles list
#print("handles length is :")
#print(len(handles))
while(len(handles) != 0):
for h in handles:
#print("inside handles for each loop")
if h.has_metadata():
torinfo = h.get_torrent_info()
final_info_hash = str(torinfo.info_hash())
final_info_hash = final_info_hash.upper()
torfile = lt.create_torrent(torinfo)
torcontent = lt.bencode(torfile.generate())
tfile_size = len(torcontent)
try:
insert_cursor = db_conn.cursor()# prepare a cursor object using cursor() method
insert_cursor.execute("""INSERT INTO dht_tfiles (hash, tdata) VALUES (%s, %s)""", [final_info_hash , torcontent] )
db_conn.commit()
#print "data inserted in DB"
except MySQLdb.Error, e:
try:
print "MySQL Error [%d]: %s" % (e.args[0], e.args[1])
except IndexError:
print "MySQL Error: %s" % str(e)
shutil.rmtree(h.save_path()) # remove temp data directory
session.remove_torrent(h) # remove torrnt handle from session
handles.remove(h) #remove handle from list
else:
if(h.status().active_time > 600): # check if handle is more than 10 minutes old i.e. 600 seconds
#print('remove_torrent')
shutil.rmtree(h.save_path()) # remove temp data directory
session.remove_torrent(h) # remove torrnt handle from session
handles.remove(h) #remove handle from list
sleep(1)
#print('sleep1')
print('sleep10')
sleep(10)
current_page = current_page + 1
#print('sleep20')
sleep(20)
os.remove(lock_file_name);
now i need to implement new things as suggested by Arvid.
UPDATE
i have managed to implement what Arvid suggested. and some more extension i found in deluge support forums http://forum.deluge-torrent.org/viewtopic.php?f=7&t=42299&start=10
#!/usr/bin/env python
import libtorrent as lt # libtorrent library
import tempfile # for settings parameters while fetching metadata as temp dir
import sys #getting arguiments from shell or exit script
from time import sleep #sleep
import shutil # removing directory tree from temp directory
import os.path # for getting pwd and other things
from pprint import pprint # for debugging, showing object data
import MySQLdb # DB connectivity
import os
from datetime import date, timedelta
def var_dump(obj):
for attr in dir(obj):
print "obj.%s = %s" % (attr, getattr(obj, attr))
session = lt.session()
session.add_extension('ut_pex')
session.add_extension('ut_metadata')
session.add_extension('smart_ban')
session.add_extension('metadata_transfer')
#session = lt.session(lt.fingerprint("DE", 0, 1, 0, 0), flags=1)
session_save_filename = "/tmp/new.client.save_state"
if(os.path.isfile(session_save_filename)):
fileread = open(session_save_filename, 'rb')
session.load_state(lt.bdecode(fileread.read()))
fileread.close()
print('session loaded from file')
else:
print('new session started')
session.add_dht_router("router.utorrent.com", 6881)
session.add_dht_router("router.bittorrent.com", 6881)
session.add_dht_router("dht.transmissionbt.com", 6881)
session.add_dht_router("router.bitcomet.com", 6881)
session.add_dht_router("dht.aelitis.com", 6881)
session.start_dht()
alerts = []
alive = True
while alive:
a = session.pop_alert()
alerts.append(a)
print('----------')
for a in alerts:
var_dump(a)
alerts.remove(a)
print('sleep10')
sleep(10)
filewrite = open(session_save_filename, "wb")
filewrite.write(lt.bencode(session.save_state()))
filewrite.close()
kept it running for a minute and received alert
obj.msg = no router found
update :
after some testing looks like
session.add_dht_router("router.bitcomet.com", 6881)
causing
('%s: %s', 'alert', 'DHT error [hostname lookup] (1) Host not found (authoritative)')
update:
i added
session.start_dht()
session.start_lsd()
session.start_upnp()
session.start_natpmp()
and got alert
('%s: %s', 'portmap_error_alert', 'could not map port using UPnP: no router found')
as MatteoItalia points out, bootstrapping the DHT is not instantaneous, and can sometimes take a while. There's no well defined point in time when the bootstrap process is complete, it's a continuum of being increasingly connected to the network.
The more connected and the more good, stable nodes you know about, the quicker lookups will be. One way to factor out most of the bootstrapping process (to get more of an apples-to-apples comparison) would be to start timing after you get the dht_bootstrap_alert (and also hold off on adding the magnet link until then).
Adding the dht bootstrap nodes will primarily make it possible to bootstrap, it's still not necessarily going to be particularly quick. You typically want about 270 nodes or so (including replacement nodes) to be considered bootstrapped.
One thing you can do to speed up the bootstrap process is to make sure you save and load the session state, which includes the dht routing table. This will reload all the nodes from the previous session into the routing table and (assuming you haven't changed IP and that everything works correctly) bootstrapping should be quicker.
Make sure you don't start the DHT in the session constructor (as the flags argument, just pass in add_default_plugins), load the state, add the router nodes and then start the dht.
Unfortunately there are a lot of moving parts involved in getting this to work internally, the ordering is important and there may be subtle issues.
Also, note that keeping the DHT running continuously would be quicker, as reloading the state still will go through a bootstrap, it will just have more nodes up front to ping and try to "connect" to.
disabling the start_default_features flag also means UPnP and NAT-PMP won't be started, if you use those, you'll have to start those manually as well.

Categories

Resources