Downloading Multiple torrent files with Libtorrent in Python

Downloading Multiple torrent files with Libtorrent in Python - python

I'm trying to write a torrent application that can take in a list of magnet links and then download them all together. I've been trying to read and understand the documentation at Libtorrent but I haven't been able to tell if what I try works or not. I've managed to be able to apply a SOCKS5 proxy to a Libtorrent session and download a single magnet link using this code:
import libtorrent as lt
import time
import os
ses = lt.session()
r = lt.proxy_settings()
r.hostname = "proxy_info"
r.username = "proxy_info"
r.password = "proxy_info"
r.port = 1080
r.type = lt.proxy_type_t.socks5_pw
ses.set_peer_proxy(r)
ses.set_web_seed_proxy(r)
ses.set_proxy(r)
t = ses.settings()
t.force_proxy = True
t.proxy_peer_connections = True
t.anonymous_mode = True
ses.set_settings(t)
print(ses.get_settings())
ses.peer_proxy()
ses.web_seed_proxy()
ses.set_settings(t)
magnet_link = "magnet"
params = {
"save_path": os.getcwd() + r"\torrents",
"storage_mode": lt.storage_mode_t.storage_mode_sparse,
"url": magnet_link
}
handle = lt.add_magnet_uri(ses, magnet_link, params)
ses.start_dht()
print('downloading metadata...')
while not handle.has_metadata():
time.sleep(1)
print('got metadata, starting torrent download...')
while handle.status().state != lt.torrent_status.seeding:
s = handle.status()
state_str = ['queued', 'checking', 'downloading metadata', 'downloading', 'finished', 'seeding', 'allocating']
print('%.2f%% complete (down: %.1f kb/s up: %.1f kB/s peers: %d) %s' % (s.progress * 100, s.download_rate / 1000, s.upload_rate / 1000, s.num_peers, state_str[s.state]))
time.sleep(5)
This is great and all for runing on its own with a single link. What I want to do is something like this:
def torrent_download(magnetic_link_list):
for mag in range(len(magnetic_link_list)):
handle = lt.add_magnet_uri(ses, magnetic_link_list[mag], params)
#Then download all the files
#Once all files complete, stop the torrents so they dont seed.
return torrent_name_list
I'm not sure if this is even on the right track or not, but some pointers would be helpful.
UPDATE: This is what I now have and it works fine in my case
def magnet2torrent(magnet_link):
global LIBTORRENT_SESSION, TORRENT_HANDLES
if LIBTORRENT_SESSION is None and TORRENT_HANDLES is None:
TORRENT_HANDLES = []
settings = lt.default_settings()
settings['proxy_hostname'] = CONFIG_DATA["PROXY"]["HOST"]
settings['proxy_username'] = CONFIG_DATA["PROXY"]["USERNAME"]
settings['proxy_password'] = CONFIG_DATA["PROXY"]["PASSWORD"]
settings['proxy_port'] = CONFIG_DATA["PROXY"]["PORT"]
settings['proxy_type'] = CONFIG_DATA["PROXY"]["TYPE"]
settings['force_proxy'] = True
settings['anonymous_mode'] = True
LIBTORRENT_SESSION = lt.session(settings)
params = {
"save_path": os.getcwd() + r"/torrents",
"storage_mode": lt.storage_mode_t.storage_mode_sparse,
"url": magnet_link
}
TORRENT_HANDLES.append(LIBTORRENT_SESSION.add_torrent(params))
def check_torrents():
global TORRENT_HANDLES
for torrent in range(len(TORRENT_HANDLES)):
print(TORRENT_HANDLES[torrent].status().is_seeding)

It's called "magnet links" (not magnetic).
In new versions of libtorrent, the way you add a magnet link is:
params = lt.parse_magnet_link(uri)
handle = ses.add_torrent(params)
That also gives you an opportunity to tweak the add_torrent_params object, to set the save directory for instance.
If you're adding a lot of magnet links (or regular torrent files for that matter) and want to do it quickly, a faster way is to use:
ses.add_torrent_async(params)
That function will return immediately and the torrent_handle object can be picked up later in the add_torrent_alert.
As for downloading multiple magnet links in parallel, your pseudo code for adding them is correct. You just want to make sure you either save off all the torrent_handle objects you get back or query all torrent handles once you're done adding them (using ses.get_torrents()). In your pseudo code you seem to overwrite the last torrent handle every time you add a new one.
The condition you expressed for exiting was that all torrents were complete. The simplest way of doing that is simply to poll them all with handle.status().is_seeding. i.e. loop over your list of torrent handles and ask that. Keep in mind that the call to status() requires a round-trip to the libtorrent network thread, which isn't super fast.
The faster way of doing this is to keep track of all torrents that aren't seeding yet, and "strike them off your list" as you get torrent_finished_alerts for torrents. (you get alerts by calling ses.pop_alerts()).
Another suggestion I would make is to set up your settings_pack object first, then create the session. It's more efficient and a bit cleaner. Especially with regards to opening listen sockets and then immediately closing and re-opening them when you change settings.
i.e.
p = lt.settings_pack()
p['proxy_hostname'] = '...'
p['proxy_username'] = '...'
p['proxy_password'] = '...'
p['proxy_port'] = 1080
p['proxy_type'] = lt.proxy_type_t.socks5_pw
p['proxy_peer_connections'] = True
ses = lt.session(p)

Related

Listing IBM Cloud Resources using ResourceControllerV2 and pagination issues

I'm using the Python ibm-cloud-sdk in an attempt to iterate all resources in a particular IBM Cloud account. My trouble has been that pagination doesn't appear to "work for me". When I pass in the "next_url" I still get the same list coming back from the call.
Here is my test code. I successfully print many of my COS instances, but I only seem to be able to print the first page....maybe I've been looking at this too long and just missed something obvious...anyone have any clue why I can't retrieve the next page?
try:
####### authenticate and set the service url
auth = IAMAuthenticator(RESOURCE_CONTROLLER_APIKEY)
service = ResourceControllerV2(authenticator=auth)
service.set_service_url(RESOURCE_CONTROLLER_URL)
####### Retrieve the resource instance listing
r = service.list_resource_instances().get_result()
####### get the row count and resources list
rows_count = r['rows_count']
resources = r['resources']
while rows_count > 0:
print('Number of rows_count {}'.format(rows_count))
next_url = r['next_url']
for i, resource in enumerate(resources):
type = resource['id'].split(':')[4]
if type == 'cloud-object-storage':
instance_name = resource['name']
instance_id = resource['guid']
crn = resource['crn']
print('Found instance id : name - {} : {}'.format(instance_id, instance_name))
############### this is SUPPOSED to get the next page
r = service.list_resource_instances(start=next_url).get_result()
rows_count = r['rows_count']
resources = r['resources']
except Exception as e:
Error = 'Error : {}'.format(e)
print(Error)
exit(1)

From looking at the API documentation for listing resource instances, the value of next_url includes the URL path and the start parameter including its token for start.
To retrieve the next page, you would only need to pass in the parameter start with the token as value. IMHO this is not ideal.
I typically do not use the SDK, but a simply Python request. Then, I can use the endpoint (base) URI + next_url as full URI.
If you stick with the SDK, use urllib.parse to extract the query parameter. Not tested, but something like:
from urllib.parse import urlparse,parse_qs
o=urlparse(next_url)
q=parse_qs(o.query)
r = service.list_resource_instances(start=q['start'][0]).get_result()

Could you use the Search API for listing the resources in your account rather than the resource controller? The search index is set up for exactly that operation, whereas paginating results from the resource controller seems much more brute force.
https://cloud.ibm.com/apidocs/search#search

Examining file on different node (different IP address), same network - possible?

I have a small group of Raspberry Pis, all on the same local network (192.168.1.2xx) All are running Python 3.7.3, one (R Pi CM3) on Raspbian Buster, the other (R Pi 4B 8gig) on Raspberry Pi OS 64.
I have a file on one device (the Pi 4B), located at /tmp/speech.wav, that is generated on the fly, real-time:
192.168.1.201 - /tmp/speech.wav
I have a script that works well on that device, that tells me the play duration time of the .wav file in seconds:
import wave
import contextlib
def getPlayTime():
fname = '/tmp/speech.wav'
with contextlib.closing(wave.open(fname,'r')) as f:
frames = f.getnframes()
rate = f.getframerate()
duration = round(frames / float(rate), 2)
return duration
However - the node that needs to operate on that duration information is running on another node at 192.168.1.210. I cannot simply move the various files all to the same node as there is a LOT going on, things are where they are for a reason.
So what I need to know is how to alter my approach such that I can change the script reference to something like this pseudocode:
fname = '/tmp/speech.wav # 192.168.1.201'
Is such a thing possible? Searching the web it seems that I am up against millions of people looking for how to obtain IP addresses, fix multiple IP address issues, fix duplicate ip address issues... but I can't seem yet to find how to simply examine a file on a different ip address as I have described here. I have no network security restrictions, so any setting is up for consideration. Help would be much appreciated.

There are lots of possibilities, and it probably comes down to how often you need to check the duration, from how many clients, and how often the file changes and whether you have other information that you want to share between the nodes.
Here are some options:
set up an SMB (Samba) server on the Pi that has the WAV file and let the other nodes mount the filesystem and access the file as if it was local
set up an NFS server on the Pi that has the WAV file and let the other nodes mount the filesystem and access the file as if it was local
let other nodes use ssh to login and extract the duration, or scp to retrieve the file - see paramiko in Python
set up Redis on one node and throw the WAV file in there so anyone can get it - this is potentially attractive if you have lots of lists, arrays, strings, integers, hashes, queues or sets that you want to share between Raspberry Pis very fast. Example here.
Here is a very simple example of writing a sound track into Redis from one node (say Redis is on 192.168.0.200) and reading it back from any other. Of course, you may just want the writing node to write the duration in there rather than the whole track - which would be more efficient. Or you may want to store loads of other shared data or settings.
This is the writer:
#!/usr/bin/env python3
import redis
from pathlib import Path
host='192.168.1.200'
# Connect to Redis
r = redis.Redis(host)
# Load some music, or otherwise create it
music = Path('song.wav').read_bytes()
# Put music into Redis where others can see it
r.set("music",music)
And this is the reader:
#!/usr/bin/env python3
import redis
from pathlib import Path
host='192.168.1.200'
# Connect to Redis
r = redis.Redis(host)
# Retrieve music track from Redis
music = r.get("music")
print(f'{len(music)} bytes read from Redis')
Then, during testing, you may want to manually push a track into Redis from the Terminal:
redis-cli -x -h 192.168.0.200 set music < OtherTrack.wav
Or manually retrieve the track from Redis to a file:
redis-cli -h 192.168.0.200 get music > RetrievedFromRedis.wav

OK, this is what I finally settled on - and it works great. Using ZeroMQ for message passing, I have the function to get the playtime of the wav, and another gathers data about the speech about to be spoken, then all that is sent to the motor core prior to sending the speech. The motor core handles the timing issues to sync the jaw to the speech. So, I'm not actually putting the code that generates the wav and also returns the length of the wav playback time onto the node that ultimately makes use of it, but it turns out that message passing is fast enough so there is plenty of time space to receive, process and implement the motion control to match the speech perfectly. Posting this here in case it's helpful for folks in the future working on similar issues.
import time
import zmq
import os
import re
import wave
import contextlib
context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind("tcp://*:5555") #Listens for speech to output
print("Connecting to Motor Control")
jawCmd = context.socket(zmq.PUB)
jawCmd.connect("tcp://192.168.1.210:5554") #Sends to MotorFunctions for Jaw Movement
def getPlayTime(): # Checks to see if current file duration has changed
fname = '/tmp/speech.wav' # and if yes, sends new duration
with contextlib.closing(wave.open(fname,'r')) as f:
frames = f.getnframes()
rate = f.getframerate()
duration = round(frames / float(rate), 3)
speakTime = str(duration)
return speakTime
def set_voice(V,T):
T2 = '"' + T + '"'
audioFile = "/tmp/speech.wav" # /tmp set as tmpfs, or RAMDISK to reduce SD Card write ops
if V == "A":
voice = "Allison"
elif V == "B":
voice = "Belle"
elif V == "C":
voice = "Callie"
elif V == "D":
voice = "Dallas"
elif V == "V":
voice = "David"
else:
voice = "Belle"
os.system("swift -n " + voice + " -o " + audioFile + " " +T2) # Record audio
tailTrim = .5 # Calculate Jaw Timing
speakTime = eval(getPlayTime()) # Start by getting playlength
speakTime = round((speakTime - tailTrim), 2) # Chop .5 s for trailing silence
wordList = T.split()
jawString = []
for index in range(len(wordList)):
wordLen = len(wordList[index])
jawString.append(wordLen)
jawString = str(jawString)
speakTime = str(speakTime)
jawString = speakTime + "|" + jawString # 3.456|[4, 2, 7, 4, 2, 9, 3, 4, 3, 6] - will split on "|"
jawCmd.send_string(jawString) # Send Jaw Operating Sequence
os.system("aplay " + audioFile) # Play audio
pronunciationDict = {'teh':'the','process':'prawcess','Maeve':'Mayve','Mariposa':'May-reeposah','Lila':'Lala','Trump':'Ass hole'}
def adjustResponse(response): # Adjusts spellings in output string to create better speech output.
for key, value in pronunciationDict.items():
if key in response or key.lower() in response:
response = re.sub(key, value, response, flags=re.I)
return response
SpeakText="Speech center connected and online."
set_voice(V,SpeakText) # Cepstral Voices: A = Allison; B = Belle; C = Callie; D = Dallas; V = David;
while True:
SpeakText = socket.recv().decode('utf-8') # .decode gets rid of the b' in front of the string
SpeakTextX = adjustResponse(SpeakText) # Run the string through the pronunciation dictionary
print("SpeakText = ",SpeakTextX)
set_voice(V,SpeakTextX)
print("Received request: %s" % SpeakTextX)
socket.send_string(str(SpeakTextX)) # Send data back to source for confirmation

Live streaming using videojs-record and videojs

I'm trying to make a video chat inside my application, and using videojs-record + videojs to do so. videojs-record to record the webcam (which is already working), and videojs to reproduce the video in the other side. By using the timestamp event, with the timeSlice property, I've managed to split the recorded video by each second.
this.player = videojs('#myVideo', this.options, () => {});
var player = this.player
var that = this
this.player.on('startRecord', () => {
that.segment = 0
});
player.on('timestamp', function() {
if (player.recordedData && player.recordedData.length > 0) {
var binaryData = player.recordedData[player.recordedData.length - 1]
that.$emit('dataVideoEvent', {video:binaryData, segment_index:that.segment})
that.segment += 1
}
});
So I've managed to upload the said segments to Amazon's S3, and have used a Python endpoint to return a HLS file containing the uploaded files:
duration = 1
totalDuration = len(list)
string = ('#EXTM3U\n'
'#EXT-X-PLAYLIST-TYPE:EVENT\n'
'#EXT-X-TARGETDURATION:3600\n'
'#EXT-X-VERSION:3\n'
'#EXT-X-MEDIA-SEQUENCE:0\n')
ended = False
now = datetime.utcnow().replace(tzinfo=pytz.utc)
for index, obj in enumerate(list):
url = {retrieving url based in obj}
string += '#EXTINF:{duration}, no desc\n{url}\n'.format(duration=duration, url=url)
ended = (obj.created_at.replace(tzinfo=pytz.utc) + timedelta(seconds=10)) < now
if ended:
string += '#EXT-X-ENDLIST'
Although, that HLS file isn't working. The videojs player shows the proper full time of the video, but never starts to play it, and logs no errors. If I try to reproduce the video using Bitmovin's player demo, it says SOURCE_MANIFEST_INVALID.
I've also tried to create a XML+DASH file instead of a HLS one, but it didn't work as well. And I've also tried to change the videojs-record videoMimeType property to other values, like video/webm;codecs=vp8,opus or video/mp2t;codecs=vp8,opus, but it also didn't change a thing.
Also, the #action that returns the HLS file has a renderer_classes property using that renderer:
class HLSRenderer(BaseRenderer):
media_type = 'application/x-mpegurl'
format = 'm3u8'
level_sep = '.'
header = None
labels = None
writer_opts = None
def render(self, data, media_type=None, renderer_context={}, writer_opts=None):
return data
And at last, I've configured CORS in that way in S3, in case that's relevant:
<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
<AllowedOrigin>*</AllowedOrigin>
<AllowedMethod>GET</AllowedMethod>
<AllowedMethod>HEAD</AllowedMethod>
<MaxAgeSeconds>3000</MaxAgeSeconds>
<AllowedHeader>Range</AllowedHeader>
</CORSRule>
</CORSConfiguration>
And by the way, the #action is usually blocked so only the expected user can see it, but at this moment I've changed its permission_classes value to AllowAny, so I can test in websites like Bitmovin.
What exactly am I doing wrong?

I gave up and ended up using Jitsi instead of videojs-record. I had to run it with an iframe within Vue, but it succeed where all the rest failed.
I've spent the last 18 days trying to make a live stream by myself, when I could make Jitsi work with my project in almost one single night. I recommend their Docker, it's pretty easy to set up.
https://github.com/jitsi/docker-jitsi-meet

Email harvester in Python: How to improve performance?

I have found a program called "Best Email Extractor" http://www.emailextractor.net/. The website says it is written in Python. I tried to write a similar program. The above program extracts about 300 - 1000 emails per minute. My program extracts about 30-100 emails per hour. Could someone give me tips on how to improve the performance of my program? I wrote the following:
import sqlite3 as sql
import urllib2
import re
import lxml.html as lxml
import time
import threading
def getUrls(start):
urls = []
try:
dom = lxml.parse(start).getroot()
dom.make_links_absolute()
for url in dom.iterlinks():
if not '.jpg' in url[2]:
if not '.JPG' in url[2]:
if not '.ico' in url[2]:
if not '.png' in url[2]:
if not '.jpeg' in url[2]:
if not '.gif' in url[2]:
if not 'youtube.com' in url[2]:
urls.append(url[2])
except:
pass
return urls
def getURLContent(urlAdresse):
try:
url = urllib2.urlopen(urlAdresse)
text = url.read()
url.close()
return text
except:
return '<html></html>'
def harvestEmail(url):
text = getURLContent(url)
emails = re.findall('[\w\-][\w\-\.]+#[\w\-][\w\-\.]+[a-zA-Z]{1,4}', text)
if emails:
if saveEmail(emails[0]) == 1:
print emails[0]
def saveUrl(url):
connection = sql.connect('url.db')
url = (url, )
with connection:
cursor = connection.cursor()
cursor.execute('SELECT COUNT(*) FROM urladressen WHERE adresse = ?', url)
data = cursor.fetchone()
if(data[0] == 0):
cursor.execute('INSERT INTO urladressen VALUES(NULL, ?)', url)
return 1
return 0
def saveEmail(email):
connection = sql.connect('emails.db')
email = (email, )
with connection:
cursor = connection.cursor()
cursor.execute('SELECT COUNT(*) FROM addresse WHERE email = ?', email)
data = cursor.fetchone()
if(data[0] == 0):
cursor.execute('INSERT INTO addresse VALUES(NULL, ?)', email)
return 1
return 0
def searchrun(urls):
for url in urls:
if saveUrl(url) == 1:
#time.sleep(0.6)
harvestEmail(url)
print url
urls.remove(url)
urls = urls + getUrls(url)
urls1 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=DVD')
urls2 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Jolie')
urls3 = getUrls('http://www.finanzen.net')
urls4 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Party')
urls5 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Games')
urls6 = getUrls('http://www.spiegel.de')
urls7 = getUrls('http://www.kicker.de/')
urls8 = getUrls('http://www.chessbase.com')
urls9 = getUrls('http://www.nba.com')
urls10 = getUrls('http://www.nfl.com')
try:
threads = []
urls = (urls1, urls2, urls3, urls4, urls5, urls6, urls7, urls8, urls9, urls10)
for urlList in urls:
thread = threading.Thread(target=searchrun, args=(urlList, )).start()
threads.append(thread)
print threading.activeCount()
for thread in threads:
thread.join()
except RuntimeError:
print RuntimeError

I don't think many people are going to help you harvest emails. It's a generally detested activity.
Regarding the performance bottlenecks in your code, you need to find out where the time is going by profiling. At the lowest level, replace each of your functions with a dummy that does no processing but returns valid output; so the email collector could return a list of the same address 100 times (or however many are in these URL results). That will show you which function is costing you time.
Things that stick out:
Get the files behind the URLs from the server beforehand; if you spam Google every time you run the script, they could well block you. Reading from disk is faster than requesting the files from the internet and can be done separately and concurrently.
The database code is creating a new connection for each call to saveEmail etc, which will spend most of its time doing handshaking and authentication. Better to have an object that keeps the connection alive between calls, or better yet to insert multiple records at once.
Once the network and database issues are done, the regex could probably do with \b around it so that the matching does less backtracking.
A series of if not 'foo' in str: then if not 'blah' in str ... is poor coding. Extract the final segment once and check it against multiple values by creating a set or even frozenset of all the non-permitted values like ignoredExtensions = set([jpg,png,gif]) and comparing with that like if not extension in ignoredExtensions. Note also that converting extension to lower case first will mean less checking and work whether it is jpg or JPG.
Finally, consider running the same script without threading on multiple commandlines. There is no real need to have the threading inside the script except for coordinating the different url lists. Frankly it would be far simpler to just have a set of url lists in files, and start a separate script to work on each. Let the OS do the multithreading, it is better at it.

Delete all data for a kind in Google App Engine

I would like to wipe out all data for a specific kind in Google App Engine. What is the
best way to do this?
I wrote a delete script (hack), but since there is so much data is
timeout's out after a few hundred records.

I am currently deleting the entities by their key, and it seems to be faster.
from google.appengine.ext import db
class bulkdelete(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
try:
while True:
q = db.GqlQuery("SELECT __key__ FROM MyModel")
assert q.count()
db.delete(q.fetch(200))
time.sleep(0.5)
except Exception, e:
self.response.out.write(repr(e)+'\n')
pass
from the terminal, I run curl -N http://...

You can now use the Datastore Admin for that: https://developers.google.com/appengine/docs/adminconsole/datastoreadmin#Deleting_Entities_in_Bulk

If I were a paranoid person, I would say Google App Engine (GAE) has not made it easy for us to remove data if we want to. I am going to skip discussion on index sizes and how they translate a 6 GB of data to 35 GB of storage (being billed for). That's another story, but they do have ways to work around that - limit number of properties to create index on (automatically generated indexes) et cetera.
The reason I decided to write this post is that I need to "nuke" all my Kinds in a sandbox. I read about it and finally came up with this code:
package com.intillium.formshnuker;
import java.io.IOException;
import java.util.ArrayList;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.Query;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.FetchOptions;
import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.labs.taskqueue.QueueFactory;
import com.google.appengine.api.labs.taskqueue.TaskOptions.Method;
import static com.google.appengine.api.labs.taskqueue.TaskOptions.Builder.url;
#SuppressWarnings("serial")
public class FormsnukerServlet extends HttpServlet {
public void doGet(final HttpServletRequest request, final HttpServletResponse response) throws IOException {
response.setContentType("text/plain");
final String kind = request.getParameter("kind");
final String passcode = request.getParameter("passcode");
if (kind == null) {
throw new NullPointerException();
}
if (passcode == null) {
throw new NullPointerException();
}
if (!passcode.equals("LONGSECRETCODE")) {
response.getWriter().println("BAD PASSCODE!");
return;
}
System.err.println("*** deleting entities form " + kind);
final long start = System.currentTimeMillis();
int deleted_count = 0;
boolean is_finished = false;
final DatastoreService dss = DatastoreServiceFactory.getDatastoreService();
while (System.currentTimeMillis() - start < 16384) {
final Query query = new Query(kind);
query.setKeysOnly();
final ArrayList<Key> keys = new ArrayList<Key>();
for (final Entity entity: dss.prepare(query).asIterable(FetchOptions.Builder.withLimit(128))) {
keys.add(entity.getKey());
}
keys.trimToSize();
if (keys.size() == 0) {
is_finished = true;
break;
}
while (System.currentTimeMillis() - start < 16384) {
try {
dss.delete(keys);
deleted_count += keys.size();
break;
} catch (Throwable ignore) {
continue;
}
}
}
System.err.println("*** deleted " + deleted_count + " entities form " + kind);
if (is_finished) {
System.err.println("*** deletion job for " + kind + " is completed.");
} else {
final int taskcount;
final String tcs = request.getParameter("taskcount");
if (tcs == null) {
taskcount = 0;
} else {
taskcount = Integer.parseInt(tcs) + 1;
}
QueueFactory.getDefaultQueue().add(
url("/formsnuker?kind=" + kind + "&passcode=LONGSECRETCODE&taskcount=" + taskcount).method(Method.GET));
System.err.println("*** deletion task # " + taskcount + " for " + kind + " is queued.");
}
response.getWriter().println("OK");
}
}
I have over 6 million records. That's a lot. I have no idea what the cost will be to delete the records (maybe more economical not to delete them). Another alternative would be to request a deletion for the entire application (sandbox). But that's not realistic in most cases.
I decided to go with smaller groups of records (in easy query). I know I could go for 500 entities, but then I started receiving very high rates of failure (re delete function).
My request from GAE team: please add a feature to delete all entities of a kind in a single transaction.

Presumably your hack was something like this:
# Deleting all messages older than "earliest_date"
q = db.GqlQuery("SELECT * FROM Message WHERE create_date < :1", earliest_date)
results = q.fetch(1000)
while results:
db.delete(results)
results = q.fetch(1000, len(results))
As you say, if there's sufficient data, you're going to hit the request timeout before it gets through all the records. You'd have to re-invoke this request multiple times from outside to ensure all the data was erased; easy enough to do, but hardly ideal.
The admin console doesn't seem to offer any help, as (from my own experience with it), it seems to only allow entities of a given type to be listed and then deleted on a page-by-page basis.
When testing, I've had to purge my database on startup to get rid of existing data.
I would infer from this that Google operates on the principle that disk is cheap, and so data is typically orphaned (indexes to redundant data replaced), rather than deleted. Given there's a fixed amount of data available to each app at the moment (0.5 GB), that's not much help for non-Google App Engine users.

Try using App Engine Console then you dont even have to deploy any special code

I've tried db.delete(results) and App Engine Console, and none of them seems to be working for me. Manually removing entries from Data Viewer (increased limit up to 200) didn't work either since I have uploaded more than 10000 entries. I ended writing this script
from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
import wsgiref.handlers
from mainPage import YourData #replace this with your data
class CleanTable(webapp.RequestHandler):
def get(self, param):
txt = self.request.get('table')
q = db.GqlQuery("SELECT * FROM "+txt)
results = q.fetch(10)
self.response.headers['Content-Type'] = 'text/plain'
#replace yourapp and YouData your app info below.
self.response.out.write("""
<html>
<meta HTTP-EQUIV="REFRESH" content="5; url=http://yourapp.appspot.com/cleanTable?table=YourData">
<body>""")
try:
for i in range(10):
db.delete(results)
results = q.fetch(10, len(results))
self.response.out.write("<p>10 removed</p>")
self.response.out.write("""
</body>
</html>""")
except Exception, ints:
self.response.out.write(str(inst))
def main():
application = webapp.WSGIApplication([
('/cleanTable(.*)', CleanTable),
])
wsgiref.handlers.CGIHandler().run(application)
The trick was to include redirect in html instead of using self.redirect. I'm ready to wait overnight to get rid of all the data in my table. Hopefully, GAE team will make it easier to drop tables in the future.

The official answer from Google is that you have to delete in chunks spread over multiple requests. You can use AJAX, meta refresh, or request your URL from a script until there are no entities left.

The fastest and efficient way to handle bulk delete on Datastore is by using the new mapper API announced on the latest Google I/O.
If your language of choice is Python, you just have to register your mapper in a mapreduce.yaml file and define a function like this:
from mapreduce import operation as op
def process(entity):
yield op.db.Delete(entity)
On Java you should have a look to this article that suggests a function like this:
#Override
public void map(Key key, Entity value, Context context) {
log.info("Adding key to deletion pool: " + key);
DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
.getMutationPool();
mutationPool.delete(value.getKey());
}

One tip. I suggest you get to know the remote_api for these types of uses (bulk deleting, modifying, etc.). But, even with the remote api, batch size can be limited to a few hundred at a time.

Unfortunately, there's no way to easily do a bulk delete. Your best bet is to write a script that deletes a reasonable number of entries per invocation, and then call it repeatedly - for example, by having your delete script return a 302 redirect whenever there's more data to delete, then fetching it with "wget --max-redirect=10000" (or some other large number).

With django, setup url:
url(r'^Model/bdelete/$', v.bulk_delete_models, {'model':'ModelKind'}),
Setup view
def bulk_delete_models(request, model):
import time
limit = request.GET['limit'] or 200
start = time.clock()
set = db.GqlQuery("SELECT __key__ FROM %s" % model).fetch(int(limit))
count = len(set)
db.delete(set)
return HttpResponse("Deleted %s %s in %s" % (count,model,(time.clock() - start)))
Then run in powershell:
$client = new-object System.Net.WebClient
$client.DownloadString("http://your-app.com/Model/bdelete/?limit=400")

If you are using Java/JPA you can do something like this:
em = EntityManagerFactoryUtils.getTransactionalEntityManager(entityManagerFactory)
Query q = em.createQuery("delete from Table t");
int number = q.executeUpdate();
Java/JDO info can be found here: http://code.google.com/appengine/docs/java/datastore/queriesandindexes.html#Delete_By_Query

Yes you can:
Go to Datastore Admin, and then select the Entitiy type you want to delete and click Delete.
Mapreduce will take care of deleting!

On a dev server, one can cd to his app's directory then run it like this:
dev_appserver.py --clear_datastore=yes .
Doing so will start the app and clear the datastore. If you already have another instance running, the app won't be able to bind to the needed IP and therefore fail to start...and to clear your datastore.

You can use the task queues to delete chunks of say 100 objects.
Deleting objects in GAE shows how limited the Admin capabilities are in GAE. You have to work with batches on 1000 entities or less. You can use the bulkloader tool that works with csv's but the documentation does not cover java.
I am using GAE Java and my strategy for deletions involves having 2 servlets, one for doing the actually delete and another to load the task queues. When i want to do a delete, I run the queue loading servlet, it loads the queues and then GAE goes to work executing all the tasks in the queue.
How to do it:
Create a servlet that deletes a small number of objects.
Add the servlet to your task queues.
Go home or work on something else ;)
Check the datastore every so often ...
I have a datastore with about 5000 objects that i purge every week and it takes about 6 hours to clean out, so i run the task on Friday night.
I use the same technique to bulk load my data which happens to be about 5000 objects, with about a dozen properties.

This worked for me:
class ClearHandler(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
q = db.GqlQuery("SELECT * FROM SomeModel")
self.response.out.write("deleting...")
db.delete(q)

Thank you all guys, I got what I need. :D
This may be useful if you have lots db models to delete, you can dispatch it in your terminal. And also, you can manage the delete list in DB_MODEL_LIST yourself.
Delete DB_1:
python bulkdel.py 10 DB_1
Delete All DB:
python bulkdel.py 11
Here is the bulkdel.py file:
import sys, os
URL = 'http://localhost:8080'
DB_MODEL_LIST = ['DB_1', 'DB_2', 'DB_3']
# Delete Model
if sys.argv[1] == '10' :
command = 'curl %s/clear_db?model=%s' % ( URL, sys.argv[2] )
os.system( command )
# Delete All DB Models
if sys.argv[1] == '11' :
for model in DB_MODEL_LIST :
command = 'curl %s/clear_db?model=%s' % ( URL, model )
os.system( command )
And here is the modified version of alexandre fiori's code.
from google.appengine.ext import db
class DBDelete( webapp.RequestHandler ):
def get( self ):
self.response.headers['Content-Type'] = 'text/plain'
db_model = self.request.get('model')
sql = 'SELECT __key__ FROM %s' % db_model
try:
while True:
q = db.GqlQuery( sql )
assert q.count()
db.delete( q.fetch(200) )
time.sleep(0.5)
except Exception, e:
self.response.out.write( repr(e)+'\n' )
pass
And of course, you should map the link to model in a file(like main.py in GAE), ;)
In case some guys like me need it in detail, here is part of main.py:
from google.appengine.ext import webapp
import utility # DBDelete was defined in utility.py
application = webapp.WSGIApplication([('/clear_db',utility.DBDelete ),('/',views.MainPage )],debug = True)

To delete all entities in a given kind in Google App Engine you only need to do as follows:
from google.cloud import datastore
query = datastore.Client().query(kind = <KIND>)
results = query.fetch()
for result in results:
datastore.Client().delete(result.key)

In javascript, the following will delete all the entries for on page:
document.getElementById("allkeys").checked=true;
checkAllEntities();
document.getElementById("delete_button").setAttribute("onclick","");
document.getElementById("delete_button").click();
given that you are on the admin-page (.../_ah/admin) with the entities you want to delete.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading Multiple torrent files with Libtorrent in Python - python

Related

Listing IBM Cloud Resources using ResourceControllerV2 and pagination issues

Examining file on different node (different IP address), same network - possible?

Live streaming using videojs-record and videojs

Email harvester in Python: How to improve performance?

Delete all data for a kind in Google App Engine

Categories

Resources