youtbe-dl multiple downloads at the same time - python

I have a python app, where I have a variable that contains multiple urls.
At this moment I use something like this:
for v in arr:
cmd = 'youtube-dl -u ' + email + ' -p ' + password + ' -o "' + v['path'] + '" ' + v['url']
os.system(cmd)
But this way I download just one video after another. How can I download, let's say 3 videos at the same time ? (Is not from youtube so no playlist or channels)
I not necessary need multi threading in python, but to call the youtube-dl multiple times, splitting the array. So from a python perspective can be on thread.

Use a Pool:
import multiprocessing.dummy
import subprocess
arr = [
{'vpath': 'example/%(title)s.%(ext)s', 'url': 'https://www.youtube.com/watch?v=BaW_jenozKc'},
{'vpath': 'example/%(title)s.%(ext)s', 'url': 'http://vimeo.com/56015672'},
{'vpath': '%(playlist_title)s/%(title)s-%(id)s.%(ext)s',
'url': 'https://www.youtube.com/playlist?list=PLLe-WjSmNEm-UnVV8e4qI9xQyI0906hNp'},
]
email = 'my-email#example.com'
password = '123456'
def download(v):
subprocess.check_call([
'echo', 'youtube-dl',
'-u', email, '-p', password,
'-o', v['vpath'], '--', v['url']])
p = multiprocessing.dummy.Pool(concurrent)
p.map(download, arr)
multiprocessing.dummy.Pool is a lightweight thread-based version of a Pool, which is more suitable here because the work tasks are just starting subprocesses.
Note that instead of os.system, subprocess.check_call, which prevents the command injection vulnerability in your previous code.
Also note that youtube-dl output templates are really powerful. In most cases, you don't actually need to define and manage file names yourself.

I achieved the same thing using threading library, which is considered a lighter way to spawn new processes.
Assumption:
Each task will download videos to a different directory.
import os
import threading
import youtube_dl
COOKIE_JAR = "path_to_my_cookie_jar"
def download_task(videos, output_dir):
if not os.path.isdir(output_dir):
os.makedirs(output_dir)
if not os.path.isfile(COOKIE_JAR):
raise FileNotFoundError("Cookie Jar not found\n")
ydl_opts = {
'cookiefile': COOKIE_JAR,
'outtmpl': f'{output_dir}/%(title)s.%(ext)s'
}
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
ydl.download(videos)
if __name__ == "__main__":
output_dir = "./root_dir"
threads = []
for playlist in many_playlists:
output_dir = f"{output_dir}/playlist.name"
thread = threading.Thread(target=download_task, args=(playlist, output_dir))
threads.append(thread)
# Actually start downloading
for thread in threads:
thread.start()
# Wait for all the downloads to complete
for thread in threads:
thread.join()

Related

python multiple threads redirecting stdout

I'm building an icecast2 radio station which will restream existing stations in lower quality. This program will generate multiple FFmpeg processes restreaming 24/7. For troubleshooting purposes, I would like to have an output of every FFmpeg process redirected to the separate file.
import ffmpeg, csv
from threading import Thread
def run(name, mount, source):
icecast = "icecast://"+ICECAST2_USER+":"+ICECAST2_PASS+"#localhost:"+ICECAST2_PORT+"/"+mount
stream = (
ffmpeg
.input(source)
.output(
icecast,
audio_bitrate=BITRATE, sample_rate=SAMPLE_RATE, format=FORMAT, acodec=CODEC,
reconnect="1", reconnect_streamed="1", reconnect_at_eof="1", reconnect_delay_max="120",
ice_name=name, ice_genre=source
)
)
return stream
with open('stations.csv', mode='r') as data:
for station in csv.DictReader(data):
stream = run(station['name'], station['mount'], station['url'])
thread = Thread(target=stream.run)
thread.start()
As I understand I can't redirect stdout of each thread separately, I also can't use ffmpeg reporting which is only configured by an environment variable. Do I have any other options?
You need to create a thread function of your own
def stream_runner(stream,id):
# open a stream-specific log file to write to
with open(f'stream_{id}.log','wt') as f:
# block until ffmpeg is done
sp.run(stream.compile(),stderr=f)
for i, station in enumerate(csv.DictReader(data)):
stream = run(station['name'], station['mount'], station['url'])
thread = Thread(target=stream_runner,args=(stream,i))
thread.start()
Something like this should work.
ffmpeg-python doesn't quite give you the tools to do this - you want to control one of the arguments to subprocess, stderr, but ffmpeg doesn't have an argument for this.
However, what ffmpeg-python does have, is the ability to show the command line arguments that it would have used. You can make your own call to subprocess after that.
You also don't need to use threads to do this - you can set up each ffmpeg subprocess, without waiting for it to complete, and check in on it each second. This example starts up two ffmpeg instances in parallel, and monitors each one by printing out the most recent line of output from each one every second, as well as tracking if they've exited.
I made two changes for testing:
It gets the stations from a dictionary rather than a CSV file.
It transcodes an MP4 file rather than an audio stream, since I don't have an icecast server. If you want to test it, it expects to have a file named 'sample.mp4' in the same directory.
Both should be pretty easy to change back.
import ffmpeg
import subprocess
import os
import time
stations = [
{'name': 'foo1', 'input': 'sample.mp4', 'output': 'output.mp4'},
{'name': 'foo2', 'input': 'sample.mp4', 'output': 'output2.mp4'},
]
class Transcoder():
def __init__(self, arguments):
self.arguments = arguments
def run(self):
stream = (
ffmpeg
.input(self.arguments['input'])
.output(self.arguments['output'])
)
args = stream.compile(overwrite_output=True)
with open(self.log_name(), 'ab') as logfile:
self.subproc = subprocess.Popen(
args,
stdin=None,
stdout=None,
stderr=logfile,
)
def log_name(self):
return self.arguments['name'] + "-ffmpeg.log"
def still_running(self):
return self.subproc.poll() is None
def last_log_line(self):
with open(self.log_name(), 'rb') as f:
try: # catch OSError in case of a one line file
f.seek(-2, os.SEEK_END)
while f.read(1) not in [b'\n', 'b\r']:
f.seek(-2, os.SEEK_CUR)
except OSError:
f.seek(0)
last_line = f.readline().decode()
last_line = last_line.split('\n')[-1]
return last_line
def name(self):
return self.arguments['name']
transcoders = []
for station in stations:
t = Transcoder(station)
t.run()
transcoders.append(t)
while True:
for t in list(transcoders):
if not t.still_running():
print(f"{t.name()} has exited")
transcoders.remove(t)
print(t.name(), repr(t.last_log_line()))
if len(transcoders) == 0:
break
time.sleep(1)

Get local DNS settings in Python

Is there any elegant and cross platform (Python) way to get the local DNS settings?
It could probably work with a complex combination of modules such as platform and subprocess, but maybe there is already a good module, such as netifaces which can retrieve it in low-level and save some "reinventing the wheel" effort.
Less ideally, one could probably query something like dig, but I find it "noisy", because it would run an extra request instead of just retrieving something which exists already locally.
Any ideas?
Using subprocess you could do something like this, in a MacBook or Linux system
import subprocess
process = subprocess.Popen(['cat', '/etc/resolv.conf'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
print(stdout, stderr)
or do something like this
import subprocess
with open('dns.txt', 'w') as f:
process = subprocess.Popen(['cat', '/etc/resolv.conf'], stdout=f)
The first output will go to stdout and the second to a file
Maybe this one will solve your problem
import subprocess
def get_local_dns(cmd_):
with open('dns1.txt', 'w+') as f:
with open('dns_log1.txt', 'w+') as flog:
try:
process = subprocess.Popen(cmd_, stdout=f, stderr=flog)
except FileNotFoundError as e:
flog.write(f"Error while executing this command {str(e)}")
linux_cmd = ['cat', '/etc/resolv.conf']
windows_cmd = ['windows_command', 'parameters']
commands = [linux_cmd, windows_cmd]
if __name__ == "__main__":
for cmd in commands:
get_local_dns(cmd)
Thanks #MasterOfTheHouse.
I ended up writing my own function. It's not so elegant, but it does the job for now. There's plenty of room for improvement, but well...
import os
import subprocess
def get_dns_settings()->dict:
# Initialize the output variables
dns_ns, dns_search = [], ''
# For Unix based OSs
if os.path.isfile('/etc/resolv.conf'):
for line in open('/etc/resolv.conf','r'):
if line.strip().startswith('nameserver'):
nameserver = line.split()[1].strip()
dns_ns.append(nameserver)
elif line.strip().startswith('search'):
search = line.split()[1].strip()
dns_search = search
# If it is not a Unix based OS, try "the Windows way"
elif os.name == 'nt':
cmd = 'ipconfig /all'
raw_ipconfig = subprocess.check_output(cmd)
# Convert the bytes into a string
ipconfig_str = raw_ipconfig.decode('cp850')
# Convert the string into a list of lines
ipconfig_lines = ipconfig_str.split('\n')
for n in range(len(ipconfig_lines)):
line = ipconfig_lines[n]
# Parse nameserver in current line and next ones
if line.strip().startswith('DNS-Server'):
nameserver = ':'.join(line.split(':')[1:]).strip()
dns_ns.append(nameserver)
next_line = ipconfig_lines[n+1]
# If there's too much blank at the beginning, assume we have
# another nameserver on the next line
if len(next_line) - len(next_line.strip()) > 10:
dns_ns.append(next_line.strip())
next_next_line = ipconfig_lines[n+2]
if len(next_next_line) - len(next_next_line.strip()) > 10:
dns_ns.append(next_next_line.strip())
elif line.strip().startswith('DNS-Suffix'):
dns_search = line.split(':')[1].strip()
return {'nameservers': dns_ns, 'search': dns_search}
print(get_dns_settings())
By the way... how did you manage to write two answers with the same account?

How do I count the number of line in a FTP file without downloading it locally while using Python

So I need to be able to read and count the number of lines from a FTP server WITHOUT downloading it to my local machine while using Python.
I know the code to connect to the server:
ftp = ftplib.FTP('example.com') //Object ftp set as server address
ftp.login ('username' , 'password') // Login info
ftp.retrlines('LIST') // List file directories
ftp.cwd ('/parent folder/another folder/file/') //Change file directory
I also know the basic code to count the number of line If it is already downloaded/stored locally :
with open('file') as f:
... count = sum(1 for line in f)
... print (count)
I just need to know how to connect these 2 pieces of code without having to download the file to my local system.
Any help is appreciated.
Thank You
As far as i know FTP doesn't provide any kind of functionality to read the file content without actually downloading it. However you could try using something like Is it possible to read FTP files without writing them using Python?
(You haven't specified what python you are using)
#!/usr/bin/env python
from ftplib import FTP
def countLines(s):
print len(s.split('\n'))
ftp = FTP('ftp.kernel.org')
ftp.login()
ftp.retrbinary('RETR /pub/README_ABOUT_BZ2_FILES', countLines)
Please take this code as a reference only
There is a way: I adapted a piece of code that I created for processes csv files "on the fly". Is implement by producer-consumer problem approach. Apply this pattern allows us to assign each task to a thread (or process) and show partial results for huge remote files. You can adapt it for ftp requests.
Download stream is saved in queue and is consumed "on the fly". No HDD extra space is needed and memory efficient. Tested in Python 3.5.2 (vanilla) on Fedora Core 25 x86_64.
This is the source adapted for ftp (over http) retrieve:
from threading import Thread, Event
from queue import Queue, Empty
import urllib.request,sys,csv,io,os,time;
import argparse
FILE_URL = 'http://cdiac.ornl.gov/ftp/ndp030/CSV-FILES/nation.1751_2010.csv'
def download_task(url,chunk_queue,event):
CHUNK = 1*1024
response = urllib.request.urlopen(url)
event.clear()
print('%% - Starting Download - %%')
print('%% - ------------------ - %%')
'''VT100 control codes.'''
CURSOR_UP_ONE = '\x1b[1A'
ERASE_LINE = '\x1b[2K'
while True:
chunk = response.read(CHUNK)
if not chunk:
print('%% - Download completed - %%')
event.set()
break
chunk_queue.put(chunk)
def count_task(chunk_queue, event):
part = False
time.sleep(5) #give some time to producer
M=0
contador = 0
'''VT100 control codes.'''
CURSOR_UP_ONE = '\x1b[1A'
ERASE_LINE = '\x1b[2K'
while True:
try:
#Default behavior of queue allows getting elements from it and block if queue is Empty.
#In this case I set argument block=False. When queue.get() and queue Empty ocurrs not block and throws a
#queue.Empty exception that I use for show partial result of process.
chunk = chunk_queue.get(block=False)
for line in chunk.splitlines(True):
if line.endswith(b'\n'):
if part: ##for treat last line of chunk (normally is a part of line)
line = linepart + line
part = False
M += 1
else:
##if line not contains '\n' is last line of chunk.
##a part of line which is completed in next interation over next chunk
part = True
linepart = line
except Empty:
# QUEUE EMPTY
print(CURSOR_UP_ONE + ERASE_LINE + CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE + CURSOR_UP_ONE)
print('Downloading records ...')
if M>0:
print('Partial result: Lines: %d ' % M) #M-1 because M contains header
if (event.is_set()): #'THE END: no elements in queue and download finished (even is set)'
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print('The consumer has waited %s times' % str(contador))
print('RECORDS = ', M)
break
contador += 1
time.sleep(1) #(give some time for loading more records)
def main():
chunk_queue = Queue()
event = Event()
args = parse_args()
url = args.url
p1 = Thread(target=download_task, args=(url,chunk_queue,event,))
p1.start()
p2 = Thread(target=count_task, args=(chunk_queue,event,))
p2.start()
p1.join()
p2.join()
# The user of this module can customized one parameter:
# + URL where the remote file can be found.
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('-u', '--url', default=FILE_URL,
help='remote-csv-file URL')
return parser.parse_args()
if __name__ == '__main__':
main()
Usage
$ python ftp-data.py -u <ftp-file>
Example:
python ftp-data-ol.py -u 'http://cdiac.ornl.gov/ftp/ndp030/CSV-FILES/nation.1751_2010.csv'
The consumer has waited 0 times
RECORDS = 16327
Csv version on Github: https://github.com/AALVAREZG/csv-data-onthefly

Python Flask, Handling Popen poll / wait / communicate without halting multi-threaded Python

The code below is executed on a certain URL (/new...) and assigns variables to the session cookie, which is used to build the display. This example calls a command using subprocess.Popen.
The problem is that the Popen command called below typically takes 3 minutes - and the subprocess.communicate Waits for the output - during which time all other Flask calls (e.g. another user connecting) are halted. I have some commented lines related to other things I've tried without success - one using the threading module and another using subprocess.poll.
from app import app
from flask import render_template, redirect, session
from subprocess import Popen, PIPE
import threading
#app.route('/new/<number>')
def new_session(number):
get_list(number)
#t = threading.Thread(target=get_list, args=(number))
#t.start()
#t.join()
return redirect('/')
def get_list(number):
#1 Call JAR Get String
command = 'java -jar fetch.jar' + str(number)
print "Executing " + command
stream=Popen(command, shell=False, stdout=PIPE)
#while stream.poll() is None:
# print "stream.poll = " + str(stream.poll())
# time.sleep(1)
stdout,stderr = stream.communicate()
#do some item splits and some processing, left out for brevity
session['data'] = stdout.split("\r\n")
return
What's the "better practice" for handling this situation correctly?
For reference, this code is run in Python 2.7.8 on win32, including Flask 0.10.1
First, you should use a work queue like Celery, RabbitMQ or Redis (here is a helpful hint).
Then, define the get_list function becomes :
#celery.task
def get_list(number):
command = 'java -jar fetch.jar {}'.format(number)
print "Executing " + command
stream = Popen(command, shell=False, stdout=PIPE)
stdout, stderr = stream.communicate()
return stdout.split('\r\n')
And in your view, you wait for the result :
#app.route('/new/<number>')
def new_session(number):
result = get_list.delay(number)
session['data'] = result.wait()
return redirect('/')
Now, it doesn't block your view! :)

measure data transfer rate with 200 simultaneous users

I have linux (Ubuntu) machine as a client. I want to measure data transfer rate when 200 users try to download files from my server at the same time.
Is there some python or linux tool for this? Or can you recommend an approach?
I saw this speedcheck code, and I can wrap it in threads, but I don't understand why the code there is so "complicated" and the block size changes all the time.
I used Mult-Mechanize recently to run some performance tests. It was fairly easy and worked pretty well.
Not sure if you're talking about an actual dedicated server. For traffic graphs and so on I prefer to use Munin. It is a pretty complete monitoring application which builds you nice graphs using rrdtool. Examples are linked on the munin site: full setup, eth0 traffic graph.
The new munin 2 is even more flashy, but I did not use it yet as it's not in my repos and I don't like to mess with perl applications.
Maybe 'ab' from apache ?
ab -n 1000 -c 200 [http[s]://]hostname[:port]/path
-n Number of requests to perform
-c Number of multiple requests to make at a time
It has many options, http://httpd.apache.org/docs/2.2/programs/ab.html
or man ab
import threading
import time
import urllib2
block_sz = 8192
num_threads = 1
url = "http://192.168.1.1/bigfile2"
secDownload = 30
class DownloadFileThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.u = urllib2.urlopen(url)
self.file_size_dl = 0
def run(self):
while True:
buffer = self.u.read(block_sz)
if not buffer:
raise 'There is nothing to read. You should have bigger file or smaller time'
self.file_size_dl += len(buffer)
if __name__ == "__main__":
print 'Download from url ' + url + ' use in ' + str(num_threads) + ' to download. test time ' + str(secDownload)
threads = []
for i in range(num_threads):
downloadThread = DownloadFileThread()
downloadThread.daemon = True
threads.append(downloadThread)
for i in range(num_threads):
threads[i].start()
time.sleep(secDownload)
sumBytes=0
for i in range(num_threads):
sumBytes = sumBytes + threads[i].file_size_dl
print sumBytes
print str(sumBytes/(secDownload *1000000)) + 'MBps'

Categories

Resources