Speeding up process speed of file downloads from the web

Speeding up process speed of file downloads from the web - python

I'm writing a program that has to download a bunch of files from the web before it can even run, so I created a function that will download all the files and "initialize" the program called init_program, how it works is it runs through a couple dicts that have urls to a gistfiles on github. It pulls the urls and uses urllib2 to download them. I won't be able to add all the files but you can try it out by cloning the repo here. Here's the function that will create the files from the gists:
def init_program():
""" Initialize the program and allow all the files to be downloaded
This will take awhile to process, but I'm working on the processing
speed """
downloaded_wordlists = [] # Used to count the amount of items downloaded
downloaded_rainbow_tables = []
print("\n")
banner("Initializing program and downloading files, this may take awhile..")
print("\n")
# INIT_FILE is a file that will contain "false" if the program is not initialized
# And "true" if the program is initialized
with open(INIT_FILE) as data:
if data.read() == "false":
for item in GIST_DICT_LINKS.keys():
sys.stdout.write("\rDownloading {} out of {} wordlists.. ".format(len(downloaded_wordlists) + 1,
len(GIST_DICT_LINKS.keys())))
sys.stdout.flush()
new_wordlist = open("dicts/included_dicts/wordlists/{}.txt".format(item), "a+")
# Download the wordlists and save them into a file
wordlist_data = urllib2.urlopen(GIST_DICT_LINKS[item])
new_wordlist.write(wordlist_data.read())
downloaded_wordlists.append(item + ".txt")
new_wordlist.close()
print("\n")
banner("Done with wordlists, moving to rainbow tables..")
print("\n")
for table in GIST_RAINBOW_LINKS.keys():
sys.stdout.write("\rDownloading {} out of {} rainbow tables".format(len(downloaded_rainbow_tables) + 1,
len(GIST_RAINBOW_LINKS.keys())))
new_rainbowtable = open("dicts/included_dicts/rainbow_tables/{}.rtc".format(table))
# Download the rainbow tables and save them into a file
rainbow_data = urllib2.urlopen(GIST_RAINBOW_LINKS[table])
new_rainbowtable.write(rainbow_data.read())
downloaded_rainbow_tables.append(table + ".rtc")
new_rainbowtable.close()
open(data, "w").write("true").close() # Will never be initialized again
else:
pass
return downloaded_wordlists, downloaded_rainbow_tables
This works, yes, however it's extremely slow, due to the size of the files, each file has at least 100,000 lines in it. How can I speed up this function to make it faster and more user friendly?

Some weeks ago I faced a similar situation where it was needed to download many huge files but all simple pure Python solutions that I found was not good enough in terms of download optimization. So I found Axel — Light command line download accelerator for Linux and Unix
What is Axel?
Axel tries to accelerate the downloading process by using multiple
connections for one file, similar to DownThemAll and other famous
programs. It can also use multiple mirrors for one download.
Using Axel, you will get files faster from Internet. So, Axel can
speed up a download up to 60% (approximately, according to some
tests).
Usage: axel [options] url1 [url2] [url...]
--max-speed=x -s x Specify maximum speed (bytes per second)
--num-connections=x -n x Specify maximum number of connections
--output=f -o f Specify local output file
--search[=x] -S [x] Search for mirrors and download from x servers
--header=x -H x Add header string
--user-agent=x -U x Set user agent
--no-proxy -N Just don't use any proxy server
--quiet -q Leave stdout alone
--verbose -v More status information
--alternate -a Alternate progress indicator
--help -h This information
--version -V Version information
As axel is written in C and there's no C extension for Python, so I used the subprocess module to execute him externally and works perfectly for me.
You can do something like this:
cmd = ['/usr/local/bin/axel', '-n', str(n_connections), '-o',
"{0}".format(filename), url]
process = subprocess.Popen(cmd,stdin=subprocess.PIPE, stdout=subprocess.PIPE)
You can also parse the progress of each download parsing the output of the stdout.
while True:
line = process.stdout.readline()
progress = YOUR_GREAT_REGEX.match(line).groups()
...

You're blocking whilst you wait for each download. So the total time is the sum of the round trip time for each download. Your code will likely spend a lot of time waiting for the network traffic. One way to improve this is not to block whilst you wait for each response. You can do this in several ways. For example by handing off each request to a separate thread (or process), or by using an event loop and coroutines. Read up on the threading and asyncio modules.

Related

How do you kill a specific process that exceeds CPU usage and runtime limit in Linux?

I have a website (Wordpress site on Ubuntu OS and Apache Server) with special math calculators, many of which utilize python3 scripts to do the main calculations. The flow of data on these calculators is as such:
1.) User inputs numbers into html form, then hits submit button.
2.) PHP function is called, it assigns html user inputs to variables and does exec() on applicable python3 file with those variables (the user inputs are filtered and escapeshellarg is used so all good here).
3.) PHP function returns result of python3 script which is displayed via shortcode on the calculator web page.
The issue I am having is that occasionally the symbolic and numeric computations within my python scripts will hang up indefinitely. As that python3 process keeps running, it starts to use massive CPU and memory resources (big problem during peak traffic hours).
My question is this: is there some way to make a script or program on my server's backend that will kill a process instance of python3 if it has exceeded an arbitrary runtime and CPU usage level? I would like to restrict it only to instances of python3 so that it can't kill something like mysqld. Also, I am OK if it only uses runtime as a kill condition. None of my python scripts should run longer than ~10 seconds under normal circumstances and CPU usage will not be an issue if they don't run longer than 10 seconds.

You can create another python script to serve as a health checker on your server based on the psutil and os modules.
The following code could serve as a base for your specific needs, notice that what it does is basically check for the PIDs for the python scripts on the script_name_list variable based on the name of the script and kill them after checking if your server's CPU is above some threshold or if the memory available is below some threshold as well.
#!/usr/bin/env python3
import psutil
import os
import signal
CPU_LIMIT = 80 #Change Me
AV_MEM = 500.0 #Change Me
script_name_list = ['script1'] #Put in the name of the scripts
def find_other_scripts_pid(script_list):
pid_list = []
for proc in psutil.process_iter(['pid','name', 'cmdline']):
#this is not the PID of the process referencing this script and therefore we chould check inside the list of script name to kill them
if proc.info['pid'] != os.getpid() and proc.info['name'] in ['python','python3']:
for element in proc.info['cmdline']:
for script_name in script_name_list:
if script_name in element:
pid_list.append(proc.info['pid'])
return pid_list
def kill_process(pid):
if psutil.pid_exists(pid):
os.kill(pid,signal.SIGKILL)
return None
def check_cpu():
return psutil.cpu_percent(interval=1)
def check_available_memory():
mem = psutil.virtual_memory()
return mem.available/(2**(20))
def main():
cpu_usage = check_cpu()
av_memory_mb = check_available_memory()
if cpu_usage > CPU_LIMIT or av_memory_mb < AV_MEM:
pid_list = find_other_scripts_pid(script_name_list)
for pid in pid_list:
kill_process(pid)
if __name__ == "__main__":
main()
You can afterwards run this script periodically on your server by using a crontab as explained on this post shared within the community.

Can Python version message be suppressed for child processes (Windows)?

I am using multiprocessing to calculate a large mass of data; i.e. I periodically spawn a process so that the total number of processes is equal to the number of CPU's on my machine.
I periodically print out the progress of the entire calculation... but this is inconveniently interspersed with Python's welcome messages from each child!
To be clear, this is a Windows specific problem due to how multiprocessing is handled.
E.g.
> python -q my_script.py
Python Version: 3.7.7 on Windows
Then many subsequent duplicates of the same version message print; one for each child process.
How can I suppress these?
I understand that if you run Python on the command line with a -q flag, it suppresses the welcome message; though I don't know how to translate that into my script.
EDIT:
I tried to include the interpreter flag -q like so:
multiprocessing.set_executable(sys.executable + ' -q')
Yet to no avail. I receive a FileNotFoundError which tells me I cannot pass options this way due to how they check arguments.
Anyways, here is the relevant section of code (It's an entire function):
def _parallelize(self, buffer, func, cpus):
## Number of Parallel Processes ##
cpus_max = mp.cpu_count()
cpus = min(cpus_max, cpus) if cpus else int(0.75*cpus_max)
## Total Processes to-do ##
N = ceil(self.SampleLength / DATA_MAX) # Number of Child Processes
print("N: ", N)
q = mp.Queue() # Child Process results Queue
## Initialize each CPU w/ a Process ##
for p in range(min(cpus, N)):
mp.Process(target=func, args=(p, q)).start()
## Collect Validation & Start Remaining Processes ##
for p in tqdm(range(N)):
n, data = q.get() # Collects a Result
i = n * DATA_MAX # Shifts to Proper Interval
buffer[i:i + len(data)] = data # Writes to open HDF5 file
if p < N - cpus: # Starts a new Process
mp.Process(target=func, args=(p + cpus, q)).start()
SECOND EDIT:
I should probably mention that I'm doing everything within an anaconda environment.

The message is printed on interactive startup.
A spawned process does inherit some flags from the child process.
But looking at the code in multiprocessing it does not seem possible to change these parameters from within the program.
So the easiest way to get rid of the messages should be to add the -q option to the original python invocation that starts your program.
I have confirmed that the -q flag is inherited.
So that should suppress the message for the original process and the children that it spawns.
Edit:
If you look at the implementation of set_executable, you will see that you cannot add or change arguments that way. :-(
Edit2:
You wrote:
I'm doing everything within an anaconda environment.
You you mean a virtual environment, or some kind of fancy IDE like spyder?
If you ever have a Python problem, first try reproducing it in plain CPython, running from the command line. IDE's and fancy environments like anaconda sometimes do weird things when running Python.

ffmpeg - Poll folder for files, and stream as video with rtp

(I'm a newbie when it comes to ffmpeg).
I have an image source which saves files to a given folder in a rate of 30 fps. I want to wait for every (let's say) 30 frames chunk, encode it to h264 and stream it with RDP to some other app.
I thought about writing a python app which just waits for the images, and then executes an ffmpeg command. For that I wrote the following code:
main.py:
import os
import Helpers
import argparse
import IniParser
import subprocess
from functools import partial
from Queue import Queue
from threading import Semaphore, Thread
def Run(config):
os.chdir(config.Workdir)
iteration = 1
q = Queue()
Thread(target=RunProcesses, args=(q, config.AllowedParallelRuns)).start()
while True:
Helpers.FileCount(config.FramesPathPattern, config.ChunkSize * iteration)
command = config.FfmpegCommand.format(startNumber = (iteration-1)*config.ChunkSize, vFrames=config.ChunkSize)
runFunction = partial(subprocess.Popen, command)
q.put(runFunction)
iteration += 1
def RunProcesses(queue, semaphoreSize):
semaphore = Semaphore(semaphoreSize)
while True:
runFunction = queue.get()
Thread(target=HandleProcess, args=(runFunction, semaphore)).start()
def HandleProcess(runFunction, semaphore):
semaphore.acquire()
p = runFunction()
p.wait()
semaphore.release()
if __name__ == '__main__':
argparser = argparse.ArgumentParser()
argparser.add_argument("config", type=str, help="Path for the config file")
args = argparser.parse_args()
iniFilePath = args.config
config = IniParser.Parse(iniFilePath)
Run(config)
Helpers.py (not really relevant):
import os
import time
from glob import glob
def FileCount(pattern, count):
count = int(count)
lastCount = 0
while True:
currentCount = glob(pattern)
if lastCount != currentCount:
lastCount = currentCount
if len(currentCount) >= count and all([CheckIfClosed(f) for f in currentCount]):
break
time.sleep(0.05)
def CheckIfClosed(filePath):
try:
os.rename(filePath, filePath)
return True
except:
return False
I used the following config file:
Workdir = "C:\Developer\MyProjects\Streaming\OutputStream\PPM"
; Workdir is the directory of reference from which all paths are relative to.
; You may still use full paths if you wish.
FramesPathPattern = "F*.ppm"
; The path pattern (wildcards allowed) where the rendered images are stored to.
; We use this pattern to detect how many rendered images are available for streaming.
; When a chunk of frames is ready - we stream it (or store to disk).
ChunkSize = 30 ; Number of frames for bulk.
; ChunkSize sets the number of frames we need to wait for, in order to execute the ffmpeg command.
; If the folder already contains several chunks, it will first process the first chunk, then second, and so on...
AllowedParallelRuns = 1 ; Number of parallel allowed processes of ffmpeg.
; This sets how many parallel ffmpeg processes are allowed.
; If more than one chunk is available in the folder for processing, we will execute several ffmpeg processes in parallel.
; Only when on of the processes will finish, we will allow another process execution.
FfmpegCommand = "ffmpeg -re -r 30 -start_number {startNumber} -i F%08d.ppm -vframes {vFrames} -vf vflip -f rtp rtp://127.0.0.1:1234" ; Command to execute when a bulk is ready for streaming.
; Once a chunk is ready for processing, this is the command that will be executed (same as running it from the terminal).
; There is however a minor difference. Since every chunk starts with a different frame number, you can use the
; expression of "{startNumber}" which will automatically takes the value of the matching start frame number.
; You can also use "{vFrames}" as an expression for the ChunkSize which was set above in the "ChunkSize" entry.
Please note that if I set "AllowedParallelRuns = 2" then it allows multiple ffmpeg processes to run simultaneously.
I then tried to play it with ffplay and see if I'm doing it right.
The first chunk was streamed fine. The following chunks weren't so great. I got a lot of [sdp # 0000006de33c9180] RTP: dropping old packet received too late messages.
What should I do so I get the ffplay, to play it in the order of the incoming images? Is it right to run parallel ffmpeg processes? Is there a better solution to my problem?
Thank you!

As I stated in the comment, since you rerun ffmpeg each time, the pts values are reset, but the client perceives this as a single continuous ffmpeg stream and thus expects increasing PTS values.
As I said you could use a ffmpeg python wrapper to control the streaming yourself, but yeah that is quite an amount of code. But, there is actually a dirty workaround.
So, apparently there is a -itsoffset parameter with which you can offset the input timestamps (see FFmpeg documentation). Since you know and control the rate, you could pass an increasing value with this parameter, so that each next stream is offset with the proper duration. E.g. if you stream 30 frames each time, and you know the fps is 30, the 30 frames create a time interval of one second. So on each call to ffmepg you would increase the -itsoffset value by one second, thus that should be added to the output PTS values. But I can't guarantee this works.
Since the idea about -itsoffset did not work, you could also try feeding the jpeg images via stdin to ffmpeg - see this link.

Keeping Python Variables between Script Calls

I have a python script, that needs to load a large file from disk to a variable. This takes a while. The script will be called many times from another application (still unknown), with different options and the stdout will be used. Is there any possibility to avoid reading the large file for each single call of the script?
I guess i could have one large script running in the background that holds the variable. But then, how can I call the script with different options and read the stdout from another application?

Make it a (web) microservice: formalize all different CLI arguments as HTTP endpoints and send requests to it from main application.

(I misunderstood the original question, but the first answer I wrote has a different solution, which might be useful to someone fitting that scenario, so I am keeping that one as is and proposing second solution.
)
For a single machine, OS provided pipes are the best solution for what you are looking.
Essentially you will create a forever running process in python which reads from pipe, and process the commands entering the pipe, and then prints to sysout.
Reference: http://kblin.blogspot.com/2012/05/playing-with-posix-pipes-in-python.html
From above mentioned source
Workload
In order to simulate my workload, I came up with the following simple script called pipetest.py that takes an output file name and then writes some text into that file.
#!/usr/bin/env python
import sys
def main():
pipename = sys.argv[1]
with open(pipename, 'w') as p:
p.write("Ceci n'est pas une pipe!\n")
if __name__ == "__main__":
main()
The Code
In my test, this "file" will be a FIFO created by my wrapper code. The implementation of the wrapper code is as follows, I will go over the code in detail further down this post:
#!/usr/bin/env python
import tempfile
import os
from os import path
import shutil
import subprocess
class TemporaryPipe(object):
def __init__(self, pipename="pipe"):
self.pipename = pipename
self.tempdir = None
def __enter__(self):
self.tempdir = tempfile.mkdtemp()
pipe_path = path.join(self.tempdir, self.pipename)
os.mkfifo(pipe_path)
return pipe_path
def __exit__(self, type, value, traceback):
if self.tempdir is not None:
shutil.rmtree(self.tempdir)
def call_helper():
with TemporaryPipe() as p:
script = "./pipetest.py"
subprocess.Popen(script + " " + p, shell=True)
with open(p, 'r') as r:
text = r.read()
return text.strip()
def main():
call_helper()
if __name__ == "__main__":
main()

Since you already can read the data into a variable, then you might consider memory mapping the file using mmap. This is safe if multiple processes are only reading it - to support a writer would require a locking protocol.
Assuming you are not familiar with memory mapped objects, I'll wager you use them every day - this is how the operating system loads and maintains executable files. Essentially your file becomes part of the paging system - although it does not have to be in any special format.
When you read a file into memory it is unlikely it is all loaded into RAM, it will be paged out when "real" RAM becomes over-subscribed. Often this paging is a considerable overhead. A memory mapped file is just your data "ready paged". There is no overhead in reading into memory (virtual memory, that is), it is there as soon as you map it .
When you try to access the data a page fault occurs and a subset (page) is loaded into RAM - all done by the operating system, the programmer is unaware of this.
While a file remains mapped it is connected to the paging system. Another process mapping the same file will access the same object, provided changes have not been made (See MAP_SHARED).
It needs a daemon to keep the memory mapped object current in kernel, but other than creating the object linked to the physical file, it does not need to do anything else - it can sleep or wait on a shutdown signal.
Other processes open the file (use os.open()) and map the object.
See the examples in the documentation, here and also Giving access to shared memory after child processes have already started

You can store the processed values in a file, and then read the values from that file in another script.
>>> import pickle as p
>>> mystr="foobar"
>>> p.dump(mystr,open('/tmp/t.txt','wb'))
>>> mystr2=p.load(open('/tmp/t.txt','rb'))
>>> mystr2
'foobar'

Hadoop streaming --- expensive shared resource (COOL)

I am looking for a nice pattern for python hadoop streaming that involves loading an expensive resource, for example a pickled python object on the server. Here is what I came up with; I've tested by piping input files and slow running programs directly into the script in bash, but haven't yet run it on a hadoop cluster. For you hadoop wizards---am i handling io such that this will work as a python streaming job? I guess I'll go spin up something on amazon to test but it would be nice if someone knew off the top.
you can test it out via cat file.txt | the_script or ./a_streaming_program | the_script
#!/usr/bin/env python
import sys
import time
def resources_for_many_lines():
# load slow, shared resources here
# for example, a shared pickle file
# in this example we use a 1 second sleep to simulate
# a long data load
time.sleep(1)
# we will pretend the value zero is the product
# of our long slow running import
resource = 0
return resource
def score_a_line(line, resources):
# put fast code to score a single example line here
# in this example we will return the value of resource + 1
return resources + 1
def run():
# here is the code that reads stdin and scores the model over a streaming data set
resources = resources_for_many_lines()
while 1:
# reads a line of input
line = sys.stdin.readline()
# ends if pipe closes
if line == '':
break
# scores a line
print score_a_line(line, resources)
# prints right away instead of waiting
sys.stdout.flush()
if __name__ == "__main__":
run();

This looks fine to me. I often load up yaml or sqlite resources in my mappers.
You typically won't be running that many mappers in your job so even if you spend a couple of seconds in loading something from disk it's usually not a huge problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.