I have been trying to debug a program using vast amounts of memory and have distilled it into the following example:
# Caution, use carefully, this can utilise all available memory on your computer
# and render it effectively unresponsive, to the point where you cannot access
# the shell to kill the process; thus requiring reboot.
import numpy as np
import collections
import torch
# q = collections.deque(maxlen=1500) # Uses around 6.4GB
# q = collections.deque(maxlen=3000) # Uses around 12GB
q = collections.deque(maxlen=5000) # Uses around 18GB
def f():
nparray = np.zeros([4,84,84], dtype=np.uint8)
q.append(nparray)
nparray1 = np.zeros([32,4,84,84], dtype=np.float32)
tens = torch.tensor(nparray1, dtype=torch.float32)
while True:
f()
Please note the cautionary message in the 1st line of this program. If you set maxlen to a level where it uses too much of your available RAM, it can crash your computer.
I measured the memory using top (VIRT column), and its memory use seems wildly excessive (details on the commented lines above). From previous experience in my original program if maxlen is high enough it will crash my computer.
Why is it using so much memory?
I calculate the increase in expected memory from maxlen=1500 to maxlen=3000 to be:
4 * 84 * 84 * 15000 / (1024**2) == 403MB.
But we see an increase of 6GB.
There seems to be some sort of interaction between using collections and the tensor allocation as commenting either out causes memory use to be expected; eg commenting out the tensor line leads to total memory use of 2GB which seems much more reasonable.
Thanks for any help or insight,
Julian.
I think PyTorch store and update the computational graph each time you call f(), and thus the graph-size just keeps getting bigger and bigger.
Can you try to free the memory usage by using del(tens) (deleting the reference for the variable after usage), and let me know how it works? (found in PyTorch-documents here: https://pytorch.org/docs/stable/notes/faq.html)
def get_pixel_colour(self, i_x, i_y):
i_desktop_window_id = win32gui.GetDesktopWindow()
i_desktop_window_dc = win32gui.GetWindowDC(i_desktop_window_id)
long_colour = win32gui.GetPixel(i_desktop_window_dc, i_x, i_y)
i_colour = int(long_colour)
return (i_colour & 0xff), ((i_colour >> 8) & 0xff), ((i_colour >> 16) & 0xff)
This is basically what I have, but someone else wrote it.
I am aware of this approach, and I have written a version of my code using the PIL, it works fine.
However, I am trying to improve the performance of my code.
Is there a way to do this
but in a different way? I am trying to avoid PIL, or the
color = GetPixel(hDC, p.x, p.y)
function.
Is there an approach called "BitBlt?
Three things you can do to improve performance:
If you are invoking your get_pixel_colour function in a tight loop (e.g. once for every pixel), then you are incurring redundant overhead for calling GetDesktopWindow and GetWindowDC for each pixel. Fetch the DC once for every frame instead of for every pixel.
Further, just the function call overhead itself can skew results. Not sure how Python optimizes code, but a function call per-pixel in an image is a killer as well. So it's often better to just inline code that directly calls GetPixel instead of having a function wrap it.
Finally, GetPixel itself is likely a heavyweight call itself. A better approach would be to blit the entire desktop to a memory buffer once. Then iterate over RGB bytes in the memory buffer. Look at this answer here: How to get a screen image into a memory buffer?
I'm writing a Python(3.4.3) program that uses VIPS(8.1.1) on Ubuntu 14.04 LTS to read many small tiles using multiple threads and put them together into a large image.
In a very simple test :
from concurrent.futures import ThreadPoolExecutor
from multiprocessing import Lock
from gi.repository import Vips
canvas = Vips.Image.black(8000,1000,bands=3)
def do_work(x):
img = Vips.Image.new_from_file('part.tif') # RGB tiff image
with lock:
canvas = canvas.insert(img, x*1000, 0)
with ThreadPoolExecutor(max_workers=8) as executor:
for x in range(8):
executor.submit(do_work, x)
canvas.write_to_file('complete.tif')
I get correct result. In my full program, the work for each thread involves read binary from a source file, turn them into tiff format, read the image data and insert into canvas. It seems to work but when I try to examine the result, I ran into trouble. Because the image is extremely large(~50000*100000 pixels), I couldn't save the entire image in one file, so I tried
canvas = canvas.resize(.5)
canvas.write_to_file('test.jpg')
This takes extremely long time, and the resulting jpeg has only black pixels. If I do resize three times, the program get killed. I also tried
canvas.extract_area(20000,40000,2000,2000).write_to_file('test.tif')
This results in error message segmentation fault(core dumped) but it does save an image. There are image contents in it, but they seem to be in the wrong place.
I'm wondering what the problem could be?
Below are the codes for the complete program. The same logic was also implemented using OpenCV + sharedmem (sharedmem handled the multiprocessing part) and it worked without a problem.
import os
import subprocess
import pickle
from multiprocessing import Lock
from concurrent.futures import ThreadPoolExecutor
import threading
import numpy as np
from gi.repository import Vips
lock = Lock()
def read_image(x):
with open(file_name, 'rb') as fin:
fin.seek(sublist[x]['dataStartPos'])
temp_array = np.fromfile(fin, dtype='int8', count=sublist[x]['dataSize'])
name_base = os.path.join(rd_path, threading.current_thread().name + 'tempimg')
with open(name_base + '.jxr', 'wb') as fout:
temp_array.tofile(fout)
subprocess.call(['./JxrDecApp', '-i', name_base + '.jxr', '-o', name_base + '.tif'])
temp_img = Vips.Image.new_from_file(name_base + '.tif')
with lock:
global canvas
canvas = canvas.insert(temp_img, sublist[x]['XStart'], sublist[x]['YStart'])
def assemble_all(filename, ramdisk_path, scene):
global canvas, sublist, file_name, rd_path, tilesize_x, tilesize_y
file_name = filename
rd_path = ramdisk_path
file_info = fetch_pickle(filename) # A custom function
# this info includes where to begin reading image data, image size and coordinates
tilesize_x = file_info['sBlockList_P0'][0]['XSize']
tilesize_y = file_info['sBlockList_P0'][0]['YSize']
sublist = [item for item in file_info['sBlockList_P0'] if item['SStart'] == scene]
max_x = max([item['XStart'] for item in file_info['sBlockList_P0']])
max_y = max([item['YStart'] for item in file_info['sBlockList_P0']])
canvas = Vips.Image.black((max_x+tilesize_x), (max_y+tilesize_y), bands=3)
with ThreadPoolExecutor(max_workers=4) as executor:
for x in range(len(sublist)):
executor.submit(read_image, x)
return canvas
The above module (imported as mcv) is called in the driver script :
canvas = mcv.assemble_all(filename, ramdisk_path, 0)
To examine the content, I used
canvas.extract_area(25000, 40000, 2000, 2000).write_to_file('test_vips1.jpg')
I think your problem has to do with the way libvips calculates pixels.
In systems like OpenCV, images are huge areas of memory. You perform a series of operations, and each operation modifies a memory image in some way.
libvips is not like this, though the interface looks similar. In libvips, when you perform an operation on an image, you are actually just adding a new section to a pipeline. It's only when you finally connect the output to some sink (a file on disk, or a region of memory you want filled with image data, or an area of the display) that libvips will actually do any calculations. libvips will then use a recursive algorithm to run a large set of worker threads up and down the whole length of the pipeline, evaluating all of the operations you created at the same time.
To make an analogy with programming languages, systems like OpenCV are imperative, libvips is functional.
The good thing about the way libvips does things is that it can see the whole pipeline at once and it can optimise away most of the memory use and make good use of your CPU. The bad thing is that long sequences of operations can need large amounts of stack to evaluate (whereas with systems like OpenCV you are more likely to be bounded by image size). In particular, the recursive system used by libvips to evaluate means that pipeline length is limited by the C stack, about 2MB on many operating systems.
Here's a simple test program that does more or less what you are doing:
#!/usr/bin/python3
import sys
import pyvips
if len(sys.argv) < 4:
print "usage: %s image-in image-out n" % sys.argv[0]
print " make an n x n grid of image-in"
sys.exit(1)
tile = pyvips.Image.new_from_file(sys.argv[1])
outfile = sys.argv[2]
size = int(sys.argv[3])
img = pyvips.Image.black(size * tile.width, size * tile.height, bands=3)
for y in range(size):
for x in range(size):
img = img.insert(tile, x * size, y * size)
# we're not interested in huge files for this test, just write a small patch
img.crop(10, 10, 100, 100).write_to_file(outfile)
You run it like this:
time ./bigjoin.py ~/pics/k2.jpg out.tif 2
real 0m0.176s
user 0m0.144s
sys 0m0.031s
It loads k2.jpg (a 2k x 2k JPG image), repeats that image into a 2 x 2 grid, and saves a small part of it. This program will work well with very large images, try removing the crop and running as:
./bigjoin.py huge.tif out.tif[bigtiff] 10
and it'll copy the huge tiff image 100 times into a REALLY huge tiff file. It'll be quick and use little memory.
However, this program will become very unhappy with small images being copied many times. For example, on this machine (a Mac), I can run:
./bigjoin.py ~/pics/k2.jpg out.tif 26
But this fails:
./bigjoin.py ~/pics/k2.jpg out.tif 28
Bus error: 10
With a 28 x 28 output, that's 784 tiles. The way we've built the image, repeatedly inserting a single tile, that's a pipeline 784 operations long -- long enough to cause a stack overflow. On my Ubuntu laptop I can get pipelines up to about 2,900 operations long before it starts failing.
There's a simple way to fix this program: build a wide rather than a deep pipeline. Instead of inserting a single image each time, make a set of strips, then join the strips. Now the pipeline depth will be proportional to the square root of the number of tiles. For example:
img = pyvips.Image.black(size * tile.width, size * tile.height, bands=3)
for y in range(size):
strip = pyvips.Image.black(size * tile.width, tile.height, bands=3)
for x in range(size):
strip = strip.insert(tile, x * size, 0)
img = img.insert(strip, 0, y * size)
Now I can run:
./bigjoin2.py ~/pics/k2.jpg out.tif 200
Which is 40,000 images joined together.
I'm finding that in PIL I can load an image from disk substantially more quickly than I can copy it. Is there a faster way to copy an image than by calling image.copy()? (and how is this even possible?)
Sample code:
import os, PIL.Image, timeit
test_filepath = os.path.expanduser("~/Test images/C.jpg")
load_image_cmd = "PIL.Image.open('{}')".format(test_filepath)
print((PIL.Image.open(test_filepath)).__class__)
print(min(timeit.repeat(load_image_cmd, setup='import PIL.Image', number=10000)))
print(min(timeit.repeat("img.copy()", setup='import PIL.Image; img = {}'.format(load_image_cmd), number=10000)))
Produces:
PIL.JpegImagePlugin.JpegImageFile
0.916192054749
1.85366988182
Adding gc.enable to the setup for timeit doesn't change things much.
According to the PIL documentation, open() is a lazy operation, which means that it's not really doing all the work to use the image yet.
To do a copy() however, it almost certainly has to read the whole thing in and process it.
EDIT:
To test whether this is true, you should access a pixel in each image as part of your timeit.
EDIT 2:
Another glance at the doc shows that a load() after the open() ought to do the trick of making it do all its work.
I have a simple Python (3.3) application in which the processing time between iterations gets longer over each iteration. I believe I've isolated the problem to the bytes() function, as you will see below. The code looks like it would be constant, n-time complexity. However, when run it actually shoots up to about n^2-time. I am providing two blocks of code. The first block is the offending block, which feels something like n^2-time complexity. The second block is a small refactor that removes bytes() from handling the input. Here's the first block:
import hashlib
def gethash(data):
return hashlib.sha1(data).hexdigest()
body = b"blob 5\01234"
for i in range(1, 100000):
hashout = gethash(body+bytes(i))
if(i%1000==0):
print(".")
(Linux time utility output for this block): real 0m12.742s - user 0m12.188s - sys 0m0.536s
And here's the second block, refactored to exclude the use of bytes() and which has a constant processing time between iterations (n-time complexity, as seems correct):
import hashlib
def gethash(data):
return hashlib.sha1(data.encode('utf-8')).hexdigest()
body = "blob 5\01234"
for i in range(1, 100000):
hashout = gethash(body+str(i))
if(i%1000==0):
print(".")
(Linux time utility output for this block): real 0m0.305s - user 0m0.296s - sys 0m0.008s
I am using Linux Mint 16 (kernel 3.11.0-12-generic on x86_64) with Python 3.3.2. I have simplified this code greatly to be more expressive of the central issue. I primarily use Python 3 and can write somewhat non-trivial applications in Python, but I cannot claim to have a Pythonic mindset. Being mindful of this, I've tried to run both of these code blocks in Python 2.7.5, and they both iterate "normally" for a constant amount of time (n-time complexity). I don't know enough about that version and its functions to know if this is meaningful. Thanks!
bytes(i) creates an i-length bytes object full of zeros. See the bytearray documentation; constructor arguments for bytes are interpreted the same way.
To fix this, encode a string:
hashout = gethash(body+str(i).encode('utf-8'))
I've picked UTF-8 because you did, but the proper encoding may depend on the context you want to use it in.