Is there any way with Python to directly get (only get, no modify) a single pixel (to get its RGB color) from an image (compressed format if possible) without having to load it in RAM nor processing it (to spare the CPU)?
More details:
My application is meant to have a huge database of images, and only of images.
So what I chose is to directly store images on harddrive, this will avoid the additional workload of a DBMS.
However I would like to optimize some more, and I'm wondering if there's a way to directly access a single pixel from an image (the only action on images that my application does), without having to load it in memory.
Does PIL pixel access allow that? Or is there another way?
The encoding of images is my own choice, so I can change whenever I want. Currently I'm using PNG or JPG. I can also store in raw, but I would prefer to keep images a bit compressed if possible. But I think harddrives are cheaper than CPU and RAM, so even if images must stay RAW in order to do that, I think it's still a better bet.
Thank you.
UPDATE
So, as I feared, it seems that it's impossible to do with variable compression formats such as PNG.
I'd like to refine my question:
Is there a constant compression format (not necessarily specific to an image format, I'll access it programmatically), which would allow to access any part by just reading the headers?
Technically, how to efficiently (read: fast and non blocking) access a byte from a file with Python?
SOLUTION
Thank's to all, I have successfully implemented the functionality I described by using run-length encoding on every row, and padding every row to the same length of the maximum row.
This way, by prepeding a header that describes the fixed number of columns for each row, I could easily access the row using first a file.readline() to get the headers data, then file.seek(headersize + fixedsize*y, 0) where y is the row currently selected.
Files are compressed, and in memory I only fetch a single row, and my application doesn't even need to uncompress it because I can compute where the pixel is exactly by just iterating over every RLE values. So it is also very easy on CPU cycles.
If you want to keep a compressed file format, you can break each image up into smaller rectangles and store them separately. Using a fixed size for the rectangles will make it easier to calculate which one you need. When you need the pixel value, calculate which rectangle it's in, open that image file, and offset the coordinates to get the proper pixel.
This doesn't completely optimize access to a single pixel, but it can be much more efficient than opening an entire large image.
In order to evalutate a file you have to load into memory. However, you might be able to figure out how to read only parts of a file, depending on the file format. For example the PNG file specifies a header of size of 8 bytes. However, because of compression the chunks are variable. But if you would store all the pixels in a raw format, you can directly access each pixel, because you can calculate the adress of the file and the appropriate offset. What PNG, JPEG is going to do with the raw data is impossible to predict.
Depending on the structure of the files you might be able to compute efficient hashes. I suppose there is loads of research, if you want to really get into this, for example: Link
"This paper introduces a novel image indexing technique that may be called an image hash function. The algorithm uses randomized signal processing strategies for a non-reversible compression of images into random binary strings, and is shown to be robust against image changes due to compression, geometric distortions, and other attacks"
Related
I am looking for some high level advice about a project that I am attempting.
I want to write a PyQt application (following the model-view pattern) to read in images from a directory one by one and process them. Typically there will be a few thousand .png images (each around 1 megapixel, 16 bit grayscale) in the directory. After being read in, the application will then process the integer pixel values of each image in some way, and crucially the result will be a matrix of floats for each. Once processed, the user should be able be able to then go back and explore any of the matrices they choose (or multiple at once), and possibly apply further processing.
My question is regarding a sensible way to store the matrices in memory, and access them when needed. After reading in the raw .png files and obatining the corresponding matrix of floats, I can then see the following options for handling the result:
Simply store each matrix as a numpy array and have every one of them stored in a class attribute. That way they will all be easily accessible to the code when requested by the user, but will this be poor in terms of RAM required?
After processing each, write out the matrix to a text file, and read it back in from the text file when requested by the user.
I have seen examples (see here) of people using SQLite databases to store data for a GUI application (using MVC pattern), and then query the database when you need access to data. This seems like it would have the advantage that data is not stored in RAM by the "model" part of the application (like in option 1), and is possibly more storage-efficient than option 2, but is this suitable given that my data are matrices?
I have seen examples (see here) of people using something called HDF5 for storing application data, and that this might be similar to using a SQLite database? Again, suitable for matrices?
Finally, I see that PyQt has the classes QImage and QPixmap. Do these make sense for solving the problem I have described?
I am a little lost with all the options, and don't want to spend too much time investigating all of them in too much detail so would appreciate some general advice. If someone could offer comments on each of the options I have described (as well as letting me know if any can be ruled out in this situation) that would be great!
Thank you
Update:
After further reading I found out that the compression is not only not stored in the image header but also that the quality number is basically meaningless because different programs have a different approach to it. I'm still leaving this here in case anybody has a very clever solution to the problem.
I did write a log and have an approach writing the time I last compressed an image comparing it to the last modified time on the file, though I still would like to solve the problem of getting the quality for other images that I can't just get by time modified.
Original question:
I have a small Python script that walks trough certain folders and compresses certain .jpg files to 60 quality utilizing the pillow library.
It works however in a second iteration it would compress all the images that already were compressed.
So is there a way toe get the compression or quality the .jpg currently has so I can skip the file if it was already compressed?
import os
from os.path import join, getsize
from PIL import Image
start_directory = ".\\test"
for root, dirs, files in os.walk(start_directory):
for f in files:
try:
with Image.open(join(root, f)) as im:
if im.quality > 60: # <= something like this
im.save(join(root, f),optimize=True,quality=60)
im.close() # <= not sure if necessary
except IOError:
pass
There is a way this might be solved. You could use bits per pixel to check your jpeg compression rate.
As I understand it, the JPEG "quality" metric is a number from 0-100 where 100 represents lossless and 0 is maximum compression. As you say, the actual compression details will vary from encoder to encoder and (probably) from pixel block to pixel block (macroblocks?). Applying 60 to an image will reduce the image quality but it will also reduce the file size, presumably to 60% of its original size.
All your images will probably have different dimensions. What you want to look at is bits per pixel.
The question is: why do you compress at quality factor 60 in the first place? Is it correct for all your images? Are you trying to achieve a specific file size? Or are you happy just to make them all smaller?
If you, instead, aim for a specific number of bits per pixel, then your check just becomes a calculation of file size divided by number of pixels in the image and check this against your desired bits per pixel. If it’s bigger, apply compression.
Of course, you’ll then have to be slightly more clever than selecting quality factor 60. Presumably 60 either means 60% of original file size or it’s some internal setting. If it’s the former, you can calculate a new value simply enough. If it’s the latter, you may need to use trial and improvement to get the desired file size.
You are asking the impossible here. JPEG "quality" depends upon two factors. First, the sampling of the components. Second, the selection of quantization tables.
Some JPEG encoders have "quality" settings but these could do anything. Some use ranges of 0 .. 100 Others use 0..4. I've seen 0..8. I've seen :"high", "Medium", "Low."
You could conceivably look at the sampling rates in the frame and you could compare the quantization tables and compare them to some baseline tables of your own to make a "quality" evaluation.
I have a huge sequence (1000000) of small matrices (32x32) stored in a hdf5 file, each one with a label.
Each of this matrices represent a sensor data for a specific time.
I want to obtain the evolution for each pixel in for a small time slice, different for each x,y position in the matrix.
This is taking more time than I expect.
def getPixelSlice (self,xpixel,ypixel,initphoto,endphoto):
#obtain h5 keys inside time range between initphoto and endphoto
valid=np.where(np.logical_and(self.photoList>=initphoto,self.photoList<endphoto))
#look at pixel data in valid frames
evolution = []
#for each valid frame, obtain the data, and append the target pixel to the list.
for frame in valid[0]:
data = self.h5f[str(self.photoList[frame])]
evolution.append(data[ypixel][xpixel])
return evolution,valid
So, there is a problem here that took me a while to sort out for a similar application. Due to the physical limitations of hard drives, the data are stored in such a way that with a three dimensional array it will always be easier to read in one orientation than another. It all depends on what order you stored the data in.
How you handle this problem depends on your application. My specific application can be characterized as "write few, read many". In this case, it makes the most sense to store the data in the order that I expect to read it. To do this, I use PyTables and specify a "chunkshape" that is the same as one of my timeseries. So, in your case it would be (1,1,1000000). I'm not sure if that size is too large or not, though, so you may need to break it down a bit farther, say (1,1,10000) or something like that.
For more info see PyTables Optimization Tips.
For applications where you intend to read in a specific orientation many times, it is crucial that you choose an appropriate chuck shape for your HDF5 arrays.
Say I have some huge amount of data stored in an HDF5 data file (size: 20k x 20k, if not more) and I want to create an image from all of this data using Python. Obviously, this much data cannot be opened and stored in the memory without an error. Therefore, is there some other library or method that would not require all of the data to be dumped into the memory and then processed into an image (like how the libraries: Image, matplotlib, numpy, etc. handle it)?
Thanks.
This question comes from a similar question I asked: Generating pcolormesh images from very large data sets saved in H5 files with Python But I think that the question I posed here covers a broader range of applications.
EDIT (7.6.2013)
Allow me to clarify my question further: In the first question (the link), I was using the easiest method I could think of to generate an image from a large collection of data stored in multiple files. This method was to import the data, generate a pcolormesh plot using matplotlib, and then save a high resolution image from this plot. But there are obvious memory limitations to this approach. I can only import about 10 data sets from the files before I reach a memory error.
In that question, I was asking if there is a better method to patch together the data sets (that are saved in HDF5 files) into a single image without importing all of the data into the memory of the computer. (I will likely require 100s of these data sets to be patched together into a single image.) Also, I need to do everything in Python to make it automated (as this script will need to be run very often for different data sets).
The real question I discovered while trying to get this to work using various libraries is: How can I work with high resolution images in Python? For example, if I have a very high resolution PNG image, how can I manipulate it with Python (crop, split, run through an fft, etc.)? In my experience, I have always run into memory issues when trying to import high resolution images (think ridiculously high resolution pictures from a microscope or telescope (my application is a microscope)). Are there any libraries designed to handle such images?
Or, conversely, how can I generate a high resolution image from a massive amount of data saved in a file with Python? Again the data file could be arbitrarily large (5-6 Gigabytes if not larger).
But in my actual application, my question is: Is there a library or some kind of technique that would allow me to take all of the data sets that I receive from my device (which are saved in HDF5) and patch them together to generate an image from all of them? Or I could save all of the data sets in a single (very large) HDF5 file. Then how could I import this one file and then create an image from its data?
I do not care about displaying the data in some interactive plot. The resolution of the plot is not important. I can easily use a lower resolution for it, but I must be able to generate and save a high resolution image from the data.
Hope this clarifies my question. Feel free to ask any other questions about my question.
You say it "obviously can't be stored in memory", but the following calculations say otherwise.
20,000 * 20,000 pixels * 4 channels = 1.6GB
Most reasonably modern computers have 8GB to 16GB of memory so handling 1.6GB shouldn't be a problem.
However, in order to handle the patchworking you need to do, you could stream each pixel from one file into the other. This assumes the format is a lossless bitmap using a linear encoding format like BMP or TIFF. Simply read each file and append to your result file.
You may need to get a bit clever if the files are different sizes or patched together in some type of grid. In that case, you'd need to calculate the total dimensions of the resulting image and offset the file writing pointer.
I have the following requirements (from the client) for zipping a number of files.
If the zip file created is less than 2**31-1 ~2GB use compression to create it (use zipfile.ZIP_DEFLATED), otherwise do not compress it (use zipfile.ZIP_STORED).
The current solution is to compress the file without zip64 and catching the zipfile.LargeZipFile exception to then create the non-compressed version.
My question is whether or not it would be worthwhile to attempt to calculate (approximately) whether or not the zip file will exceed the zip64 size without actually processing all the files, and how best to go about it? The process for zipping such large amounts of data is slow, and minimizing the duplicate compression processing might speed it up a bit.
Edit: I would upvote both solutions, as I think I can generate a useful heuristic from a combination of max and min file sizes and compression ratios. Unfortunately at this time, StackOverflow prevents me from upvoting anything (until I have a reputation higher than noob). Thanks for the good suggestions.
The only way I know of to estimate the zip file size is to look at the compression ratios for previously compressed files of a similar nature.
I can only think of two ways, one simple but requires manual tuning, and the other may not provide enough benefit to justify the complexity.
Define a file size at which you just skip the zip attempt, and tune it to your satisfacton by hand.
Keep a record of the last N filesizes between the smallest failure to zip ever observed and the largest successful zip ever observed. Decide what the acceptable probability of an incorrect choice resulting in an file that should be zipped not being zipped (say 5%). set your "don't bother trying to zip" threshold such that it would have resulted in that percentage of files that would have been erroneously left unzipped.
If you absolutely can never miss an opportunity to zip file that should have been zipped then you've already got the solution.
A heuristic approach will always involve some false positives and some false negatives.
The eventual size of the zipped file will depend on a number of factors, some of which are not knowable without running the compression process itself.
Zip64 allows you to use many different compression formats, such as bzip2, LZMA, etc.
Even the compression format may do the compression differently depending on the data to be compressed. For example, bzip2 can use Burrows-Wheeler, run length encoding and Huffman among others. The eventual size of the file will then depend on the statistical properties of the data being compressed.
Take Huffman, for instance; the size of the symbol table depends on how randomly-distributed the content of the file is.
One can go on and try to profile different types of data, serialized binary, text, images etc. and each will have a different normal distribution of final zipped size.
If you really need to save time by doing the process only once, apart from building a very large database and using a rule-based expert system or one based on Bayes' Theorem, there is no real 100% approach to this problem.
You could also try sampling blocks of the file at random intervals and compressing this sample, then linearly interpolating based on the size of the file.