Read jpg comression quality

Read jpg comression quality - python

Update:
After further reading I found out that the compression is not only not stored in the image header but also that the quality number is basically meaningless because different programs have a different approach to it. I'm still leaving this here in case anybody has a very clever solution to the problem.
I did write a log and have an approach writing the time I last compressed an image comparing it to the last modified time on the file, though I still would like to solve the problem of getting the quality for other images that I can't just get by time modified.
Original question:
I have a small Python script that walks trough certain folders and compresses certain .jpg files to 60 quality utilizing the pillow library.
It works however in a second iteration it would compress all the images that already were compressed.
So is there a way toe get the compression or quality the .jpg currently has so I can skip the file if it was already compressed?
import os
from os.path import join, getsize
from PIL import Image
start_directory = ".\\test"
for root, dirs, files in os.walk(start_directory):
for f in files:
try:
with Image.open(join(root, f)) as im:
if im.quality > 60: # <= something like this
im.save(join(root, f),optimize=True,quality=60)
im.close() # <= not sure if necessary
except IOError:
pass

There is a way this might be solved. You could use bits per pixel to check your jpeg compression rate.
As I understand it, the JPEG "quality" metric is a number from 0-100 where 100 represents lossless and 0 is maximum compression. As you say, the actual compression details will vary from encoder to encoder and (probably) from pixel block to pixel block (macroblocks?). Applying 60 to an image will reduce the image quality but it will also reduce the file size, presumably to 60% of its original size.
All your images will probably have different dimensions. What you want to look at is bits per pixel.
The question is: why do you compress at quality factor 60 in the first place? Is it correct for all your images? Are you trying to achieve a specific file size? Or are you happy just to make them all smaller?
If you, instead, aim for a specific number of bits per pixel, then your check just becomes a calculation of file size divided by number of pixels in the image and check this against your desired bits per pixel. If it’s bigger, apply compression.
Of course, you’ll then have to be slightly more clever than selecting quality factor 60. Presumably 60 either means 60% of original file size or it’s some internal setting. If it’s the former, you can calculate a new value simply enough. If it’s the latter, you may need to use trial and improvement to get the desired file size.

You are asking the impossible here. JPEG "quality" depends upon two factors. First, the sampling of the components. Second, the selection of quantization tables.
Some JPEG encoders have "quality" settings but these could do anything. Some use ranges of 0 .. 100 Others use 0..4. I've seen 0..8. I've seen :"high", "Medium", "Low."
You could conceivably look at the sampling rates in the frame and you could compare the quantization tables and compare them to some baseline tables of your own to make a "quality" evaluation.

Related

Get character out of image with python

I want to detect the characters in a image like this with python:
In this case the code should return the result '6010001'.
How can I get the result out of this image? What do I need?
For your information, if the solution is a AI-solution, there are about 20.000 labeled images.
Thank you in forward :)

Question: Are all the pictures of similar nature?
Meaning the Numbers are stamped into a similar material, or are they random pictures with numbers with different techniques (e.g. pen drawn, stamped etc.)?
If they are all quite similar (nice contrast as in sample pic), I would recommend to write your "own" AI, otherwise use an existing neural network / library (as I assume you may want to avoid the pain of creating your own neural network - and tag a lot of pictures).
If they pics are quite "similar", following suggested approach:
greyscale Image with increase contrast
define box (greater than a digit), scan over image and count 0s, define by trial valid range to detect a digit, avoid overlaps
each hit take area, split it in sectors, e.g. 6x4, count 0s
build a little knowledge base (csv file) of counts per sector for each number from 0-9 (e.g. a string); you will end up in the database with multiple valid strings per each number, just ensure they are unique (otherwise redefine steps 1-3)
In addition I recommend to make yourself a smart knowledge database, meaning: if the digit could not be identified, save digit picture and result. Then make yourself a little review program where it shows you the undefined digits and the result string, you can then manually add them to your knowledge database for the respective number.
Hope it helps. I used the same approach read a lot of different data from screen pictures and store them in a database. Works like a charm.
#better do it yourself than using a standard neural network :)

You can use opencv-python and pytesseract
import cv2
import pytesseract
img = cv2.imread('img3.jpeg')
text = pytesseract.image_to_string(img)
print(text)
It doesn't work for all images with text, but works for most.

Fast slicing .h5 files using h5py

I am working with .h5 files with little experience.
In a script I wrote I load in data from an .h5 file. The shape of the resulting array is: [3584, 3584, 75]. Here the values 3584 denotes the number of pixels, and 75 denotes the number of time frames. Loading the data and printing the shape takes 180 ms. I obtain this time using os.times().
If I now want to look at the data at a specific time frame I use the following piece of code:
data_1 = data[:, :, 1]
The slicing takes up a lot of time (1.76 s). I understand that my 2D array is huge but at some point I would like to loop over time which will take very long as I'm performing this slice within the for loop.
Is there a more effective/less time consuming way of slicing the time frames or handling this type of data?
Thank you!

Note: I'm making assumptions here since I'm unfamiliar with .H5 files and the Python code the accesses them.
I think that what is happening is that when you "load" the array, you're not actually loading an array. Instead, I think that an object is constructed on top of the file. It probably reads in dimensions and information related to how the file is organized, but it doesn't read the whole file.
That object mimicks an array so good that when you later on perform the slice operation, the normal Python slice operation can be executed, but at this point the actual data is being read. That's why the slice takes so long time compared to "loading" all the data.
I arrive at this conclusion because of the following.
If you're reading 75 frames of 3584x3584 pixels, I'm assuming they're uncompressed (H5 seems to be just raw dumps of data), and in that case, 75 * 3.584 * 3.584 = 963.379.200, this is around 918MB of data. Couple that with you "reading" this in 180ms, we get this calculation:
918MB / 180ms = 5.1GB/second reading speed
Note, this number is for 1-byte pixels, which is also unlikely.
This speed thus seems highly unlikely, as even the best SSDs today reach way below 1GB/sec.
It seems much more plausible that an object is just constructed on top of the file and the slice operation incurs the cost of reading at least 1 frame worth of data.
If we divide the speed by 75 to get per-frame speed, we get 68MB/sec speed for 1-byte pixels, and with 24 or 32-bit pixels we get up to 270MB/sec reading speeds. Much more plausible.

Constructing high resolution images in Python

Say I have some huge amount of data stored in an HDF5 data file (size: 20k x 20k, if not more) and I want to create an image from all of this data using Python. Obviously, this much data cannot be opened and stored in the memory without an error. Therefore, is there some other library or method that would not require all of the data to be dumped into the memory and then processed into an image (like how the libraries: Image, matplotlib, numpy, etc. handle it)?
Thanks.
This question comes from a similar question I asked: Generating pcolormesh images from very large data sets saved in H5 files with Python But I think that the question I posed here covers a broader range of applications.
EDIT (7.6.2013)
Allow me to clarify my question further: In the first question (the link), I was using the easiest method I could think of to generate an image from a large collection of data stored in multiple files. This method was to import the data, generate a pcolormesh plot using matplotlib, and then save a high resolution image from this plot. But there are obvious memory limitations to this approach. I can only import about 10 data sets from the files before I reach a memory error.
In that question, I was asking if there is a better method to patch together the data sets (that are saved in HDF5 files) into a single image without importing all of the data into the memory of the computer. (I will likely require 100s of these data sets to be patched together into a single image.) Also, I need to do everything in Python to make it automated (as this script will need to be run very often for different data sets).
The real question I discovered while trying to get this to work using various libraries is: How can I work with high resolution images in Python? For example, if I have a very high resolution PNG image, how can I manipulate it with Python (crop, split, run through an fft, etc.)? In my experience, I have always run into memory issues when trying to import high resolution images (think ridiculously high resolution pictures from a microscope or telescope (my application is a microscope)). Are there any libraries designed to handle such images?
Or, conversely, how can I generate a high resolution image from a massive amount of data saved in a file with Python? Again the data file could be arbitrarily large (5-6 Gigabytes if not larger).
But in my actual application, my question is: Is there a library or some kind of technique that would allow me to take all of the data sets that I receive from my device (which are saved in HDF5) and patch them together to generate an image from all of them? Or I could save all of the data sets in a single (very large) HDF5 file. Then how could I import this one file and then create an image from its data?
I do not care about displaying the data in some interactive plot. The resolution of the plot is not important. I can easily use a lower resolution for it, but I must be able to generate and save a high resolution image from the data.
Hope this clarifies my question. Feel free to ask any other questions about my question.

You say it "obviously can't be stored in memory", but the following calculations say otherwise.
20,000 * 20,000 pixels * 4 channels = 1.6GB
Most reasonably modern computers have 8GB to 16GB of memory so handling 1.6GB shouldn't be a problem.
However, in order to handle the patchworking you need to do, you could stream each pixel from one file into the other. This assumes the format is a lossless bitmap using a linear encoding format like BMP or TIFF. Simply read each file and append to your result file.
You may need to get a bit clever if the files are different sizes or patched together in some type of grid. In that case, you'd need to calculate the total dimensions of the resulting image and offset the file writing pointer.

Direct access to a single pixel using Python

Is there any way with Python to directly get (only get, no modify) a single pixel (to get its RGB color) from an image (compressed format if possible) without having to load it in RAM nor processing it (to spare the CPU)?
More details:
My application is meant to have a huge database of images, and only of images.
So what I chose is to directly store images on harddrive, this will avoid the additional workload of a DBMS.
However I would like to optimize some more, and I'm wondering if there's a way to directly access a single pixel from an image (the only action on images that my application does), without having to load it in memory.
Does PIL pixel access allow that? Or is there another way?
The encoding of images is my own choice, so I can change whenever I want. Currently I'm using PNG or JPG. I can also store in raw, but I would prefer to keep images a bit compressed if possible. But I think harddrives are cheaper than CPU and RAM, so even if images must stay RAW in order to do that, I think it's still a better bet.
Thank you.
UPDATE
So, as I feared, it seems that it's impossible to do with variable compression formats such as PNG.
I'd like to refine my question:
Is there a constant compression format (not necessarily specific to an image format, I'll access it programmatically), which would allow to access any part by just reading the headers?
Technically, how to efficiently (read: fast and non blocking) access a byte from a file with Python?
SOLUTION
Thank's to all, I have successfully implemented the functionality I described by using run-length encoding on every row, and padding every row to the same length of the maximum row.
This way, by prepeding a header that describes the fixed number of columns for each row, I could easily access the row using first a file.readline() to get the headers data, then file.seek(headersize + fixedsize*y, 0) where y is the row currently selected.
Files are compressed, and in memory I only fetch a single row, and my application doesn't even need to uncompress it because I can compute where the pixel is exactly by just iterating over every RLE values. So it is also very easy on CPU cycles.

If you want to keep a compressed file format, you can break each image up into smaller rectangles and store them separately. Using a fixed size for the rectangles will make it easier to calculate which one you need. When you need the pixel value, calculate which rectangle it's in, open that image file, and offset the coordinates to get the proper pixel.
This doesn't completely optimize access to a single pixel, but it can be much more efficient than opening an entire large image.

In order to evalutate a file you have to load into memory. However, you might be able to figure out how to read only parts of a file, depending on the file format. For example the PNG file specifies a header of size of 8 bytes. However, because of compression the chunks are variable. But if you would store all the pixels in a raw format, you can directly access each pixel, because you can calculate the adress of the file and the appropriate offset. What PNG, JPEG is going to do with the raw data is impossible to predict.
Depending on the structure of the files you might be able to compute efficient hashes. I suppose there is loads of research, if you want to really get into this, for example: Link
"This paper introduces a novel image indexing technique that may be called an image hash function. The algorithm uses randomized signal processing strategies for a non-reversible compression of images into random binary strings, and is shown to be robust against image changes due to compression, geometric distortions, and other attacks"

Calculate (approximately) if zip64 extensions are required without relying on exceptions?

I have the following requirements (from the client) for zipping a number of files.
If the zip file created is less than 2**31-1 ~2GB use compression to create it (use zipfile.ZIP_DEFLATED), otherwise do not compress it (use zipfile.ZIP_STORED).
The current solution is to compress the file without zip64 and catching the zipfile.LargeZipFile exception to then create the non-compressed version.
My question is whether or not it would be worthwhile to attempt to calculate (approximately) whether or not the zip file will exceed the zip64 size without actually processing all the files, and how best to go about it? The process for zipping such large amounts of data is slow, and minimizing the duplicate compression processing might speed it up a bit.
Edit: I would upvote both solutions, as I think I can generate a useful heuristic from a combination of max and min file sizes and compression ratios. Unfortunately at this time, StackOverflow prevents me from upvoting anything (until I have a reputation higher than noob). Thanks for the good suggestions.

The only way I know of to estimate the zip file size is to look at the compression ratios for previously compressed files of a similar nature.

I can only think of two ways, one simple but requires manual tuning, and the other may not provide enough benefit to justify the complexity.
Define a file size at which you just skip the zip attempt, and tune it to your satisfacton by hand.
Keep a record of the last N filesizes between the smallest failure to zip ever observed and the largest successful zip ever observed. Decide what the acceptable probability of an incorrect choice resulting in an file that should be zipped not being zipped (say 5%). set your "don't bother trying to zip" threshold such that it would have resulted in that percentage of files that would have been erroneously left unzipped.
If you absolutely can never miss an opportunity to zip file that should have been zipped then you've already got the solution.

A heuristic approach will always involve some false positives and some false negatives.
The eventual size of the zipped file will depend on a number of factors, some of which are not knowable without running the compression process itself.
Zip64 allows you to use many different compression formats, such as bzip2, LZMA, etc.
Even the compression format may do the compression differently depending on the data to be compressed. For example, bzip2 can use Burrows-Wheeler, run length encoding and Huffman among others. The eventual size of the file will then depend on the statistical properties of the data being compressed.
Take Huffman, for instance; the size of the symbol table depends on how randomly-distributed the content of the file is.
One can go on and try to profile different types of data, serialized binary, text, images etc. and each will have a different normal distribution of final zipped size.
If you really need to save time by doing the process only once, apart from building a very large database and using a rule-based expert system or one based on Bayes' Theorem, there is no real 100% approach to this problem.
You could also try sampling blocks of the file at random intervals and compressing this sample, then linearly interpolating based on the size of the file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.