I'm using Tesseract to do OCR on millions of PDFs, and I'm trying to squeeze out as much performance as I can.
My current pipeline uses convert to convert a PDF to PNG files (one per page), and then uses Tesseract on each of those.
During profiling, I've discovered that a lot of time is spent writing files to disk, then reading them again, so I'd like to move all of this into memory.
I've got the PDF to PNG conversion working in memory, so now I need a way to pass the in-memory blob to Tesseract instead of giving it a path to a file? I haven't been able to find any documentation or examples of this?
Related
I have read a lot of essays and articles about (Compressing Image Algorithm). There are many algorithms which I can only understand some of them because I'm a student and I haven't gone to high school yet. I read this article which it helps me a lot! Article In page 3 at this part (Run length code). It's a very EZ and helpful algorithm but I don't know how do I make new format of image. I am a python developer but I don't know how to make a new format which it has a separate algorithm and program. --> like .jpeg, ,jpg, .png, .bmp
(Sorry I have studied English for 1 years so if I have some problems such as grammar or vocabulary just excuse me )
Sure, you can make your own image file format. Choose a filename extension, define how it will be stored and write Python code to:
read the format from disk into a Numpy array, and
write an image contained in a Numpy array to disk
That way you will be interoperable with all the major image processing libraries such as OpenCV, scikit-image, PIL, wand.
Have a look how NetPBM works to get started with a simple format. Maybe look at PCX format if you like the thought of RLE.
Read up on how to write binary to a file with Python.
I'm trying to write a program to read in a .psd file, split the layers into individual images (maintaining the original image's dimensions) and export them as EXR files.
I'm currently trying to use the OpenImageIo library to accomplish this but the documentation isn't particularly clear on how this can be achieved in python.
I've successfully managed to read the full .psd and export it to .exr, but nothing I've been trying seems to indicate that there is more than one layer (subimage) to interact with.
Is there:
something obvious that I'm missing, or
a better way to accomplish this?
Side note:
I have had some success using psd_tools2 but the images can't be exported as .exr nor are they the correct dimensions.
This is actually relatively straightforward, however there is one caveat in that it only seems to be supported for 8-bit psd files at the moment.
import OpenImageIO as oiio
sourcefile = '/path/to/sourcefile.psd'
buf = oiio.ImageBuf(sourcefile)
for layer in range(buf.nsubimages):
buf.reset(sourcefile, subimage=layer)
buf.write('/tmp/mylayer_{l}.exr'.format(l=layer))
I have to read thousands of images in memory.This has to be done.When i extract frames using ffmpeg from a video,the disk space for the 14400 files =92MB and are in JPG format.When I read those images in python and append in a python list using libraries like opencv,scipy etc the same 14400 files=2.5 to 3GB.Guess the decoding is the reason?any thoughts on this will be helpful?
You are exactly right, JPEG images are compressed (this is even a lossy compression, PNG would be a format with lossless compression), and JPEG files are much smaller than the data in uncompressed form.
When you load the images to memory, they are in uncompressed form, and having several GB of data with 14400 images is not surprising.
Basically, my advice is don't do that. Load them one at a time (or in batches), process them, then load the next images. If you load everything to memory beforehand, there will be a point when you run out of memory.
I'm doing a lot of image processing, and I have trouble imagining a case where it is necessary to have that many images loaded at once.
I'm currently working on a little python script to equalize MP3 file.
I've read some docs about MP3 file format (at https://en.wikipedia.org/wiki/ID3)
And i've noticed that in the ID3v2 format there is a field for Equalization (EQUA, EQU2)
Using the python librarie mutagen i've tried to extract theses information from the MP3 but the field isn't present.
What's the right way to equalize MP3 file regardless of the ID3 version ?
Thank in advance. Creekorful
There are two high-level approaches you can take: modify the encoded audio stream, or put metadata on it describing the desired change. Modifying the audio stream is the most compatible, but generally less desirable. However, ID3v1 has no place for this metadata, only ID3v2.2 and up do.
Depending on what you mean by equalize, you might want equalization information stored in the EQA/EQUA/EQU2 frames, or a replay gain volume adjustment stored in the RVA/RVAD/RVA2 frames. Mutagen supports the linked frames, so all but EQA/EQUA. If you need them, it should be straightforward to add them from the information in the actual specification (see 4.12 on http://id3.org/id3v2.4.0-frames). With tests they could likely be contributed back to the project.
Note that Quod Libet, the player paired with Mutagen, has taken a preference for reading and storing replay gain information in a TXXX frame.
Say I have some huge amount of data stored in an HDF5 data file (size: 20k x 20k, if not more) and I want to create an image from all of this data using Python. Obviously, this much data cannot be opened and stored in the memory without an error. Therefore, is there some other library or method that would not require all of the data to be dumped into the memory and then processed into an image (like how the libraries: Image, matplotlib, numpy, etc. handle it)?
Thanks.
This question comes from a similar question I asked: Generating pcolormesh images from very large data sets saved in H5 files with Python But I think that the question I posed here covers a broader range of applications.
EDIT (7.6.2013)
Allow me to clarify my question further: In the first question (the link), I was using the easiest method I could think of to generate an image from a large collection of data stored in multiple files. This method was to import the data, generate a pcolormesh plot using matplotlib, and then save a high resolution image from this plot. But there are obvious memory limitations to this approach. I can only import about 10 data sets from the files before I reach a memory error.
In that question, I was asking if there is a better method to patch together the data sets (that are saved in HDF5 files) into a single image without importing all of the data into the memory of the computer. (I will likely require 100s of these data sets to be patched together into a single image.) Also, I need to do everything in Python to make it automated (as this script will need to be run very often for different data sets).
The real question I discovered while trying to get this to work using various libraries is: How can I work with high resolution images in Python? For example, if I have a very high resolution PNG image, how can I manipulate it with Python (crop, split, run through an fft, etc.)? In my experience, I have always run into memory issues when trying to import high resolution images (think ridiculously high resolution pictures from a microscope or telescope (my application is a microscope)). Are there any libraries designed to handle such images?
Or, conversely, how can I generate a high resolution image from a massive amount of data saved in a file with Python? Again the data file could be arbitrarily large (5-6 Gigabytes if not larger).
But in my actual application, my question is: Is there a library or some kind of technique that would allow me to take all of the data sets that I receive from my device (which are saved in HDF5) and patch them together to generate an image from all of them? Or I could save all of the data sets in a single (very large) HDF5 file. Then how could I import this one file and then create an image from its data?
I do not care about displaying the data in some interactive plot. The resolution of the plot is not important. I can easily use a lower resolution for it, but I must be able to generate and save a high resolution image from the data.
Hope this clarifies my question. Feel free to ask any other questions about my question.
You say it "obviously can't be stored in memory", but the following calculations say otherwise.
20,000 * 20,000 pixels * 4 channels = 1.6GB
Most reasonably modern computers have 8GB to 16GB of memory so handling 1.6GB shouldn't be a problem.
However, in order to handle the patchworking you need to do, you could stream each pixel from one file into the other. This assumes the format is a lossless bitmap using a linear encoding format like BMP or TIFF. Simply read each file and append to your result file.
You may need to get a bit clever if the files are different sizes or patched together in some type of grid. In that case, you'd need to calculate the total dimensions of the resulting image and offset the file writing pointer.