Difference from raw image and image in npy - python

I am working with some EDF (European Data Format) images, and I have the following problem: if I load the files in a npy array, and I compare a certain array element with the corresponding raw file, I get that
The files look the same BUT
The difference is not 0. Plotting Image_from_stack - Ram_image, I get a striped value distribution (see image). Does anyone have a suggestion on what could be the cause for this, and how to fix it?
To make things more interesting, the difference changes from image to image, but it always shows a striped pattern.
I am working in python.

A note for future readers: the problem explained above was related to a scientific programming script running on a high performance computing machine. The script was using a substantial amount of memory (up to 100 GB).
My guess is that the striped pattern effect presented above is related to such anomalous memory requirements. After rebooting the machine I couldn't replicate the problem.
So in case you see something similar, check the memory usage. If it's very high, give reboot a chance!

Related

OCR PDF image to Excel by template

I need to convert a lot PDF tables data scans with bad quality to excel tables. The only way I see the solution is to train tesseract or some other framework on pre-generated images(all tables in PDF are the same in most cases). Is it real to have a great solution around 70-80% at home conditions and what you can advice. I will appreciate any advice other than Abby FineReader or similar solution(tested on my dataset - result is so bad and few opportunities for automation)
All tables structures need to be correct in result for further handwork.
You should use a PDF parser for that.
Here's the parsed result using Parsio (https://parsio.io). It looks correct to me. You can export the parsed data to Sheets / Excel / CSV / Zapier.
When the input image is a very poor quality the dirt tends to get in the way of text recognition. This is exacerbated when trying to look for areas without dictionary entries, thus only numbers can be the worst type of text to train, for every twist and turn that bad scanning produces.
If the electronic source before manual stamp and scan is available it might be possible to meld the text with the distorted image , but its a highly manual task defeating the aim.
The docents need to be rescanned, by a trained operator, with a good eye for details. That, with an OCR scan device, will be faster than tuning images that are never likely to provide a reasonably trustworthy output. There are too many cases of numeric fails, that would make any single page worthless for reading or computations.
Recently scanned some accounts and spent more time check/correct than if it had been typed, but it needed to be "legal" copy, however clearly it was not as I did it after the event.
The best result I could squeeze from Adobe PDF to Excel was "Pants"
There are some improvements in image contrast and noise reduction (handwork).
Some effect but not obvious.
Image2word

What is a sensible way to store matrices (which represent images) either in memory or on disk, to make them available to a GUI application?

I am looking for some high level advice about a project that I am attempting.
I want to write a PyQt application (following the model-view pattern) to read in images from a directory one by one and process them. Typically there will be a few thousand .png images (each around 1 megapixel, 16 bit grayscale) in the directory. After being read in, the application will then process the integer pixel values of each image in some way, and crucially the result will be a matrix of floats for each. Once processed, the user should be able be able to then go back and explore any of the matrices they choose (or multiple at once), and possibly apply further processing.
My question is regarding a sensible way to store the matrices in memory, and access them when needed. After reading in the raw .png files and obatining the corresponding matrix of floats, I can then see the following options for handling the result:
Simply store each matrix as a numpy array and have every one of them stored in a class attribute. That way they will all be easily accessible to the code when requested by the user, but will this be poor in terms of RAM required?
After processing each, write out the matrix to a text file, and read it back in from the text file when requested by the user.
I have seen examples (see here) of people using SQLite databases to store data for a GUI application (using MVC pattern), and then query the database when you need access to data. This seems like it would have the advantage that data is not stored in RAM by the "model" part of the application (like in option 1), and is possibly more storage-efficient than option 2, but is this suitable given that my data are matrices?
I have seen examples (see here) of people using something called HDF5 for storing application data, and that this might be similar to using a SQLite database? Again, suitable for matrices?
Finally, I see that PyQt has the classes QImage and QPixmap. Do these make sense for solving the problem I have described?
I am a little lost with all the options, and don't want to spend too much time investigating all of them in too much detail so would appreciate some general advice. If someone could offer comments on each of the options I have described (as well as letting me know if any can be ruled out in this situation) that would be great!
Thank you

Errors processing large images with SIFT OpenCV

I want to use OpenCV Python to do SIFT feature detection on remote sensing images. These images are high resolution and can be thousands of pixels wide (7000 x 6000 or bigger). I am having trouble with insufficient memory, however. As a reference point, I ran the same 7000 x 6000 image in Matlab (using VLFEAT) without memory error, although larger images could be problematic. Does anyone have suggestions for processing this kind of data set using OpenCV SIFT?
OpenCV Error: Insufficient memory (Failed to allocate 672000000 bytes) in cv::OutOfMemoryError, file C:\projects\opencv-python\opencv\modules\core\src\alloc.cpp, line 55
OpenCV Error: Assertion failed (u != 0) in cv::Mat::create, file
(I'm using Python 2.7 and OpenCV 3.4 in the Spyder IDE on a Windows 64-bit with 32 GB of RAM.)
I would split the image into smaller windows. So long as you know the windows overlap (I assume you have an idea of the lateral shift) the match in any window will be valid.
You can even use this as a check, the translation between feature points in any part of the image must be the same for the transform to be valid
There are a few flavors how to process SIFT corner detection in this case:
process single image per unit/time one core;
multiprocess 2 or more images /unit time on single core;
multiprocess 2 or more images/unit time on multiple cores.
Read cores as either cpu or gpu. Threading result in serial processing instead of parallel.
As stated Rebecca has at least 32gb internal memory on her PC at her disposal which is more than sufficient for option 1 to process at once.
So in that light.. splitting a single image as suggested by Martin... should be a last resort in my opinion.
Why should you avoid splitting a single image in multiple windows during feature detection (w/o running out of memory)?
Answer:
If a corner is located at the spilt-side of the window and thus becomes unwillingly two more or less polygonal straight-line-like shapes you won't find the corner you're looking for, unless you got a specialized algorithm to search for those anomalies.
In casu:
In Rebecca's case its crucial to know which approach she took on processing the image(s)... Was it one, two, or many more images loaded simultaneously into memory?
If hundreds or thousands of images are simultaneously loaded into memory... you're basically choking the system by taking away its breathing space (in the form of free memory). In addition, we're not talking about other programs that are loaded into memory and claim (reserve) or consume memory space for various background programs. That comes on top of the issue at hand.
Overthinking:
If as suggested by Martin there is an issue with the Opencv lib in handling such amount of images as described by Rebecca.. do some debugging and then report your findings to Opencv, post a question here at SO as she did... but post also code that shows how you deal with the image processing at the start; as explained above why that is important. And yes as Martin stated... don't post wrappers... totally pointless to do so. A referral link to it (with possible version number) is more than enough... or a tag ;-)

Analysing twitter for research : moving from small data to big

We have a research work that we are doing as part of our college project in which we need to analyse twitter data.
We have already built the prototype for classification and analysis using pandas and nltk, reading the comments from a csv file and then processing it. The problem now is that we want to scale it so as to read and analyse some big comments file also. But the problem is that we dont have anybody who could guide us(majority of them being from biology background) with what technologies to use for this massive amount.
Our issues are :-
1.] How to store a massive comments file(5 gb, offline data). Till now we had only 5000-10000 line of comments which we processed using pandas. But how do we store and process such a huge file. Which database to use for it.
2.] Also since we plan to use nltk, machine learning on this data, what should be our approach on parallels of :: csv->pandas,nltk,machine learning->model->prediction. That is, where in this path we need changes and with what technologies should we replace them to handle the huge data.
Generally speaking, there's two types of scaling:
Scale up
Scale out
Scale up, most of the time, means taking what you already have, and run it on a bigger machine (more CPU, RAM, disk throughput).
Scale out generally means partitioning your problem, and handling parts on separate threads/processes/machines.
Scaling up is much easier: keep the code you already have and run it on a big machine (possibly on Amazon EC2 or Rackspace, if you don't have one available).
If scaling up is not enough, you will need to scale out. Start by identifying what parts of your problem can be partitioned. Since you're processing twitter comments, there's a good chance you can simply partition your file into multiple ones, and train N independent models.
Since you're just processing text data, there isn't a big advantage to using a database over plain text files (for storing the input data, at least). Simply split your file into multiple files and distribute each one to a different processing unit.
Depending on the specific machine learning techniques you're using, it may be easy to merge the independent models into a single one, but it will likely require expert knowledge.
If you're using K-nearest-neighbors, for example, it's trivial to join the independent models

Constructing high resolution images in Python

Say I have some huge amount of data stored in an HDF5 data file (size: 20k x 20k, if not more) and I want to create an image from all of this data using Python. Obviously, this much data cannot be opened and stored in the memory without an error. Therefore, is there some other library or method that would not require all of the data to be dumped into the memory and then processed into an image (like how the libraries: Image, matplotlib, numpy, etc. handle it)?
Thanks.
This question comes from a similar question I asked: Generating pcolormesh images from very large data sets saved in H5 files with Python But I think that the question I posed here covers a broader range of applications.
EDIT (7.6.2013)
Allow me to clarify my question further: In the first question (the link), I was using the easiest method I could think of to generate an image from a large collection of data stored in multiple files. This method was to import the data, generate a pcolormesh plot using matplotlib, and then save a high resolution image from this plot. But there are obvious memory limitations to this approach. I can only import about 10 data sets from the files before I reach a memory error.
In that question, I was asking if there is a better method to patch together the data sets (that are saved in HDF5 files) into a single image without importing all of the data into the memory of the computer. (I will likely require 100s of these data sets to be patched together into a single image.) Also, I need to do everything in Python to make it automated (as this script will need to be run very often for different data sets).
The real question I discovered while trying to get this to work using various libraries is: How can I work with high resolution images in Python? For example, if I have a very high resolution PNG image, how can I manipulate it with Python (crop, split, run through an fft, etc.)? In my experience, I have always run into memory issues when trying to import high resolution images (think ridiculously high resolution pictures from a microscope or telescope (my application is a microscope)). Are there any libraries designed to handle such images?
Or, conversely, how can I generate a high resolution image from a massive amount of data saved in a file with Python? Again the data file could be arbitrarily large (5-6 Gigabytes if not larger).
But in my actual application, my question is: Is there a library or some kind of technique that would allow me to take all of the data sets that I receive from my device (which are saved in HDF5) and patch them together to generate an image from all of them? Or I could save all of the data sets in a single (very large) HDF5 file. Then how could I import this one file and then create an image from its data?
I do not care about displaying the data in some interactive plot. The resolution of the plot is not important. I can easily use a lower resolution for it, but I must be able to generate and save a high resolution image from the data.
Hope this clarifies my question. Feel free to ask any other questions about my question.
You say it "obviously can't be stored in memory", but the following calculations say otherwise.
20,000 * 20,000 pixels * 4 channels = 1.6GB
Most reasonably modern computers have 8GB to 16GB of memory so handling 1.6GB shouldn't be a problem.
However, in order to handle the patchworking you need to do, you could stream each pixel from one file into the other. This assumes the format is a lossless bitmap using a linear encoding format like BMP or TIFF. Simply read each file and append to your result file.
You may need to get a bit clever if the files are different sizes or patched together in some type of grid. In that case, you'd need to calculate the total dimensions of the resulting image and offset the file writing pointer.

Categories

Resources