This question already has answers here:
Efficient serialization of numpy boolean arrays
(3 answers)
Closed 8 years ago.
I have many 100x100px black/white GIF images.
I want to use them in Numpy to train a machine learning algorithm, but I would like to save them in a single file that is easily readable in Python/Numpy.
By saying many I mean several hundred thousands, so I would like to take advantage of the images carrying only 1 bit per pixel.
Any idea on how I can do this?
EDIT:
I used a BitArray object, from the bitstring module. Then I saved it using numpy.savez. The problem is that it takes ages to save. I never managed to see the end of such process on the entire dataset. I tried to save a small subset and it took 10 minutes and about 20 times the size of the small subset itself.
I will try with the BoolArray, thanks for the reference.
EDIT (solved):
I solved the problem by using a different approach from those that I found in the questions you linked. I found the numpy.packbits function here: numpy boolean array with 1 bit entries
I'm reporting my code here so it can be useful to others:
accepted_shape = (100, 100)
images = []
for file_path in gifs:
img_data = imread(file_path)
if img_data.shape != accepted_shape:
continue
max_value = img_data.max()
min_value = img_data.min()
middle_value = (max_value - min_value) // 2
image = np.packbits((img_data.ravel() > middle_value).astype(int))
images.append(image)
np.vstack(images)
np.savez_compressed('dataset.npz', shape=accepted_shape, images=images)
This just requires some attention when uncompressing because if the number of bits is not a multiple of 8, some zeros are added as padding. This is how I uncompress the files:
data = np.load('dataset.npz')
shape = data['shape']
images = data['images']
nf = np.prod(shape)
ne = images.size / nf
images = np.unpackbits(images, axis=1)
images = images[:,:nf]
PyTables seems like a good option here. Something like this might work:
import numpy as np
import tables as tb
nfiles = 100000 #or however many files you have
h5file = tb.openFile('data.h5', mode='w', title="Test Array")
root = h5file.root
x = h5file.createCArray(root,'x',tb.Float64Atom(),shape=(100,100,nfiles))
x[:100,:100, 0] = np.random.random(size=(100,100)) # Now put in some data
h5file.close()
Related
I am new to python and FITS image files, as such I am running into issues. I have two FITS files; the first FITS file is pixels/counts and the second FITS file (calibration file) is pixels/wavelength. I need to convert pixels/counts into wavelength/counts. Once this is done, I need to output wavelength/counts as a new FITS file for further analysis. So far I have managed to array the required data as shown in the code below.
import numpy as np
from astropy.io import fits
# read the images
image_file = ("run_1.fits")
image_calibration = ("cali_1.fits")
hdr = fits.getheader(image_file)
hdr_c = fits.getheader(image_calibration)
# print headers
sp = fits.open(image_file)
print('\n\nHeader of the spectrum :\n\n', sp[0].header, '\n\n')
sp_c = fits.open(image_calibration)
print('\n\nHeader of the spectrum :\n\n', sp_c[0].header, '\n\n')
# generation of arrays with the wavelengths and counts
count = np.array(sp[0].data)
wave = np.array(sp_c[0].data)
I do not understand how to save two separate arrays into one FITS file. I tried an alternative approach by creating list as shown in this code
file_list = fits.open(image_file)
calibration_list = fits.open(image_calibration)
image_data = file_list[0].data
calibration_data = calibration_list[0].data
# make a list to hold images
img_list = []
img_list.append(image_data)
img_list.append(calibration_data)
# list to numpy array
img_array = np.array(img_list)
# save the array as fits - image cube
fits.writeto('mycube.fits', img_array)
However I could only save as a cube, which is not correct because I just need wavelength and counts data. Also, I lost all the headers in the newly created FITS file. To say I am lost is an understatement! Could someone point me in the right direction please? Thank you.
I am still working on this problem. I have now managed (I think) to produce a FITS file containing the wavelength and counts using this website:
https://www.mubdirahman.com/assets/lecture-3---numerical-manipulation-ii.pdf
This is my code:
# Making a Primary HDU (required):
primaryhdu = fits.PrimaryHDU(flux) # Makes a header # or if you have a header that you’ve created: primaryhdu = fits.PrimaryHDU(arr1, header=head1)
# If you have additional extensions:
secondhdu = fits.ImageHDU(wave)
# Making a new HDU List:
hdulist1 = fits.HDUList([primaryhdu, secondhdu])
# Writing the file:
hdulist1.writeto("filename.fits", overwrite=True)
image = ("filename.fits")
hdr = fits.open(image)
image_data = hdr[0].data
wave_data = hdr[1].data
I am sure this is not the correct format for wavelength/counts. I need both wavelength and counts to be contained in hdr[0].data
If you are working with spectral data, it might be useful to look into specutils which is designed for common tasks associated with reading/writing/manipulating spectra.
It's common to store spectral data in FITS files using tables, rather than images. For example you can create a table containing wavelength, flux, and counts columns, and include the associated units in the column metadata.
The docs include an example on how to create a generic "FITS table" writer with wavelength and flux columns. You could start from this example and modify it to suit your exact needs (which can vary quite a bit from case to case, which is probably why a "generic" FITS writer is not built-in).
You might also be able to use the fits-wcs1d format.
If you prefer not to use specutils, that example still might be useful as it demonstrates how to create an Astropy Table from your data and output it to a well-formatted FITS file.
I have 2 image folder containing 10k and 35k images. Each image is approximately the size of (2k,2k).
I want to remove the images which are exact duplicates.
The variation in different images are just a change in some pixels.
I have tried DHashing, PHashing, AHashing but as they are lossy image hashing technique so they are giving the same hash for non-duplicate images too.
I also tried writing a code in python, which will just subtract images and the combination in which the resultant array is not zero everywhere gives those image pair to be duplicate of each other.
Buth the time for a single combination is 0.29 seconds and for total 350 million combinations is really huge.
Is there a way to do it in a faster way without flagging non-duplicate images also.
I am open to doing it in any language(C,C++), any approach(distributed computing, multithreading) which can solve my problem accurately.
Apologies if I added some of the irrelevant approaches as I am not from computer science background.
Below is the code I used for python approach -
start = timeit.default_timer()
dict = {}
for i in path1:
img1 = io.imread(i)
base1 = os.path.basename(i)
for j in path2:
img2 = io.imread(j)
base2 = os.path.basename(j)
if np.array_equal(img1, img2):
err = img1.astype('float') - img2.astype('float')
is_all_zero = np.all((err == 0))
if is_all_zero:
dict[base1] = base2
else:
continue
stop = timeit.default_timer()
print('Time: ', stop - start)
Use lossy hashing as a prefiltering step, before a complete comparison. You can also generate thumbnail images (say 12 x 8 pixels), and compare for similarity.
The idea is to perform quick rejection of very different images.
You should find the answer on how to delete duplicate files (not only images). Then you can use, for example, fdupes or find some alternative SW: https://alternativeto.net/software/fdupes/
This code checks if there are any duplicates in a folder (it's a bit slow though):
import image_similarity_measures
from image_similarity_measures.quality_metrics import rmse, psnr
from sewar.full_ref import rmse, psnr
import cv2
import os
import time
def check(path_orginal,path_new):#give r strings
original = cv2.imread(path_orginal)
new = cv2.imread(path_new)
return rmse(original, new)
def folder_check(folder_path):
i=0
file_list = os.listdir(folder_path)
print(file_list)
duplicate_dict={}
for file in file_list:
# print(file)
file_path=os.path.join(folder_path,file)
for file_compare in file_list:
print(i)
i+=1
file_compare_path=os.path.join(folder_path,file_compare)
if file_compare!=file:
similarity_score=check(file_path,file_compare_path)
# print(str(similarity_score))
if similarity_score==0.0:
print(file,file_compare)
duplicate_dict[file]=file_compare
file_list.remove(str(file))
return duplicate_dict
start_time=time.time()
print(folder_check(r"C:\Users\Admin\Linear-Regression-1\image-similarity-measures\input1"))
end_time=time.time()
stamp=end_time-start_time
print(stamp)
I am working on a problem which involves a batch of 19 tokens each with 400 features. I get the shape (19,1,400) when concatenating two vectors of size (1, 200) into the final feature vector. If I squeeze the 1 out I am left with (19,) but I am trying to get (19,400). I have tried converting to list, squeezing and raveling but nothing has worked.
Is there a way to convert this array to the correct shape?
def attn_output_concat(sample):
out_h, state_h = get_output_and_state_history(agent.model, sample)
attns = get_attentions(state_h)
inner_outputs = get_inner_outputs(state_h)
if len(attns) != len(inner_outputs):
print 'Length err'
else:
tokens = [np.zeros((400))] * largest
print(tokens.shape)
for j, (attns_token, inner_token) in enumerate(zip(attns, inner_outputs)):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(np.array(tokens).shape)
return tokens
The easiest way would be to declare tokens to be a numpy.shape=(19,400) array to start with. That's also more memory/time efficient. Here's the relevant portion of your code revised...
import numpy as np
attns_token = np.zeros(shape=(1,200))
inner_token = np.zeros(shape=(1,200))
largest = 19
tokens = np.zeros(shape=(largest,400))
for j in range(largest):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(tokens.shape)
BTW... It makes it difficult for people to help you if you don't include a self-contained and runnable segment of code (which is probably why you haven't gotten a response on this yet). Something like the above snippet is preferred and will help you get better answers because there's less guessing at what your trying to accomplish.
This question may be a little specialist, but hopefully someone might be able to help. I normally use IDL, but for developing a pipeline I'm looking to use python to improve running times.
My fits file handling setup is as follows:
import numpy as numpy
from astropy.io import fits
#Directory: /Users/UCL_Astronomy/Documents/UCL/PHASG199/M33_UVOT_sum/UVOTIMSUM/M33_sum_epoch1_um2_norm.img
with fits.open('...') as ima_norm_um2:
#Open UVOTIMSUM file once and close it after extracting the relevant values:
ima_norm_um2_hdr = ima_norm_um2[0].header
ima_norm_um2_data = ima_norm_um2[0].data
#Individual dimensions for number of x pixels and number of y pixels:
nxpix_um2_ext1 = ima_norm_um2_hdr['NAXIS1']
nypix_um2_ext1 = ima_norm_um2_hdr['NAXIS2']
#Compute the size of the images (you can also do this manually rather than calling these keywords from the header):
#Call the header and data from the UVOTIMSUM file with the relevant keyword extensions:
corrfact_um2_ext1 = numpy.zeros((ima_norm_um2_hdr['NAXIS2'], ima_norm_um2_hdr['NAXIS1']))
coincorr_um2_ext1 = numpy.zeros((ima_norm_um2_hdr['NAXIS2'], ima_norm_um2_hdr['NAXIS1']))
#Check that the dimensions are all the same:
print(corrfact_um2_ext1.shape)
print(coincorr_um2_ext1.shape)
print(ima_norm_um2_data.shape)
# Make a new image file to save the correction factors:
hdu_corrfact = fits.PrimaryHDU(corrfact_um2_ext1, header=ima_norm_um2_hdr)
fits.HDUList([hdu_corrfact]).writeto('.../M33_sum_epoch1_um2_corrfact.img')
# Make a new image file to save the corrected image to:
hdu_coincorr = fits.PrimaryHDU(coincorr_um2_ext1, header=ima_norm_um2_hdr)
fits.HDUList([hdu_coincorr]).writeto('.../M33_sum_epoch1_um2_coincorr.img')
I'm looking to then apply the following corrections:
# Define the variables from Poole et al. (2008) "Photometric calibration of the Swift ultraviolet/optical telescope":
alpha = 0.9842000
ft = 0.0110329
a1 = 0.0658568
a2 = -0.0907142
a3 = 0.0285951
a4 = 0.0308063
for i in range(nxpix_um2_ext1 - 1): #do begin
for j in range(nypix_um2_ext1 - 1): #do begin
if (numpy.less_equal(i, 4) | numpy.greater_equal(i, nxpix_um2_ext1-4) | numpy.less_equal(j, 4) | numpy.greater_equal(j, nxpix_um2_ext1-4)): #then begin
#UVM2
corrfact_um2_ext1[i,j] == 0
coincorr_um2_ext1[i,j] == 0
else:
xpixmin = i-4
xpixmax = i+4
ypixmin = j-4
ypixmax = j+4
#UVM2
ima_UVM2sum = total(ima_norm_um2[xpixmin:xpixmax,ypixmin:ypixmax])
xvec_UVM2 = ft*ima_UVM2sum
fxvec_UVM2 = 1 + (a1*xvec_UVM2) + (a2*xvec_UVM2*xvec_UVM2) + (a3*xvec_UVM2*xvec_UVM2*xvec_UVM2) + (a4*xvec_UVM2*xvec_UVM2*xvec_UVM2*xvec_UVM2)
Ctheory_UVM2 = - alog(1-(alpha*ima_UVM2sum*ft))/(alpha*ft)
corrfact_um2_ext1[i,j] = Ctheory_UVM2*(fxvec_UVM2/ima_UVM2sum)
coincorr_um2_ext1[i,j] = corrfact_um2_ext1[i,j]*ima_sk_um2[i,j]
The above snippet is where it is messing up, as I have a mixture of IDL syntax and python syntax. I'm just not sure how to convert certain aspects of IDL to python. For example, the ima_UVM2sum = total(ima_norm_um2[xpixmin:xpixmax,ypixmin:ypixmax]) I'm not quite sure how to handle.
I'm also missing the part where it will update the correction factor and coincidence correction image files, I would say. If anyone could have the patience to go over it with a fine tooth comb and suggest the neccessary changes I need that would be excellent.
The original normalised image can be downloaded here: Replace ... in above code with this file
One very important thing about numpy is that it does every mathematical or comparison function on an element-basis. So you probably don't need to loop through the arrays.
So maybe start where you convolve your image with a sum-filter. This can be done for 2D images by astropy.convolution.convolve or scipy.ndimage.filters.uniform_filter
I'm not sure what you want but I think you want a 9x9 sum-filter that would be realized by
from scipy.ndimage.filters import uniform_filter
ima_UVM2sum = uniform_filter(ima_norm_um2_data, size=9)
since you want to discard any pixel that are at the borders (4 pixel) you can simply slice them away:
ima_UVM2sum_valid = ima_UVM2sum[4:-4,4:-4]
This ignores the first and last 4 rows and the first and last 4 columns (last is realized by making the stop value negative)
now you want to calculate the corrections:
xvec_UVM2 = ft*ima_UVM2sum_valid
fxvec_UVM2 = 1 + (a1*xvec_UVM2) + (a2*xvec_UVM2**2) + (a3*xvec_UVM2**3) + (a4*xvec_UVM2**4)
Ctheory_UVM2 = - np.alog(1-(alpha*ima_UVM2sum_valid*ft))/(alpha*ft)
these are all arrays so you still do not need to loop.
But then you want to fill your two images. Be careful because the correction is smaller (we inored the first and last rows/columns) so you have to take the same region in the correction images:
corrfact_um2_ext1[4:-4,4:-4] = Ctheory_UVM2*(fxvec_UVM2/ima_UVM2sum_valid)
coincorr_um2_ext1[4:-4,4:-4] = corrfact_um2_ext1[4:-4,4:-4] *ima_sk_um2
still no loop just using numpys mathematical functions. This means it is much faster (MUCH FASTER!) and does the same.
Maybe I have forgotten some slicing and that would yield a Not broadcastable error if so please report back.
Just a note about your loop: Python's first axis is the second axis in FITS and the second axis is the first FITS axis. So if you need to loop over the axis bear that in mind so you don't end up with IndexErrors or unexpected results.
I made a pickle file, storing a grayscale value of each pixel in 100,000 80x80 sized images.
(Plus an array of 100,000 integers whose values are one-digit).
My approximation for the total size of the pickle is,
4 byte x 80 x 80 x 100000 = 2.88 GB
plus the array of integers, which shouldn't be that large.
The generated pickle file however is over 16GB, so it's taking hours just to unpickle it and load it, and it eventually freezes, after it takes full memory resources.
Is there something wrong with my calculation or is it the way I pickled it?
I pickled the file in the following way.
from PIL import Image
import pickle
import os
import numpy
import time
trainpixels = numpy.empty([80000,6400])
trainlabels = numpy.empty(80000)
validpixels = numpy.empty([10000,6400])
validlabels = numpy.empty(10000)
testpixels = numpy.empty([10408,6400])
testlabels = numpy.empty(10408)
i=0
tr=0
va=0
te=0
for (root, dirs, filenames) in os.walk(indir1):
print 'hello'
for f in filenames:
try:
im = Image.open(os.path.join(root,f))
Imv=im.load()
x,y=im.size
pixelv = numpy.empty(6400)
ind=0
for ii in range(x):
for j in range(y):
temp=float(Imv[j,ii])
temp=float(temp/255.0)
pixelv[ind]=temp
ind+=1
if i<40000:
trainpixels[tr]=pixelv
tr+=1
elif i<45000:
validpixels[va]=pixelv
va+=1
else:
testpixels[te]=pixelv
te+=1
print str(i)+'\t'+str(f)
i+=1
except IOError:
continue
trainimage=(trainpixels,trainlabels)
validimage=(validpixels,validlabels)
testimage=(testpixels,testlabels)
output=open('data.pkl','wb')
pickle.dump(trainimage,output)
pickle.dump(validimage,output)
pickle.dump(testimage,output)
Please let me know if you see something wrong with either my calculation or my code!
Python Pickles are not a thrifty mechanism for storing data as you're storing objects instead of "just the data."
The following test case takes 24kb on my system and this is for a small, sparsely populated numpy array stored in a pickle:
import os
import sys
import numpy
import pickle
testlabels = numpy.empty(1000)
testlabels[0] = 1
testlabels[99] = 0
test_labels_size = sys.getsizeof(testlabels) #80
output = open('/tmp/pickle', 'wb')
test_labels_pickle = pickle.dump(testlabels, output)
print os.path.getsize('/tmp/pickle')
Further, I'm not sure why you believe 4kb to be the size of a number in Python -- non-numpy ints are 24 bytes (sys.getsizeof(1)) and numpy arrays are a minimum of 80 bytes (sys.getsizeof(numpy.array([0], float))).
As you stated as a response to my comment, you have reasons for staying with Pickle, so I won't try to convince you further to not store objects, but be aware of the overhead of storing objects.
As an option: reduce the size of your training data/Pickle fewer objects.