I am working on a problem which involves a batch of 19 tokens each with 400 features. I get the shape (19,1,400) when concatenating two vectors of size (1, 200) into the final feature vector. If I squeeze the 1 out I am left with (19,) but I am trying to get (19,400). I have tried converting to list, squeezing and raveling but nothing has worked.
Is there a way to convert this array to the correct shape?
def attn_output_concat(sample):
out_h, state_h = get_output_and_state_history(agent.model, sample)
attns = get_attentions(state_h)
inner_outputs = get_inner_outputs(state_h)
if len(attns) != len(inner_outputs):
print 'Length err'
else:
tokens = [np.zeros((400))] * largest
print(tokens.shape)
for j, (attns_token, inner_token) in enumerate(zip(attns, inner_outputs)):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(np.array(tokens).shape)
return tokens
The easiest way would be to declare tokens to be a numpy.shape=(19,400) array to start with. That's also more memory/time efficient. Here's the relevant portion of your code revised...
import numpy as np
attns_token = np.zeros(shape=(1,200))
inner_token = np.zeros(shape=(1,200))
largest = 19
tokens = np.zeros(shape=(largest,400))
for j in range(largest):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(tokens.shape)
BTW... It makes it difficult for people to help you if you don't include a self-contained and runnable segment of code (which is probably why you haven't gotten a response on this yet). Something like the above snippet is preferred and will help you get better answers because there's less guessing at what your trying to accomplish.
Related
I am using numpy arrays aside from pandas for speed purposes. However, I am unable to advance my codes using broadcasting, indexing etc. Instead, I am using loop in loops as below. It is working but seems so ugly and inefficient to me.
Basically what I am doing is, I am trying to imitate groupby of pandas at the step mydata[mydata[:,1]==i]. You may consider it as a firm id number. Then with respect to the lookup data, I am checking if it is inside the selected firm or not at the step all(np.isin(lookup[u],d[:,3])). But as I denoted at the beginning, I feel so uncomfortable about this.
out = []
for i in np.unique(mydata[:,1]):
d = mydata[mydata[:,1]==i]
for u in range(0,len(lookup)):
control = all(np.isin(lookup[u],d[:,3]))
if(control):
out.append(d[np.isin(d[:,3],lookup[u])])
It takes about 0.27 seconds. However there must exist some clever alternatives.
I also tried Numba jit() but it does not work.
Could anyone help me about that?
Thanks in advance!
Fake Data:
a = np.repeat(np.arange(100)+5000, np.random.randint(50, 100, 100))
b = np.random.randint(100,200,len(a))
c = np.random.randint(10,70,len(a))
index = np.arange(len(a))
mydata = np.vstack((index,a, b,c)).T
lookup = []
for i in range(0,60):
lookup.append(np.random.randint(10,70,np.random.randint(3,6,1) ))
I had some problems getting the goal of your Program, but I got a decent performance improvement, by refactoring your second for loop. I was able to compress your code to 3 or 4 lines.
f = (
lambda lookup: out1.append(d[np.isin(d[:, 3], lookup)])
if all(np.isin(lookup, d[:, 3]))
else None
)
out = []
for i in np.unique(mydata[:, 1]):
d = mydata[mydata[:, 1] == i]
list(map(f, lookups))
This resolves to the same output list you received previously and the code runs almost twice as quick (at least on my machine).
Given an initial 2-D array:
initial = [
[0.6711999773979187, 0.1949000060558319],
[-0.09300000220537186, 0.310699999332428],
[-0.03889999911189079, 0.2736999988555908],
[-0.6984000205993652, 0.6407999992370605],
[-0.43619999289512634, 0.5810999870300293],
[0.2825999855995178, 0.21310000121593475],
[0.5551999807357788, -0.18289999663829803],
[0.3447999954223633, 0.2071000039577484],
[-0.1995999962091446, -0.5139999985694885],
[-0.24400000274181366, 0.3154999911785126]]
The goal is to multiply some random values inside the array by a random percentage. Lets say only 3 random numbers get replaced by a random multipler, we should get something like this:
output = [
[0.6711999773979187, 0.52],
[-0.09300000220537186, 0.310699999332428],
[-0.03889999911189079, 0.2736999988555908],
[-0.6984000205993652, 0.6407999992370605],
[-0.43619999289512634, 0.5810999870300293],
[0.84, 0.21310000121593475],
[0.5551999807357788, -0.18289999663829803],
[0.3447999954223633, 0.2071000039577484],
[-0.1995999962091446, 0.21],
[-0.24400000274181366, 0.3154999911785126]]
I've tried doing this:
def mutate(array2d, num_changes):
for _ in range(num_changes):
row, col = initial.shape
rand_row = np.random.randint(row)
rand_col = np.random.randint(col)
cell_value = array2d[rand_row][rand_col]
array2d[rand_row][rand_col] = random.uniform(0, 1) * cell_value
return array2d
And that works for 2D arrays but there's chance that the same value is mutated more than once =(
And I don't think that's efficient and it only works on 2D array.
Is there a way to do such "mutation" for array of any shape and more efficiently?
There's no restriction of which value the "mutation" can choose from but the number of "mutation" should be kept strict to the user specified number.
One fairly simple way would be to work with a raveled view of the array. You can generate all your numbers at once that way, and make it easier to guarantee that you won't process the same index twice in one call:
def mutate(array_anyd, num_changes):
raveled = array_anyd.reshape(-1)
indices = np.random.choice(raveled.size, size=num_changes, replace=False)
values = np.random.uniform(0, 1, size=num_changes)
raveled[indices] *= values
I use array_anyd.reshape(-1) in favor of array_anyd.ravel() because according to the docs, the former is less likely to make an inadvertent copy.
The is of course still such a possibility. You can add an extra check to write back if you need to. A more efficient way would be to use np.unravel_index to avoid creating a view to begin with:
def mutate(array_anyd, num_changes):
indices = np.random.choice(array_anyd.size, size=num_changes, replace=False)
indices = np.unravel_indices(indices, array_anyd.shape)
values = np.random.uniform(0, 1, size=num_changes)
raveled[indices] *= values
There is no need to return anything because the modification is done in-place. Conventionally, such functions do not return anything. See for example list.sort vs sorted.
Using shuffle instead of random_choice, this would be a different solution. It works on an array of any shape.
def mutate(arrayIn, num_changes):
mult = np.zeros(arrayIn.ravel().shape[0])
mult[:num_changes] = np.random.uniform(0,1,num_changes)
np.random.shuffle(mult)
mult = mult.reshape(arrayIn.shape)
arrayIn = arrayIn + mult*arrayIn
return arrayIn
I am trying to get the doc2vec function to work in python 3.
I Have the following code:
tekstdata = [[ index, str(row["StatementOfTargetFiguresAndPoliciesForTheUnderrepresentedGender"])] for index, row in data.iterrows()]
def prep (x):
low = x.lower()
return word_tokenize(low)
def cleanMuch(data, clean):
output = []
for x, y in data:
z = clean(y)
output.append([str(x), z])
return output
tekstdata = cleanMuch(tekstdata, prep)
def tagdocs(docs):
output = []
for x,y in docs:
output.append(gensim.models.doc2vec.TaggedDocument(y, x))
return output
tekstdata = tagdocs(tekstdata)
print(tekstdata[100])
vectorModel = gensim.models.doc2vec.Doc2Vec(tekstdata, size = 100, window = 4,min_count = 3, iter = 2)
ranks = []
second_ranks = []
for x, y in tekstdata:
print (x)
print (y)
inferred_vector = vectorModel.infer_vector(y)
sims = vectorModel.docvecs.most_similar([inferred_vector], topn=1001, restrict_vocab = None)
rank = [docid for docid, sim in sims].index(y)
ranks.append(rank)
All works as far as I can understand until the rank function.
The error I get is that there is no zero in my list e.g. the documents I am putting in does not have 10 in list:
File "C:/Users/Niels Helsø/Documents/github/Speciale/Test/Data prep.py", line 59, in <module>
rank = [docid for docid, sim in sims].index(y)
ValueError: '10' is not in list
It seems to me that it is the similar function that does not work.
the model trains on my data (1000 documents) and build a vocab which is tagged.
The documentation I mainly have used is this:
Gensim dokumentation
Torturial
I hope that some one can help. If any additional info is need please let me know.
best
Niels
If you're getting ValueError: '10' is not in list, you can rely on the fact that '10' is not in the list. So have you looked at the list, to see what is there, and if it matches what you expect?
It's not clear from your code excerpts that tagdocs() is ever called, and thus unclear what form tekstdata is in when provided to Doc2Vec. The intent is a bit convoluted, and there's nothing to display what the data appears as in its raw, original form.
But perhaps the tags you are supplying to TaggedDocument are not the required list-of-tags, but rather a simple string, which will be interpreted as a list-of-characters. As a result, even if you're supplying a tags of '10', it will be seen as ['1', '0'] – and len(vectorModel.doctags) will be just 10 (for the 10 single-digit strings).
Separate comments on your setup:
1000 documents is pretty small for Doc2Vec, where most published results use tens-of-thousands to millions of documents
an iter of 10-20 is more common in Doc2Vec work (and even larger values might be helpful with smaller datasets)
infer_vector() often works better with non-default values in its optional parameters, especially a steps that's much larger (20-200) or a starting alpha that's more like the bulk-training default (0.025)
This question may be a little specialist, but hopefully someone might be able to help. I normally use IDL, but for developing a pipeline I'm looking to use python to improve running times.
My fits file handling setup is as follows:
import numpy as numpy
from astropy.io import fits
#Directory: /Users/UCL_Astronomy/Documents/UCL/PHASG199/M33_UVOT_sum/UVOTIMSUM/M33_sum_epoch1_um2_norm.img
with fits.open('...') as ima_norm_um2:
#Open UVOTIMSUM file once and close it after extracting the relevant values:
ima_norm_um2_hdr = ima_norm_um2[0].header
ima_norm_um2_data = ima_norm_um2[0].data
#Individual dimensions for number of x pixels and number of y pixels:
nxpix_um2_ext1 = ima_norm_um2_hdr['NAXIS1']
nypix_um2_ext1 = ima_norm_um2_hdr['NAXIS2']
#Compute the size of the images (you can also do this manually rather than calling these keywords from the header):
#Call the header and data from the UVOTIMSUM file with the relevant keyword extensions:
corrfact_um2_ext1 = numpy.zeros((ima_norm_um2_hdr['NAXIS2'], ima_norm_um2_hdr['NAXIS1']))
coincorr_um2_ext1 = numpy.zeros((ima_norm_um2_hdr['NAXIS2'], ima_norm_um2_hdr['NAXIS1']))
#Check that the dimensions are all the same:
print(corrfact_um2_ext1.shape)
print(coincorr_um2_ext1.shape)
print(ima_norm_um2_data.shape)
# Make a new image file to save the correction factors:
hdu_corrfact = fits.PrimaryHDU(corrfact_um2_ext1, header=ima_norm_um2_hdr)
fits.HDUList([hdu_corrfact]).writeto('.../M33_sum_epoch1_um2_corrfact.img')
# Make a new image file to save the corrected image to:
hdu_coincorr = fits.PrimaryHDU(coincorr_um2_ext1, header=ima_norm_um2_hdr)
fits.HDUList([hdu_coincorr]).writeto('.../M33_sum_epoch1_um2_coincorr.img')
I'm looking to then apply the following corrections:
# Define the variables from Poole et al. (2008) "Photometric calibration of the Swift ultraviolet/optical telescope":
alpha = 0.9842000
ft = 0.0110329
a1 = 0.0658568
a2 = -0.0907142
a3 = 0.0285951
a4 = 0.0308063
for i in range(nxpix_um2_ext1 - 1): #do begin
for j in range(nypix_um2_ext1 - 1): #do begin
if (numpy.less_equal(i, 4) | numpy.greater_equal(i, nxpix_um2_ext1-4) | numpy.less_equal(j, 4) | numpy.greater_equal(j, nxpix_um2_ext1-4)): #then begin
#UVM2
corrfact_um2_ext1[i,j] == 0
coincorr_um2_ext1[i,j] == 0
else:
xpixmin = i-4
xpixmax = i+4
ypixmin = j-4
ypixmax = j+4
#UVM2
ima_UVM2sum = total(ima_norm_um2[xpixmin:xpixmax,ypixmin:ypixmax])
xvec_UVM2 = ft*ima_UVM2sum
fxvec_UVM2 = 1 + (a1*xvec_UVM2) + (a2*xvec_UVM2*xvec_UVM2) + (a3*xvec_UVM2*xvec_UVM2*xvec_UVM2) + (a4*xvec_UVM2*xvec_UVM2*xvec_UVM2*xvec_UVM2)
Ctheory_UVM2 = - alog(1-(alpha*ima_UVM2sum*ft))/(alpha*ft)
corrfact_um2_ext1[i,j] = Ctheory_UVM2*(fxvec_UVM2/ima_UVM2sum)
coincorr_um2_ext1[i,j] = corrfact_um2_ext1[i,j]*ima_sk_um2[i,j]
The above snippet is where it is messing up, as I have a mixture of IDL syntax and python syntax. I'm just not sure how to convert certain aspects of IDL to python. For example, the ima_UVM2sum = total(ima_norm_um2[xpixmin:xpixmax,ypixmin:ypixmax]) I'm not quite sure how to handle.
I'm also missing the part where it will update the correction factor and coincidence correction image files, I would say. If anyone could have the patience to go over it with a fine tooth comb and suggest the neccessary changes I need that would be excellent.
The original normalised image can be downloaded here: Replace ... in above code with this file
One very important thing about numpy is that it does every mathematical or comparison function on an element-basis. So you probably don't need to loop through the arrays.
So maybe start where you convolve your image with a sum-filter. This can be done for 2D images by astropy.convolution.convolve or scipy.ndimage.filters.uniform_filter
I'm not sure what you want but I think you want a 9x9 sum-filter that would be realized by
from scipy.ndimage.filters import uniform_filter
ima_UVM2sum = uniform_filter(ima_norm_um2_data, size=9)
since you want to discard any pixel that are at the borders (4 pixel) you can simply slice them away:
ima_UVM2sum_valid = ima_UVM2sum[4:-4,4:-4]
This ignores the first and last 4 rows and the first and last 4 columns (last is realized by making the stop value negative)
now you want to calculate the corrections:
xvec_UVM2 = ft*ima_UVM2sum_valid
fxvec_UVM2 = 1 + (a1*xvec_UVM2) + (a2*xvec_UVM2**2) + (a3*xvec_UVM2**3) + (a4*xvec_UVM2**4)
Ctheory_UVM2 = - np.alog(1-(alpha*ima_UVM2sum_valid*ft))/(alpha*ft)
these are all arrays so you still do not need to loop.
But then you want to fill your two images. Be careful because the correction is smaller (we inored the first and last rows/columns) so you have to take the same region in the correction images:
corrfact_um2_ext1[4:-4,4:-4] = Ctheory_UVM2*(fxvec_UVM2/ima_UVM2sum_valid)
coincorr_um2_ext1[4:-4,4:-4] = corrfact_um2_ext1[4:-4,4:-4] *ima_sk_um2
still no loop just using numpys mathematical functions. This means it is much faster (MUCH FASTER!) and does the same.
Maybe I have forgotten some slicing and that would yield a Not broadcastable error if so please report back.
Just a note about your loop: Python's first axis is the second axis in FITS and the second axis is the first FITS axis. So if you need to loop over the axis bear that in mind so you don't end up with IndexErrors or unexpected results.
I have a series of maps with two different indices, i and j. Let this be indexed like map_series[i][j].
EDIT 1/21: A minimal working example would be something like
map_series=np.array([np.array([np.arange(12) + 0.1*(i+1) + 0.01*(j+1) for j in range(3)]) for i in range(5)])
I'd like to apply the same mask to each; if map_series is one-dimensional, these each work.
I can imagine a few different ways of applying these maps:
(A) Applying the mask to the whole array:
map_series_ma = hp.ma(map_series)
map_series_ma.mask = predefined_mask
(B1) Applying the mask to each element of the array:
map_series_ma = np.zeros_like(map_series)
for i in range(len(map_series)):
for j in range(len(map_series[0])):
temp = hp.ma(map_series[i][j])
temp.mask = predefined_mask
map_series_ma[i][j] = temp
(B2) Applying the mask to each element of the array:
map_series_ma = np.zeros_like(map_series)
for i in range(len(map_series)):
for j in range(len(map_series[0])):
map_series_ma[i][j] = hp.ma(map_series[i][j])
map_series_ma[i][j].mask = predefined_mask
(C) Pythonically enumerating the list:
map_series_ma = np.array([hp.ma(map_series[i][j]) for j in range(j_max) for i in range(i_max)])
map_series_ma.mask = predetermined_mask
All of these fail to give my desired output, however.
Upon trying (A) or (C) I get an error after the first step, telling me TypeError: bad number of pixels.
Upon trying (B1) I don't get an error, but I also none of the elements of the maps_series_ma have masks; in fact, they do not even appear to be hp.ma objects. Oddly enough, though: when I return temp it does have the appropriate mask.
Upon trying (B2) I get the error
AttributeError: 'numpy.ndarray' object has no attribute 'mask' (which, after looking at my syntax, I totally understand!)
I'm a little confused how to go about this. Both (A) and (B1) seem acceptable to me...
Any help is much appreciated,
Thanks,
Sam
this works for me:
import numpy as np
import healpy as hp
map_series=np.array([np.array([np.arange(12) + 0.1*(i+1) + 0.01*(j+1) for j in range(3)]) for i in range(5)])
map_series_ma = map(lambda x: hp.ma(x), map_series)
pm=[True, True,True,True,True,True,False,False,False,False,False,False]
for m in map_series_ma:
for mm in m:
mm.mask=pm