I am currently refactoring some code where I see both these lines being used :
foo = df['bar'].values[0]
foo = df['bar'].iloc[0]
From my current understanding, both lines do the same thing: retrieving the first value of the pandas series.
Are they really the same?
If yes, is one way more recommended than the other? (due to internals subtleties, speed, behavior when setting value instead of getting value, etc)
The code df.values actually returns a numpy.array (i.e. it can be used without square brackets).
df[col].values
df[col].values[0] # 1st element of numpy array
df[col].values[1:3] # 2nd and 3rd element of numpy array
Meanwhile df.iloc is a position based indexing to get elements from a dataframe. iloc must be used with square brackets otherwise you'll see an error.
df.iloc # Error
df.iloc[row, col] # Returns a cell, array (`Series`), matrix (`DataFrame`) based on input
The subtle difference lies in the object being returned, and also the implementation behind the scenes.
iloc directly reads data from memory and returns the output.
values converts a DataFrame into a numpy.array object and then reads data from memory and returns the output (hence iloc is faster).
Related
I am working with a data array A which has the following behaviour when plotted. As one can see, there are some "isles" in the middle. In those areas, the A array is zero. It is assured that the remaining values are nonzero, even if on the order of 1e-9.
What I would like to do is to make the function "continuous", meaning I would like to substitute the zero values with the nonzero value that the array had before becoming zero.
Is there a fast general way this could be implemented? This is the first example I got, but future results may involve even more "isles".
I tried using np.where, but it does not seem to support a command such as "if zero, substitute with previous nonzero value in array". Or at least, I don't know how to do that.
If so, how could I write in in code?
Given a data array a you could do this:
while np.any(a==0):
indices=np.where(a==0)
newvalueindices=np.subtract(indices,np.ones_like(indices))
a[indices]=a[newvalueindices]
If you would like to avoid loops, here is another,faster solution:
zeroindices=np.where(a==0)
nonzeroindices=np.where(a!=0)
missingfrom=np.searchsorted(nonzeroindices[0],zeroindices[0])
previndices=np.subtract(missingfrom,np.ones_like(missingfrom))
a[zeroindices]=a[nonzeroindices][previndices]
I'm attempting to resize a column in a FITS BinTableHDU column after it has been created, but I can't seem to find a way to do this in the astropy documentation.
For example, suppose a simple fits file was created with a BinTableHDU containing a single column of length 3:
from astropy.io import fits
from astropy.table import Table
import numpy as np
hdul = fits.HDUList()
hdul.append(fits.PrimaryHDU())
data = Table({'test':[42,42,42]})
hdul.append(fits.BinTableHDU(data))
hdul.writeto('test.fits', overwrite=True)
And then later, I want to reopen the file, change the column and save it out to a new file:
hdul = fits.open('test.fits')
hdul[1].data['test'] = np.array([27, 27])
hdul.writeto('new_test.fits', overwrite=True)
My hope was, that by replacing the column with a new numpy array instance, that it would overwrite the old one. But I get the following error:
ValueError: could not broadcast input array from shape (2,) into shape (3,)
That error is not too surprising, given the difference in dimensions, but I'm looking for a way to completely replace the column, or otherwise change its shape.
Things to note, the column is of type numpy.ndarray:
col_data = hdul[1].data['test']
print(type(col_data))
print(col_data)
which shows:
<class 'numpy.ndarray'>
[42 42 42]
However, the usual method for resizing the array doesn't seem to work:
hdul[1].data['test'].resize((2,))
throws:
ValueError: cannot resize this array: it does not own its data
Other strange behavior. If I try to replace it with a single element array, rather than throwing an error, it replaces every element with the scalar:
hdul[1].data['test'] = np.array([27])
print(col_data)
shows:
[27 27 27]
I realize that one may point out that I should just change the original dimensions of the column as it is created. But in my particular use case, I need to modify it after the creation of the BinTableHDU. That's what I'm trying to accomplish here.
Easy way to update binary table data with different size in a FITS file
In the example from the question, only one table column needs to be updated and with a different size. The Table class has a replace_column function to help with this.
hdul = fits.open('test.fits')
table = Table(hdul[1].data)
table.replace_column('test', [27, 27])
hdul[1] = fits.BinTableHDU(table)
hdul.writeto('new_test.fits', overwrite=True)
However, this is a little complicated if multiple columns are involved. In the standard BinTableHDU implemented by astropy, all columns must be the same size. In order to use variable length columns special special keywords must be used. See astropy tutorial here. That being said, as long as columns are being replaced with arrays of the same size the code above should be OK.
When replacing an element with a single element array, rather than throwing an error, it replaces every element with the scalar.
As mentioned above, the standard type of FITS BinaryTable supported by astropy is non-variable length. Therefore, one behavior astropy has implemented for your convenience is to broadcast scalars into the array. While the broadcast doesn't seem like the right behavior for a single column table, my guess is this behavior is overriding the expected behavior. See table examples from quick overview. "A single column can be added to a table using syntax like adding a key-value pair to a dict. The value on the right hand side can be a list or numpy.ndarray of the correct size, or a scalar value that will be broadcast".
With the below code I'm trying to update the column df_test['placed'] to = 1 when the if statement is triggered and a prediction is placed. I haven't been able to get this to update correctly though, the code compiles but doesn't update to = 1 for the respective predictions placed.
df_test['placed'] = np.zeros(len(df_test))
for i in set(df_test['id']) :
mask = df_test['id']==i
predictions = lm.predict(X_test[mask])
j = np.argmax(predictions)
if predictions[j] > 0 :
df_test['placed'][mask][j] = 1
print(df_test['placed'][mask][j])
Answering your question
Edit: changed suggestion based on comments
The assignment part of your code, df_test['placed'][mask][j] = 1, uses what is called chained indexing. In short, your assignment only changes a temporary copy of the DataFrame that gets immediately thrown away, and never changes the original DataFrame.
To avoid this, the rule of thumb when doing assignment is: use only one set of square braces on a single DataFrame. For your problem, that should look like:
df_test.loc[mask.nonzero()[0][j], 'placed'] = 1
(I know the mask.nonzero() uses two sets of square brackets; actually nonzero() returns a tuple, and the first element of that tuple is an ndarray. But the dataframe only uses one set, and that's the important part.)
Some other notes
There are a couple notes I have on using pandas (& numpy).
Pandas & NumPy both have a feature called broadcasting. Basically, if you're assigning a single value to an entire array, you don't need to make an array of the same size first; you can just assign the single value, and pandas/NumPy automagically figures out for you how to apply it. So the first line of your code can be replaced with df_test['placed'] = 0, and it accomplishes the same thing.
Generally speaking when working with pandas & numpy objects, loops are bad; usually you can find a way to use some combination of broadcasting, element-wise operations and boolean indexing to do what a loop would do. And because of the way those features are designed, it'll run a lot faster too. Unfortunately I'm not familiar enough with the lm.predict method to say, but you might be able to avoid the whole for-loop entirely for this code.
I'm quite new with python, however, I have to accomplish some assignment and I am struggling now on a problem. I try to get the index of the element in a table A when some other parameter from this table A corresponds to a value in a list B. The table A also already contains a column "index" where all elements are numerated from 0 till the end. Moreover, the values in tableA.parameter1 and listB can coincide only once, multiple matches are not possible. So to derive the necessary index I use a line
t=tableA.index[tableA.parameter1==listB[numberObservation]]
However, what I get as a result is something like:
t Int64Index([2], dtype='int64')
If I use the variable t in this format Int64Index, it doesn't suit for the further code I have to work with. Actually, I need only 2 as an integer number, without all this redundant rest.
Can somebody please help me to circumvent my problem? I am in total despair and would be grateful for any help.
Try .tolist()
t=tableA.index[tableA.parameter1==listB[numberObservation]].tolist()
This should return
t = [2]
a list "without all the redundant rest" :)
What package is giving you Int64Index? This looks vaguely numpy-ish, but numpy arrays define __index__ so a single element array of integer values will seamlessly operate as indices for sequence lookup.
Regardless, assuming t is supposed to be exactly one value, and it's a sequence type itself, you can just do:
t, = tableA.index[tableA.parameter1==listB[numberObservation]]
That trailing comma changes the line from straight assignment to iterable unpacking; it expects the right hand side to produce an iterable with exactly one value, and that one value is unpacked into t. If the iterable has 0 or 2+ values, you'll get a ValueError.
I have a pandas Series and a function that I want to apply to each element of the Series. The function have an additional argument too. So far so good: for example
python pandas: apply a function with arguments to a series. Update
What about if the argument varies by itself running over a given list?
I had to face this problem in my code and I have found a straightforward solution but it is quite specific and (even worse) do not use the apply method.
Here is a toy model code:
a=pd.DataFrame({'x'=[1,2]})
t=[10,20]
I want to multiply elements in a['x'] by elements in t. Here the function is quite simple and len(t) matches with len(a['x'].index) so I could just do:
a['t']=t
a['x*t']=a['x']*a['t']
But what about if the function is more elaborate or the two lengths do not match?
What I would like is a command line like:
a['x'].apply(lambda x,y: x*y, arg=t)
The point is that this specific line exits with an error because the arg variable in that case will accept only a tuple of len=1. I do not see any 'place' to put the various element of t.
What you're looking for is similar to what R calls "recycling", where operations on arrays of unequal length loops through the smaller array over and over as many times as needed to match the length of the longer array.
I'm not aware of any simple, built-in way to do this with numpy or pandas. What you can do is use np.tile to repeat your smaller array. Something like:
a.x*np.tile(t, len(a)/len(t))
This will only work if the longer array's length is a simple multiple of the shorter one's.
The behavior you want is somewhat unusual. Depending on what you're doing, there may be a better way to handle it. Relying on the values to match up in the desired way just by repetition is a little fragile. If you have some way to match up the values in each array that you want to multiply, you could use the .map method of Series to select the right "other value" to multiply each element of your Series with.