I just began to learn Python and Pandas and I saw in many tutorials the use of the iloc function. It is always stated that you can use this function to refer to columns and rows in a dataframe. However, you can also do this directly without the iloc function. So here is an example that yield the same output:
# features is just a dataframe with several rows and columns
features = pd.DataFrame(features_standardized)
y_train = features.iloc[start:end] [[1]]
y_train_noIloc = features [start:end] [[1]]
What is the difference between the two statements and what advantage do I have when using iloc? I'd appreicate every comment.
Per the pandas docs, iloc provides:
Purely integer-location based indexing for selection by position.
Therefore, as shown in the simplistic examples below, [row, col] indexing is not possible without using loc or iloc, as a KeyError will be thrown.
Example:
# Build a simple, sample DataFrame.
df = pd.DataFrame({'a': [1, 2, 3, 4]})
# No iloc
>>> df[0, 0]
KeyError: (0, 0)
# With iloc:
>>> df.iloc[0, 0]
1
The same logic holds true when using loc and a column name.
What is the difference and when does the indexing work without iloc?
The short answer:
Use loc and/or iloc when indexing rows and columns. If indexing on row or column, you can get away without it, and is referred to as 'slicing'.
However, I see in your example [start:end][[1]] has been used. It is generaly considered bad practice to have back-to-back square brackets in pandas, (e.g.: [][]), and generally an indication that a different (more efficient) approach should be taken - in this case, using iloc.
The longer answer:
Adapting your [start:end] slicing example (shown below), indexing works without iloc when indexing (slicing) on row only. The following example does not use iloc and will return rows 0 through 3.
df[0:3]
Output:
a
0 1
1 2
2 3
Note the difference in [0:3] and [0, 3]. The former (slicing) uses a colon and will return rows or indexes 0 through 3. Whereas the latter uses a comma, and is a [row, col] indexer, which requires the use of iloc.
Aside:
The two methods can be combined as show here, and will return rows 0 through 3, for column index 0. Whereas this is not possible without the use of iloc.
df.iloc[0:3, 0]
I have the following data set:
Beginning Data Set:
ObjectID,Date,Price,Vol,Mx
101,2017-01-01,,145,203
101,2017-01-02,,155,163
101,2017-01-03,67.0,140,234
101,2017-01-04,78.0,130,182
101,2017-01-05,58.0,178,202
101,2017-01-06,53.0,134,204
101,2017-01-07,52.0,134,183
101,2017-01-08,62.0,148,176
101,2017-01-09,42.0,152,193
101,2017-01-10,80.0,137,150
I first create two new columns of boolean values called VolPrice and Check based on the values in my starting data set. I think want to create a third additional column called DoubleCheck where the value of this column should be True if either VolPrice OR Check are equal to True, otherwise the value of DoubleCheck should be false. Initially I got the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
but then I added .any() after each column within my statement to construct the DoubleCheck column. However this isn't working either because it is providing 'True' values throughout the DoubleCheck column even when there should be false values as shown below.
Code:
import pandas as pd
import numpy as np
Observations = pd.read_csv("C:\\Users\\Observations.csv", parse_dates=['Date'], index_col=['ObjectID', 'Date'])
Observations['VolPrice'] = np.where((Observations['Price']<Observations['Vol']) & (Observations['Vol']<Observations['Mx']), True, False)
Observations['Check'] = np.where(Observations['Vol']<Observations['Price'], True, False)
Observations['DoubleCheck'] = np.where((Observations['Check'].any()==True) or (Observations['VolPrice'].any()==True), True, False)
print(Observations)
Current Result:
ObjectID,Date,Price,Vol,Mx,VolPrice,Check,DoubleCheck
101,2017-01-01,,145,203,False,False,True
101,2017-01-02,,155,163,False,False,True
101,2017-01-03,67.0,140,234,True,False,True
101,2017-01-04,78.0,130,182,True,False,True
101,2017-01-05,58.0,178,202,True,False,True
101,2017-01-06,53.0,134,204,True,False,True
101,2017-01-07,52.0,134,183,True,False,True
101,2017-01-08,62.0,148,176,True,False,True
101,2017-01-09,42.0,152,193,True,False,True
101,2017-01-10,80.0,137,150,True,False,True
Desired Result:
ObjectID,Date,Price,Vol,Mx,VolPrice,Check,DoubleCheck
101,2017-01-01,,145,203,False,False,False
101,2017-01-02,,155,163,False,False,False
101,2017-01-03,67.0,140,234,True,False,True
101,2017-01-04,78.0,130,182,True,False,True
101,2017-01-05,58.0,178,202,True,False,True
101,2017-01-06,53.0,134,204,True,False,True
101,2017-01-07,52.0,134,183,True,False,True
101,2017-01-08,62.0,148,176,True,False,True
101,2017-01-09,42.0,152,193,True,False,True
101,2017-01-10,80.0,137,150,True,False,True
Use | for bitwise OR, working same like & for bitwise AND:
Observations['DoubleCheck'] = Observations['Check'] | Observations['VolPrice']
Or DataFrame.any with both columns:
Observations['DoubleCheck'] = Observations[['Check','VolPrice']].any(axis=1)
All together is possible without np.where:
Observations['VolPrice'] = (Observations['Price']<Observations['Vol']) & (Observations['Vol']<Observations['Mx'])
Observations['Check'] = Observations['Vol']<Observations['Price']
Observations['DoubleCheck'] = Observations['Check'] | Observations['VolPrice']
I have 2 series.
The first one contains a list of numbers with an index counting 0..8.
A = pd.Series([2,3,4,6,5,4,7,6,5], name=['A'], index=[0,1,2,3,4,5,6,7,8])
The second one only contains True values, but the index of the series is a subset of the first one.
B = pd.Series([1, 1, 1, 1, 1], name=['B'], index=[0,2,4,7,8], dtype=bool)
I'd like to use B as boolean vector to get the A-values for the corresponding indexes, like:
A[B]
[...]
IndexingError: Unalignable boolean Series key provided
Unfortunately this raises an error.
Do I need to align them first?
Does
A[B.index.values]
work for your version of pandas? (I see we have different versions because now the Series name has to be hashable, so your code gave me an error)
I've been reading over this and still find the subject a little confusing :
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Say I have a Pandas DataFrame and I wish to simultaneously set the first and last row elements of a single column to whatever value. I can do this :
df.iloc[[0, -1]].mycol = [1, 2]
which tells me A value is trying to be set on a copy of a slice from a DataFrame. and that this is potentially dangerous.
I could use .loc instead, but then I need to know the index of the first and last rows ( in constrast, .iloc allows me to access by location ).
What's the safest Pandasy way to do this ?
To get to this point :
# Django queryset
query = market.stats_set.annotate(distance=F("end_date") - query_date)
# Generate a dataframe from this queryset, and order by distance
df = pd.DataFrame.from_records(query.values("distance", *fields), coerce_float=True)
df = df.sort_values("distance").reset_index(drop=True)
Then, I try calling df.distance.iloc[[0, -1]] = [1, 2]. This raises the warning.
The issue isn't with iloc, it's when you access .mycol that a copy is created. You can do this all within iloc:
df.iloc[[0, -1], df.columns.get_loc('mycol')] = [1, 2]
Usually ix is used if you want mixed integer and label based access, but doesn't work in this case since -1 isn't actually in the index, and apparently ix isn't smart enough to know it should be the last index.
What you're doing is called chained indexing, you can use iloc just on that column to avoid the warning:
In [24]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
Out[24]:
a b c
0 1.589940 0.735713 -1.158907
1 0.485653 0.044611 0.070907
2 1.123221 -0.862393 -0.807051
3 0.338653 -0.734169 -0.070471
4 0.344794 1.095861 -1.300339
In [25]:
df['a'].iloc[[0,-1]] ='foo'
df
Out[25]:
a b c
0 foo 0.735713 -1.158907
1 0.485653 0.044611 0.070907
2 1.12322 -0.862393 -0.807051
3 0.338653 -0.734169 -0.070471
4 foo 1.095861 -1.300339
If you do it the other way then it raises the warning:
In [27]:
df.iloc[[0,-1]]['a'] ='foo'
C:\WinPython-64bit-3.4.3.1\python-3.4.3.amd64\lib\site-packages\IPython\kernel\__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == '__main__':