I am curious as to why df[2] is not supported, while df.ix[2] and df[2:3] both work.
In [26]: df.ix[2]
Out[26]:
A 1.027680
B 1.514210
C -1.466963
D -0.162339
Name: 2000-01-03 00:00:00
In [27]: df[2:3]
Out[27]:
A B C D
2000-01-03 1.02768 1.51421 -1.466963 -0.162339
I would expect df[2] to work the same way as df[2:3] to be consistent with Python indexing convention. Is there a design reason for not supporting indexing row by single integer?
echoing #HYRY, see the new docs in 0.11
http://pandas.pydata.org/pandas-docs/stable/indexing.html
Here we have new operators, .iloc to explicity support only integer indexing, and .loc to explicity support only label indexing
e.g. imagine this scenario
In [1]: df = pd.DataFrame(np.random.rand(5,2),index=range(0,10,2),columns=list('AB'))
In [2]: df
Out[2]:
A B
0 1.068932 -0.794307
2 -0.470056 1.192211
4 -0.284561 0.756029
6 1.037563 -0.267820
8 -0.538478 -0.800654
In [5]: df.iloc[[2]]
Out[5]:
A B
4 -0.284561 0.756029
In [6]: df.loc[[2]]
Out[6]:
A B
2 -0.470056 1.192211
[] slices the rows (by label location) only
The primary purpose of the DataFrame indexing operator, [] is to select columns.
When the indexing operator is passed a string or integer, it attempts to find a column with that particular name and return it as a Series.
So, in the question above: df[2] searches for a column name matching the integer value 2. This column does not exist and a KeyError is raised.
The DataFrame indexing operator completely changes behavior to select rows when slice notation is used
Strangely, when given a slice, the DataFrame indexing operator selects rows and can do so by integer location or by index label.
df[2:3]
This will slice beginning from the row with integer location 2 up to 3, exclusive of the last element. So, just a single row. The following selects rows beginning at integer location 6 up to but not including 20 by every third row.
df[6:20:3]
You can also use slices consisting of string labels if your DataFrame index has strings in it. For more details, see this solution on .iloc vs .loc.
I almost never use this slice notation with the indexing operator as its not explicit and hardly ever used. When slicing by rows, stick with .loc/.iloc.
You can think DataFrame as a dict of Series. df[key] try to select the column index by key and returns a Series object.
However slicing inside of [] slices the rows, because it's a very common operation.
You can read the document for detail:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
To index-based access to the pandas table, one can also consider numpy.as_array option to convert the table to Numpy array as
np_df = df.as_matrix()
and then
np_df[i]
would work.
You can take a look at the source code .
DataFrame has a private function _slice() to slice the DataFrame, and it allows the parameter axis to determine which axis to slice. The __getitem__() for DataFrame doesn't set the axis while invoking _slice(). So the _slice() slice it by default axis 0.
You can take a simple experiment, that might help you:
print df._slice(slice(0, 2))
print df._slice(slice(0, 2), 0)
print df._slice(slice(0, 2), 1)
you can loop through the data frame like this .
for ad in range(1,dataframe_c.size):
print(dataframe_c.values[ad])
I would normally go for .loc/.iloc as suggested by Ted, but one may also select a row by transposing the DataFrame. To stay in the example above, df.T[2] gives you row 2 of df.
If you want to index multiple rows by their integer indexes, use a list of indexes:
idx = [2,3,1]
df.iloc[idx]
N.B. If idx is created using some rule, then you can also sort the dataframe by using .iloc (or .loc) because the output will be ordered by idx. So in a sense, iloc can act like a sorting function where idx is the sorting key.
Related
I have the following list
x = [1,2,3]
And the following df
Sample df
pd.DataFrame({'UserId':[1,1,1,2,2,2,3,3,3,4,4,4],'Origins':[1,2,3,2,2,3,7,8,9,10,11,12]})
Lets say I want to return, the userid who contains any of the values in the list, in his groupby origins list.
Wanted result
pd.Series({'UserId':[1,2]})
What would be the best approach? To do this, maybe a groupby with a lambda, but I am having a little trouble formulating the condition.
df['UserId'][df['Origins'].isin(x)].drop_duplicates()
I had considered using unique(), but that returns a numpy array. Since you wanted a series, I went with drop_duplicates().
IIUC, OP wants, for each Origin, the UserId whose number appears in list x. If that is the case, the following, using pandas.Series.isin and pandas.unique will do the work
df_new = df[df['Origins'].isin(x)]['UserId'].unique()
[Out]:
[1 2]
Assuming one wants a series, one can convert the dataframe to a series as follows
df_new = pd.Series(df_new)
[Out]:
0 1
1 2
dtype: int64
If one wants to return a Series, and do it all in one step, instead of pandas.unique, one can use pandas.DataFrame.drop_duplicates (see Steven Rumbaliski answer).
I have pandas Series like:
s = pd.Series([1,9,3,4,5], index = [1,2,5,3,9])
How can I obtain, say, element '3'? Given that I do not know exact elements in advance. I need to write a function that gets, say, first element of the Series.
series[2] understands it like 'index=2' instead of 'second element', when we do have indices.
When I do not indicate indices, the slicing works fine, just through elements.
But how can I prioritize slicing through elements if they overlap with indices?
Like this, using boolean indexing:
s[s==3]
Given:
s = pd.Series([1,9,3,4,5], index = [1,2,5,3,9])
Let's find elements 3 and 9, use:
s[s.isin([9,3])]
Output:
2 9
5 3
dtype: int64
Update per comment below
Use iloc for integer location:
s.iloc[2]
Output:
3
I have a dataframe (called my_df1) and want to drop several rows based on certain dates. How can I create a new dataframe (my_df2) without the dates '2020-05-01' and '2020-05-04'?
I tried the following which did not work as you can see below:
my_df2 = mydf_1[(mydf_1['Date'] != '2020-05-01') | (mydf_1['Date'] != '2020-05-04')]
my_df2.head()
The problem seems to be with your logical operator.
You should be using and here instead of or since you have to select all the rows which are not 2020-05-01 and 2020-05-04.
The bitwise operators will not be short circuiting and hence the result.
You can use isin with negation ~ sign:
dates=['2020-05-01', '2020-05-04']
my_df2 = mydf_1[~mydf_1['Date'].isin(dates)]
The short explanation about your mistake AND and OR was addressed by kanmaytacker.
Following a few additional recommendations:
Indexing in pandas:
By label .loc
By index .iloc
By label also works without .loc but it's slower as it's composed of chained operations instead of a single internal operation consisting on nested loops (see here). Also, with .loc you can select on more than one axis at a time.
# example with rows. Same logic for columns or additional axis.
df.loc[(df['a']!=4) & (df['a']!=1),:] # ".loc" is the only addition
>>>
a b c
2 0 4 6
Your index is a boolean set. This is true for numpy and as a consecuence, pandas too.
(df['a']!=4) & (df['a']!=1)
>>>
0 False
1 False
2 True
Name: a, dtype: bool
For some reason, the following 2 calls to iloc / loc produce different behavior:
>>> import pandas as pd
>>> df = pd.DataFrame(dict(A=range(3), B=range(3)))
>>> df.iloc[:1]
A B
0 0 0
>>> df.loc[:1]
A B
0 0 0
1 1 1
I understand that loc considers the row labels, while iloc considers the integer-based indices of the rows. But why is the upper bound for the loc call considered inclusive, while the iloc bound is considered exclusive?
Quick answer:
It often makes more sense to do end-inclusive slicing when using labels, because it requires less knowledge about other rows in the DataFrame.
Whenever you care about labels instead of positions, end-exclusive label slicing introduces position-dependence in a way that can be inconvenient.
Longer answer:
Any function's behavior is a trade-off: you favor some use cases over others. Ultimately the operation of .iloc is a subjective design decision by the Pandas developers (as the comment by #ALlollz indicates, this behavior is intentional). But to understand why they might have designed it that way, think about what makes label slicing different from positional slicing.
Imagine we have two DataFrames df1 and df2:
df1 = pd.DataFrame(dict(X=range(4)), index=['a','b','c','d'])
df2 = pd.DataFrame(dict(X=range(3)), index=['b','c','z'])
df1 contains:
X
a 0
b 1
c 2
d 3
df2 contains:
X
b 0
c 1
z 2
Let's say we have a label-based task to perform: we want to get rows between b and c from both df1 and df2, and we want to do it using the same code for both DataFrames. Because b and c don't have the same positions in both DataFrames, simple positional slicing won't do the trick. So we turn to label-based slicing.
If .loc were end-exclusive, to get rows between b and c we would need to know not only the label of our desired end row, but also the label of the next row after that. As constructed, this next label would be different in each DataFrame.
In this case, we would have two options:
Use separate code for each DataFrame: df1.loc['b':'d'] and df2.loc['b':'z']. This is inconvenient because it means we need to know extra information beyond just the rows that we want.
For either dataframe, get the positional index first, add 1, and then use positional slicing: df.iloc[df.index.get_loc('b'):df.index.get_loc('c')+1]. This is just wordy.
But since .loc is end-inclusive, we can just say .loc['b':'c']. Much simpler!
Whenever you care about labels instead of positions, and you're trying to write position-independent code, end-inclusive label slicing re-introduces position-dependence in a way that can be inconvenient.
That said, maybe there are use cases where you really do want end-exclusive label-based slicing. If so, you can use #Willz's answer in this question:
df.loc[start:end].iloc[:-1]
It might be a stupid question but it is driving me crazy. I have a corpus composed by 8807 articles:
print(type(doc_set))
class 'pandas.core.series.Series'
print(len(doc_set))
8807
From this list, I just want to select the first one. I have tried doc_set[1], but it retrieves 46 articles. Any idea of how to select a specific article? Thanks
try to use iloc locator:
doc_set.iloc[0]
Docs [iloc]:
Purely integer-location based indexing for selection by position.
.iloc[] is primarily integer position based (from 0 to length-1 of the
axis), but may also be used with a boolean array.
Allowed inputs are:
An integer, e.g. 5. A list or array of integers, e.g. [4, 3, 0]. A
slice object with ints, e.g. 1:7. A boolean array. A callable function
with one argument (the calling Series, DataFrame or Panel) and that
returns valid output for indexing (one of the above) .iloc will raise
IndexError if a requested indexer is out-of-bounds, except slice
indexers which allow out-of-bounds indexing (this conforms with
python/numpy slice semantics).
or iat locator:
doc_set.iat[0]
Docs [iat]:
Fast integer location scalar accessor.
Similarly to iloc, iat provides integer based lookups. You can also
set using these indexers.
PS iat should be faster compared to iloc, because the latter one does some overhead
I think you have duplicity in index.
Use iat if you need select first value of Series:
doc_set = pd.Series([8,9,10], index=[1,1,1])
print (doc_set)
1 8
1 9
1 10
dtype: int64
print (doc_set[1])
1 8
1 9
1 10
dtype: int64
print (doc_set.iat[0])
8