I have a dataframe (called my_df1) and want to drop several rows based on certain dates. How can I create a new dataframe (my_df2) without the dates '2020-05-01' and '2020-05-04'?
I tried the following which did not work as you can see below:
my_df2 = mydf_1[(mydf_1['Date'] != '2020-05-01') | (mydf_1['Date'] != '2020-05-04')]
my_df2.head()
The problem seems to be with your logical operator.
You should be using and here instead of or since you have to select all the rows which are not 2020-05-01 and 2020-05-04.
The bitwise operators will not be short circuiting and hence the result.
You can use isin with negation ~ sign:
dates=['2020-05-01', '2020-05-04']
my_df2 = mydf_1[~mydf_1['Date'].isin(dates)]
The short explanation about your mistake AND and OR was addressed by kanmaytacker.
Following a few additional recommendations:
Indexing in pandas:
By label .loc
By index .iloc
By label also works without .loc but it's slower as it's composed of chained operations instead of a single internal operation consisting on nested loops (see here). Also, with .loc you can select on more than one axis at a time.
# example with rows. Same logic for columns or additional axis.
df.loc[(df['a']!=4) & (df['a']!=1),:] # ".loc" is the only addition
>>>
a b c
2 0 4 6
Your index is a boolean set. This is true for numpy and as a consecuence, pandas too.
(df['a']!=4) & (df['a']!=1)
>>>
0 False
1 False
2 True
Name: a, dtype: bool
Related
I have a Pandas dataframe (tempDF) of 5 columns by N rows. Each element of the dataframe is an object (string in this case). For example, the dataframe looks like (this is fake data - not real world):
I have two tuples, each contains a collection of numbers as a string type. For example:
codeset = ('6108','532','98120')
additionalClinicalCodes = ('131','1','120','130')
I want to retrieve a subset of the rows from the tempDF in which the columns "medcode" OR "enttype" have at least one entry in the tuples above. Thus, from the example above, I would retrieve a subset containing rows with the index 8 and 9 and 11.
Until updating some packages earlier today (too many now to work out which has started throwing the warning), this did work:
tempDF = tempDF[tempDF["medcode"].isin(codeSet) | tempDF["enttype"].isin(additionalClinicalCodes)]
But now it is throwing the warning:
FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask |= (ar1 == a)
Looking at the API, isin states the the condition "if ALL" is in the iterable collection. I want an "if ANY" condition.
UPDATE #1
The problem lies with using the | operator, also the np.logical_or method. If I remove the second isin condition i.e., just keep tempDF[tempDF["medcode"].isin(codeSet) then no warning is thrown but I'm only subsetting on the one possible condition.
import numpy as np
tempDF = tempDF[np.logical_or(tempDF["medcode"].isin(codeSet), tempDF["enttype"].isin(additionalClinicalCodes))
I'm unable to reproduce your warning (I assume you are using an outdated numpy version), however I believe it is related to the fact that your enttype column is a numerical type, but you're using strings in additionalClinicalCodes.
Try this:
tempDF = temp[temp["medcode"].isin(list(codeset)) | temp["enttype"].isin(list(additionalClinicalCodes))]
Boiling your question down to an executable example:
import pandas as pd
tempDF = pd.DataFrame({'medcode': ['6108', '6154', '95744', '98120'], 'enttype': ['99', '131', '372', '372']})
codeset = ('6108','532','98120')
additionalClinicalCodes = ('131','1','120','130')
newDF = tempDF[tempDF["medcode"].isin(codeset) | tempDF["enttype"].isin(additionalClinicalCodes)]
print(newDF)
print("Pandas Version")
print(pd.__version__)
This returns for me
medcode enttype
0 6108 99
1 6154 131
3 98120 372
Pandas Version
1.4.2
Thus I am not able to reproduce your warning.
This is a numpy strange behaviour. I think the right way to do this is yours way, but if the warning bothers you, try this:
tempDF = tempDF[
(
tempDF.medcode.isin(codeset).astype(int) +
tempDF.isin(additionalClinicalCode).astype(int)
) >= 1
]
I am curious as to why df[2] is not supported, while df.ix[2] and df[2:3] both work.
In [26]: df.ix[2]
Out[26]:
A 1.027680
B 1.514210
C -1.466963
D -0.162339
Name: 2000-01-03 00:00:00
In [27]: df[2:3]
Out[27]:
A B C D
2000-01-03 1.02768 1.51421 -1.466963 -0.162339
I would expect df[2] to work the same way as df[2:3] to be consistent with Python indexing convention. Is there a design reason for not supporting indexing row by single integer?
echoing #HYRY, see the new docs in 0.11
http://pandas.pydata.org/pandas-docs/stable/indexing.html
Here we have new operators, .iloc to explicity support only integer indexing, and .loc to explicity support only label indexing
e.g. imagine this scenario
In [1]: df = pd.DataFrame(np.random.rand(5,2),index=range(0,10,2),columns=list('AB'))
In [2]: df
Out[2]:
A B
0 1.068932 -0.794307
2 -0.470056 1.192211
4 -0.284561 0.756029
6 1.037563 -0.267820
8 -0.538478 -0.800654
In [5]: df.iloc[[2]]
Out[5]:
A B
4 -0.284561 0.756029
In [6]: df.loc[[2]]
Out[6]:
A B
2 -0.470056 1.192211
[] slices the rows (by label location) only
The primary purpose of the DataFrame indexing operator, [] is to select columns.
When the indexing operator is passed a string or integer, it attempts to find a column with that particular name and return it as a Series.
So, in the question above: df[2] searches for a column name matching the integer value 2. This column does not exist and a KeyError is raised.
The DataFrame indexing operator completely changes behavior to select rows when slice notation is used
Strangely, when given a slice, the DataFrame indexing operator selects rows and can do so by integer location or by index label.
df[2:3]
This will slice beginning from the row with integer location 2 up to 3, exclusive of the last element. So, just a single row. The following selects rows beginning at integer location 6 up to but not including 20 by every third row.
df[6:20:3]
You can also use slices consisting of string labels if your DataFrame index has strings in it. For more details, see this solution on .iloc vs .loc.
I almost never use this slice notation with the indexing operator as its not explicit and hardly ever used. When slicing by rows, stick with .loc/.iloc.
You can think DataFrame as a dict of Series. df[key] try to select the column index by key and returns a Series object.
However slicing inside of [] slices the rows, because it's a very common operation.
You can read the document for detail:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
To index-based access to the pandas table, one can also consider numpy.as_array option to convert the table to Numpy array as
np_df = df.as_matrix()
and then
np_df[i]
would work.
You can take a look at the source code .
DataFrame has a private function _slice() to slice the DataFrame, and it allows the parameter axis to determine which axis to slice. The __getitem__() for DataFrame doesn't set the axis while invoking _slice(). So the _slice() slice it by default axis 0.
You can take a simple experiment, that might help you:
print df._slice(slice(0, 2))
print df._slice(slice(0, 2), 0)
print df._slice(slice(0, 2), 1)
you can loop through the data frame like this .
for ad in range(1,dataframe_c.size):
print(dataframe_c.values[ad])
I would normally go for .loc/.iloc as suggested by Ted, but one may also select a row by transposing the DataFrame. To stay in the example above, df.T[2] gives you row 2 of df.
If you want to index multiple rows by their integer indexes, use a list of indexes:
idx = [2,3,1]
df.iloc[idx]
N.B. If idx is created using some rule, then you can also sort the dataframe by using .iloc (or .loc) because the output will be ordered by idx. So in a sense, iloc can act like a sorting function where idx is the sorting key.
For some reason, the following 2 calls to iloc / loc produce different behavior:
>>> import pandas as pd
>>> df = pd.DataFrame(dict(A=range(3), B=range(3)))
>>> df.iloc[:1]
A B
0 0 0
>>> df.loc[:1]
A B
0 0 0
1 1 1
I understand that loc considers the row labels, while iloc considers the integer-based indices of the rows. But why is the upper bound for the loc call considered inclusive, while the iloc bound is considered exclusive?
Quick answer:
It often makes more sense to do end-inclusive slicing when using labels, because it requires less knowledge about other rows in the DataFrame.
Whenever you care about labels instead of positions, end-exclusive label slicing introduces position-dependence in a way that can be inconvenient.
Longer answer:
Any function's behavior is a trade-off: you favor some use cases over others. Ultimately the operation of .iloc is a subjective design decision by the Pandas developers (as the comment by #ALlollz indicates, this behavior is intentional). But to understand why they might have designed it that way, think about what makes label slicing different from positional slicing.
Imagine we have two DataFrames df1 and df2:
df1 = pd.DataFrame(dict(X=range(4)), index=['a','b','c','d'])
df2 = pd.DataFrame(dict(X=range(3)), index=['b','c','z'])
df1 contains:
X
a 0
b 1
c 2
d 3
df2 contains:
X
b 0
c 1
z 2
Let's say we have a label-based task to perform: we want to get rows between b and c from both df1 and df2, and we want to do it using the same code for both DataFrames. Because b and c don't have the same positions in both DataFrames, simple positional slicing won't do the trick. So we turn to label-based slicing.
If .loc were end-exclusive, to get rows between b and c we would need to know not only the label of our desired end row, but also the label of the next row after that. As constructed, this next label would be different in each DataFrame.
In this case, we would have two options:
Use separate code for each DataFrame: df1.loc['b':'d'] and df2.loc['b':'z']. This is inconvenient because it means we need to know extra information beyond just the rows that we want.
For either dataframe, get the positional index first, add 1, and then use positional slicing: df.iloc[df.index.get_loc('b'):df.index.get_loc('c')+1]. This is just wordy.
But since .loc is end-inclusive, we can just say .loc['b':'c']. Much simpler!
Whenever you care about labels instead of positions, and you're trying to write position-independent code, end-inclusive label slicing re-introduces position-dependence in a way that can be inconvenient.
That said, maybe there are use cases where you really do want end-exclusive label-based slicing. If so, you can use #Willz's answer in this question:
df.loc[start:end].iloc[:-1]
I wanted to use a boolean indexing, checking for rows of my data frame where a particular column does not have NaN values. So, I did the following:
import pandas as pd
my_df.loc[pd.isnull(my_df['col_of_interest']) == False].head()
to see a snippet of that data frame, including only the values that are not NaN (most values are NaN).
It worked, but seems less-than-elegant. I'd want to type:
my_df.loc[!pd.isnull(my_df['col_of_interest'])].head()
However, that generated an error. I also spend a lot of time in R, so maybe I'm confusing things. In Python, I usually put in the syntax "not" where I can. For instance, if x is not none:, but I couldn't really do it here. Is there a more elegant way? I don't like having to put in a senseless comparison.
In general with pandas (and numpy), we use the bitwise NOT ~ instead of ! or not (whose behaviour can't be overridden by types).
While in this case we have notnull, ~ can come in handy in situations where there's no special opposite method.
>>> df = pd.DataFrame({"a": [1, 2, np.nan, 3]})
>>> df.a.isnull()
0 False
1 False
2 True
3 False
Name: a, dtype: bool
>>> ~df.a.isnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
>>> df.a.notnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
(For completeness I'll note that -, the unary negative operator, will also work on a boolean Series but ~ is the canonical choice, and - has been deprecated for numpy boolean arrays.)
Instead of using pandas.isnull() , you should use pandas.notnull() to find the rows where the column has not null values. Example -
import pandas as pd
my_df.loc[pd.notnull(my_df['col_of_interest'])].head()
pandas.notnull() is the boolean inverse of pandas.isnull() , as given in the documentation -
See also
pandas.notnull
boolean inverse of pandas.isnull
For a df table like below,
A B C D
0 0 1 1 1
1 2 3 5 7
3 3 1 2 8
why are the double brackets needed for selecting specific columns after boolean indexing?
the [['A','C']] part of
df[df['A'] < 3][['A','C']]
For pandas objects (Series, DataFrame), the indexing operator [] only accepts
colname or list of colnames to select column(s)
slicing or Boolean array to select row(s), i.e. it only refers to one dimension of the dataframe.
For df[[colname(s)]], the interior brackets are for list, and the outside brackets are indexing operator, i.e. you must use double brackets if you select two or more columns. With one column name, single pair of brackets returns a Series, while double brackets return a dataframe.
Also, df.ix[df['A'] < 3,['A','C']] or df.loc[df['A'] < 3,['A','C']] is better than the chained selection for avoiding returning a copy versus a view of the dataframe.
Please refer pandas documentation for details
Because you have no columns named 'A','C', which is what you'd be trying to do which will raise a KeyError, so you have to use an iterable to sub-select from the df.
So
df[df['A'] < 3]['A','C']
raises
KeyError: ('A', 'C')
Which is different to
In [261]:
df[df['A'] < 3][['A','C']]
Out[261]:
A C
0 0 1
1 2 5
This is no different to trying:
df['A','C']
hence why you need double square brackets:
df[['A','C']]
Note that the modern way is to use .ix:
In [264]:
df.ix[df['A'] < 3,['A','C']]
Out[264]:
A C
0 0 1
1 2 5
So that you're operating on a view rather than potentially a copy
Because inner brackets are just python syntax (literal) for list.
The outer brackets are the indexer operation of pandas dataframe object.
In this use case inner ['A', 'B'] defines the list of columns to pass as single argument to the indexer operation, which is denoted by outer brackets.
Adding to previous responses, you could also use df.iloc accessor if you need to select index positions. It's also making the code more reproducible, which is nice.