I've been struggling with an error for days, and after many conversations with ChatGPT, finally got it boiled down to this one minimal example:
import pandas as pd
# Create two data frames with duplicate values
goal_df = pd.DataFrame({'user_id': [1], 'sentence_id': [2]})
source_df = pd.DataFrame({'user_id': [1, 1], 'sentence_id': [2, 2]})
# The first assertion passes
assert (goal_df[['user_id', 'sentence_id']].iloc[0] == source_df[['user_id', 'sentence_id']].iloc[0]).all()
# The second assertion fails
assert goal_df[['user_id', 'sentence_id']].iloc[0].isin(source_df[['user_id', 'sentence_id']]).all()
Why does the second assertion fail?
When I print out intermediate values, it looks like even if I replaced the all with any, it would still fail. That is, the isin is saying that the user_id and sentence_id aren't in the source_df at all, despite the line just beforehand proving that they are.
I also thought that maybe it was because there was an indexing issue where the example didn't match the index, as it's required to by isin, however, even if you make source_df = pd.DataFrame({'user_id': [1], 'sentence_id': [2]}), the same behavior occurs (first assert passes, second fails.)
What's going on here?
As Laurent pointed out, isin() is not the right tool here. Instead, you can extend the approach from your first assertion to the full source_df dataframe by using NumPy broadcasting.
Assuming goal_df has just one row and source_df has any number of rows, while both dataframes have the same number of columns, the following assertion checks that the row from goal_df is present as a row somewhere in source_df.
assert (goal_df.values == source_df.values).all(axis=1).any()
This works because the values attribute gives the values from within the dataframes as NumPy arrays. When these arrays are compared with ==, the first array is broadcast to match the dimensions of the second one, meaning its one row is repeated as many times as is the number of rows in the second dataframe. Then for each position the values from both arrays are compared, resulting in a Boolean array with the same shape as the second dataframe.
With all(axis=1) we collapse each row of this array to a single Boolean value, indicating if all the values in that row are True, i.e. if the corresponding row of the second dataframe completely matches the first dataframe. Finally, any() lets the assertion pass if such a match was found for any of the rows.
Let's consider the second assertion, and check the different parts :
type(goal_df[['user_id', 'sentence_id']].iloc[0])
<class 'pandas.core.series.Series'>
type(source_df[['user_id', 'sentence_id']])
<class 'pandas.core.frame.DataFrame'>
Then in second expression you are trying to check if Series elements are in a dataframe.
Problem is that isin Series method is not designed for that.
Like documentation exposed isin method for series can only take a set or list-like as values argument :
values : set or list-like
The sequence of values to test.
Passing in a single string will raise a TypeError. Instead, turn a single string
into a list of one element.
So Assertion is considered as False, and an Assertion exception is raised
Proposed script
# Proposed second Assertion
src = source_df[['user_id', 'sentence_id']].to_numpy().flatten() # Dataframe to numpy list array
assert goal_df[['user_id', 'sentence_id']].iloc[0].isin(src).all()
Checking row by row
I just read your comment, try this (Arne new one is excellent) :
assert (goal_df.isin(source_df).sum(axis=1) == len(source_df.columns)).all()
As Laurent above explained; the isin() method is applied to a Series (iloc[int] returns a Series), and checks for each value in that series if it is in the itterable you provide.
If I understand your question right, you are trying to check if a row from one DF exists in another DF, correct? One way of solving that is to do a merge and count the length of the result:
assert len(pd.merge(goal_df, source_df, how='inner')) > 0
Related
Apologies if this sounds quite basic but I'm trying to understand the deeper mechanics of subsetting syntax:
I understand that with non-.loc subsetting, you can select columns, rows by index number, and cross-select columns and rows-by-index-number.
But by what mechanism do you subset a series of booleans from a dataframe, using non.loc syntax?
e.g.,
Working with this practice df:
you could write
test['age']==42
and get a series of booleans indicating where 42 appeared in the age column.
But when you write that same boolean filter as a subset of the same df
test[test['age']==42]
you get all the columns of the df, and full rows for any row that had 42 in the age column.
I'm wondering, more granularly, by what mechanism you subset a series of booleans from a df in this non-.loc context. Put differently, is this considered a row or column selection, or is it an entirely different mechanism that simply allows inputting a list/series/df of booleans?
It seems like you're selecting whether to show the follow rows depending on the True-False value of each row. And indeed, you could write
test[[True, False, True, False, False]]
to get the same result. But you'd get an error making the same direct row selection as a list, as via
test[[0,1,2,3,4]]
At bottom I'm trying to get a better understanding of the mechanism for such boolean-filtering, and how it might relate to non-.loc row/column selection.
Your question asks:
by what mechanism do you subset a series of booleans from a dataframe, using non.loc syntax?
The pandas docs on Boolean indexing state:
You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame)
You also write:
But you'd get an error making the same direct row selection as a list, as via test[[0,1,2,3,4]]
Such error behavior is a result of the fact that [] access with a list other than booleans expects column labels, not row index labels. This is made explicit in the Basics subsection of the Indexing and selecting data section of the pandas docs:
You can pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised.
I search Pandas DataFrame by loc -for example like this
x = df.loc[df.index.isin(['one','two'])]
But I need only the first row of the result. If I use
x = df.loc[df.index.isin(['one','two'])].iloc[0]
I get error in the case that no row is found. Of course, I can select all the rows (the first example) and then check if result is empty or not. But I seek some more efficient way (the dataframe can be long). Is there any?
pandas.Index.duplicated
The pandas.Index object has a duplicated method that identifies all repeated values after the first occurance.
x[~x.index.duplicated()]
If you wanted to ...
df[df.index.isin(['one', 'two']) & ~df.index.duplicated()]
What is the difference between these two methods to delete a row if the string 'something' is found in the column 'search'?
First method:
mydata = mydata.set_index("search")
mydata = mydata.drop("something", axis=0)
This method seems pretty straight forward and is understandable.
Second method:
mydata = mydata[~mydata.select_dtypes(['object']).eq('something').any(1)]
I don't really know how this method works. Where in this line is it specified to drop/delete the row? And why does it work with 'object' instead of 'search'? What does the "~" stand for? I just can't find it in the documentation.
Your two methods are not identical. Let's look at the second method in parts.
Step 1: subset dataframe via select_dtypes
mydata.select_dtypes(['object']) filters your dataframe for only series with object dtype. You can extract the dtype of each series via mydata.dtypes. Typically, non-numeric series will have object dtype, which indicates a sequence of pointers, similar to list.
In this case, your two methods only align when series search is the only object dtype series.
Step 2: Test for equality via eq
Since Step 1 returns a dataframe, even if it only contains one series, pd.DataFrame.eq will return a dataframe of Boolean values.
Step 3: Test for any True value row-wise via any
Next your second method checks if any value is True row-wise (axis=1). Again, if your only object series is search, then this equates to the same as your first method.
If you have multiple object series, then your two methods may not align, as a row may be excluded due to another series being equal to 'something'.
I am trying to filter specific rows with python-pandas:
df = pd.read_csv('file.csv', delimiter=',', header=None,engine='python', usecols=range(0, 7), error_bad_lines=False)
df = df.drop(df.index[9:86579])
df = df[df[[0,1]].apply(lambda r: r.str.contains('TestString1', case=False).any(), axis=1)]
df.to_csv("yourcsv.csv", index=False, header=None)#
Now how can I set a starting row? Because my rows "0-10" consist information and I want to start searching by keyword from row 11. But how?
Try this:
df.iloc[11:].to_csv("yourcsv.csv", index=False, header=None)
If you don't want to drop rows and "see" only from a certain row your dataframe you can use ILOC function:
df["column name"].iloc[11:].apply(function)
This example you get from 11th row until last one and apply your function.
DataFrame.iloc
Purely integer-location based indexing for selection by position.
Allowed inputs are:
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.
I am not sure what you mean by "Because my rows "0-10" consist information and I want to start searching by keyword from row 11".
If you mean that you need the first 10 rows to be used as a condition for making your filter working afterwards, then you can iterate by row and use np.where.
If this is not the case, than I believe the other two answers (John, Rafael) already solved your problem so you can vote them up.
I have a small problem: I have a column in my DataFrame, which has multiple rows, and in each row it holds either 1 or more values starting with 'M' letter followed by 3 digits. If there is more than 1 value, they are separated by a comma.
I would like to print out a view of the DataFrame, only featuring rows where that 1 column holds values I specify (e.g. I want them to hold any item from list ['M111', 'M222'].
I have started to build my boolean mask in the following way:
df[df['Column'].apply(lambda x: x.split(', ').isin(['M111', 'M222']))]
In my mind, .apply() with .split() methods in there first convert 'Column' values to lists in each row with 1 or more values in it, and then .isin() method confirms whether or not any of items in list of items in each row are in the list of specified values ['M111', 'M222'].
In practice however, instead of getting a desired view of DataFrame, I get error
'TypeError: unhashable type: 'list'
What am I doing wrong?
Kind regards,
Greem
I think you need:
df2 = df[df['Column'].str.contains('|'.join(['M111', 'M222']))]
You can only access the isin() method with a Pandas object. But split() returns a list. Wrapping split() in a Series will work:
# sample data
data = {'Column':['M111, M000','M333, M444']}
df = pd.DataFrame(data)
print(df)
Column
0 M111, M000
1 M333, M444
Now wrap split() in a Series.
Note that isin() will return a list of boolean values, one for each element coming out of split(). You want to know "whether or not any of item in list...are in the list of specified values", so add any() to your apply function.
df[df['Column'].apply(lambda x: pd.Series(x.split(', ')).isin(['M111', 'M222']).any())]
Output:
Column
0 M111, M000
As others have pointed out, there are simpler ways to go about achieving your end goal. But this is how to resolve the specific issue you're encountering with isin().