Python Pandas Filter - python

I am trying to filter specific rows with python-pandas:
df = pd.read_csv('file.csv', delimiter=',', header=None,engine='python', usecols=range(0, 7), error_bad_lines=False)
df = df.drop(df.index[9:86579])
df = df[df[[0,1]].apply(lambda r: r.str.contains('TestString1', case=False).any(), axis=1)]
df.to_csv("yourcsv.csv", index=False, header=None)#
Now how can I set a starting row? Because my rows "0-10" consist information and I want to start searching by keyword from row 11. But how?

Try this:
df.iloc[11:].to_csv("yourcsv.csv", index=False, header=None)

If you don't want to drop rows and "see" only from a certain row your dataframe you can use ILOC function:
df["column name"].iloc[11:].apply(function)
This example you get from 11th row until last one and apply your function.
DataFrame.iloc
Purely integer-location based indexing for selection by position.
Allowed inputs are:
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

I am not sure what you mean by "Because my rows "0-10" consist information and I want to start searching by keyword from row 11".
If you mean that you need the first 10 rows to be used as a condition for making your filter working afterwards, then you can iterate by row and use np.where.
If this is not the case, than I believe the other two answers (John, Rafael) already solved your problem so you can vote them up.

Related

`isin` fails to detect a row that is in a dataframe

I've been struggling with an error for days, and after many conversations with ChatGPT, finally got it boiled down to this one minimal example:
import pandas as pd
# Create two data frames with duplicate values
goal_df = pd.DataFrame({'user_id': [1], 'sentence_id': [2]})
source_df = pd.DataFrame({'user_id': [1, 1], 'sentence_id': [2, 2]})
# The first assertion passes
assert (goal_df[['user_id', 'sentence_id']].iloc[0] == source_df[['user_id', 'sentence_id']].iloc[0]).all()
# The second assertion fails
assert goal_df[['user_id', 'sentence_id']].iloc[0].isin(source_df[['user_id', 'sentence_id']]).all()
Why does the second assertion fail?
When I print out intermediate values, it looks like even if I replaced the all with any, it would still fail. That is, the isin is saying that the user_id and sentence_id aren't in the source_df at all, despite the line just beforehand proving that they are.
I also thought that maybe it was because there was an indexing issue where the example didn't match the index, as it's required to by isin, however, even if you make source_df = pd.DataFrame({'user_id': [1], 'sentence_id': [2]}), the same behavior occurs (first assert passes, second fails.)
What's going on here?
As Laurent pointed out, isin() is not the right tool here. Instead, you can extend the approach from your first assertion to the full source_df dataframe by using NumPy broadcasting.
Assuming goal_df has just one row and source_df has any number of rows, while both dataframes have the same number of columns, the following assertion checks that the row from goal_df is present as a row somewhere in source_df.
assert (goal_df.values == source_df.values).all(axis=1).any()
This works because the values attribute gives the values from within the dataframes as NumPy arrays. When these arrays are compared with ==, the first array is broadcast to match the dimensions of the second one, meaning its one row is repeated as many times as is the number of rows in the second dataframe. Then for each position the values from both arrays are compared, resulting in a Boolean array with the same shape as the second dataframe.
With all(axis=1) we collapse each row of this array to a single Boolean value, indicating if all the values in that row are True, i.e. if the corresponding row of the second dataframe completely matches the first dataframe. Finally, any() lets the assertion pass if such a match was found for any of the rows.
Let's consider the second assertion, and check the different parts :
type(goal_df[['user_id', 'sentence_id']].iloc[0])
<class 'pandas.core.series.Series'>
type(source_df[['user_id', 'sentence_id']])
<class 'pandas.core.frame.DataFrame'>
Then in second expression you are trying to check if Series elements are in a dataframe.
Problem is that isin Series method is not designed for that.
Like documentation exposed isin method for series can only take a set or list-like as values argument :
values : set or list-like
The sequence of values to test.
Passing in a single string will raise a TypeError. Instead, turn a single string
into a list of one element.
So Assertion is considered as False, and an Assertion exception is raised
Proposed script
# Proposed second Assertion
src = source_df[['user_id', 'sentence_id']].to_numpy().flatten() # Dataframe to numpy list array
assert goal_df[['user_id', 'sentence_id']].iloc[0].isin(src).all()
Checking row by row
I just read your comment, try this (Arne new one is excellent) :
assert (goal_df.isin(source_df).sum(axis=1) == len(source_df.columns)).all()
As Laurent above explained; the isin() method is applied to a Series (iloc[int] returns a Series), and checks for each value in that series if it is in the itterable you provide.
If I understand your question right, you are trying to check if a row from one DF exists in another DF, correct? One way of solving that is to do a merge and count the length of the result:
assert len(pd.merge(goal_df, source_df, how='inner')) > 0

In what cases are loc and iloc a better approach than just using __getitem__ with pandas dataframes? [duplicate]

I've noticed three methods of selecting a column in a Pandas DataFrame:
First method of selecting a column using loc:
df_new = df.loc[:, 'col1']
Second method - seems simpler and faster:
df_new = df['col1']
Third method - most convenient:
df_new = df.col1
Is there a difference between these three methods? I don't think so, in which case I'd rather use the third method.
I'm mostly curious as to why there appear to be three methods for doing the same thing.
In the following situations, they behave the same:
Selecting a single column (df['A'] is the same as df.loc[:, 'A'] -> selects column A)
Selecting a list of columns (df[['A', 'B', 'C']] is the same as df.loc[:, ['A', 'B', 'C']] -> selects columns A, B and C)
Slicing by rows (df[1:3] is the same as df.iloc[1:3] -> selects rows 1 and 2. Note, however, if you slice rows with loc, instead of iloc, you'll get rows 1, 2 and 3 assuming you have a RangeIndex. See details here.)
However, [] does not work in the following situations:
You can select a single row with df.loc[row_label]
You can select a list of rows with df.loc[[row_label1, row_label2]]
You can slice columns with df.loc[:, 'A':'C']
These three cannot be done with [].
More importantly, if your selection involves both rows and columns, then assignment becomes problematic.
df[1:3]['A'] = 5
This selects rows 1 and 2 then selects column 'A' of the returning object and assigns value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of making this assignment is:
df.loc[1:3, 'A'] = 5
With .loc, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']), select a single row (df.loc[5]), and select a list of rows (df.loc[[1, 2, 5]]).
Also note that these two were not included in the API at the same time. .loc was added much later as a more powerful and explicit indexer. See unutbu's answer for more detail.
Note: Getting columns with [] vs . is a completely different topic. . is only there for convenience. It only allows accessing columns whose names are valid Python identifiers (i.e. they cannot contain spaces, they cannot be composed of numbers...). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1 won't work if there is no column a). Other than that, . and [] are the same.
loc is specially useful when the index is not numeric (e.g. a DatetimeIndex) because you can get rows with particular labels from the index:
df.loc['2010-05-04 07:00:00']
df.loc['2010-1-1 0:00:00':'2010-12-31 23:59:59 ','Price']
However [] is intended to get columns with particular names:
df['Price']
With [] you can also filter rows, but it is more elaborated:
df[df['Date'] < datetime.datetime(2010,1,1,7,0,0)]['Price']
If you're confused which of these approaches is (at least) the recommended one for your use-case, take a look at this brief instructions from pandas tutorial:
When selecting subsets of data, square brackets [] are used.
Inside these brackets, you can use a single column/row label, a list
of column/row labels, a slice of labels, a conditional expression or
a colon.
Select specific rows and/or columns using loc when using the row and
column names
Select specific rows and/or columns using iloc when using the
positions in the table
You can assign new values to a selection based on loc/iloc.
I highlighted some of the points to make their use-case differences even more clear.
There seems to be a difference between df.loc[] and df[] when you create dataframe with multiple columns.
You can refer to this question:
Is there a nice way to generate multiple columns using .loc?
Here, you can't generate multiple columns using df.loc[:,['name1','name2']] but you can do by just using double bracket df[['name1','name2']]. (I wonder why they behave differently.)

How to limit Pandas loc selection

I search Pandas DataFrame by loc -for example like this
x = df.loc[df.index.isin(['one','two'])]
But I need only the first row of the result. If I use
x = df.loc[df.index.isin(['one','two'])].iloc[0]
I get error in the case that no row is found. Of course, I can select all the rows (the first example) and then check if result is empty or not. But I seek some more efficient way (the dataframe can be long). Is there any?
pandas.Index.duplicated
The pandas.Index object has a duplicated method that identifies all repeated values after the first occurance.
x[~x.index.duplicated()]
If you wanted to ...
df[df.index.isin(['one', 'two']) & ~df.index.duplicated()]

What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?

I've noticed three methods of selecting a column in a Pandas DataFrame:
First method of selecting a column using loc:
df_new = df.loc[:, 'col1']
Second method - seems simpler and faster:
df_new = df['col1']
Third method - most convenient:
df_new = df.col1
Is there a difference between these three methods? I don't think so, in which case I'd rather use the third method.
I'm mostly curious as to why there appear to be three methods for doing the same thing.
In the following situations, they behave the same:
Selecting a single column (df['A'] is the same as df.loc[:, 'A'] -> selects column A)
Selecting a list of columns (df[['A', 'B', 'C']] is the same as df.loc[:, ['A', 'B', 'C']] -> selects columns A, B and C)
Slicing by rows (df[1:3] is the same as df.iloc[1:3] -> selects rows 1 and 2. Note, however, if you slice rows with loc, instead of iloc, you'll get rows 1, 2 and 3 assuming you have a RangeIndex. See details here.)
However, [] does not work in the following situations:
You can select a single row with df.loc[row_label]
You can select a list of rows with df.loc[[row_label1, row_label2]]
You can slice columns with df.loc[:, 'A':'C']
These three cannot be done with [].
More importantly, if your selection involves both rows and columns, then assignment becomes problematic.
df[1:3]['A'] = 5
This selects rows 1 and 2 then selects column 'A' of the returning object and assigns value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of making this assignment is:
df.loc[1:3, 'A'] = 5
With .loc, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']), select a single row (df.loc[5]), and select a list of rows (df.loc[[1, 2, 5]]).
Also note that these two were not included in the API at the same time. .loc was added much later as a more powerful and explicit indexer. See unutbu's answer for more detail.
Note: Getting columns with [] vs . is a completely different topic. . is only there for convenience. It only allows accessing columns whose names are valid Python identifiers (i.e. they cannot contain spaces, they cannot be composed of numbers...). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1 won't work if there is no column a). Other than that, . and [] are the same.
loc is specially useful when the index is not numeric (e.g. a DatetimeIndex) because you can get rows with particular labels from the index:
df.loc['2010-05-04 07:00:00']
df.loc['2010-1-1 0:00:00':'2010-12-31 23:59:59 ','Price']
However [] is intended to get columns with particular names:
df['Price']
With [] you can also filter rows, but it is more elaborated:
df[df['Date'] < datetime.datetime(2010,1,1,7,0,0)]['Price']
If you're confused which of these approaches is (at least) the recommended one for your use-case, take a look at this brief instructions from pandas tutorial:
When selecting subsets of data, square brackets [] are used.
Inside these brackets, you can use a single column/row label, a list
of column/row labels, a slice of labels, a conditional expression or
a colon.
Select specific rows and/or columns using loc when using the row and
column names
Select specific rows and/or columns using iloc when using the
positions in the table
You can assign new values to a selection based on loc/iloc.
I highlighted some of the points to make their use-case differences even more clear.
There seems to be a difference between df.loc[] and df[] when you create dataframe with multiple columns.
You can refer to this question:
Is there a nice way to generate multiple columns using .loc?
Here, you can't generate multiple columns using df.loc[:,['name1','name2']] but you can do by just using double bracket df[['name1','name2']]. (I wonder why they behave differently.)

possible index by column number (not label) without iloc?

Can we index both the row and column in pandas without using.iloc? The documentations says
With DataFrame, slicing inside of [] slices the rows.
But when I want to include both row and column in the same fashion, it is not working.
data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data[0:2] #only rows
data.iloc[0:2,0:3] # works.
data[0:2,0:3] # not working in python, but it works similarly in R
I agree that using iloc is probably the clearest solution, but indexing by row and column number simultaneously can be done with two separate indexing operations. Unless you use iloc, I don't think pandas knows if you are looking for columns number 0-3, or columns named 0, 1, 2, and 3
data[0:2][data.columns[0:3]]
This is fairly clear though in showing exactly what you are selecting. Otherwise, you'll have to drop into array indexing to get your subset.
data.values[0:2,0:3]

Categories

Resources