select rows with duplicate observations in pandas

select rows with duplicate observations in pandas - python

I am working on a large dataset and there are a few duplicates in my index. I'd like to (perhaps visually) check what these duplicated rows are like and then decide which one to drop. Is there a way that I can select the slice of the dataframe that have duplicated indices (or duplicates in any columns)?
Any help is appreciated.

You can use pandas.duplicated and then slice it using a boolean. For more information on any method or advanced features, I would advise you to always check in its docstring.
Well, this would solve the case for you:
df[df.duplicated('Column Name', keep=False) == True]
Here,
keep=False will return all those rows having duplicate values in that column.

use duplicated method of DataFrame:
df.duplicated(cols=[...])
See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
EDIT
You can use:
df[df.duplicated(cols=[...]) | df.duplicated(cols=[...], take_last=True)]
or, you can use groupby and filter:
df.groupby([...]).filter(lambda df:df.shape[0] > 1)
or apply:
df.groupby([...], group_keys=False).apply(lambda df:df if df.shape[0] > 1 else None)

Related

Exclude values in DF column

I have a problem, I want to exclude from a column and drop from my DF all my rows finishing by "99".
I tried to create a list :
filteredvalues = [x for x in df['XX'] if x.endswith('99')]
I have in this list all the concerned rows but how to apply to my DF and drop those rows :
I tried a few things but nothing works :
Lately I tried this :
df = df[df['XX'] not in filteredvalues]
Any help on this?

Use the .str attribute, with corresponding string methods, to select such items. Then use ~ to negate the result, and filter your dataframe with that:
df = df[~df['XX'].str.endswith('99')]

Python.pandas: how to select rows where objects start with letters 'PL'

I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!

Use startswith for this:
df = df[df['Code'].str.startswith('pl')]

Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]

If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()

The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]

Fastest way to check if an ID in your dataframe exists in another dataframe

I have large pandas dataframe (around million rows) and a list of id-s (length of array is 100,000). For each id in df1 I have to check if that id is in my list (called special) and flag it accordingly:
df['Segment'] = df['ID'].apply(lambda x: 1 if x in special else np.nan)
problem is that this is extremely slow, as for million id-s lambda expression checks if that id is in a list of 100,000 entries. Is there a faster way to accomplish this?

I recommend you see When should I ever want to use apply
Use Series.isin with Series.astype:
df['Segment'] = df['ID'].isin(special).astype(int)
We can also use Series.view:
df['Segment'] = df['ID'].isin(special).view('uint8')
or numpy.where
df['Segment'] = np.where(df['ID'].isin(special),1 ,0)

How to select rows based categories in Pandas dataframe

this is really trivial but can't believe I have wandered around for an hour and still can find the answer, so here you are:
df = pd.DataFrame({"cats":["a","b"], "vals":[1,2]})
df.cats = df.cats.astype("category")
df
My problem is how to select the row that its "cats" columns's category is "a". I know that df.loc[df.cats == "a"] will work but it's based on equality on element. Is there a way to select based on levels of category?

This works:
df.cats[df.cats=='a']
UPDATE
The question was updated. New solution:
df[df.cats.cat.categories == ['a']]

For those who are trying to filter rows based on a numerical categorical column:
df[df['col'] == pd.Interval(46, 53, closed='right')]
This would keep the rows where the col column has category (46, 53].
This kind of categorical column is common when you discretize numerical columns using pd.qcut() method.

You can query the categorical list using df.cats.cat.categories which prints output as
Index(['a', 'b'], dtype='object')
For this case, to select a row with category of 'a' which is df.cats.cat.categories['0'], you just use:
df[df.cats == df.cats.cat.categories[0]]

Using the isin function to create a boolean index is an approach that will extend to multiple categories, similar to R's %in% operator.
# will return desired subset
df[df.cats.isin(['a'])]
# can be extended to multiple categories
df[df.cats.isin(['a', 'b'])]

df[df.cats.cat.categories == df.cats.cat.categories[0]]

How to group a Series by values in pandas?

I currently have a pandas Series with dtype Timestamp, and I want to group it by date (and have many rows with different times in each group).
The seemingly obvious way of doing this would be something similar to
grouped = s.groupby(lambda x: x.date())
However, pandas' groupby groups Series by its index. How can I make it group by value instead?

grouped = s.groupby(s)
Or:
grouped = s.groupby(lambda x: s[x])

Three methods:
DataFrame: pd.groupby(['column']).size()
Series： sel.groupby(sel).size()
Series to DataFrame:
pd.DataFrame( sel, columns=['column']).groupby(['column']).size()

For anyone else who wants to do this inline without throwing a lambda in (which tends to kill performance):
s.to_frame(0).groupby(0)[0]

You should convert it to a DataFrame, then add a column that is the date(). You can do groupby on the DataFrame with the date column.
df = pandas.DataFrame(s, columns=["datetime"])
df["date"] = df["datetime"].apply(lambda x: x.date())
df.groupby("date")
Then "date" becomes your index. You have to do it this way because the final grouped object needs an index so you can do things like select a group.

To add another suggestion, I often use the following as it uses simple logic:
pd.Series(index=s.values).groupby(level=0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

select rows with duplicate observations in pandas - python

Related

Exclude values in DF column

Python.pandas: how to select rows where objects start with letters 'PL'

Fastest way to check if an ID in your dataframe exists in another dataframe

How to select rows based categories in Pandas dataframe

How to group a Series by values in pandas?

Categories

Resources