I am working on a large dataset and there are a few duplicates in my index. I'd like to (perhaps visually) check what these duplicated rows are like and then decide which one to drop. Is there a way that I can select the slice of the dataframe that have duplicated indices (or duplicates in any columns)?
Any help is appreciated.
You can use pandas.duplicated and then slice it using a boolean. For more information on any method or advanced features, I would advise you to always check in its docstring.
Well, this would solve the case for you:
df[df.duplicated('Column Name', keep=False) == True]
Here,
keep=False will return all those rows having duplicate values in that column.
use duplicated method of DataFrame:
df.duplicated(cols=[...])
See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
EDIT
You can use:
df[df.duplicated(cols=[...]) | df.duplicated(cols=[...], take_last=True)]
or, you can use groupby and filter:
df.groupby([...]).filter(lambda df:df.shape[0] > 1)
or apply:
df.groupby([...], group_keys=False).apply(lambda df:df if df.shape[0] > 1 else None)
Related
I have a problem, I want to exclude from a column and drop from my DF all my rows finishing by "99".
I tried to create a list :
filteredvalues = [x for x in df['XX'] if x.endswith('99')]
I have in this list all the concerned rows but how to apply to my DF and drop those rows :
I tried a few things but nothing works :
Lately I tried this :
df = df[df['XX'] not in filteredvalues]
Any help on this?
Use the .str attribute, with corresponding string methods, to select such items. Then use ~ to negate the result, and filter your dataframe with that:
df = df[~df['XX'].str.endswith('99')]
I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!
Use startswith for this:
df = df[df['Code'].str.startswith('pl')]
Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]
If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()
The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]
I have large pandas dataframe (around million rows) and a list of id-s (length of array is 100,000). For each id in df1 I have to check if that id is in my list (called special) and flag it accordingly:
df['Segment'] = df['ID'].apply(lambda x: 1 if x in special else np.nan)
problem is that this is extremely slow, as for million id-s lambda expression checks if that id is in a list of 100,000 entries. Is there a faster way to accomplish this?
I recommend you see When should I ever want to use apply
Use Series.isin with Series.astype:
df['Segment'] = df['ID'].isin(special).astype(int)
We can also use Series.view:
df['Segment'] = df['ID'].isin(special).view('uint8')
or numpy.where
df['Segment'] = np.where(df['ID'].isin(special),1 ,0)
this is really trivial but can't believe I have wandered around for an hour and still can find the answer, so here you are:
df = pd.DataFrame({"cats":["a","b"], "vals":[1,2]})
df.cats = df.cats.astype("category")
df
My problem is how to select the row that its "cats" columns's category is "a". I know that df.loc[df.cats == "a"] will work but it's based on equality on element. Is there a way to select based on levels of category?
This works:
df.cats[df.cats=='a']
UPDATE
The question was updated. New solution:
df[df.cats.cat.categories == ['a']]
For those who are trying to filter rows based on a numerical categorical column:
df[df['col'] == pd.Interval(46, 53, closed='right')]
This would keep the rows where the col column has category (46, 53].
This kind of categorical column is common when you discretize numerical columns using pd.qcut() method.
You can query the categorical list using df.cats.cat.categories which prints output as
Index(['a', 'b'], dtype='object')
For this case, to select a row with category of 'a' which is df.cats.cat.categories['0'], you just use:
df[df.cats == df.cats.cat.categories[0]]
Using the isin function to create a boolean index is an approach that will extend to multiple categories, similar to R's %in% operator.
# will return desired subset
df[df.cats.isin(['a'])]
# can be extended to multiple categories
df[df.cats.isin(['a', 'b'])]
df[df.cats.cat.categories == df.cats.cat.categories[0]]
I currently have a pandas Series with dtype Timestamp, and I want to group it by date (and have many rows with different times in each group).
The seemingly obvious way of doing this would be something similar to
grouped = s.groupby(lambda x: x.date())
However, pandas' groupby groups Series by its index. How can I make it group by value instead?
grouped = s.groupby(s)
Or:
grouped = s.groupby(lambda x: s[x])
Three methods:
DataFrame: pd.groupby(['column']).size()
Series: sel.groupby(sel).size()
Series to DataFrame:
pd.DataFrame( sel, columns=['column']).groupby(['column']).size()
For anyone else who wants to do this inline without throwing a lambda in (which tends to kill performance):
s.to_frame(0).groupby(0)[0]
You should convert it to a DataFrame, then add a column that is the date(). You can do groupby on the DataFrame with the date column.
df = pandas.DataFrame(s, columns=["datetime"])
df["date"] = df["datetime"].apply(lambda x: x.date())
df.groupby("date")
Then "date" becomes your index. You have to do it this way because the final grouped object needs an index so you can do things like select a group.
To add another suggestion, I often use the following as it uses simple logic:
pd.Series(index=s.values).groupby(level=0)