Create a new dataframe with only duplicated rows - python

I would like to have a new dataframe with only rows that are duplicated in the previous df.
I tried to assign a new column that it is true if there are duplicates and then select only rows that are true. However I got 0 entities. I am sure that I have duplicates in the df
I want to keep in the old dataframe the first rows and remove all the other duplicates.
Column with duplicate values is called 'merged'
df=df.assign(
is_duplicate= lambda d: d.duplicated()
).sort_values('merged').reset_index(drop=True)
df2= df.loc[df['is_duplicate'] == 'True']

They are not strings, they are booleans, so use:
df2 = df.loc[df['is_duplicate']]

I think you need boolean indexing, loc should be removed:
df[df.duplicated()]
Or your solution cannot be used with .reset_index(drop=True), because then filtered another rows, also sorting should be better before or after solution:
df = df.assign(is_duplicate= lambda d: d.duplicated())
df2= df[df['is_duplicate']]

Related

How should I filter one dataframe by entries from another one in pandas with isin?

I have two dataframes (df1, df2). The columns names and indices are the same (the difference in columns entries). Also, df2 has only 20 entries (which also existed in df1 as i said).
I want to filter df1 by df2 entries, but when i try to do it with isin but nothing happens.
df1.isin(df2) or df1.index.isin(df2.index)
Tell me please what I'm doing wrong and how should I do it..
First of all the isin function in pandas returns a Dataframe of booleans and not the result you want. So it makes sense that the cmds you used did not work.
I am possitive that hte following psot will help
pandas - filter dataframe by another dataframe by row elements
If you want to select the entries in df1 with an index that is also present in df2, you should be able to do it with:
df1.loc[df2.index]
or if you really want to use isin:
df1[df1.index.isin(df2.index)]

How to check if a string value does not exist in another dataframe?

I have 2 dataframes to focus: df_hours and new_df
I want to check if a string value of a dataframe(df_hours) does not exist in the whole dataframe(new_df).
For e.g.
df_hours has a 'Category' column with string values 'A','B','C' etc.
I want to check if 'A' does not exist in new_df.
I have 2 for loops and inside it I have the following if condition:
for i in range(len(df_hours)):
for j in range(len(df_hours_copy)):
if df_hours.iloc[i,1] == df_hours_copy.iloc[j,1] and (~df_hours.iloc[i,1].isin(new_df)):
How can I code the second part of the if (the one after the 'and')?
The idea:
By the code after and, I just want to check if that value does not exist in new_df then insert some values from df_hours to new_df.
I am not sure what exactly you are trying to do with the two loops, but you could use masks to filter your df, for example:
mask = ~df_new[col_new].isin(df_hours[col].values)
df[mask]
where col_new is a certain column of df_new and col is a certain column from df_hours and you could look on the columns if required.
You can use any and a list comprehension to gather all missing values in new_df from you series :
[value for value in df_hours["Category"].unique() if not (new_df==value).any().any()]
Calling .any() once will look for the value column-wise. The second call will check if any True is in the resulting series.

Iterate over all rows in dataframe and check all column values are in list

I have a dataframe with 7 columns and ~5.000 rows. I want to check that all the column values in a row are in my list and if so either add them to a new dataframe OR remove those where all values do not match, i.e. remove false rows (w/e is the easiest);
for row in df:
for columns in row:
if df.iloc[row, column].isin(MyList):
...*something*
I could imagine that .apply and .all could be used, but I'm afraid my python skills are limited, any help?
If I understood correctly, you can solve this by using apply with a lambda expression like:
df.loc[df.apply(lambda row: all(value in MyList for value in row), axis=1))]

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

Get the column index as a value when value exists

I need do create some additional columns to my table or separate table based on following:
I have a table
and I need to create additional columns where the column indexes (names of columns) will be inserted as values. Like this:
How to do it in pandas? Any ideas?
Thank you
If need matched columns only for 1 values:
df = (df.set_index('name')
.eq(1)
.dot(df.columns[1:].astype(str) + ',')
.str.rstrip(',')
.str.split(',', expand=True)
.add_prefix('c')
.reset_index())
print (df)
Explanation:
Idea is create boolean mask with True for values which are replaced by columns names - so compare by DataFrame.eq by 1 and used matrix multiplication by DataFrame.dot by all columns without first with added separator. Then remove last traling separator by Series.str.rstrip and use Series.str.split for new column, changed columns names by DataFrame.add_prefix.
Another solution:
df1 = df.set_index('name').eq(1).apply(lambda x: x.index[x].tolist(), 1)
df = pd.DataFrame(df1.values.tolist(), index=df1.index).add_prefix('c').reset_index()

Categories

Resources