Dropping columns with duplicate values

Dropping columns with duplicate values - python

I have a dataframe with 2415 columns and I want to drop consecutive duplicate columns. That is, if column 1 and column 2 have the same values, I want to drop column 2.
I wrote the below code but it doesn't seem to work:
for i in (0,len(df.columns)-1):
if (df[i].tolist() == df[i+1].tolist()):
df=df.drop([i+1], axis=1)
else:
df=df

You need to select column name from the index.Try this.
columns = df.columns
drops = []
for i in (0,len(df.columns)-1):
if (df[columns[i]].tolist() == df[columns[i+1]].tolist()):
drops.append(columns[i])
df = df.drop(drops,axis=1)

Let us try shift
df.loc[:,~df.eq(df.astype(object).shift(axis=1)).all()]

Related

How to concatenate values from many columns into one column when one doesn't know the number of columns will have

My application saves an indeterminate number of values in different columns. As a results, I have a data frame with a certain number of columns at the beginning but then from a particular column (that I know) I will have an uncertain number of columns saving same data
Example:
known1 known2 know3 unknow1 unknow2 unknow3 ...
1 3 3 data data2 data3
The result I would like to get should be something like this
known1 known2 know3 all_unknow
1 3 3 data,data2,data3
How can I do this when I don't know the number of unknown columns but what I do know is this will occur (in this example) from the 4th column.

IIUC, use filter to select the columns by keyword:
cols = list(df.filter(like='unknow'))
# ['unknow1', 'unknow2', 'unknow3']
df['all_unknow'] = df[cols].apply(','.join, axis=1)
df = df.drop(columns=cols)
or take all columns from the 4th one:
cols = df.columns[3:]
df['all_unknow'] = df[cols].apply(','.join, axis=1)
df = df.drop(columns=cols)
output:
known1 known2 know3 all_unknow
0 1 3 3 data,data2,data3

df['all_unknown'] = df.iloc[:, 3:].apply(','.join, axis=1)
if you also want to drop all columns after the 4th:
cols = df.columns[3:-1]
df.drop(cols, axis=1)
the -1 is to avoid dropping the new column

Drop rows in dataframe whose column has more than a certain number of distinct values

I have an example dataframe as given below, and am trying to drop the rows where the column cluster_num has only 1 distinct value.
df = pd.DataFrame([[1,2,3,4,5],[1,3,4,2,5],[1,3,7,9,10],[2,6,2,7,9],[2,2,4,7,0],[3,1,9,2,7],[4,9,5,1,2],[5,8,4,2,1],[5,0,7,1,2],[6,9,2,5,7]])
df.rename(columns = {0:"cluster_num",1:"value_1",2:"value_2",3:"value_3",4:"value_4"},inplace=True)
# Dropping rows for which cluster_num has only one distinct value
count_dict = df['cluster_num'].value_counts().to_dict()
df['count'] = df['cluster_num'].apply(lambda x : count_dict[x])
df[df['count']>1]
In the above example, the rows where cluster_num equals 3,4 and 6 would be dropped.
Is there a way of doing this without having to create a separate column? I need all 5 initial columns (cluster_num, value_1, value_2, value_3, value_4) in the output. My output dataframe according to the above code is :
I have tried to filter using groupby() with count() but it was not working out.

groupby/filter
df.groupby('cluster_num').filter(lambda d: len(d) > 1)
duplicated
df[df.duplicated('cluster_num', keep=False)]
groupby/transform
Per #QuangHoang
df[df.groupby('cluster_num')['cluster_num'].transform('size') >= 2]

Select columns based on != condition

I have a dataframe and I have a list of some column names that correspond to the dataframe. How do I filter the dataframe so that it != the list of column names, i.e. I want the dataframe columns that are outside the specified list.
I tried the following:
quant_vair = X != true_binary_cols
but get the output error of: Unable to coerce to Series, length must be 545: given 155
Been battling for hours, any help will be appreciated.

It will help:
df.drop(columns = ["col1", "col2"])

You can either drop the columns from the dataframe, or create a list that does not contain all these columns:
df_filtered = df.drop(columns=true_binary_cols)
Or:
filtered_col = [col for col in df if col not in true_binary_cols]
df_filtered = df[filtered_col]

If dataframe.tail(1) is X, do something

I am trying to check if the last cell in a pandas data-frame column contains a 1 or a 2 (these are the only options). If it is a 1, I would like to delete the whole row, if it is a 2 however I would like to keep it.
import pandas as pd
df1 = pd.DataFrame({'number':[1,2,1,2,1], 'name': ['bill','mary','john','sarah','tom']})
df2 = pd.DataFrame({'number':[1,2,1,2,1,2], 'name': ['bill','mary','john','sarah','tom','sam']})
In the above example I would want to delete the last row of df1 (so the final row is 'sarah'), however in df2 I would want to keep it exactly as it is.
So far, I have thought to try the following but I am getting an error
if df1['number'].tail(1) == 1:
df = df.drop(-1)

DataFrame.drop removes rows based on labels (the actual values of the indices). While it is possible to do with df1.drop(df1.index[-1]) this is problematic with a duplicated index. The last row can be selected with iloc, or a single value with .iat
if df1['number'].iat[-1] == 1:
df1 = df1.iloc[:-1, :]

You can check if the value of number in the last row is equal to one:
check = df1['number'].tail(1).values == 1
# Or check entire row with
# check = 1 in df1.tail(1).values
If that condition holds, you can select all rows, except the last one and assign back to df1:
if check:
df1 = df1.iloc[:-1, :]

if df1.tail(1).number == 1:
df1.drop(len(df1)-1, inplace = True)

You can use the same tail function
df.drop(df.tail(n).index,inplace=True) # drop last n rows

Pandas: filter dataframe by multiple conditions with column containing nan

Connected to:
Pandas: add column with index of matching row from other dataframe
Matching multiple columns with corresponding columns from 2nd dataframe, and returning index of the matching row from the 2nd dataframe.
df1['new_column'] = df1.apply(lambda x: df2[(df2.col1 == x.col1)
& (df2.col2 == x.col2)
& (df2.col3 == x.col3)
& (df2.col4 == x.col4)
& (df2.col5 == x.col5)].index[0], axis=1)
Code above works like a charm... unless one of the columns can contain nan values, since nan != nan.
In other words, even if col1:col4 in df1 matches df2 and col5 in both df1 and df2 is nan it fails to match it returning empty index object.
I need it to return True if col1:col5 match no matter if they contain values or nan.
Anyone knows solution for that?

One workaround here is to simply use fillna to replace all na values with something like a 'NaN' string.
Simply use:
df1 = df1.fillna('NaN')
df2 = df2.fillna('NaN')
Then use your existing code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dropping columns with duplicate values - python

You need to select column name from the index.Try this. columns = df.columns drops = [] for i in (0,len(df.columns)-1): if (df[columns[i]].tolist() == df[columns[i+1]].tolist()): drops.append(columns[i]) df = df.drop(drops,axis=1)

Let us try shift df.loc[:,~df.eq(df.astype(object).shift(axis=1)).all()]

Related

How to concatenate values from many columns into one column when one doesn't know the number of columns will have

Drop rows in dataframe whose column has more than a certain number of distinct values

Select columns based on != condition

If dataframe.tail(1) is X, do something

Pandas: filter dataframe by multiple conditions with column containing nan

Categories

Resources