I have this dataframe:
I want to add each column, as duration + credit_amount, so I have created the following algorithm:
def automate_add(add):
for i, column in enumerate(df):
for j, operando in enumerate(df):
if column != operando:
columnName = column + '_sum_' + operando
add[columnName] = df[column] + df[operando]
with the output:
duration_sum_credit_amount
duration_sum_installment_commitment
credit_amount_sum_duration
credit_amount_sum_installment_commitment
installment_commitment_sum_duration
installment_commitment_sum_credit_amount
However, knowing that duration + credit_amount = credit_amount + duration. I wouldn't like to have repeated columns.
Expecting this result from the function:
duration_sum_credit_amount
duration_sum_installment_commitment
credit_amount_sum_installment_commitment
How can I do it?
I am trying to use hash sets but seems to work only in pandas series [1].
EDIT:
Dataframe: https://www.openml.org/d/31
Use the below, should work faster:
import itertools
my_list=[(pd.Series(df.loc[:,list(i)].sum(axis=1),\
name='_sum_'.join(df.loc[:,list(i)].columns))) for i in list(itertools.combinations(df.columns,2))]
final_df=pd.concat(my_list,axis=1)
print(final_df)
duration_sum_credit_amount duration_sum_installment_commitment \
0 1175 10
1 5999 50
2 2108 14
3 7924 44
4 4894 27
credit_amount_sum_installment_commitment
0 1173
1 5953
2 2098
3 7884
4 4873
Explanation:
print(list(itertools.combinations(df.columns,2))) gives:
[('duration', 'credit_amount'),
('duration', 'installment_commitment'),
('credit_amount', 'installment_commitment')]
Post that do :
for i in list(itertools.combinations(df.columns,2)):
print(df.loc[:,list(i)])
print("---------------------------")
this prints the combinations of columns together. so i just summed it on axis=1 and called it under pd.series, and gave it a name by joining them.
Post this just append them to the list and concat them on axis=1 to get the final result. :)
You have been pointed already to itertools.combinations, which is the right tool here, and will save you some for loops and the issue with repeated columns. See the documentation for more details about permutations, combinations etc.
First, let's create the DataFrame so we can reproduce the example:
import pandas as pd
from itertools import combinations
df = pd.DataFrame({
'a': [1,2,3],
'b': [4,5,6],
'c': [7,8,9]
})
>>> df
a b c
0 1 4 7
1 2 5 8
2 3 6 9
Now let's get to work. The idea is to get all the combinations of the columns, then do a dictionary comprehension to return something like {column_name: sum}. Here it is:
>>> pd.DataFrame({c1 + '_sum_' + c2: df[c1] + df[c2]
for c1, c2 in combinations(df.columns, 2)})
a_sum_b a_sum_c b_sum_c
0 5 8 11
1 7 10 13
2 9 12 15
Notice you can replace sum with any other function that operates on two pd.Series.
The function can have one more if condition to check if the associate addition is already added as a column to dataframe like below:
def automate_add(add):
columnLst=[]
#list where we will add column names to avoid the associate sum columns
for i, column in enumerate(df):
for j, operando in enumerate(df):
if column != operando:
if operando + '_sum_' + column not in columnLst:
columnName = column + '_sum_' + operando
add[columnName] = df[column] + df[operando]
columnLst.append(columnName)
I havent tested this on your data. Try and let me know if it doesnt work.
Related
I just wanted to ask the community and see if there is a more efficient to do this.
I have several rows in a data frame and I am using .loc to filter values in row A for I can perform calculations on row B.
I can easily do something like...
filter_1 = df.loc['Condition'] = 1
And then perform the mathematical calculation on row B that I need.
But there are many conditions I must go through so I was wondering if I could possibly make a list of the conditions and then iterate them through the .loc function in less lines of code?
Would something like this work where I create a list, then iterate the conditions through a loop?
Thank you!
This example gets most of what I want. I just need it to show 6.4 and 7.0 in this example. How can I manipulate the iteration for it shows the results for the unique values in row 'a'?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
col = ['a', 'b']
list_1 = []
for i, j in zip(a,b):
list_1.append([i,j])
df1 = pd.DataFrame(list_1, columns= col)
for i in a:
aa = df1[df1['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using set
set_a = set(a)
for i in set_a:
aa = df[df['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using pandas mean function
Is this what you are looking for?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
df = pd.DataFrame({'a':a,'b':b})
print (df)
print(df.groupby('a').mean())
The results from this are:
Original Dataframe df:
a b
0 1 5
1 2 1
2 1 3
3 2 5
4 1 7
5 2 20
6 1 9
7 2 5
8 1 8
9 2 4
The mean value of df['a'] is:
b
a
1 6.4
2 7.0
Here you go:
df = df[(df['A'] > 1) & (df['A'] < 10)]
I have a data-frame with column1 containing string values and column 2 containing lists of sting values.
I want to iterate through column1 and concatenate column1 values with their corresponding row values into a new data-frame.
Say, my input is
`dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}`
after the operation my data will look like this
dfd2 = {'TRAINSET':['101a1','101x1','101b2', '102a1','102b3','102b2','103d3', '103g5','103x2','104x1','104b2', '104a1']}
what i tried is:
dg = pd.concat([g['TRAINSET'].map(g['unique']).apply(pd.Series)], axis = 1)
but i get KeyError:'TRAINSET' as this is probably not the proper syntax
.Also, I would like to remove the Nan values in the list
Here is possible use list comprehension with flatten values of lists, join values by + and pass to DataFrame constructor is necessary:
#if necessary
#df = df.reset_index()
#flatten values with filter out missing values
L = [(str(a) + x) for a, b in df[['TRAINSET','unique']].values for x in b if pd.notna(x)]
df1 = pd.DataFrame({'TRAINSET': L})
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Or use DataFrame.explode (pandas 0.25+), crete default index, remove missing values by DataFrame.dropna and join columns to + with Series.to_frame for one column DataFrame :
df = df.explode('unique').dropna(subset=['unique']).reset_index(drop=True)
df1 = (df['TRAINSET'].astype(str) + df['unique']).to_frame('TRAINSET')
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Coming from your original data you can do the below using explode (new in pandas -0.25+) and agg:
Input:
dfd = {'TRAINSET':['101','102','103', '104'],
'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
Solution:
df = pd.DataFrame(dfd)
df.explode('unique').astype(str).agg(''.join,1).to_frame('TRAINSET').to_dict('list')
{'TRAINSET': ['101a1',
'101x1',
'101b2',
'102a1',
'102b3',
'102b2',
'103d3',
'103g5',
'103x2',
'104x1',
'104b2',
'104a1']}
Another solution, just to give you some choice...
import pandas as pd
_dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
dfd = pd.DataFrame.from_dict(_dfd)
dfd.set_index("TRAINSET", inplace=True)
print(dfd)
dfd2 = dfd.reset_index()
def refactor(row):
key, l = str(row["TRAINSET"]), str(row["unique"])
res = [key+i for i in l]
return res
dfd2['TRAINSET'] = dfd2.apply(refactor, axis=1)
dfd2.set_index("TRAINSET", inplace=True)
dfd2.drop("unique", inplace=True, axis=1)
print(dfd2)
I have a DataFrame which I want to groupby with a few columns. I know how to aggregate the data after that, or view each index tuple. However, I am unsure of the best way to just append the "group number" of each group in a column on the original dataframe:
For example, I have a dataframe, df, with two indices (a_id and b_id) which I want to use for grouping the df using groupby.
import pandas as pd
a = pd.DataFrame({'a_id':['q','q','q','q','q','r','r','r','r','r'],
'b_id':['m','m','j','j','j','g','g','f','f','f'],
'val': [1,2,3,4,5,6,7,8,9,8]})
# Output:
a_id b_id val
0 q m 1
1 q m 2
2 q j 3
3 q j 4
4 q j 5
5 r g 6
6 r g 7
7 r f 8
8 r f 9
9 r f 8
When I do the groupby, rather than aggregate everything, I just want to add a column group_id that has an integer representing the group. However, I am not sure if there is a simple way to do this. My current solution involves inverting the GroupBy.indices dictionary, turning that into a series, and appending it to the dataframe as follows:
gb = a.groupby(['a_id','b_id'])
dict_g = dict(enumerate(gb.indices.values()))
dict_g_reversed = {x:k for k,v in dict_g.items() for x in v}
group_ids = pd.Series(dict_g_reversed)
a['group_id'] = group_ids
This gives me sort of what I want, although the group_id indices are not in the right order. This seems like it should be a simple function, but I'm not sure why it seems not to be. I know in MATLAB, for example, they have a findgroups that does exactly what I would like. So far I haven't been able to find an equivalent in pandas. How can this be done with a pd DataFrame?
You can using ngroup this will provide the order as occurrence
a.groupby(['a_id','b_id']).ngroup()
Or using factorize
pd.factorize(list(map(tuple,a[['a_id','b_id']].values.tolist())))[0]+1
df['newid']=pd.factorize(list(map(tuple,a.values.tolist())))[0]+1
I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.
Please find the code to generate the input dataframe
df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
The data looks like as shown below
Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion
What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.
I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.
for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):
Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach
For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.
The final output would look like this
Can you please let me know if there is an elegant approach to do this?
You can use groupby to do something like:
m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))
studyid age_interview Gender Ethnicity
1 1 65 1.Male 1.Chinese
2 None None 2.Female 2.Indian
3 None None None 3.Malay
Explanation: m is a helper series which helps seperating the groups:
print(m.cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
Then we group this helper series and apply list:
df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1 [studyid, 1]
2 [age_interview, 65]
3 [Gender, 1.Male, 2.Female]
4 [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object
At this point we have each group as a list with the column name as the first entry.
So we create a dataframe with this and set the first column as index and transpose to get our desired output.
Use itertools.groupby and then construct pd.DataFrame:
import pandas as pd
import itertools
l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}
pd.DataFrame.from_dict(d, orient='index').T
Output:
Gender studyid age_interview Ethnicity
0 1.Male 1 65 1.Chinese
1 2.Female None None 2.Indian
2 None None None 3.Malay
What is the easiest way to remove duplicate columns from a dataframe?
I am reading a text file that has duplicate columns via:
import pandas as pd
df=pd.read_table(fname)
The column names are:
Time, Time Relative, N2, Time, Time Relative, H2, etc...
All the Time and Time Relative columns contain the same data. I want:
Time, Time Relative, N2, H2
All my attempts at dropping, deleting, etc such as:
df=df.T.drop_duplicates().T
Result in uniquely valued index errors:
Reindexing only valid with uniquely valued index objects
Sorry for being a Pandas noob. Any Suggestions would be appreciated.
Additional Details
Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)
data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):
Time Time Relative [s] N2[%] Time Time Relative [s] H2[ppm]
2/12/2013 9:20:55 AM 6.177 9.99268e+001 2/12/2013 9:20:55 AM 6.177 3.216293e-005
2/12/2013 9:21:06 AM 17.689 9.99296e+001 2/12/2013 9:21:06 AM 17.689 3.841667e-005
2/12/2013 9:21:18 AM 29.186 9.992954e+001 2/12/2013 9:21:18 AM 29.186 3.880365e-005
... etc ...
2/12/2013 2:12:44 PM 17515.269 9.991756+001 2/12/2013 2:12:44 PM 17515.269 2.800279e-005
2/12/2013 2:12:55 PM 17526.769 9.991754e+001 2/12/2013 2:12:55 PM 17526.769 2.880386e-005
2/12/2013 2:13:07 PM 17538.273 9.991797e+001 2/12/2013 2:13:07 PM 17538.273 3.131447e-005
Here's a one line solution to remove columns based on duplicate column names:
df = df.loc[:,~df.columns.duplicated()].copy()
How it works:
Suppose the columns of the data frame are ['alpha','beta','alpha']
df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].
Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])
Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.
The final .copy() is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.
Note: the above only checks columns names, not column values.
To remove duplicated indexes
Since it is similar enough, do the same thing on the index:
df = df.loc[~df.index.duplicated(),:].copy()
To remove duplicates by checking values without transposing
Update and caveat: please be careful in applying this. Per the counter-example provided by DrWhat in the comments, this solution may not have the desired outcome in all cases.
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
This avoids the issue of transposing. Is it fast? No. Does it work? In some cases. Here, try it on this:
# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312)))
#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs
# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]
# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()
It sounds like you already know the unique column names. If that's the case, then df = df['Time', 'Time Relative', 'N2'] would work.
If not, your solution should work:
In [101]: vals = np.random.randint(0,20, (4,3))
vals
Out[101]:
array([[ 3, 13, 0],
[ 1, 15, 14],
[14, 19, 14],
[19, 5, 1]])
In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
df
Out[106]:
Time H1 N2 Time Relative N2 Time
0 3 13 0 3 13 0
1 1 15 14 1 15 14
2 14 19 14 14 19 14
3 19 5 1 19 5 1
In [107]: df.T.drop_duplicates().T
Out[107]:
Time H1 N2
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
You probably have something specific to your data that's messing it up. We could give more help if there's more details you could give us about the data.
Edit:
Like Andy said, the problem is probably with the duplicate column titles.
For a sample table file 'dummy.csv' I made up:
Time H1 N2 Time N2 Time Relative
3 13 13 3 13 0
1 15 15 1 15 14
14 19 19 14 19 14
19 5 5 19 5 1
using read_table gives unique columns and works properly:
In [151]: df2 = pd.read_table('dummy.csv')
df2
Out[151]:
Time H1 N2 Time.1 N2.1 Time Relative
0 3 13 13 3 13 0
1 1 15 15 1 15 14
2 14 19 19 14 19 14
3 19 5 5 19 5 1
In [152]: df2.T.drop_duplicates().T
Out[152]:
Time H1 Time Relative
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
If your version doesn't let your, you can hack together a solution to make them unique:
In [169]: df2 = pd.read_table('dummy.csv', header=None)
df2
Out[169]:
0 1 2 3 4 5
0 Time H1 N2 Time N2 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [171]: from collections import defaultdict
col_counts = defaultdict(int)
col_ix = df2.first_valid_index()
In [172]: cols = []
for col in df2.ix[col_ix]:
cnt = col_counts[col]
col_counts[col] += 1
suf = '_' + str(cnt) if cnt else ''
cols.append(col + suf)
cols
Out[172]:
['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
Time H1 N2 Time_1 N2_1 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [178]: df2.T.drop_duplicates().T
Out[178]:
Time H1 Time Relative
1 3 13 0
2 1 15 14
3 14 19 14
4 19 5 1
Transposing is inefficient for large DataFrames. Here is an alternative:
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
dcols = frame[v].to_dict(orient="list")
vs = dcols.values()
ks = dcols.keys()
lvs = len(vs)
for i in range(lvs):
for j in range(i+1,lvs):
if vs[i] == vs[j]:
dups.append(ks[i])
break
return dups
Use it like this:
dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)
Edit
A memory efficient version that treats nans like any other value:
from pandas.core.common import array_equivalent
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
ia = vs.iloc[:,i].values
for j in range(i+1, lcs):
ja = vs.iloc[:,j].values
if array_equivalent(ia, ja):
dups.append(cs[i])
break
return dups
If I'm not mistaken, the following does what was asked without the memory problems of the transpose solution and with fewer lines than #kalu 's function, keeping the first of any similarly named columns.
Cols = list(df.columns)
for i,item in enumerate(df.columns):
if item in df.columns[:i]: Cols[i] = "toDROP"
df.columns = Cols
df = df.drop("toDROP",1)
It looks like you were on the right path. Here is the one-liner you were looking for:
df.reset_index().T.drop_duplicates().T
But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects, it is tough to say exactly what would solve the problem. if restoring the original index is important to you do this:
original_index = df.index.names
df.reset_index().T.drop_duplicates().reset_index(original_index).T
Note that Gene Burinsky's answer (at the time of writing the selected answer) keeps the first of each duplicated column. To keep the last:
df=df.loc[:, ~df.columns[::-1].duplicated()[::-1]]
An update on #kalu's answer, which uses the latest pandas:
def find_duplicated_columns(df):
dupes = []
columns = df.columns
for i in range(len(columns)):
col1 = df.iloc[:, i]
for j in range(i + 1, len(columns)):
col2 = df.iloc[:, j]
# break early if dtypes aren't the same (helps deal with
# categorical dtypes)
if col1.dtype is not col2.dtype:
break
# otherwise compare values
if col1.equals(col2):
dupes.append(columns[i])
break
return dupes
Although #Gene Burinsky answer is great, it has a potential problem in that the reassigned df may be either a copy or a view of the original df.
This means that subsequent assignments like df['newcol'] = 1 generate a SettingWithCopy warning and may fail (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing). The following solution prevents that issue:
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
I ran into this problem where the one liner provided by the first answer worked well. However, I had the extra complication where the second copy of the column had all of the data. The first copy did not.
The solution was to create two data frames by splitting the one data frame by toggling the negation operator. Once I had the two data frames, I ran a join statement using the lsuffix. This way, I could then reference and delete the column without the data.
- E
March 2021 update
The subsequent post by #CircArgs may have provided a succinct one-liner to accomplish what I described here.
First step:- Read first row i.e all columns the remove all duplicate columns.
Second step:- Finally read only that columns.
cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates()
df = pd.read_csv("file.csv", usecols=cols)
The way below will identify dupe columns to review what is going wrong building the dataframe originally.
dupes = pd.DataFrame(df.columns)
dupes[dupes.duplicated()]
Just in case somebody still looking for an answer in how to look for duplicated values in columns for a Pandas Data Frame in Python, I came up with this solution:
def get_dup_columns(m):
'''
This will check every column in data frame
and verify if you have duplicated columns.
can help whenever you are cleaning big data sets of 50+ columns
and clean up a little bit for you
The result will be a list of tuples showing what columns are duplicates
for example
(column A, Column C)
That means that column A is duplicated with column C
more info go to https://wanatux.com
'''
headers_list = [x for x in m.columns]
duplicate_col2 = []
y = 0
while y <= len(headers_list)-1:
for x in range(1,len(headers_list)-1):
if m[headers_list[y]].equals(m[headers_list[x]]) == False:
continue
else:
duplicate_col2.append((headers_list[y],headers_list[x]))
headers_list.pop(0)
return duplicate_col2
And you can cast the definition like this:
duplicate_col = get_dup_columns(pd_excel)
It will show a result like the following:
[('column a', 'column k'),
('column a', 'column r'),
('column h', 'column m'),
('column k', 'column r')]
I am not sure why Gene Burinsky's answer did not work for me. I was getting the same original dataframes with duplicated columns. My workaround was force the selection over the ndarray and get back the dataframe.
df = pd.DataFrame(df.values[:,~df.columns.duplicated()], columns=df.columns[~df.columns.duplicated()])
A simple column-wise comparison is the most efficient way (in terms of memory and time) to check duplicated columns by values. Here an example:
import numpy as np
import pandas as pd
from itertools import combinations as combi
df = pd.DataFrame(np.random.uniform(0,1, (100,4)), columns=['a','b','c','d'])
df['a'] = df['d'].copy() # column 'a' is equal to column 'd'
# to keep the first
dupli_cols = [cc[1] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
# to keep the last
dupli_cols = [cc[0] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
df = df.drop(columns=dupli_cols)
In case you want to check for duplicate columns, this code can be useful
columns_to_drop= []
for cname in sorted(list(df)):
for cname2 in sorted(list(df))[::-1]:
if df[cname].equals(df[cname2]) and cname!=cname2 and cname not in columns_to_drop:
columns_to_drop.append(cname2)
print(cname,cname2,'Are equal')
df = df.drop(columns_to_drop, axis=1)
Fast and easy way to drop the duplicated columns by their values:
df = df.T.drop_duplicates().T
More info: Pandas DataFrame drop_duplicates manual .