how to delete a duplicate column read from excel in pandas

how to delete a duplicate column read from excel in pandas - python

Data in excel:
a b a d
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
Code:
df= pd.io.excel.read_excel(r"sample.xlsx",sheetname="Sheet1")
df
a b a.1 d
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
how to delete the column a.1?
when pandas reads the data from excel it automatically changes the column name of 2nd a to a.1.
I tried df.drop("a.1",index=1) , this does not work.
I have a huge excel file which has duplicate names, and i am interested only in few of columns.

You need to pass axis=1 for drop to work:
In [100]:
df.drop('a.1', axis=1)
Out[100]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7
Or just pass a list of the cols of interest for column selection:
In [102]:
cols = ['a','b','d']
df[cols]
Out[102]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7
Also works with 'fancy indexing':
In [103]:
df.ix[:,cols]
Out[103]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7

If you know the name of the column you want to drop:
df = df[[col for col in df.columns if col != 'a.1']]
and if you have several columns you want to drop:
columns_to_drop = ['a.1', 'b.1', ... ]
df = df[[col for col in df.columns if col not in columns_to_drop]]

Much more generally drop all duplicated columns
df= df.drop(df.filter(regex='\.\d').columns, axis=1)

Related

Merge two DataFrames by combining duplicates and concatenating nonduplicates

I have two DataFrames:
df = pd.DataFrame({'A':[1,2],
'B':[3,4]})
A B
0 1 3
1 2 4
df2 = pd.DataFrame({'A':[3,2,1],
'C':[5,6,7]})
A C
0 3 5
1 2 6
2 1 7
and I want to merge in a way that the column 'A' add the different values between DataFrames but merge the duplicates.
Desired output:
A B C
0 3 NaN 5
1 2 4 6
2 1 3 7

You can use combine_first:
df2 = df2.combine_first(df)
Output:
A B C
0 1 3.0 5
1 2 4.0 6
2 3 NaN 7

Is there any way to set row to column-row in DataFrame? or How to get DataFrame from Excel file with setting any index row to column_row?

Q1) Is there any way to set row to column-row in DataFrame?
(DF) (DF)
A B C D a b c d
0 a b c d pandas function 0 4 5 3 6
1 4 5 3 6 ==========================> 1 3 2 5 3
2 3 2 5 3 0-idx row to columns-row 2 4 7 9 0
3 4 7 9 0
Q2) How to get DataFrame from Excel file with setting any index row to column_row?
(EXCEL or CSV) (DF)
A B C D a b c d
0 a b c d pd.read_excel() 0 4 5 3 6
1 4 5 3 6 ==========================> 1 3 2 5 3
2 3 2 5 3 0-idx row to columns-row 2 4 7 9 0
3 4 7 9 0

You can try it:
import pandas as pd
data={"A":['a',4,3,4],"B":['b',5,2,7],"C":['c',3,5,9],"D":['d',6,3,0]}
df=pd.DataFrame(data)
#define the column name from line index 0
df.columns=df.iloc[0].tolist()
#remove the line index 0
df = df.drop(0)
result:
a b c d
1 4 5 3 6
2 3 2 5 3
3 4 7 9 0

This would do the job:
new_header = df.iloc[0] #first_row
df = df[1:] #remaining_dataframe
df.columns = new_header

Add multiple columns to DataFrame and set them equal to an existing column

I want to add multiple columns to a pandas DataFrame and set them equal to an existing column. Is there a simple way of doing this? In R I would do:
df <- data.frame(a=1:5)
df[c('b','c')] <- df$a
df
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
In pandas this results in KeyError: "['b' 'c'] not in index":
df = pd.DataFrame({'a': np.arange(1,6)})
df[['b','c']] = df.a

you can use .assign() method:
In [31]: df.assign(b=df['a'], c=df['a'])
Out[31]:
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
or a little bit more creative approach:
In [41]: cols = list('bcdefg')
In [42]: df.assign(**{col:df['a'] for col in cols})
Out[42]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
another solution:
In [60]: pd.DataFrame(np.repeat(df.values, len(cols)+1, axis=1), columns=['a']+cols)
Out[60]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
NOTE: as #Cpt_Jauchefuerst mentioned in the comment DataFrame.assign(z=1, a=1) will add columns in alphabetical order - i.e. first a will be added to existing columns and then z.

A pd.concat approach
df = pd.DataFrame(dict(a=range5))
pd.concat([df.a] * 5, axis=1, keys=list('abcde'))
a b c d e
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4

Turns out you can use a loop to do this:
for i in ['b','c']: df[i] = df.a

You can set them individually if you're only dealing with a few columns:
df['b'] = df['a']
df['c'] = df['a']
or you can use a loop as you discovered.

How to re-order the columns based on another dataframe with the same columns but different order

I wonder if there is a handy method to order the columns of a dataframe based on another one that has the same columns but with different order. Or, do I have to make a loop to achieve this?

Try this:
df2 = df2[df1.columns]
Demo:
In [1]: df1 = pd.DataFrame(np.random.randint(0, 10, (5,4)), columns=list('abcd'))
In [2]: df2 = pd.DataFrame(np.random.randint(0, 10, (5,4)), columns=list('badc'))
In [3]: df1
Out[3]:
a b c d
0 8 3 9 6
1 0 6 4 7
2 7 2 0 7
3 0 5 1 8
4 6 2 5 4
In [4]: df2
Out[4]:
b a d c
0 3 8 0 4
1 7 7 4 2
2 2 7 3 8
3 2 4 9 6
4 3 4 7 1
In [5]: df2 = df2[df1.columns]
In [6]: df2
Out[6]:
a b c d
0 8 3 4 0
1 7 7 2 4
2 7 2 8 3
3 4 2 6 9
4 4 3 1 7
Alternative solution:
df2 = df2.reindex_axis(df1.columns, axis=1)
Note: Pandas reindex_axis is deprecated since version 0.21.0: Use reindex instead.
df2 = df2.reindex(df1.columns, axis=1)

Python, pandas, cumulative sum in new column on matching groups

If I have these columns in a dataframe:
a b
1 5
1 7
2 3
1,2 3
2 5
How do I create column c where column b is summed using groupings of column a (string), keeping the existing dataframe. Some rows can belong to more than one group.
a b c
1 5 15
1 7 15
2 3 11
1,2 3 26
2 5 11
Is there an easy and efficient solution as the dataframe I have is very large.

You can first need split column a and join it to original DataFrame:
print (df.a.str.split(',', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('a'))
0 1
1 1
2 2
3 1
3 2
4 2
Name: a, dtype: object
df1 = df.drop('a', axis=1)
.join(df.a.str.split(',', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('a'))
print (df1)
b a
0 5 1
1 7 1
2 3 2
3 3 1
3 3 2
4 5 2
Then use transform for sum without aggragation.
df1['c'] = df1.groupby(['a'])['b'].transform(sum)
#cast for aggreagation join working with strings
df1['a'] = df1.a.astype(str)
print (df1)
b a c
0 5 1 15
1 7 1 15
2 3 2 11
3 3 1 15
3 3 2 11
4 5 2 11
Last groupby by index and aggregate columns by agg:
print (df1.groupby(level=0)
.agg({'a':','.join,'b':'first' ,'c':sum})
[['a','b','c']] )
a b c
0 1 5 15
1 1 7 15
2 2 3 11
3 1,2 3 26
4 2 5 11

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to delete a duplicate column read from excel in pandas - python

If you know the name of the column you want to drop: df = df[[col for col in df.columns if col != 'a.1']] and if you have several columns you want to drop: columns_to_drop = ['a.1', 'b.1', ... ] df = df[[col for col in df.columns if col not in columns_to_drop]]

Much more generally drop all duplicated columns df= df.drop(df.filter(regex='\.\d').columns, axis=1)

Related

Merge two DataFrames by combining duplicates and concatenating nonduplicates

Is there any way to set row to column-row in DataFrame? or How to get DataFrame from Excel file with setting any index row to column_row?

Add multiple columns to DataFrame and set them equal to an existing column

How to re-order the columns based on another dataframe with the same columns but different order

Python, pandas, cumulative sum in new column on matching groups

Categories

Resources