Pandas DataFrame drop tuple or list of columns - python

When using the drop method for a pandas.DataFrame it accepts lists of column names, but not tuples, despite the documentation saying that "list-like" arguments are acceptable. Am I reading the documentation incorrectly, as I would expect my MWE to work.
MWE
import pandas as pd
df = pd.DataFrame({k: range(5) for k in list('abcd')})
df.drop(['a', 'c'], axis=1) # Works
df.drop(('a', 'c'), axis=1) # Errors
Versions - Using Python 2.7.12, Pandas 0.20.3.

There is problem with tuples select Multiindex:
np.random.seed(345)
mux = pd.MultiIndex.from_arrays([list('abcde'), list('cdefg')])
df = pd.DataFrame(np.random.randint(10, size=(4,5)), columns=mux)
print (df)
a b c d e
c d e f g
0 8 0 3 9 8
1 4 3 4 1 7
2 4 0 9 6 3
3 8 0 3 1 5
df = df.drop(('a', 'c'), axis=1)
print (df)
b c d e
d e f g
0 0 3 9 8
1 3 4 1 7
2 0 9 6 3
3 0 3 1 5
Same as:
df = df[('a', 'c')]
print (df)
0 8
1 4
2 4
3 8
Name: (a, c), dtype: int32

Pandas treats tuples as multi-index values, so try this instead:
In [330]: df.drop(list(('a', 'c')), axis=1)
Out[330]:
b d
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
here is an example for deleting rows (axis=0 - default) in the multi-index DF:
In [342]: x = df.set_index(np.arange(len(df), 0, -1), append=True)
In [343]: x
Out[343]:
a b c d
0 5 0 0 0 0
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [344]: x.drop((0,5))
Out[344]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [345]: x.drop([(0,5), (4,1)])
Out[345]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
So when you specify tuple Pandas treats it as a multi-index label

I used this to delete column of tuples
del df3[('val1', 'val2')]
and it got deleted.

Related

Pandas custom groupby

Is there any way to use a custom groupby function in Pandas? for example suppose I have the data below.
a|b|c
-----
1 2 3
1 2 4
1 3 7
1 4 3
1 4 5
2 1 0
2 3 5
2 4 6
2 3 6
3 1 0
4 1 0
4 2 3
Is it possible to group my data by a and b if a is not in [2,4] and by a otherwise?
In the example above I'd like to get the following groups:
123
124
137
143
145
210
235
246
236
310
410
423
The column b is an open set so I would ideally like a function that is independent of the values in b
you can mask the column b when a meets your condition with isin and replace by any value (like 1), then use this in the groupby.
for _, dfg in df.groupby(['a',
df['b'].mask(df['a'].isin([2,4]), # condition
1)]): # replacement value
print('new group')
print(dfg)
new group
a b c
0 1 2 3
1 1 2 4
new group
a b c
2 1 3 7
new group
a b c
3 1 4 3
4 1 4 5
new group
a b c
5 2 1 0
6 2 3 5
7 2 4 6
8 2 3 6
new group
a b c
9 3 1 0
new group
a b c
10 4 1 0
11 4 2 3
IIUC, you can also try:
Here, if the value of a is in [2,4] it'll ignore the value in column b and will group them together.
for _, k in df.groupby([df.a.values, np.where(df.a.isin([2, 4]), 0, df.b)]):
print(k)
OUTPUT:
a b c
0 1 2 3
1 1 2 4
a b c
2 1 3 7
a b c
3 1 4 3
4 1 4 5
a b c
5 2 1 0
6 2 3 5
7 2 4 6
8 2 3 6
a b c
9 3 1 0
a b c
10 4 1 0
11 4 2 3
You can create a temporary Series of tuples, containing either (a) or (a, b) and then group by that:
a = df[['a']].apply(tuple, axis=1)
ab = df[['a', 'b']].apply(tuple, axis=1)
df['group'] = np.where(df['a'].isin([2,4]), a, ab)
Output
> df.sort_values('group')
a b c group
1 2 3 (1, 2)
1 2 4 (1, 2)
1 3 7 (1, 3)
1 4 3 (1, 4)
1 4 5 (1, 4)
2 1 0 (2,)
2 3 5 (2,)
2 4 6 (2,)
2 3 6 (2,)
3 1 0 (3, 1)
4 1 0 (4,)
4 2 3 (4,)
You can do this indirectly. First define a function that defines groups:
def grouping(row):
if row.a in [2,4]:
return 0
else:
return f"{row.a}_{row.b}"
Then use apply to get grouping column:
df['grouping'] = df.apply(grouping)
Then group by grouping column:
df = df.groupby('grouping')

How to create new columns based on whether another group of Columns Exists

My Problem is as follows:
I have a dataframe df which has 5 columns say ('A', 'B', 'C', 'D', 'E')
Now I am looking to combine these columns for some other purposes based on the columns where they are in sets say GP1 = [ 'A', 'B', 'D'] and GP2 = ['C','E'] based on which I will create two new columns.
df['Group1'] = df[GP1].min(axis=1)
df['Group2'] = df[GP2].max(axis=1)
However, it can be possible based on the data that many times say the column 'A' ( or say 'D' or 'B' or maybe all) may be missing from the first set or maybe the column 'C' or 'E' (or both) may be missing from second set.
So what I am looking for is to do something such that the code will check if any of the columns from first set or second set is missing and then only create the new 'Group1' or 'Group2' if all columns exists in a group and if any of the columns in any set is missing it will then skip creating the new column.
How can I achieve that. I was trying for loops but not helping and becoming complicated logic.
An example when all the columns in both set is there:
df_in
A B C D E
1 2 3 4 5
2 4 6 2 3
1 0 2 4 2
df_out
A B C D E Group1 Group2
1 2 3 4 5 1 5
2 4 6 2 3 2 6
1 0 2 4 2 0 2
An example when say E column from second group is not there:
df_in
A B C D
1 2 3 4
2 4 6 2
1 0 2 4
df_out
A B C D Group1
1 2 3 4 1
2 4 6 2 2
1 0 2 4 0
When both A & D are missing from set A ( and only B is there from set/group 1)
df_in
B C E
2 3 5
4 6 3
0 2 2
df_out
B C E Group2
2 3 5 5
4 6 3 6
0 2 2 2
The following case when A from set 1 missing and C from set 2 missing :
df_in
B D E
2 4 5
4 2 3
0 4 2
df_out
B D E
2 4 5
4 2 3
0 4 2
Any help in this direction will be immensely appreciated. Thanks
Here you go, I think you can use this:
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
MCVE:
df_in = pd.read_clipboard() #Read from copy of df_in in the question above
print(df_in)
# A B C D E
# 0 1 2 3 4 5
# 1 2 4 6 2 3
# 2 1 0 2 4 2
gp1 = ['A','B','D']
gp2 = ['C','E']
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
print(df_out)
# A B C D E Group1 Group2
# 0 1 2 3 4 5 1 5
# 1 2 4 6 2 3 2 6
# 2 1 0 2 4 2 0 2
df_in_copy=df_in.copy() #make a copy to reuse later
df_in = df_in.drop('E', axis=1) #Drop Col E
print(df_in)
# A B C D
# 0 1 2 3 4
# 1 2 4 6 2
# 2 1 0 2 4
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
print(df_out)
# A B C D Group1
# 0 1 2 3 4 1
# 1 2 4 6 2 2
# 2 1 0 2 4 0
df_in = df_in_copy.copy() #Copy for copy create
df_in = df_in.drop(['A','D'], axis=1) #Drop Columns A and D
print(df_in)
# B C E
# 0 2 3 5
# 1 4 6 3
# 2 0 2 2
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
print(df_out)
# B C E
# 0 2 3 5
# 1 4 6 3
# 2 0 2 2

Is there any way to set row to column-row in DataFrame? or How to get DataFrame from Excel file with setting any index row to column_row?

Q1) Is there any way to set row to column-row in DataFrame?
(DF) (DF)
A B C D a b c d
0 a b c d pandas function 0 4 5 3 6
1 4 5 3 6 ==========================> 1 3 2 5 3
2 3 2 5 3 0-idx row to columns-row 2 4 7 9 0
3 4 7 9 0
Q2) How to get DataFrame from Excel file with setting any index row to column_row?
(EXCEL or CSV) (DF)
A B C D a b c d
0 a b c d pd.read_excel() 0 4 5 3 6
1 4 5 3 6 ==========================> 1 3 2 5 3
2 3 2 5 3 0-idx row to columns-row 2 4 7 9 0
3 4 7 9 0
You can try it:
import pandas as pd
data={"A":['a',4,3,4],"B":['b',5,2,7],"C":['c',3,5,9],"D":['d',6,3,0]}
df=pd.DataFrame(data)
#define the column name from line index 0
df.columns=df.iloc[0].tolist()
#remove the line index 0
df = df.drop(0)
result:
a b c d
1 4 5 3 6
2 3 2 5 3
3 4 7 9 0
This would do the job:
new_header = df.iloc[0] #first_row
df = df[1:] #remaining_dataframe
df.columns = new_header

Add multiple columns to DataFrame and set them equal to an existing column

I want to add multiple columns to a pandas DataFrame and set them equal to an existing column. Is there a simple way of doing this? In R I would do:
df <- data.frame(a=1:5)
df[c('b','c')] <- df$a
df
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
In pandas this results in KeyError: "['b' 'c'] not in index":
df = pd.DataFrame({'a': np.arange(1,6)})
df[['b','c']] = df.a
you can use .assign() method:
In [31]: df.assign(b=df['a'], c=df['a'])
Out[31]:
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
or a little bit more creative approach:
In [41]: cols = list('bcdefg')
In [42]: df.assign(**{col:df['a'] for col in cols})
Out[42]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
another solution:
In [60]: pd.DataFrame(np.repeat(df.values, len(cols)+1, axis=1), columns=['a']+cols)
Out[60]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
NOTE: as #Cpt_Jauchefuerst mentioned in the comment DataFrame.assign(z=1, a=1) will add columns in alphabetical order - i.e. first a will be added to existing columns and then z.
A pd.concat approach
df = pd.DataFrame(dict(a=range5))
pd.concat([df.a] * 5, axis=1, keys=list('abcde'))
a b c d e
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
Turns out you can use a loop to do this:
for i in ['b','c']: df[i] = df.a
You can set them individually if you're only dealing with a few columns:
df['b'] = df['a']
df['c'] = df['a']
or you can use a loop as you discovered.

add new column to pandas DataFrame with value depended on previous row

I have an existing pandas DataFrame, and I want to add a new column, where the value of each row will depend on the previous row.
for example:
df1 = pd.DataFrame(np.random.randint(10, size=(4, 4)), columns=['a', 'b', 'c', 'd'])
df1
Out[31]:
a b c d
0 9 3 3 0
1 3 9 5 1
2 1 7 5 6
3 8 0 1 7
and now I want to create column e, where for each row i the value of df1['e'][i] would be: df1['e'][i] = df1['d'][i] - df1['d'][i-1]
desired output:
df1:
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
how can I achieve this?
You can use sub with shift:
df['e'] = df.d.sub(df.d.shift(), fill_value=0)
print (df)
a b c d e
0 9 3 3 0 0.0
1 3 9 5 1 1.0
2 1 7 5 6 5.0
3 8 0 1 7 1.0
If need convert to int:
df['e'] = df.d.sub(df.d.shift(), fill_value=0).astype(int)
print (df)
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1

Categories

Resources