pandas dataframe select columns in multiindex [duplicate] - python

This question already has answers here:
Selecting columns from pandas MultiIndex
(13 answers)
Closed 4 years ago.
I have the following pd.DataFrame:
Name 0 1 ...
Col A B A B ...
0 0.409511 -0.537108 -0.355529 0.212134 ...
1 -0.332276 -1.087013 0.083684 0.529002 ...
2 1.138159 -0.327212 0.570834 2.337718 ...
It has MultiIndex columns with names=['Name', 'Col'] and hierarchical levels. The Name label goes from 0 to n, and for each label, there are two A and B columns.
I would like to subselect all the A (or B) columns of this DataFrame.

There is a get_level_values method that you can use in conjunction with boolean indexing to get the the intended result.
In [13]:
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([[1,2],['A','B']])
print df
1 2
A B A B
0 0.543980 0.628078 0.756941 0.698824
1 0.633005 0.089604 0.198510 0.783556
2 0.662391 0.541182 0.544060 0.059381
3 0.841242 0.634603 0.815334 0.848120
In [14]:
print df.iloc[:, df.columns.get_level_values(1)=='A']
1 2
A A
0 0.543980 0.756941
1 0.633005 0.198510
2 0.662391 0.544060
3 0.841242 0.815334

Method 1:
df.xs('A', level='Col', axis=1)
for more refer to http://pandas.pydata.org/pandas-docs/stable/advanced.html#cross-section
Method 2:
df.loc[:, (slice(None), 'A')]
Caveat: this method requires the labels to be sorted. for more refer to http://pandas.pydata.org/pandas-docs/stable/advanced.html#the-need-for-sortedness-with-multiindex

EDIT*
Best way now is to use indexSlice for multi-index selections
idx = pd.IndexSlice
A = df.loc[:,idx[:,'A']]
B = df.loc[:,idx[:,'B']]

Related

Select pandas dataframe columns based on which names contain strings in list

I have a dataframe, df, and a list of strings, cols_needed, which indicate the columns I want to retain in df. The column names in df do not exactly match the strings in cols_needed, so I cannot directly use something like intersection. But the column names do contain the strings in cols_needed. I tried playing around with str.contains but couldn't get it to work. How can I subset df based on cols_needed?
import pandas as pd
df = pd.DataFrame({
'sim-prod1': [1,2],
'sim-prod2': [3,4],
'sim-prod3': [5,6],
'sim_prod4': [7,8]
})
cols_needed = ['prod1', 'prod2']
# What I want to obtain:
sim-prod1 sim-prod2
0 1 3
1 2 4
With the regex option of filter
df.filter(regex='|'.join(cols_needed))
sim-prod1 sim-prod2
0 1 3
1 2 4
You can explore str.contains with a joint pattern, for example:
df.loc[:,df.columns.str.contains('|'.join(cols_needed))]
Output:
sim-prod1 sim-prod2
0 1 3
1 2 4
A list comprehension could work as well:
columns = [cols for cols in df
for col in cols_needed
if col in cols]
['sim-prod1', 'sim-prod2']
In [110]: df.loc[:, columns]
Out[110]:
sim-prod1 sim-prod2
0 1 3
1 2 4

How to select columns which contain non-duplicate from a pandas data frame

I want to select columns which contain non-duplicate from a pandas data frame and use these columns to make up a subset data frame. For example, I have a data frame like this:
x y z
a 1 2 3
b 1 2 2
c 1 2 3
d 4 2 3
The columns "x" and "z" have non-duplicate values, so I want to pick them out and create a new data frame like:
x z
a 1 3
b 1 2
c 1 3
d 4 3
The can be realized by the following code:
import pandas as pd
df = pd.DataFrame([[1,2,3],[1,2,2],[1,2,3],[4,2,3]],index=['a','b','c','d'],columns=['x','y','z'])
df0 = pd.DataFrame()
for i in range(df.shape[1]):
if df.iloc[:,i].nunique() > 1:
df1 = df.iloc[:,i].T
df0 = pd.concat([df0,df1],axis=1, sort=False)
However, there must be more simple and direct methods. What are they?
Best regards
df[df.columns[(df.nunique()!=1).values]]
Maybe you can try this one-liner.
Apply nunique, then remove columns where nunique is 1:
nunique = df.apply(pd.Series.nunique)
cols_to_drop = nunique[nunique == 1].index
df = df.drop(cols_to_drop, axis=1)
df =df[df.columns[df.nunique()>1]]
assuming columns with all repeated values with give nunique =1 other will be more 1.
df.columns[df.nunique()>1] will give all columns names which fulfill the purpose
simple one liner:
df0 = df.loc[:,(df.max()-df.min())!=0]
or even better
df0 = df.loc[:,(df.max()!=df.min())]

Apply function using multiple Pandas columns? [duplicate]

This question already has answers here:
How to apply a function to two columns of Pandas dataframe
(15 answers)
Closed 4 years ago.
I need to make a column in my pandas dataframe that relies on other items in that same row. For example, here's my dataframe.
df = pd.DataFrame(
[['a',],['a',1],['a',1],['a',2],['b',2],['b',2],['c',3]],
columns=['letter','number']
)
letters numbers
0 a 1
1 a 1
2 a 1
3 a 2
4 b 2
5 b 2
6 c 3
I need a third column, that is 1 if 'a' and 2 are present in the row, and 0 otherwise. So it would be [`0,0,0,1,0,0,0]`
How can I use Pandas `apply` or `map` to do this? Iterating over the rows is my first thought, but this seems like a clumsy way of doing it.
You can use apply with axis=1. Suppose you wanted to call your new column c:
df['c'] = df.apply(
lambda row: (row['letter'] == 'a') and (row['number'] == 2),
axis=1
).astype(int)
print(df)
# letter number c
#0 a NaN 0
#1 a 1.0 0
#2 a 1.0 0
#3 a 2.0 1
#4 b 2.0 0
#5 b 2.0 0
#6 c 3.0 0
But apply is slow and should be avoided if possible. In this case, it would be much better to boolean logic operations, which are vectorized.
df['c'] = ((df['letter'] == "a") & (df['number'] == 2)).astype(int)
This has the same result as using apply above.
You can try to use pd.Series.where()/np.where(). If you only are interested in the int represantation of the boolean values, you can pick the other solution. If you want more freedom for the if/else value you can use np.where()
import pandas as pd
import numpy as np
# create example
values = ['a', 'b', 'c']
df = pd.DataFrame()
df['letter'] = np.random.choice(values, size=10)
df['number'] = np.random.randint(1,3, size=10)
# condition
df['result'] = np.where((df['letter'] == 'a') & (df['number'] == 2), 1, 0)

How to delete all columns in DataFrame except certain ones?

Let's say I have a DataFrame that looks like this:
a b c d e f g
1 2 3 4 5 6 7
4 3 7 1 6 9 4
8 9 0 2 4 2 1
How would I go about deleting every column besides a and b?
This would result in:
a b
1 2
4 3
8 9
I would like a way to delete these using a simple line of code that says, delete all columns besides a and b, because let's say hypothetically I have 1000 columns of data.
Thank you.
In [48]: df.drop(df.columns.difference(['a','b']), 1, inplace=True)
Out[48]:
a b
0 1 2
1 4 3
2 8 9
or:
In [55]: df = df.loc[:, df.columns.intersection(['a','b'])]
In [56]: df
Out[56]:
a b
0 1 2
1 4 3
2 8 9
PS please be aware that the most idiomatic Pandas way to do that was already proposed by #Wen:
df = df[['a','b']]
or
df = df.loc[:, ['a','b']]
Another option to add to the mix. I prefer this approach for readability.
df = df.filter(['a', 'b'])
Where the first positional argument is items=[]
Bonus
You can also use a like argument or regex to filter.
Helpful if you have a set of columns like ['a_1','a_2','b_1','b_2']
You can do
df = df.filter(like='b_')
and end up with ['b_1','b_2']
Pandas documentation for filter.
there are multiple solution .
df = df[['a','b']] #1
df = df[list('ab')] #2
df = df.loc[:,df.columns.isin(['a','b'])] #3
df = pd.DataFrame(data=df.eval('a,b').T,columns=['a','b']) #4 PS:I do not recommend this method , but still a way to achieve this
Hey what you are looking for is:
df = df[["a","b"]]
You will recive a dataframe which only contains the columns a and b
If you only want to keep more columns than you're dropping put a "~" before the .isin statement to select every column except the ones you want:
df = df.loc[:, ~df.columns.isin(['a','b'])]
If you have more than two columns that you want to drop, let's say 20 or 30, you can use lists as well. Make sure that you also specify the axis value.
drop_list = ["a","b"]
df = df.drop(df.columns.difference(drop_list), axis=1)

How to remove duplicate columns from a dataframe using python pandas

By grouping two columns I made some changes.
I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?
It's probably easiest to use a groupby (assuming they have duplicate names too):
In [11]: df
Out[11]:
A B B
0 a 4 4
1 b 4 4
2 c 4 4
In [12]: df.T.groupby(level=0).first().T
Out[12]:
A B
0 a 4
1 b 4
2 c 4
If they have different names you can drop_duplicates on the transpose:
In [21]: df
Out[21]:
A B C
0 a 4 4
1 b 4 4
2 c 4 4
In [22]: df.T.drop_duplicates().T
Out[22]:
A B
0 a 4
1 b 4
2 c 4
Usually read_csv will usually ensure they have different names...
Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442
This is the best I found so far.
remove = []
cols = df.columns
for i in range(len(cols)-1):
v = df[cols[i]].values
for j in range(i+1,len(cols)):
if np.array_equal(v,df[cols[j]].values):
remove.append(cols[j])
df.drop(remove, axis=1, inplace=True)
https://www.kaggle.com/kobakhit/santander-customer-satisfaction/0-84-score-with-36-features-only/code
It's already answered here python pandas remove duplicate columns.
Idea is that df.columns.duplicated() generates boolean vector where each value says whether it has seen the column before or not. For example, if df has columns ["Col1", "Col2", "Col1"], then it generates [False, False, True]. Let's take inversion of it and call it as column_selector.
Using the above vector and using loc method of df which helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector] we can select columns.
column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]
I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):
df.drop(df.columns[i], axis=1)
The fast solution for dataset without NANs:
share = 0.05
dfx = df.sample(int(df.shape[0]*share))
dfx = dfx.T.drop_duplicates().T
df = df[dfx.columns]

Categories

Resources