Subsetting DataFrame based on column names of another DataFrame

Subsetting DataFrame based on column names of another DataFrame - python

I have two DataFrames and I want to subset df2 based on the column names that intersect with the column names of df1. In R this is easy.
R code:
df1 <- data.frame(a=rnorm(5), b=rnorm(5))
df2 <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5))
df2[names(df2) %in% names(df1)]
a b
1 -0.8173361 0.6450052
2 -0.8046676 0.6441492
3 -0.3545996 -1.6545289
4 1.3364769 -0.4340254
5 -0.6013046 1.6118360
However, I'm not sure how to do this in pandas.
pandas attempt:
df1 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,))})
df2 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,)), 'c': np.random.standard_normal((5,))})
df2[df2.columns in df1.columns]
This results in TypeError: unhashable type: 'Index'. What's the right way to do this?

If you need a true intersection, since .columns yields an Index object which supports basic set operations, you can use &, e.g.
df2[df1.columns & df2.columns]
or equivalently with Index.intersection
df2[df1.columns.intersection(df2.columns)]
However if you are guaranteed that df1 is just a column subset of df2 you can directly use
df2[df1.columns]
or if assigning,
df2.loc[:, df1.columns]
Demo
>>> df2[df1.columns & df2.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343
>>> df2.loc[:, df1.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343

The equivalent would be:
df2[df1.columns.intersection(df2.columns)]
Out:
a b
0 -0.019703 0.379820
1 0.040658 0.243309
2 1.103032 0.066454
3 -0.921378 1.016017
4 0.188666 -0.626612
With this, you will not get a KeyError if a column in df1 does not exist in df2.

Related

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution

This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4

The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4

ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

How to remove duplication of columns names using Pandas Merge function

When we merge two dataframes using pandas merge function, is it possible to ensure the key(s) based on which the two dataframes are merged is not repeated twice in the result? For e.g., I tried to merge two DFs with a column named 'isin_code' in the left DF and a column named 'isin' in the right DF. Even though the column/header names are different, the values of both the columns are same. In, the eventual result though, I get to see both 'isin_code' column and 'isin' column, which I am trying to avoid.
Code used:
result = pd.merge(df1,df2[['isin','issue_date']],how='left',left_on='isin_code',right_on = 'isin')

Either rename the columns to match before merge to uniform the column names and specify only on:
result = pd.merge(
df1,
df2[['isin', 'issue_date']].rename(columns={'isin': 'isin_code'}),
on='isin_code',
how='left'
)
OR drop the duplicate column after merge:
result = pd.merge(
df1,
df2[['isin', 'issue_date']],
how='left',
left_on='isin_code',
right_on='isin'
).drop(columns='isin')
Sample DataFrames and output:
import pandas as pd
df1 = pd.DataFrame({'isin_code': [1, 2, 3], 'a': [4, 5, 6]})
df2 = pd.DataFrame({'isin': [1, 3], 'issue_date': ['2021-01-02', '2021-03-04']})
df1:
isin_code a
0 1 4
1 2 5
2 3 6
df2:
isin issue_date
0 1 2021-01-02
1 3 2021-03-04
result:
isin_code a issue_date
0 1 4 2021-01-02
1 2 5 NaN
2 3 6 2021-03-04

Python: sample from dataframe, storing the non-sampled

I have a pandas DataFrame.
Say I want to sample two persons of each group, I use the following code to get a new dataframe:
sample_df = df.groupby("category").apply(lambda group_df: group_df.sample(2, random_state=1234)
I would like to create a dataframe where the non-sampled persons are stored.
The sample_df stil has the indices of the original df so I probably have to do something with that, but I'm not sure what...
Thanks in advance!

First add group_keys=False to groupby for avoid category to MultiIndex:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'category':list('aaabbb')
})
sample_df = (df.groupby("category", group_keys=False)
.apply(lambda group_df: group_df.sample(2, random_state=1234)))
print(sample_df)
A B category
0 a 4 a
1 b 5 a
3 d 5 b
4 e 5 b
So possible filter original index values with boolean indexing by Index.isin and inverted mask by ~:
non_sample_df = df[~df.index.isin(sample_df.index)]
print(non_sample_df)
A B category
2 c 4 a
5 f 4 b

Accessing an Non Numerical Index in a DataFrame [duplicate]

I'm simply trying to access named pandas columns by an integer.
You can select a row by location using df.ix[3].
But how to select a column by integer?
My dataframe:
df=pandas.DataFrame({'a':np.random.rand(5), 'b':np.random.rand(5)})

Two approaches that come to mind:
>>> df
A B C D
0 0.424634 1.716633 0.282734 2.086944
1 -1.325816 2.056277 2.583704 -0.776403
2 1.457809 -0.407279 -1.560583 -1.316246
3 -0.757134 -1.321025 1.325853 -2.513373
4 1.366180 -1.265185 -2.184617 0.881514
>>> df.iloc[:, 2]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
>>> df[df.columns[2]]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
Edit: The original answer suggested the use of df.ix[:,2] but this function is now deprecated. Users should switch to df.iloc[:,2].

You can also use df.icol(n) to access a column by integer.
Update: icol is deprecated and the same functionality can be achieved by:
df.iloc[:, n] # to access the column at the nth position

You could use label based using .loc or index based using .iloc method to do column-slicing including column ranges:
In [50]: import pandas as pd
In [51]: import numpy as np
In [52]: df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd'))
In [53]: df
Out[53]:
a b c d
0 0.806811 0.187630 0.978159 0.317261
1 0.738792 0.862661 0.580592 0.010177
2 0.224633 0.342579 0.214512 0.375147
3 0.875262 0.151867 0.071244 0.893735
In [54]: df.loc[:, ["a", "b", "d"]] ### Selective columns based slicing
Out[54]:
a b d
0 0.806811 0.187630 0.317261
1 0.738792 0.862661 0.010177
2 0.224633 0.342579 0.375147
3 0.875262 0.151867 0.893735
In [55]: df.loc[:, "a":"c"] ### Selective label based column ranges slicing
Out[55]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
In [56]: df.iloc[:, 0:3] ### Selective index based column ranges slicing
Out[56]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244

You can access multiple columns by passing a list of column indices to dataFrame.ix.
For example:
>>> df = pandas.DataFrame({
'a': np.random.rand(5),
'b': np.random.rand(5),
'c': np.random.rand(5),
'd': np.random.rand(5)
})
>>> df
a b c d
0 0.705718 0.414073 0.007040 0.889579
1 0.198005 0.520747 0.827818 0.366271
2 0.974552 0.667484 0.056246 0.524306
3 0.512126 0.775926 0.837896 0.955200
4 0.793203 0.686405 0.401596 0.544421
>>> df.ix[:,[1,3]]
b d
0 0.414073 0.889579
1 0.520747 0.366271
2 0.667484 0.524306
3 0.775926 0.955200
4 0.686405 0.544421

The method .transpose() converts columns to rows and rows to column, hence you could even write
df.transpose().ix[3]

Most of the people have answered how to take columns starting from an index. But there might be some scenarios where you need to pick columns from in-between or specific index, where you can use the below solution.
Say that you have columns A,B and C. If you need to select only column A and C you can use the below code.
df = df.iloc[:, [0,2]]
where 0,2 specifies that you need to select only 1st and 3rd column.

You can use the method take. For example, to select first and last columns:
df.take([0, -1], axis=1)

Mean of repeated columns in pandas dataframe

I have a dataframe with repeated column names which account for repeated measurements.
df = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df2 = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df3 = pd.concat([df,df2], axis=1)
df3
A B A B
0 -0.875884 -0.298203 0.877414 1.282025
1 1.605602 -0.127038 -0.286237 0.572269
2 1.349540 -0.067487 0.126440 1.063988
3 -0.142809 1.282968 0.941925 -1.593592
4 -0.630353 1.888605 -1.176436 -1.623352
I'd like to take the mean of cols 'A's and 'B's such that the dataframe shrinks to
A B
0 0.000765 0.491911
1 0.659682 0.222616
2 0.737990 0.498251
3 0.399558 -0.155312
4 -0.903395 0.132627
If I do the typical
df3['A'].mean(axis=1)
I get a Series (with no column name) and I should then build a new dataframe with the means of each col group. Also the .groupby() method apparently doesn't allow you to group by column name, but rather you give the columns and it sorts the indexes. Is there a fancy way to do this?
Side question: why does
df = pd.DataFrame({'A': randn(5), 'B': randn(5), 'A': randn(5), 'B': randn(5)})
not generate a 4-column dataframe but merges same-name cols?

You can use the level keyword (regarding your columns as the first level (level 0) of the index with only one level in this case):
In [11]: df3
Out[11]:
A B A B
0 -0.367326 -0.422332 2.379907 1.502237
1 -1.060848 0.083976 0.619213 -0.303383
2 0.805418 -0.109793 0.257343 0.186462
3 2.419282 -0.452402 0.702167 0.216165
4 -0.464248 -0.980507 0.823302 0.900429
In [12]: df3.mean(axis=1, level=0)
Out[12]:
A B
0 1.006291 0.539952
1 -0.220818 -0.109704
2 0.531380 0.038334
3 1.560725 -0.118118
4 0.179527 -0.040039

You've created df3 in a strange way for this simple case the following would work:
In [86]:
df = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df2 = pd.DataFrame({'A': randn(5), 'B': randn(5)})
print(df)
print(df2)
A B
0 -0.732807 -0.571942
1 -1.546377 -1.586371
2 0.638258 0.569980
3 -1.017427 1.395300
4 0.666853 -0.258473
[5 rows x 2 columns]
A B
0 0.589185 1.029062
1 -1.447809 -0.616584
2 -0.506545 0.432412
3 -1.168424 0.312796
4 1.390517 1.074129
[5 rows x 2 columns]
In [87]:
(df+df2)/2
Out[87]:
A B
0 -0.071811 0.228560
1 -1.497093 -1.101477
2 0.065857 0.501196
3 -1.092925 0.854048
4 1.028685 0.407828
[5 rows x 2 columns]
to answer your side question, this is nothing to do with Pandas and more to do with the dict constructor:
In [88]:
{'A': randn(5), 'B': randn(5), 'A': randn(5), 'B': randn(5)}
Out[88]:
{'B': array([-0.03087831, -0.24416885, -2.29924624, 0.68849978, 0.41938536]),
'A': array([ 2.18471335, 0.68051101, -0.35759988, 0.54023489, 0.49029071])}
dict keys must be unique so my guess is that in the constructor it just reassigns the values to the pre-existing keys
EDIT
If you insist on having duplicate columns then you have to create a new dataframe from this because if you were to update the columns 'A' and 'B', the mean will be duplicated still as the columns are repeated:
In [92]:
df3 = pd.concat([df,df2], axis=1)
new_df = pd.DataFrame()
new_df['A'], new_df['B'] = df3['A'].sum(axis=1)/df3['A'].shape[1], df3['B'].sum(axis=1)/df3['B'].shape[1]
new_df
Out[92]:
A B
0 -0.071811 0.228560
1 -1.497093 -1.101477
2 0.065857 0.501196
3 -1.092925 0.854048
4 1.028685 0.407828
[5 rows x 2 columns]
So the above would work with df3 and in fact for an arbritary numer of repeated columns which is why I am using shape, you could hard code this to 2 if you new the columns were only ever duplicated once

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subsetting DataFrame based on column names of another DataFrame - python

The equivalent would be: df2[df1.columns.intersection(df2.columns)] Out: a b 0 -0.019703 0.379820 1 0.040658 0.243309 2 1.103032 0.066454 3 -0.921378 1.016017 4 0.188666 -0.626612 With this, you will not get a KeyError if a column in df1 does not exist in df2.

Related

pandas: how to merge columns irrespective of index

How to remove duplication of columns names using Pandas Merge function

Python: sample from dataframe, storing the non-sampled

Accessing an Non Numerical Index in a DataFrame [duplicate]

Mean of repeated columns in pandas dataframe

Categories

Resources