pandas: Merge two columns with different names? - python

I am trying to concatenate two dataframes, above and below. Not concatenate side-by-side.
The dataframes contain the same data, however, in the first dataframe one column might have name "ObjectType" and in the second dataframe the column might have name "ObjectClass". When I do
df_total = pandas.concat ([df0, df1])
the df_total will have two column names, one with "ObjectType" and another with "ObjectClass". In each of these two columns, half of the values will be "NaN". So I have to manually merge these two columns into one which is a pain.
Can I somehow merge the two columns into one? I would like to have a function that does something like:
df_total = pandas.merge_many_columns(input=["ObjectType,"ObjectClass"], output=["MyObjectClasses"]
which merges the two columns and creates a new column. I have looked into melt() but it does not really do this?
(Maybe it would be nice if I could specify what will happen if there is a collision, say that two columns contain values, in that case I supply a lambda function that says "keep the largest value", "use an average", etc)

I think you can rename column first for align data in both DataFrames:
df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df0)
df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df1)
inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"
#dict comprehension
d = {x:output for x in inputs}
print (d)
{'ObjectType': 'MyObjectClasses', 'ObjectClass': 'MyObjectClasses'}
df0 = df0.rename(columns=d)
df1 = df1.rename(columns=d)
df_total = pd.concat([df0, df1], ignore_index=True)
print (df_total)
B C MyObjectClasses
0 4 7 1
1 5 8 2
2 6 9 3
3 4 7 1
4 5 8 2
5 6 9 3
EDIT:
More simplier is update (working inplace):
df = pd.concat([df0, df1])
df['ObjectType'].update(df['ObjectClass'])
print (df)
B C ObjectClass ObjectType
0 4 7 NaN 1.0
1 5 8 NaN 2.0
2 6 9 NaN 3.0
0 4 7 1.0 1.0
1 5 8 2.0 2.0
2 6 9 3.0 3.0
Or fillna, but then need drop original columns columns:
df = pd.concat([df0, df1])
df["ObjectType"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop('ObjectClass', axis=1)
print (df)
B C ObjectType
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
df = pd.concat([df0, df1])
df["MyObjectClasses"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop(['ObjectType','ObjectClass'], axis=1)
print (df)
B C MyObjectClasses
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
EDIT1:
Timings:
df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df0)
df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df1)
df0 = pd.concat([df0]*1000).reset_index(drop=True)
df1 = pd.concat([df1]*1000).reset_index(drop=True)
inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"
#dict comprehension
d = {x:output for x in inputs}
In [241]: %timeit df_total = pd.concat([df0.rename(columns=d), df1.rename(columns=d)], ignore_index=True)
1000 loops, best of 3: 821 µs per loop
In [240]: %%timeit
...: df = pd.concat([df0, df1])
...: df['ObjectType'].update(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.18 ms per loop
In [242]: %%timeit
...: df = pd.concat([df0, df1])
...: df['MyObjectClasses'] = df['ObjectType'].combine_first(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.21 ms per loop
In [243]: %%timeit
...: df = pd.concat([df0, df1])
...: df['MyObjectClasses'] = df['ObjectType'].fillna(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.28 ms per loop

You can merge two columns separated by Nan's into one using combine_first
>>> import numpy as np
>>> import pandas as pd
>>>
>>> df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
>>> df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
>>> df = pd.concat([df0, df1])
>>> df['ObjectType'] = df['ObjectType'].combine_first(df['ObjectClass'])
>>> df['ObjectType']
0 1
1 2
2 3
0 1
1 2
3 3
Name: ObjectType, dtype: float64

Related

Pandas: correct way to use apply() here? [duplicate]

I am trying to access the index of a row in a function applied across an entire DataFrame in Pandas. I have something like this:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df
a b c
0 1 2 3
1 4 5 6
and I'll define a function that access elements with a given row
def rowFunc(row):
return row['a'] + row['b'] * row['c']
I can apply it like so:
df['d'] = df.apply(rowFunc, axis=1)
>>> df
a b c d
0 1 2 3 7
1 4 5 6 34
Awesome! Now what if I want to incorporate the index into my function?
The index of any given row in this DataFrame before adding d would be Index([u'a', u'b', u'c', u'd'], dtype='object'), but I want the 0 and 1. So I can't just access row.index.
I know I could create a temporary column in the table where I store the index, but I'm wondering if it is stored in the row object somewhere.
To access the index in this case you access the name attribute:
In [182]:
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
def rowFunc(row):
return row['a'] + row['b'] * row['c']
def rowIndex(row):
return row.name
df['d'] = df.apply(rowFunc, axis=1)
df['rowIndex'] = df.apply(rowIndex, axis=1)
df
Out[182]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
Note that if this is really what you are trying to do that the following works and is much faster:
In [198]:
df['d'] = df['a'] + df['b'] * df['c']
df
Out[198]:
a b c d
0 1 2 3 7
1 4 5 6 34
In [199]:
%timeit df['a'] + df['b'] * df['c']
%timeit df.apply(rowIndex, axis=1)
10000 loops, best of 3: 163 µs per loop
1000 loops, best of 3: 286 µs per loop
EDIT
Looking at this question 3+ years later, you could just do:
In[15]:
df['d'],df['rowIndex'] = df['a'] + df['b'] * df['c'], df.index
df
Out[15]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
but assuming it isn't as trivial as this, whatever your rowFunc is really doing, you should look to use the vectorised functions, and then use them against the df index:
In[16]:
df['newCol'] = df['a'] + df['b'] + df['c'] + df.index
df
Out[16]:
a b c d rowIndex newCol
0 1 2 3 7 0 6
1 4 5 6 34 1 16
Either:
1. with row.name inside the apply(..., axis=1) call:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'], index=['x','y'])
a b c
x 1 2 3
y 4 5 6
df.apply(lambda row: row.name, axis=1)
x x
y y
2. with iterrows() (slower)
DataFrame.iterrows() allows you to iterate over rows, and access their index:
for idx, row in df.iterrows():
...
To answer the original question: yes, you can access the index value of a row in apply(). It is available under the key name and requires that you specify axis=1 (because the lambda processes the columns of a row and not the rows of a column).
Working example (pandas 0.23.4):
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df.set_index('a', inplace=True)
>>> df
b c
a
1 2 3
4 5 6
>>> df['index_x10'] = df.apply(lambda row: 10*row.name, axis=1)
>>> df
b c index_x10
a
1 2 3 10
4 5 6 40

Fill NA values by a two levels indexed Series

I have a dataframe with columns (A, B and value) where there are missing values in the value column. And there is a Series indexed by two columns (A and B) from the dataframe. How can I fill the missing values in the dataframe with corresponding values in the series?
I think you need fillna with set_index and reset_index:
df = pd.DataFrame({'A': [1,1,3],
'B': [2,3,4],
'value':[2,np.nan,np.nan] })
print (df)
A B value
0 1 2 2.0
1 1 3 NaN
2 3 4 NaN
idx = pd.MultiIndex.from_product([[1,3],[2,3,4]])
s = pd.Series([5,6,0,8,9,7], index=idx)
print (s)
1 2 5
3 6
4 0
3 2 8
3 9
4 7
dtype: int64
df = df.set_index(['A','B'])['value'].fillna(s).reset_index()
print (df)
A B value
0 1 2 2.0
1 1 3 6.0
2 3 4 7.0
Consider the dataframe and series df and s
df = pd.DataFrame(dict(
A=list('aaabbbccc'),
B=list('xyzxyzxyz'),
value=[1, 2, np.nan, 4, 5, np.nan, 7, 8, 9]
))
s = pd.Series(range(1, 10)[::-1])
s.index = [df.A, df.B]
We can fillna with a clever join
df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
# \_____________/ \_________/
# make series same get old
# name as column column out
# we are filling of the way
A B value
0 a x 1.0
1 a y 2.0
2 a z 7.0
3 b x 4.0
4 b y 5.0
5 b z 4.0
6 c x 7.0
7 c y 8.0
8 c z 9.0
Timing
join is cute, but #jezrael's set_index is quicker
%timeit df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
100 loops, best of 3: 3.56 ms per loop
%timeit df.set_index(['A','B'])['value'].fillna(s).reset_index()
100 loops, best of 3: 2.06 ms per loop

How to extract rows in a pandas dataframe NOT in a subset dataframe

I have two dataframes. DF and SubDF. SubDF is a subset of DF. I want to extract the rows in DF that are NOT in SubDF.
I tried the following:
DF2 = DF[~DF.isin(SubDF)]
The number of rows are correct and most rows are correct,
ie number of rows in subDF + number of rows in DF2 = number of rows in DF
but I get rows with NaN values that do not exist in the original DF
Not sure what I'm doing wrong.
Note: the original DF does not have any NaN values, and to double check I did DF.dropna() before and the result still produced NaN
You need merge with outer join and boolean indexing, because DataFrame.isin need values and index match:
DF = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (DF)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
SubDF = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
print (SubDF)
A B C D E F
0 3 6 9 5 6 3
#return no match
DF2 = DF[~DF.isin(SubDF)]
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
DF2 = pd.merge(DF, SubDF, how='outer', indicator=True)
DF2 = DF2[DF2._merge == 'left_only'].drop('_merge', axis=1)
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
Another way, borrowing the setup from #jezrael:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
sub = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
extract_idx = list(set(df.index) - set(sub.index))
df_extract = df.loc[extract_idx]
The rows may not be sorted in the original df order. If matching order is required:
extract_idx = list(set(df.index) - set(sub.index))
idx_dict = dict(enumerate(df.index))
order_dict = dict(zip(idx_dict.values(), idx_dict.keys()))
df_extract = df.loc[sorted(extract_idx, key=order_dict.get)]

Split a pandas dataframe into two by columns

I have a dataframe and I want to split it into two dataframes, one that has all the columns beginning with foo and one with the rest of the columns.
Is there a quick way of doing this?
You can use list comprehensions for select all columns names:
df = pd.DataFrame({'fooA':[1,2,3],
'fooB':[4,5,6],
'fooC':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
D E F fooA fooB fooC
0 1 5 7 1 4 7
1 3 3 4 2 5 8
2 5 6 3 3 6 9
foo = [col for col in df.columns if col.startswith('foo')]
print (foo)
['fooA', 'fooB', 'fooC']
other = [col for col in df.columns if not col.startswith('foo')]
print (other)
['D', 'E', 'F']
print (df[foo])
fooA fooB fooC
0 1 4 7
1 2 5 8
2 3 6 9
print (df[other])
D E F
0 1 5 7
1 3 3 4
2 5 6 3
Another solution with filter and difference:
df1 = df.filter(regex='^foo')
print (df1)
fooA fooB fooC
0 1 4 7
1 2 5 8
2 3 6 9
print (df.columns.difference(df1.columns))
Index(['D', 'E', 'F'], dtype='object')
print (df[df.columns.difference(df1.columns)])
D E F
0 1 5 7
1 3 3 4
2 5 6 3
Timings:
In [123]: %timeit a(df)
1000 loops, best of 3: 1.06 ms per loop
In [124]: %timeit b(df3)
1000 loops, best of 3: 1.04 ms per loop
In [125]: %timeit c(df4)
1000 loops, best of 3: 1.41 ms per loop
df3 = df.copy()
df4 = df.copy()
def a(df):
df1 = df.filter(regex='^foo')
df2 = df[df.columns.difference(df1.columns)]
return df1, df2
def b(df):
df1 = df[[col for col in df.columns if col.startswith('foo')]]
df2 = df[[col for col in df.columns if not col.startswith('foo')]]
return df1, df2
def c(df):
df1 = df[df.columns[df.columns.str.startswith('foo')]]
df2 = df[df.columns[~df.columns.str.startswith('foo')]]
return df1, df2
df1, df2 = a(df)
print (df1)
print (df2)
df1, df2 = b(df3)
print (df1)
print (df2)
df1, df2 = c(df4)
print (df1)
print (df2)

Replace certain column with `filter(like = "")` in Pandas

Sometimes, I would manipulate some columns of the dataframe and re-change it.
For example, one dataframe df has 6 columns like this:
A, B1, B2, B3, C, D
And I want to change the values in the columns (B1,B2,B3) transform into (B1*A, B2*A, B3*A).
Aside the loop subroutine which is slow, the df.filter(like = 'B') will accelerate a lot.
df.filter(like = "B").mul(df.A, axis = 0) can produce the right answer. But I can't change the B-like columns in df using:
df.filter(like = "B") =df.filter(like = "B").mul(df.A. axis = 0)`
How to achieve it? I know using pd.concat to creat a new dataframe can get it done. But when the number of columns are huge, this method may be loss of efficiency. What I want to do is to assign new value to the columns already exist.
Any advices would be appreciate!
Use str.contains with boolean indexing:
cols = df.columns[df.columns.str.contains('B')]
df[cols] = df[cols].mul(df.A, axis = 0)
Sample:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'C':[5,3,6],
'D':[7,4,3]})
print (df)
A B1 B2 B3 C D
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
cols = df.columns[df.columns.str.contains('B')]
print (cols)
Index(['B1', 'B2', 'B3'], dtype='object')
df[cols] = df[cols].mul(df.A, axis = 0)
print (df)
A B1 B2 B3 C D
0 1 4 7 1 5 7
1 2 10 16 6 3 4
2 3 18 27 15 6 3
Timings:
len(df)=3:
In [17]: %timeit (a(df))
1000 loops, best of 3: 1.36 ms per loop
In [18]: %timeit (b(df1))
100 loops, best of 3: 2.39 ms per loop
len(df)=30k:
In [14]: %timeit (a(df))
100 loops, best of 3: 2.89 ms per loop
In [15]: %timeit (b(df1))
100 loops, best of 3: 4.71 ms per loop
Code:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'C':[5,3,6],
'D':[7,4,3]})
print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
def a(df):
cols = df.columns[df.columns.str.contains('B')]
df[cols] = df[cols].mul(df.A, axis = 0)
return (df)
def b(df):
df.loc[:, df.filter(regex=r'^B').columns] = df.loc[:, df.filter(regex=r'^B').columns].mul(df.A, axis=0)
return (df)
print (a(df))
print (b(df1))
you have almost done it:
In [136]: df.loc[:, df.filter(regex=r'^B').columns] = df.loc[:, df.filter(regex=r'^B').columns].mul(df.A, axis=0)
In [137]: df
Out[137]:
A B1 B2 B3 B4 F
0 1 4 7 1 5 7
1 2 10 16 6 6 4
2 3 18 27 15 18 3

Categories

Resources