Midway between combine_first and merge? - python

I have a master dataframe A
id A B C
0 a b c
1 d e f
2 g h i
3 j k l
and a newer dataframe B
id A B D
0 a2 b2 x
1 d2 e2 y
2 g2 h2 z
3 j2 k2 NaN
4 l2 m2 NaN
5 n2 o2 NaN
If I merge them, I get duplicated columns like A, A_x, B and B_x. If I use combine_first, I end up with rows 4,5, and column D, which I'm not interested in. Besides doing something like
not_on_a = B["id"].isin(A["id"])
B = B.loc[not_on_a, [A,B]]
A = A.combine_first(B)
Is there a way to overwrite B on A ignoring everything that isn't on A? The desired output is
id A B C
0 a2 b2 c
1 d2 e2 f
2 g2 h2 i
3 j2 k2 l

If the indices are the same, this is a simple update:
>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": ["a", "d", "g", "j"], "B": ["b", "e", "h", "k"], "C": ["c", "f", "i", "l"]})
>>> df1
A B C
0 a b c
1 d e f
2 g h i
3 j k l
>>> df2 = pd.DataFrame({"A": ["a2", "d2", "g2", "j2", "l2", "n2"], "B": ["b2", "e2", "h2", "k2", "m2", "o2"], "D": ["x", "y", "z", None, None, None]})
>>> df2
A B D
0 a2 b2 x
1 d2 e2 y
2 g2 h2 z
3 j2 k2 None
4 l2 m2 None
5 n2 o2 None
>>> df1.update(df2)
>>> df1
A B C
0 a2 b2 c
1 d2 e2 f
2 g2 h2 i
3 j2 k2 l
If you don't want to mutate the first dataframe, you can make a copy first.

You could use join and then clean up the dataframe as desired. I do this dynamically by putting drop in the suffix of the columns you later use to dropna values by defining the subset of columns as those with "drop" in them.
df = B.join(A, lsuffix='', rsuffix='drop')
df = df.dropna(subset=[col for col in df.columns if 'drop' in col])
df = df[A.columns]
df
id A B C
0 a2 b2 c
1 d2 e2 f
2 g2 h2 i
3 j2 k2 l

Related

How to flat in one row Dataframe and concatenate them

I have several Dataframes with the same structure :
0
1
0
TITLE
TITLE1
1
A
A1
2
B
B1
3
C
C1
4
D
D1
5
E
E1
0
1
0
TITLE
TITLE2
1
A
A2
2
B
B2
3
C
C2
4
D
D2
5
E
E2
My goal is to have :
TITLE
A
B
C
D
E
TITLE1
A1
B1
C1
D1
E1
TITLE2
A2
B2
C2
D2
E2
How can I transform my Dataframes to flat them and concatenate them like this?
Use DataFrame.set_index with transpose and then concat:
df11 = df1.set_index(0).T
df22 = df2.set_index(0).T
df = pd.concat([df11,df22]).set_index('TITLE')
print (df)
0 A B C D E
TITLE
TITLE1 A1 B1 C1 D1 E1
TITLE2 A2 B2 C2 D2 E2
Or transpose after concat with axis=1:
df11 = df1.set_index(0)
df22 = df2.set_index(0)
df = pd.concat([df11,df22], axis=1).T.set_index('TITLE')

How to shift a dataframe element-wise to fill NaNs?

I have a DataFrame like this:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
I am trying to fill NaN with values of the previous column in the next row and dropping this second row. In other words, I want to combine the two rows with NaNs to form a single row without NaNs like this:
a b
0 A E
1 B C
2 D F
I have tried various flavors of df.fillna(method="<bfill/ffill>") but this didn't give me the expected output.
I haven't found any other question about this problem, Here's one. And actually that DataFrame is made from list of DataFrame by doing .concat(), you may notice that from indexes also. I am telling this because it may be easy to do in single row rather then in multiple rows.
I have found some suggestions to use shift, combine_first but non of them worked for me. You may try these too.
I also have found this too. It is a whole article about filling nan values but I haven't found problem/answer like mine.
OK misunderstood what you wanted to do the first time. The dummy example was a bit ambiguous.
Here is another:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
To my knowledge, this operation does not exist with pandas, so we will use numpy to do the work.
First transform the dataframe to numpy array and flatten it to be one-dimensional. Then drop NaNs using pandas.isna that is working on a larger range types than numpy.isnan, and then reshape the array to its original shape before transforming back to dataframe:
array = df.to_numpy().flatten()
pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
output:
a b
0 A E
1 B C
2 D F
It is also working for more complex examples, as long as the NaN pattern is conserved among columns with NaNs:
In:
a b c d
0 A H A2 H2
1 B NaN B2 NaN
2 C NaN C2 NaN
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
Out:
a b c d
0 A H A2 H2
1 B B2 C C2
2 D I D2 I2
3 E E2 F F2
4 G J G2 J2
In:
a b c
0 A F H
1 B NaN NaN
2 C NaN NaN
3 D NaN NaN
4 E G I
Out:
a b c
0 A F H
1 B C D
2 E G I
In case NaNs columns do not have the same pattern such as:
a b c d
0 A H A2 NaN
1 B NaN B2 NaN
2 C NaN C2 H2
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
You can apply the operation per group of two columns:
def elementwise_shift(df):
array = df.to_numpy().flatten()
return pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
(df.groupby(np.repeat(np.arange(df.shape[1]/2), 2), axis=1)
.apply(elementwise_shift)
)
output:
a b c d
0 A H A2 B2
1 B C C2 H2
2 D I D2 I2
3 E F E2 F2
4 G J G2 J2
You can do this in two steps with a placeholder column. First you fill all the nans in column b with the a values from the next row. Then you apply the filtering. In this example I use ffill with a limit of 1 to filter all nan values after the first, there's probably a better method.
import pandas as pd
import numpy as np
df=pd.DataFrame({"a":[1,2,3,3,4],"b":[1,2,np.nan,np.nan,4]})
# Fill all nans:
df['new_b'] = df['b'].fillna(df['a'].shift(-1))
df = df[df['b'].ffill(limit=1).notna()].copy() # .copy() because loc makes a view
df = df.drop('b', axis=1).rename(columns={'new_b': 'b'})
print(df)
# output:
# a b
# 0 1 1
# 1 2 2
# 2 3 2
# 4 4 4

How to turn convert rows to columns in pandas?

I want to convert every three rows of a DataFrame into columns .
Input:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,11,12,13],'b':['a','b','c','aa','bb','cc']})
print(df)
Output:
a b
0 1 a
1 2 b
2 3 c
3 11 aa
4 12 bb
5 13 cc
Expected:
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc
Use set_index by floor division and modulo by 3 with unstack and flattening MultiIndex:
a = np.arange(len(df))
#if default index
#a = df.index
df1 = df.set_index([a // 3, a % 3]).unstack()
#python 3.6+ solution
df1.columns = [f'{i}{j + 1}' for i,j in df1.columns]
#python bellow 3.6
#df1.columns = ['{}{}'.format(i,j+1) for i,j in df1.columns]
print (df1)
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc
I'm adding a different approach with group -> apply.
df is first grouped by df.index//3 and then the munge function is applied to each group.
def munge(group):
g = group.T.stack()
g.index = ['{}{}'.format(c, i+1) for i, (c, _) in enumerate(g.index)]
return g
result = df.groupby(df.index//3).apply(munge)
Output:
>>> df.groupby(df.index//3).apply(munge)
a1 a2 a3 b4 b5 b6
0 1 2 3 a b c
1 11 12 13 aa bb cc

transforming multiple columns in data frame at once

I have some data that I'm trying to clean up. That involves modifying some columns, combining other cols into new ones, etc. I am wondering if there is a way to do this in a succinct way in pandas or if each operation needs to be a separate line of code. Here is an example:
ex_df = pd.DataFrame(data = {"a": [1,2,3,4], "b": ["a-b", "c-d", "e-f", "g-h"]})
Say I want to create a new column called c which will be the first letter in each row of b, I want to transform b by removing the "-", and I want to create another col called d which will be the first letter of b concatenated with the entry in a in that same row. Right now I would have to do something like this:
ex_df["b"] = ex_df["b"].map(lambda x: "".join(x.split(sep="-")))
ex_df["c"] = ex_df["b"].map(lambda x: x[0])
ex_df["d"] = ex_df.apply(func=lambda s: s["c"] + str(s["a"]), axis=1)
ex_df
# a b c d
#0 1 ab a a1
#1 2 cd c c2
#2 3 ef e e3
#3 4 gh g g4
Coming from an R data.table background (which would combine all these operations into a single statement), I'm wondering how things are done in pandas.
You can use:
In [12]: ex_df.assign(
...: b=ex_df.b.str.replace('-', ''),
...: c=ex_df.b.str[0],
...: d=ex_df.b.str[0] + ex_df.a.astype(str)
...: )
Out[12]:
a b c d
0 1 ab a a1
1 2 cd c c2
2 3 ef e e3
3 4 gh g g4
This is one approach.
Demo:
import pandas as pd
ex_df = pd.DataFrame(data = {"a": [1,2,3,4], "b": ["a-b", "c-d", "e-f", "g-h"]})
ex_df["c"] = ex_df["b"].str[0]
ex_df["b"] = ex_df["b"].str.replace("-", "")
ex_df["d"] = ex_df.apply(lambda s: s["c"] + str(s["a"])), axis=1)
print(ex_df)
Output:
a b c d
0 1 ab a a1
1 2 cd c c2
2 3 ef e e3
3 4 gh g g4
You can use the build in str method to make the required output.

replicate rows in pandas by specific column with the values from that column

What would be the most efficient way to solve this problem?
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'v' : [ 's,m,l', '1,2,3', 'k,g']
})
i_need = pd.DataFrame(data={
'id': ['A','A','A','B','B','B','C', 'C'],
'v' : ['s','m','l','1','2','3','k','g']
})
I though about creating a new df and while iterating over i_have append the records to the new df. But as number of rows grow, it can take a while.
Use numpy.repeat with numpy.concatenate for flattening:
#create lists by split
splitted = i_have['v'].str.split(',')
#get legths of each lists
lens = splitted.str.len()
df = pd.DataFrame({'id':np.repeat(i_have['id'], lens),
'v':np.concatenate(splitted)})
print (df)
id v
0 A s
0 A m
0 A l
1 B 1
1 B 2
1 B 3
2 C k
2 C g
Thank you piRSquared for solution for repeat multiple columns:
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'id1': ['A1', 'B1', 'C1'],
'v' : [ 's,m,l', '1,2,3', 'k,g']
})
print (i_have)
id id1 v
0 A A1 s,m,l
1 B B1 1,2,3
2 C C1 k,g
splitted = i_have['v'].str.split(',')
lens = splitted.str.len()
df = i_have.loc[i_have.index.repeat(lens)].assign(v=np.concatenate(splitted))
print (df)
id id1 v
0 A A1 s
0 A A1 m
0 A A1 l
1 B B1 1
1 B B1 2
1 B B1 3
2 C C1 k
2 C C1 g
If you have multiple columns then first split the data by , with expand = True(Thank you piRSquared) then stack and ffill i.e
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'v' : [ 's,m,l', '1,2,3', 'k,g'],
'w' : [ 's,8,l', '1,2,3', 'k,g'],
'x' : [ 's,0,l', '1,21,3', 'ks,g'],
'y' : [ 's,m,l', '11,2,3', 'ks,g'],
'z' : [ 's,4,l', '1,2,32', 'k,gs'],
})
i_want = i_have.apply(lambda x :x.str.split(',',expand=True).stack()).reset_index(level=1,drop=True).ffill()
If the values are not equal sized then
i_want = i_have.apply(lambda x :x.str.split(',',expand=True).stack()).reset_index(level=1,drop=True)
i_want['id'] = i_want['id'].ffill()
Output i_want
id v w x y z
0 A s s s s s
1 A m 8 0 m 4
2 A l l l l l
3 B 1 1 1 11 1
4 B 2 2 21 2 2
5 B 3 3 3 3 32
6 C k k ks ks k
7 C g g g g gs
Here's another way
In [1667]: (i_have.set_index('id').v.str.split(',').apply(pd.Series)
.stack().reset_index(name='v').drop('level_1', 1))
Out[1667]:
id v
0 A s
1 A m
2 A l
3 B 1
4 B 2
5 B 3
6 C k
7 C g
As pointed in comment.
In [1672]: (i_have.set_index('id').v.str.split(',', expand=True)
.stack().reset_index(name='v').drop('level_1', 1))
Out[1672]:
id V
0 A s
1 A m
2 A l
3 B 1
4 B 2
5 B 3
6 C k
7 C g

Categories

Resources