I have a master dataframe A
id A B C
0 a b c
1 d e f
2 g h i
3 j k l
and a newer dataframe B
id A B D
0 a2 b2 x
1 d2 e2 y
2 g2 h2 z
3 j2 k2 NaN
4 l2 m2 NaN
5 n2 o2 NaN
If I merge them, I get duplicated columns like A, A_x, B and B_x. If I use combine_first, I end up with rows 4,5, and column D, which I'm not interested in. Besides doing something like
not_on_a = B["id"].isin(A["id"])
B = B.loc[not_on_a, [A,B]]
A = A.combine_first(B)
Is there a way to overwrite B on A ignoring everything that isn't on A? The desired output is
id A B C
0 a2 b2 c
1 d2 e2 f
2 g2 h2 i
3 j2 k2 l
If the indices are the same, this is a simple update:
>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": ["a", "d", "g", "j"], "B": ["b", "e", "h", "k"], "C": ["c", "f", "i", "l"]})
>>> df1
A B C
0 a b c
1 d e f
2 g h i
3 j k l
>>> df2 = pd.DataFrame({"A": ["a2", "d2", "g2", "j2", "l2", "n2"], "B": ["b2", "e2", "h2", "k2", "m2", "o2"], "D": ["x", "y", "z", None, None, None]})
>>> df2
A B D
0 a2 b2 x
1 d2 e2 y
2 g2 h2 z
3 j2 k2 None
4 l2 m2 None
5 n2 o2 None
>>> df1.update(df2)
>>> df1
A B C
0 a2 b2 c
1 d2 e2 f
2 g2 h2 i
3 j2 k2 l
If you don't want to mutate the first dataframe, you can make a copy first.
You could use join and then clean up the dataframe as desired. I do this dynamically by putting drop in the suffix of the columns you later use to dropna values by defining the subset of columns as those with "drop" in them.
df = B.join(A, lsuffix='', rsuffix='drop')
df = df.dropna(subset=[col for col in df.columns if 'drop' in col])
df = df[A.columns]
df
id A B C
0 a2 b2 c
1 d2 e2 f
2 g2 h2 i
3 j2 k2 l
Related
I have several Dataframes with the same structure :
0
1
0
TITLE
TITLE1
1
A
A1
2
B
B1
3
C
C1
4
D
D1
5
E
E1
0
1
0
TITLE
TITLE2
1
A
A2
2
B
B2
3
C
C2
4
D
D2
5
E
E2
My goal is to have :
TITLE
A
B
C
D
E
TITLE1
A1
B1
C1
D1
E1
TITLE2
A2
B2
C2
D2
E2
How can I transform my Dataframes to flat them and concatenate them like this?
Use DataFrame.set_index with transpose and then concat:
df11 = df1.set_index(0).T
df22 = df2.set_index(0).T
df = pd.concat([df11,df22]).set_index('TITLE')
print (df)
0 A B C D E
TITLE
TITLE1 A1 B1 C1 D1 E1
TITLE2 A2 B2 C2 D2 E2
Or transpose after concat with axis=1:
df11 = df1.set_index(0)
df22 = df2.set_index(0)
df = pd.concat([df11,df22], axis=1).T.set_index('TITLE')
I have a DataFrame like this:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
I am trying to fill NaN with values of the previous column in the next row and dropping this second row. In other words, I want to combine the two rows with NaNs to form a single row without NaNs like this:
a b
0 A E
1 B C
2 D F
I have tried various flavors of df.fillna(method="<bfill/ffill>") but this didn't give me the expected output.
I haven't found any other question about this problem, Here's one. And actually that DataFrame is made from list of DataFrame by doing .concat(), you may notice that from indexes also. I am telling this because it may be easy to do in single row rather then in multiple rows.
I have found some suggestions to use shift, combine_first but non of them worked for me. You may try these too.
I also have found this too. It is a whole article about filling nan values but I haven't found problem/answer like mine.
OK misunderstood what you wanted to do the first time. The dummy example was a bit ambiguous.
Here is another:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
To my knowledge, this operation does not exist with pandas, so we will use numpy to do the work.
First transform the dataframe to numpy array and flatten it to be one-dimensional. Then drop NaNs using pandas.isna that is working on a larger range types than numpy.isnan, and then reshape the array to its original shape before transforming back to dataframe:
array = df.to_numpy().flatten()
pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
output:
a b
0 A E
1 B C
2 D F
It is also working for more complex examples, as long as the NaN pattern is conserved among columns with NaNs:
In:
a b c d
0 A H A2 H2
1 B NaN B2 NaN
2 C NaN C2 NaN
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
Out:
a b c d
0 A H A2 H2
1 B B2 C C2
2 D I D2 I2
3 E E2 F F2
4 G J G2 J2
In:
a b c
0 A F H
1 B NaN NaN
2 C NaN NaN
3 D NaN NaN
4 E G I
Out:
a b c
0 A F H
1 B C D
2 E G I
In case NaNs columns do not have the same pattern such as:
a b c d
0 A H A2 NaN
1 B NaN B2 NaN
2 C NaN C2 H2
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
You can apply the operation per group of two columns:
def elementwise_shift(df):
array = df.to_numpy().flatten()
return pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
(df.groupby(np.repeat(np.arange(df.shape[1]/2), 2), axis=1)
.apply(elementwise_shift)
)
output:
a b c d
0 A H A2 B2
1 B C C2 H2
2 D I D2 I2
3 E F E2 F2
4 G J G2 J2
You can do this in two steps with a placeholder column. First you fill all the nans in column b with the a values from the next row. Then you apply the filtering. In this example I use ffill with a limit of 1 to filter all nan values after the first, there's probably a better method.
import pandas as pd
import numpy as np
df=pd.DataFrame({"a":[1,2,3,3,4],"b":[1,2,np.nan,np.nan,4]})
# Fill all nans:
df['new_b'] = df['b'].fillna(df['a'].shift(-1))
df = df[df['b'].ffill(limit=1).notna()].copy() # .copy() because loc makes a view
df = df.drop('b', axis=1).rename(columns={'new_b': 'b'})
print(df)
# output:
# a b
# 0 1 1
# 1 2 2
# 2 3 2
# 4 4 4
I want to convert every three rows of a DataFrame into columns .
Input:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,11,12,13],'b':['a','b','c','aa','bb','cc']})
print(df)
Output:
a b
0 1 a
1 2 b
2 3 c
3 11 aa
4 12 bb
5 13 cc
Expected:
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc
Use set_index by floor division and modulo by 3 with unstack and flattening MultiIndex:
a = np.arange(len(df))
#if default index
#a = df.index
df1 = df.set_index([a // 3, a % 3]).unstack()
#python 3.6+ solution
df1.columns = [f'{i}{j + 1}' for i,j in df1.columns]
#python bellow 3.6
#df1.columns = ['{}{}'.format(i,j+1) for i,j in df1.columns]
print (df1)
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc
I'm adding a different approach with group -> apply.
df is first grouped by df.index//3 and then the munge function is applied to each group.
def munge(group):
g = group.T.stack()
g.index = ['{}{}'.format(c, i+1) for i, (c, _) in enumerate(g.index)]
return g
result = df.groupby(df.index//3).apply(munge)
Output:
>>> df.groupby(df.index//3).apply(munge)
a1 a2 a3 b4 b5 b6
0 1 2 3 a b c
1 11 12 13 aa bb cc
I have some data that I'm trying to clean up. That involves modifying some columns, combining other cols into new ones, etc. I am wondering if there is a way to do this in a succinct way in pandas or if each operation needs to be a separate line of code. Here is an example:
ex_df = pd.DataFrame(data = {"a": [1,2,3,4], "b": ["a-b", "c-d", "e-f", "g-h"]})
Say I want to create a new column called c which will be the first letter in each row of b, I want to transform b by removing the "-", and I want to create another col called d which will be the first letter of b concatenated with the entry in a in that same row. Right now I would have to do something like this:
ex_df["b"] = ex_df["b"].map(lambda x: "".join(x.split(sep="-")))
ex_df["c"] = ex_df["b"].map(lambda x: x[0])
ex_df["d"] = ex_df.apply(func=lambda s: s["c"] + str(s["a"]), axis=1)
ex_df
# a b c d
#0 1 ab a a1
#1 2 cd c c2
#2 3 ef e e3
#3 4 gh g g4
Coming from an R data.table background (which would combine all these operations into a single statement), I'm wondering how things are done in pandas.
You can use:
In [12]: ex_df.assign(
...: b=ex_df.b.str.replace('-', ''),
...: c=ex_df.b.str[0],
...: d=ex_df.b.str[0] + ex_df.a.astype(str)
...: )
Out[12]:
a b c d
0 1 ab a a1
1 2 cd c c2
2 3 ef e e3
3 4 gh g g4
This is one approach.
Demo:
import pandas as pd
ex_df = pd.DataFrame(data = {"a": [1,2,3,4], "b": ["a-b", "c-d", "e-f", "g-h"]})
ex_df["c"] = ex_df["b"].str[0]
ex_df["b"] = ex_df["b"].str.replace("-", "")
ex_df["d"] = ex_df.apply(lambda s: s["c"] + str(s["a"])), axis=1)
print(ex_df)
Output:
a b c d
0 1 ab a a1
1 2 cd c c2
2 3 ef e e3
3 4 gh g g4
You can use the build in str method to make the required output.
What would be the most efficient way to solve this problem?
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'v' : [ 's,m,l', '1,2,3', 'k,g']
})
i_need = pd.DataFrame(data={
'id': ['A','A','A','B','B','B','C', 'C'],
'v' : ['s','m','l','1','2','3','k','g']
})
I though about creating a new df and while iterating over i_have append the records to the new df. But as number of rows grow, it can take a while.
Use numpy.repeat with numpy.concatenate for flattening:
#create lists by split
splitted = i_have['v'].str.split(',')
#get legths of each lists
lens = splitted.str.len()
df = pd.DataFrame({'id':np.repeat(i_have['id'], lens),
'v':np.concatenate(splitted)})
print (df)
id v
0 A s
0 A m
0 A l
1 B 1
1 B 2
1 B 3
2 C k
2 C g
Thank you piRSquared for solution for repeat multiple columns:
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'id1': ['A1', 'B1', 'C1'],
'v' : [ 's,m,l', '1,2,3', 'k,g']
})
print (i_have)
id id1 v
0 A A1 s,m,l
1 B B1 1,2,3
2 C C1 k,g
splitted = i_have['v'].str.split(',')
lens = splitted.str.len()
df = i_have.loc[i_have.index.repeat(lens)].assign(v=np.concatenate(splitted))
print (df)
id id1 v
0 A A1 s
0 A A1 m
0 A A1 l
1 B B1 1
1 B B1 2
1 B B1 3
2 C C1 k
2 C C1 g
If you have multiple columns then first split the data by , with expand = True(Thank you piRSquared) then stack and ffill i.e
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'v' : [ 's,m,l', '1,2,3', 'k,g'],
'w' : [ 's,8,l', '1,2,3', 'k,g'],
'x' : [ 's,0,l', '1,21,3', 'ks,g'],
'y' : [ 's,m,l', '11,2,3', 'ks,g'],
'z' : [ 's,4,l', '1,2,32', 'k,gs'],
})
i_want = i_have.apply(lambda x :x.str.split(',',expand=True).stack()).reset_index(level=1,drop=True).ffill()
If the values are not equal sized then
i_want = i_have.apply(lambda x :x.str.split(',',expand=True).stack()).reset_index(level=1,drop=True)
i_want['id'] = i_want['id'].ffill()
Output i_want
id v w x y z
0 A s s s s s
1 A m 8 0 m 4
2 A l l l l l
3 B 1 1 1 11 1
4 B 2 2 21 2 2
5 B 3 3 3 3 32
6 C k k ks ks k
7 C g g g g gs
Here's another way
In [1667]: (i_have.set_index('id').v.str.split(',').apply(pd.Series)
.stack().reset_index(name='v').drop('level_1', 1))
Out[1667]:
id v
0 A s
1 A m
2 A l
3 B 1
4 B 2
5 B 3
6 C k
7 C g
As pointed in comment.
In [1672]: (i_have.set_index('id').v.str.split(',', expand=True)
.stack().reset_index(name='v').drop('level_1', 1))
Out[1672]:
id V
0 A s
1 A m
2 A l
3 B 1
4 B 2
5 B 3
6 C k
7 C g