I have two dataframes with similar columns:
df1 = (a, b, c, d)
df2 = (a, b, c, d)
I want concat or merge some columns of them like below in df3
df3 = (a_1, a_2, b_1, b_2)
How can I put them beside as they are (without any change), and how can I merge them on a similar key like d? I tried to add them to a list and concat them but don't know how to give them a new name. I don't want multi-level column names.
for ii, tdf in enumerate(mydfs):
tdf = tdf.sort_values(by="fid", ascending=False)
for _col in ["fid", "pred_text1"]:
new_col = _col + str(ii)
dfs.append(tdf[_col])
ii += 1
df = pd.concat(dfs, axis=1)
Without having a look at your dataframe, it would not be easy, but I am generating a dataframe to give you samples and insight into how the code works:
import pandas as pd
import re
df1 = pd.DataFrame({"a":[1,2,4], "b":[2,4,5], "c":[5,6,7], "d":[1,2,3]})
df2 = pd.DataFrame({"a":[6,7,5], "b":[3,4,8], "c":[6,3,9], "d":[1,2,3]})
mergedDf = df1.merge(df2, how="left", on="d").rename(columns=lambda x: re.sub("(.+)\_x", r"\1_1", x)).rename(columns=lambda x: re.sub("(.+)\_y", r"\1_2", x))
mergedDf
which results in:
a_1
b_1
c_1
d
a_2
b_2
c_2
0
1
2
5
1
6
3
6
1
2
4
6
2
7
4
3
2
4
5
7
3
5
8
9
If you are interested in dropping other columns you can use code below:
mergedDf.iloc[:, ~mergedDf.columns.str.startswith("c")]
which results in:
a_1
b_1
d
a_2
b_2
0
1
2
1
6
3
1
2
4
2
7
4
2
4
5
3
5
8
Related
I have two dataframes with hundreds of columns.
Some have the same name, some do not.
I want the two dataframes to have the columns with same name listed in the same order.
Typically, if those were the only columns, I would do:
df2 = df2.filter(df1.columns)
However, because there are columns with different names, this would eliminate all columns in df2 that do not exists in df1.
How do I order all common columns with same order without losing the columns that are not in common? Those not in common must be kept in the original order. Because I have hundreds of columns I cannot do it manually but need a quick solution like "filter". Please, note that though there are similar questions, they do not deal with the case of "some columns are in common and some are different".
Example:
df1.columns = A,B,C,...,Z,1,2,...,1000
df2.columns = Z,K,P,T,...,01,02,...,01000
I want to reorder the columns for df2 to be:
df2.columns = A,B,C,...,Z,01,02,...,01000
Try sets operations on column names like intersection and difference:
Setup a MRE
>>> df1
A B C D
0 2 7 7 5
1 6 8 4 2
>>> df2
C B E F
0 8 7 3 2
1 8 6 5 8
c0 = df1.columns.intersection(df2.columns)
c1 = df1.columns.difference(df2.columns)
c2 = df2.columns.difference(df1.columns)
df1 = df1[c0.tolist() + c1.tolist()]
df2 = df2[c0.tolist() + c2.tolist()]
Output:
>>> df1
B C A D
0 7 7 2 5
1 8 4 6 2
>>> df2
B C E F
0 7 8 3 2
1 6 8 5 8
Assume you want to also keep columns that are not in common in the same place:
# make a copy of df2 column names
new_cols = df2.columns.values.copy()
# reorder common column names in df2 to be same order as df1
new_cols[df2.columns.isin(df1.columns)] = df1.columns[df1.columns.isin(df2.columns)]
# reorder columns using new_cols
df2[new_cols]
Example:
df1 = pd.DataFrame([[1,2,3,4,5]], columns=list('badfe'))
df2 = pd.DataFrame([[1,2,3,4,5]], columns=list('fsxad'))
df1
b a d f e
0 1 2 3 4 5
df2
f s x a d
0 1 2 3 4 5
new_cols = df2.columns.values.copy()
new_cols[df2.columns.isin(df1.columns)] = df1.columns[df1.columns.isin(df2.columns)]
df2[new_cols]
a s x d f
0 4 2 3 5 1
You can do using pd.Index.difference and pd.index.union
i = df1.columns.intersection(df2.columns,sort=False).union(
df2.columns.difference(df1.columns),sort=False
)
out = df2.loc[:,i]
df1 = pd.DataFrame(columns=list("ABCEFG"))
df2 = pd.DataFrame(columns=list("ECDAFGHI"))
print(df1)
print(df2)
i = df2.columns.intersection(df1.columns,sort=False).union(
df2.columns.difference(df1.columns),sort=False
)
print(df2.loc[:,i])
Empty DataFrame
Columns: [A, B, C, E, F, G]
Index: []
Empty DataFrame
Columns: [E, C, D, A, F, G, H, I]
Index: []
Empty DataFrame
Columns: [A, C, E, F, G, D, H, I]
Index: []
I have a pandas, that contains lists at some entries
pandas = pd.DataFrame([[1,[2,3],[4,5]],[9,[2,3],[4,5]]],columns = ['A','B','C'])
I would like to know, how one can flatten this
dataframe to
pandas_flat = pd.DataFrame([[1,2,3,4,5],[9,2,3,4,5]],columns = ['A','B_0','B_1','C_0','C_1'])
where the column names are adapted.
The next level is to flatten a pandas dataframe of lists with varying size in one column. How do I flatten them and fill in a fill_value as follows
pandas_1 = pd.DataFrame([[1,[2,3,3],[4,5]],[9,[2,3],[4,5]]],columns = ['A','B','C'])
-->
fill_value = 0
pandas_flat_1 = pd.DataFrame([[1,2,3,3,4,5],[9,2,3,0,4,5]],columns = ['A','B_0','B_1','B_2','C_1','C_2'])
--------------- Solution
For the first dataframe
pandas = pd.DataFrame([[1,[2,3],[4,5]],[9,[2,3],[4,5]]],columns = ['Samples','B','C'])
we have
df=pandas.T.apply(lambda x: x.explode())
groups=df.groupby(level=0)
ids=groups.cumcount()
#df.index=(df.index+'_'+ids.astype(str)).mask(groups.size()<2,df.index.to_series())
new_df=df.T
# Change list names
m = list(new_df.columns)
d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(m).items()}
new_df.columns = [i+str(d[i].pop(0)) if len(d[i]) else i for i in m]
In the first case you could use Series.explode + DataFrame.apply and DataFrame.transpose. We can use to rename the columns GroupBy.cumcount:
df=pandas.T.apply(lambda x: x.explode())
ids=df.groupby(level=0).cumcount()
df.index=df.index+'_'+ids.astype(str)
new_df=df.T
print(new_df)
A_0 B_0 B_1 C_0 C_1
0 1 2 3 4 5
1 9 2 3 4 5
EDIT:
df=pandas.T.apply(lambda x: x.explode())
groups=df.groupby(level=0)
ids=groups.cumcount()
df.index=(df.index+'_'+ids.astype(str)).mask(groups.size()<2,df.index.to_series())
new_df=df.T
print(new_df)
A B_0 B_1 C_0 C_1
0 1 2 3 4 5
1 9 2 3 4 5
Initial df is:
df =
a a a a
1 2 3 4
5 6 7 8
9 1 2 3
Desired output:
df =
b_1 c_1 b_2 c_2
1 2 3 4
5 6 7 8
9 1 2 3
I can do in a long way, like choose odd then even columns, rename them and concat. But looking for a quick solution
Try this:
df =pd.DataFrame({'a':[],'b':[],'c':[],'d':[],'e':[],'f':[],'g':[],'h':[]})
df.columns = ['b_'+str(i//2 +1) if i%2==0 else 'c_'+str((i//2 +1)) for i in range(df.shape[1]) ]
print(df.columns)
output:
Index(['b_1', 'c_1', 'b_2', 'c_2', 'b_3', 'c_3', 'b_4', 'c_4'], dtype='object')
[Finished in 2.6s]
I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?
I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b
An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]
let say I have a dataframe that looks like this:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df
Out[92]:
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Asumming that this dataframe already exist, how can I simply add a level 'C' to the column index so I get this:
df
Out[92]:
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I saw SO anwser like this python/pandas: how to combine two dataframes into one with hierarchical column index? but this concat different dataframe instead of adding a column level to an already existing dataframe.
-
As suggested by #StevenG himself, a better answer:
df.columns = pd.MultiIndex.from_product([df.columns, ['C']])
print(df)
# A B
# C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
option 1
set_index and T
df.T.set_index(np.repeat('C', df.shape[1]), append=True).T
option 2
pd.concat, keys, and swaplevel
pd.concat([df], axis=1, keys=['C']).swaplevel(0, 1, 1)
A solution which adds a name to the new level and is easier on the eyes than other answers already presented:
df['newlevel'] = 'C'
df = df.set_index('newlevel', append=True).unstack('newlevel')
print(df)
# A B
# newlevel C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
You could just assign the columns like:
>>> df.columns = [df.columns, ['C', 'C']]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Or for unknown length of columns:
>>> df.columns = [df.columns.get_level_values(0), np.repeat('C', df.shape[1])]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Another way for MultiIndex (appanding 'E'):
df.columns = pd.MultiIndex.from_tuples(map(lambda x: (x[0], 'E', x[1]), df.columns))
A B
E E
C D
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I like it explicit (using MultiIndex) and chain-friendly (.set_axis):
df.set_axis(pd.MultiIndex.from_product([df.columns, ['C']]), axis=1)
This is particularly convenient when merging DataFrames with different column level numbers, where Pandas (1.4.2) raises a FutureWarning (FutureWarning: merging between different levels is deprecated and will be removed ... ):
import pandas as pd
df1 = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df2 = pd.DataFrame(index=list('abcde'), data=range(10, 15), columns=pd.MultiIndex.from_tuples([("C", "x")]))
# df1:
A B
a 0 0
b 1 1
# df2:
C
x
a 10
b 11
# merge while giving df1 another column level:
pd.merge(df1.set_axis(pd.MultiIndex.from_product([df1.columns, ['']]), axis=1),
df2,
left_index=True, right_index=True)
# result:
A B C
x
a 0 0 10
b 1 1 11
Another method, but using a list comprehension of tuples as the arg to pandas.MultiIndex.from_tuples():
df.columns = pd.MultiIndex.from_tuples([(col, 'C') for col in df.columns])
df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4