I have two dataframes with hundreds of columns.
Some have the same name, some do not.
I want the two dataframes to have the columns with same name listed in the same order.
Typically, if those were the only columns, I would do:
df2 = df2.filter(df1.columns)
However, because there are columns with different names, this would eliminate all columns in df2 that do not exists in df1.
How do I order all common columns with same order without losing the columns that are not in common? Those not in common must be kept in the original order. Because I have hundreds of columns I cannot do it manually but need a quick solution like "filter". Please, note that though there are similar questions, they do not deal with the case of "some columns are in common and some are different".
Example:
df1.columns = A,B,C,...,Z,1,2,...,1000
df2.columns = Z,K,P,T,...,01,02,...,01000
I want to reorder the columns for df2 to be:
df2.columns = A,B,C,...,Z,01,02,...,01000
Try sets operations on column names like intersection and difference:
Setup a MRE
>>> df1
A B C D
0 2 7 7 5
1 6 8 4 2
>>> df2
C B E F
0 8 7 3 2
1 8 6 5 8
c0 = df1.columns.intersection(df2.columns)
c1 = df1.columns.difference(df2.columns)
c2 = df2.columns.difference(df1.columns)
df1 = df1[c0.tolist() + c1.tolist()]
df2 = df2[c0.tolist() + c2.tolist()]
Output:
>>> df1
B C A D
0 7 7 2 5
1 8 4 6 2
>>> df2
B C E F
0 7 8 3 2
1 6 8 5 8
Assume you want to also keep columns that are not in common in the same place:
# make a copy of df2 column names
new_cols = df2.columns.values.copy()
# reorder common column names in df2 to be same order as df1
new_cols[df2.columns.isin(df1.columns)] = df1.columns[df1.columns.isin(df2.columns)]
# reorder columns using new_cols
df2[new_cols]
Example:
df1 = pd.DataFrame([[1,2,3,4,5]], columns=list('badfe'))
df2 = pd.DataFrame([[1,2,3,4,5]], columns=list('fsxad'))
df1
b a d f e
0 1 2 3 4 5
df2
f s x a d
0 1 2 3 4 5
new_cols = df2.columns.values.copy()
new_cols[df2.columns.isin(df1.columns)] = df1.columns[df1.columns.isin(df2.columns)]
df2[new_cols]
a s x d f
0 4 2 3 5 1
You can do using pd.Index.difference and pd.index.union
i = df1.columns.intersection(df2.columns,sort=False).union(
df2.columns.difference(df1.columns),sort=False
)
out = df2.loc[:,i]
df1 = pd.DataFrame(columns=list("ABCEFG"))
df2 = pd.DataFrame(columns=list("ECDAFGHI"))
print(df1)
print(df2)
i = df2.columns.intersection(df1.columns,sort=False).union(
df2.columns.difference(df1.columns),sort=False
)
print(df2.loc[:,i])
Empty DataFrame
Columns: [A, B, C, E, F, G]
Index: []
Empty DataFrame
Columns: [E, C, D, A, F, G, H, I]
Index: []
Empty DataFrame
Columns: [A, C, E, F, G, D, H, I]
Index: []
Related
I have two dataframes with similar columns:
df1 = (a, b, c, d)
df2 = (a, b, c, d)
I want concat or merge some columns of them like below in df3
df3 = (a_1, a_2, b_1, b_2)
How can I put them beside as they are (without any change), and how can I merge them on a similar key like d? I tried to add them to a list and concat them but don't know how to give them a new name. I don't want multi-level column names.
for ii, tdf in enumerate(mydfs):
tdf = tdf.sort_values(by="fid", ascending=False)
for _col in ["fid", "pred_text1"]:
new_col = _col + str(ii)
dfs.append(tdf[_col])
ii += 1
df = pd.concat(dfs, axis=1)
Without having a look at your dataframe, it would not be easy, but I am generating a dataframe to give you samples and insight into how the code works:
import pandas as pd
import re
df1 = pd.DataFrame({"a":[1,2,4], "b":[2,4,5], "c":[5,6,7], "d":[1,2,3]})
df2 = pd.DataFrame({"a":[6,7,5], "b":[3,4,8], "c":[6,3,9], "d":[1,2,3]})
mergedDf = df1.merge(df2, how="left", on="d").rename(columns=lambda x: re.sub("(.+)\_x", r"\1_1", x)).rename(columns=lambda x: re.sub("(.+)\_y", r"\1_2", x))
mergedDf
which results in:
a_1
b_1
c_1
d
a_2
b_2
c_2
0
1
2
5
1
6
3
6
1
2
4
6
2
7
4
3
2
4
5
7
3
5
8
9
If you are interested in dropping other columns you can use code below:
mergedDf.iloc[:, ~mergedDf.columns.str.startswith("c")]
which results in:
a_1
b_1
d
a_2
b_2
0
1
2
1
6
3
1
2
4
2
7
4
2
4
5
3
5
8
I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Let's say I have a DataFrame and don't know the names of all columns. However, I know there's a column called "N_DOC" and I want this to be the first column of the DataFrame - (while keeping all other columns, regardless its order).
How can I do this?
You can reorder the columns of a datframe with reindex:
cols = df.columns.tolist()
cols.remove('N_DOC')
df.reindex(['N_DOC'] + cols, axis=1)
Use DataFrame.insert with DataFrame.pop for extract column:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'N_DOC':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
c = 'N_DOC'
df.insert(0, c, df.pop(c))
Or:
df.insert(0, 'N_DOC', df.pop('N_DOC'))
print (df)
N_DOC A B C E F
0 1 a 4 7 5 a
1 3 b 5 8 3 a
2 5 c 4 9 6 a
3 7 d 5 4 9 b
4 1 e 5 2 2 b
5 0 f 4 3 4 b
Here's a simple, one line, solution using DataFrame masking:
import pandas as pd
# Building sample dataset.
cols = ['N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe', 'N_DOC']
df = pd.DataFrame(columns=cols)
# Re-order columns.
df = df[['N_DOC'] + df.columns.drop('N_DOC').tolist()]
Before:
Index(['N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe', 'N_DOC'], dtype='object')
After:
Index(['N_DOC', 'N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe'], dtype='object')
I am trying to create a new dataframes df_A, df_B and df_C from an existing dataframe df based on categorical values in the column category (A,B and C).
This doesn't work
df_A = {n: df.ix[rows]
for n, rows in enumerate(df.groupby('Category').groups)}
Here I get the error "Key Error: A"
(Note: A is one of the categories)
This doesn't work either
df_A = np.where(df['Category']=='A')).copy()
Here I get the error: "syntax error"
Finally, this doesn't work
df_A = np.where(raw[raw['Category']=='A']).copy()
"AttributeError: 'tuple' object has no attribute 'copy'"
Thank You
It seems you need first boolean indexing because Category is column, not index if need dictionary :
df2 = {n: data[ data['Category'] == rows]
for n, rows in enumerate(data.groupby('Category').groups)}
Or try remove groups:
df2 = {n: rows[1] for n, rows in enumerate(data.groupby('Category'))}
Sample:
data = pd.DataFrame({'Category':['A','A','D'],
'B':[4,5,6],
'C':[7,8,9]})
print (data)
B C Category
0 4 7 A
1 5 8 A
2 6 9 D
df2 = {n: rows[1] for n, rows in enumerate(data.groupby('Category'))}
print (df2)
{0: B C Category
0 4 7 A
1 5 8 A, 1: B C Category
2 6 9 D}
df2 = {n: data[ data['Category'] == rows]
for n, rows in enumerate(data.groupby('Category').groups)}
print (df2)
{0: B C Category
0 4 7 A
1 5 8 A, 1: B C Category
2 6 9 D}
Solution without groupby
df2 = {n: data[data['Category'] == rows] for n, rows in enumerate(data['Category'].unique())}
print (df2)
{0: B C Category
0 4 7 A
1 5 8 A, 1: B C Category
2 6 9 D}
print (df2[0])
B C Category
0 4 7 A
1 5 8 A
But if need select dict of DataFrame by Category value:
dfs = {n: rows for n, rows in data.groupby('Category')}
print (dfs)
{'A': B C Category
0 4 7 A
1 5 8 A, 'D': B C Category
2 6 9 D}
print (dfs['A'])
B C Category
0 4 7 A
1 5 8 A
I have this dataframe:
dfx = pd.DataFrame([[1,2],['A','B'],[['C','D'],'E']],columns=list('AB'))
A B
0 1 2
1 A B
2 [C, D] E
... that I want to transform in ...
A B
0 1 2
1 A B
2 C E
3 D E
... adding a row for each value contained in column A if it's a list.
Which is the most pythonic way?
And vice versa, if I want to group by a column (let's say B) and have in column A a list of the grouped values? (so the opposite that the example above)
Thanks in advance,
Gianluca
You have mixed dataframe - int with str and list values (very problematic because many functions raise errors), so first convert all numeric to str by where and mask is by to_numeric with parameter errors='coerce' which convert non numeric to NaN:
dfx.A = dfx.A.where(pd.to_numeric(dfx.A, errors='coerce').isnull(), dfx.A.astype(str))
print (dfx)
A B
0 1 2
1 A B
2 [C, D] E
and then create new DataFrame by np.repeat and flat values of lists by chain.from_iterable:
df = pd.DataFrame({
"B": np.repeat(dfx.B.values, dfx.A.str.len()),
"A": list(chain.from_iterable(dfx.A))})
print (df)
A B
0 1 2
1 A B
2 C E
3 D E
Pure pandas solution convert column A to list and then create new DataFrame.from_records. Then drop original column A and join stacked df:
df = pd.DataFrame.from_records(dfx.A.values.tolist(), index = dfx.index)
df = dfx.drop('A', axis=1).join(df.stack().rename('A')
.reset_index(level=1, drop=True))[['A','B']]
print (df)
A B
0 1 2
1 A B
2 C E
2 D E
If need lists use groupby and apply tolist:
print (df.groupby('B')['A'].apply(lambda x: x.tolist()).reset_index())
B A
0 2 [1]
1 B [A]
2 E [C, D]
but if need list only if length of values is more as 1 is necessary if..else:
print (df.groupby('B')['A'].apply(lambda x: x.tolist() if len(x) > 1 else x.values[0])
.reset_index())
B A
0 2 1
1 B A
2 E [C, D]