I have two pandas DataFrames, as below:
df1 = pd.DataFrame({('Q1', 'SubQ1'):[1, 2, 3], ('Q1', 'SubQ2'):[1, 2, 3], ('Q2', 'SubQ1'):[1, 2, 3]})
df1['ID'] = ['a', 'b', 'c']
df2 = pd.DataFrame({'item_id': ['a', 'b', 'c'], 'url':['a.com', 'blah.com', 'company.com']})
df1:
Q1 Q2 ID
SubQ1 SubQ2 SubQ1
0 1 1 1 a
1 2 2 2 b
2 3 3 3 c
df2:
item_id url
0 a a.com
1 b blah.com
2 c company.com
Note that df1 has some columns with hierarchical indexing (eg. ('Q1', 'SubQ1')) and some with just normal indexing (eg. ID).
I want to merge these two data frames on the ID and item_id fields. Using:
result = pd.merge(df1, df2, left_on='ID', right_on='item_id')
gives:
(Q1, SubQ1) (Q1, SubQ2) (Q2, SubQ1) (ID, ) item_id url
0 1 1 1 a a a.com
1 2 2 2 b b blah.com
2 3 3 3 c c company.com
As you can see, the merge itself works fine, but the MultiIndex has been lost and has reverted to tuples. I've tried to recreate the MultiIndex by using pd.MultiIndex.from_tuples, as in:
result.columns = pd.MultiIndex.from_tuples(result)
but this causes problems with the item_id and url columns, taking just the first two characters of their names:
Q1 Q2 ID i u
SubQ1 SubQ2 SubQ1 t r
0 1 1 1 a a a.com
1 2 2 2 b b blah.com
2 3 3 3 c c company.com
Converting the columns in df2 to be one-element tuples (ie. ('item_id',) rather than just 'item_id') makes no difference.
How can I merge these two DataFrames and keep the MultiIndex properly? Or alternatively, how can I take the result of the merge and get back to columns with a proper MultiIndex without mucking up the names of the item_id and url columns?
If you can't beat 'em, join 'em. (Make both DataFrames have the same number of index levels before merging):
import pandas as pd
df1 = pd.DataFrame({('Q1', 'SubQ1'):[1, 2, 3], ('Q1', 'SubQ2'):[1, 2, 3], ('Q2', 'SubQ1'):[1, 2, 3]})
df1['ID'] = ['a', 'b', 'c']
df2 = pd.DataFrame({'item_id': ['a', 'b', 'c'], 'url':['a.com', 'blah.com', 'company.com']})
df2.columns = pd.MultiIndex.from_product([df2.columns, ['']])
result = pd.merge(df1, df2, left_on='ID', right_on='item_id')
print(result)
yields
Q1 Q2 ID item_id url
SubQ1 SubQ2 SubQ1
0 1 1 1 a a a.com
1 2 2 2 b b blah.com
2 3 3 3 c c company.com
This also avoids the UserWarning:
pandas/core/reshape/merge.py:551: UserWarning: merging between different levels can give an unintended result (2 levels on the left, 1 on the right)
The column for ID is not "non-hierarchical". It is signified by ('ID', ). However, pandas allows you to reference just the first level of columns in a way that looks like you are referencing a single leveled column structure. Meaning this should work df1['ID'] as well as as df1[('ID',)] as well as df1.loc[:, ('ID',)]. But if it happened to be that the top level 'ID' had more columns associated with it in the second level, df1['ID'] would return a dataframe. I feel more comfortable with this solution, which looks a lot like #JohnGalt's answer in the comments.
df1.assign(u=df1[('ID', '')].map(df2.set_index('item_id').url))
Q1 Q2 ID u
SubQ1 SubQ2 SubQ1
0 1 1 1 a a.com
1 2 2 2 b blah.com
2 3 3 3 c company.com
Joining on a single level column'd dataframe to a multi-level column'd dataframe is difficult. I have to artificially add another level.
def rnm(d):
d = d.copy()
d.columns = [d.columns, [''] * len(d.columns)]
return d
df1.join(rnm(df2.set_index('item_id')), on=('ID',))
Q1 Q2 ID url
SubQ1 SubQ2 SubQ1
0 1 1 1 a a.com
1 2 2 2 b blah.com
2 3 3 3 c company.com
This solution is more flexible in the sense that you won't have to insert columns levels before concat, you can use it to concat any number of levels:
import pandas as pd
df1 = pd.DataFrame({('A', 'b'): [1, 2], ('A', 'c'): [3, 4]})
df2 = pd.DataFrame({'Zaa': [1, 2]})
df3 = pd.DataFrame({('Maaa', 'k', 'l'): [1, 2]})
df = pd.concat([df1, df2, df3], axis=1)
cols = [col if isinstance(col, tuple) else (col, ) for col in df.columns]
df.columns = pd.MultiIndex.from_tuples(cols)
Related
Suppose there is the following dataframe:
import pandas as pd
df = pd.DataFrame({'Group': ['A', 'A', 'B', 'B', 'C', 'C'], 'Value': [1, 2, 3, 4, 5, 6]})
I would like to subtract the values from group B and C with those of group A and make a new column with the difference. That is, I would like to do something like this:
df[df['Group'] == 'B']['Value'].reset_index() - df[df['Group'] == 'A']['Value'].reset_index()
df[df['Group'] == 'C']['Value'].reset_index() - df[df['Group'] == 'A']['Value'].reset_index()
and place the result in a new column. Is there a way of doing it without a for loop?
Assuming you want to subtract the first A to the first B/C, second A to second B/C, etc. the easiest might be to reshape:
df2 = (df
.assign(cnt=df.groupby('Group').cumcount())
.pivot('cnt', 'Group', 'Value')
)
# Group A B C
# cnt
# 0 1 3 5
# 1 2 4 6
df['new_col'] = df2.sub(df2['A'], axis=0).melt()['value']
variant:
df['new_col'] = (df
.assign(cnt=df.groupby('Group').cumcount())
.groupby('cnt', group_keys=False)
.apply(lambda d: d['Value'].sub(d.loc[d['Group'].eq('A'), 'Value'].iloc[0]))
)
output:
Group Value new_col
0 A 1 0
1 A 2 0
2 B 3 2
3 B 4 2
4 C 5 4
5 C 6 4
two sample dataframes with different index values, but identical column names and order:
df1 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[2,4])
df2 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[7,9])
df1
A B C
2 1 3
4 2
df2
A B C
7 4
9 5 6
I know how to concat the two dataframes, but that gives this:
A B C
2 1 3
4 2
Omitting the non=matching indexes from the other df
result I am trying to achieve is:
A B C
0 1 4 3
1 5 2 6
I want to combine the rows with the same index values from each df so that missing values in one df are replaced by the corresponding value in the other.
Concat and Merge are not up to the job I have found.
I assume I have to have identical indexes in each df which correspond to the values I want to merge into one row. But, so far, no luck getting it to come out correctly. Any pandas transformational wisdom is appreciated.
This merge attempt did not do the trick:
df1.merge(df2, on='A', how='outer')
The solutions below were all offered before I edited the question. My fault there, I neglected to point out that my actual data has different indexes in the two dataframes.
Let us try mask
out = df1.mask(df1=='',df2)
Out[428]:
A B C
0 1 4 3
1 5 2 6
for i in range(df1.shape[0]):
for j in range(df1.shape[1]):
if df1.iloc[i,j]=="":
df1.iloc[i,j] = df2.iloc[i,j]
print(df1)
A B C
0 1 4 3
1 5 2 6
Since the index of your two dataframes are different, it's easier to make it into the same index.
index = [i for i in range(len(df1))]
df1.index = index
df2.index = index
ddf = df1.replace('',np.nan)).fillna(df2)
If both df1 and df2 have different size of datas, it's still workable.
df1 = pd.DataFrame([[1, '', 3], ['', 2, ''],[7,8,9],[10,11,12]], columns=['A', 'B', 'C'],index=[7,8,9,10])
index1 = [i for i in range(len(df1))]
index2 = [i for i in range(len(df2))]
df1.index = index1
df2.index = index2
df1.replace('',np.nan).fillna(df2)
You can get
Out[17]:
A B C
0 1.0 5 3.0
1 4 2.0 6
2 7.0 8.0 9.0
3 10.0 11.0 12.0
I want to match two pandas Dataframes by the name of their columns.
import pandas as pd
df1 = pd.DataFrame([[0,2,1],[1,3,0],[0,4,0]], columns=['A', 'B', 'C'])
A B C
0 0 2 1
1 1 3 0
2 0 4 0
df2 = pd.DataFrame([[0,0,1],[1,5,0],[0,7,0]], columns=['A', 'B', 'D'])
A B D
0 0 0 1
1 1 5 0
2 0 7 0
If the names match, do nothing. (Keep the column of df2)
If a column is in Dataframe 1 but not in Dataframe 2, add the column in Dataframe 2 as a vector of zeros.
If a column is in Dataframe 2 but not in Dataframe 1, drop it.
The output should look like this:
A B C
0 0 0 0
1 1 5 0
2 0 7 0
I know if I do:
df2 = df2[df1.columns]
I get:
KeyError: "['C'] not in index"
I could also add the vectors of zeros manually, but of course this is a toy example of a much longer dataset. Is there any smarter/pythonic way of doing this?
It appears that df2 columns should be the same as df1 columns after this operation, as columns that are in df1 and not df2 should be added, while columns only in df2 should be removed. We can simply reindex df2 to match df1 columns with a fill_value=0 (this is the safe equivalent to df2 = df2[df1.columns] when adding new columns with a fill value):
df2 = df2.reindex(columns=df1.columns, fill_value=0)
df2:
A B C
0 0 0 0
1 1 5 0
2 0 7 0
I have a dataframe, we can proxy by
df = pd.DataFrame({'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]})
and a category series
category = pd.Series(['A', 'B', 'B', 'A'], ['a', 'b', 'c', 'd'])
I'd like to get a sum of df's columns grouped into the categories 'A', 'B'. Maybe something like:
result = df.groupby(??, axis=1).sum()
returning
result = pd.DataFrame({'A':[3,3,4], 'B':[1,1,0]})
Use groupby + sum on the columns (the axis=1 is important here):
df.groupby(df.columns.map(category.get), axis=1).sum()
A B
0 3 1
1 3 1
2 4 0
After reindex you can assign the category to the column of df
df=df.reindex(columns=category.index)
df.columns=category
df.groupby(df.columns.values,axis=1).sum()
Out[1255]:
A B
0 3 1
1 3 1
2 4 0
Or pd.Series.get
df.groupby(category.get(df.columns),axis=1).sum()
Out[1262]:
A B
0 3 1
1 3 1
2 4 0
Here what i did to group dataframe with similar column names
data_df:
1 1 2 1
q r f t
Code:
df_grouped = data_df.groupby(data_df.columns, axis=1).agg(lambda x: ' '.join(x.values))
df_grouped:
1 2
q r t f
I have a dataframe that looks like this:
ID Description
1 A
1 B
1 C
2 A
2 C
3 A
I would like to group by the ID column and get the description as a list of list like this:
ID Description
1 [["A"],["B"],["C"]]
2 [["A"],["C"]]
3 [["A"]]
The df.groupby('ID')['Description'].apply(list) but this create only the "first level" of lists.
This is slightly different to #jezrael in that the listifying of strings is done via map. In addition call reset_index() adds "Description" explicitly to output.
import pandas as pd
df = pd.DataFrame([[1, 'A'], [1, 'B'], [1, 'C'], [2, 'A'], [2, 'C'], [3, 'A']], columns=['ID', 'Description'])
df.groupby('ID')['Description'].apply(list).apply(lambda x: list(map(list, x))).reset_index()
# ID Description
# 1 [[A], [B], [C]]
# 2 [[A], [C]]
# 3 [[A]]
You need create inner lists:
print (df)
ID Description
0 1 Aas
1 1 B
2 1 C
3 2 A
4 2 C
5 3 A
df = df['Description'].apply(lambda x: [x]).groupby(df['ID']).apply(list).reset_index()
Another solution similar like #jp_data_analysis with one apply:
df = df.groupby('ID')['Description'].apply(lambda x: [[y] for y in x]).reset_index()
And pure python solution:
a = list(zip(df['ID'], df['Description']))
d = {}
for k, v in a:
d.setdefault(k, []).append([v])
df = pd.DataFrame({'ID':list(d.keys()), 'Description':list(d.values())},
columns=['ID','Description'])
print (df)
ID Description
0 1 [[Aas], [B], [C]]
1 2 [[A], [C]]
2 3 [[A]]