Lets say I have a DataFrame df with a multi index ['siec', 'geo'] (shown in italic):
siec
geo
value
a
DE
1
a
FR
2
and a mapping DataFrame mapping_df from geo to id_region with a single index ['geo']:
geo
id_region
DE
10
FR
20
=> How can I join/merge/replace the index column 'geo' of df with the values of the column 'id_region' from mapping_df?
Expected result with new multi index ['siec', 'id_region']:
siec
id_region
value
a
10
1
a
20
2
I tried following code:
import pandas as pd
df = pd.DataFrame([{'siec': 'a', 'geo': 'DE', 'value': 1}, {'siec': 'a', 'geo': 'FR', 'value': 1}])
df.set_index(['siec', 'geo'], inplace=True)
mapping_df = pd.DataFrame([{'geo': 'DE', 'id_region': 10}, {'geo': 'FR', 'id_region': 20}])
mapping_df.set_index(['geo'], inplace=True)
joined_data = df.join(mapping_df)
merged_data = df.merge(mapping_df, left_index=True, right_index=True)
but it does not do what I want. It adds an additional column and keeps the old index.
siec
geo
value
id_region
a
DE
1
10
a
FR
2
20
=> Is there a convenient method for my use case or would I need to manually correct the index after a joining step?
As a workaround, I could reindex() the DataFrames, do some joining manipulations and then reintroduce some multi index.
However, I would like to avoid switching back and forth between the indexed and non-indexed versions of the DataFrames if possible (?).
Try as follows.
Use MultiIndex.get_level_values to select only level 1 (or: geo) and apply Index.map with mapping_df['id_region'] as mapper.
Wrap the result inside MultiIndex.set_levels to overwrite level 1.
Finally, chain Index.set_names to rename the level (or use MultiIndex.rename).
df.index = df.index.set_levels(
df.index.get_level_values(1).map(mapping_df['id_region']), level=1)\
.set_names('id_region', level=1)
print(df)
value
siec id_region
a 10 1
20 2
Related
I'm looking to merge two dataframes across multiple columns but with some additional conditions.
import pandas as pd
df1 = pd.DataFrame({
'col1': ['a','b','c', 'd'],
'optional_col2': ['X',None,'Z','V'],
'optional_col3': [None,'def', 'ghi','jkl']
})
df2 = pd.DataFrame({
'col1': ['a','b','c', 'd'],
'optional_col2': ['X','Y','Z','W'],
'optional_col3': ['abc', 'def', 'ghi','mno']
})
I would like to always join on col1 but then try to also join on optional_col2 and optional_col3. In df1, the value can be NaN for both columns but it is always populated in df2. I would like the join to be valid when the col1 + one of optional_col2 or optional_col3 match.
This would result in ['a', 'b', 'c'] joining due to exact col2, col3, and exact match, respectively.
In SQL I suppose you could write the join as this, if it helps explain further:
select
*
from
df1
inner join
df2
on df1.col1 = df2.col2
AND (df1.optional_col2 = df2.optional_col2 OR df1.optional_col3 = df2.optional_col3)
I've messed around with pd.merge but can't figure how to do a complex operation like this. I think I can do a merge on ['col1', 'optional_col2'] then a second merge on ['col1', 'optional_col_3'] then union and drop duplicates?
Expected DataFrame would be something like:
merged_df = pd.DataFrame({
'col1': ['a', 'b', 'c'],
'optional_col_2': ['X', 'Y', 'Z'],
'optional_col_3': ['abc', 'def', 'ghi']
})
This solution works by creating an extra column called "temp" in both dataframes. In df11 it will be a column of true values. In df2 the values will be true if there is a match between either of the optional columns. I'm not clear whether you consider a NaN value to be matchable or not, if so then you need to fill in the NaNs of columns in df1 with values from df2 before comparing to fulfill your criteria around missing values (this is what is below). If this is not required then drop the fillna calls in the example below.
df1["temp"] = True
optional_col2_match = df1["optional_col2"].fillna(df2["optional_col2"]).eq(df2["optional_col2"])
optional_col3_match = df1["optional_col3"].fillna(df2["optional_col3"]).eq(df2["optional_col3"])
df2["temp"] = optional_col2_match | optional_col3_match
Then use the "temp" column in the merge, and then drop it - it has served its purpose
pd.merge(df1, df2, on=["col1", "temp"]).drop(columns="temp")
This gives the following result
col1 optional_col2_x optional_col3_x optional_col2_y optional_col3_y
0 a X abc X abc
1 b Y def Y def
2 c Z ghi Z ghi
You will need to decide what to do here. In the example you gave there are no rows which match on just one of optional_col2 and optional_col2, which is why a 3 column solution looks reasonable. This won't generally be the case.
I have a pandas dataframe that contains duplicates according to one column (ID), but has differing values in several other columns. My goal is to remove the duplicates based on ID, but to concatenate the information from the other columns.
Here is an example of what I'm working with:
ID Age Gender Form Signature Level
000 30 M Paper Yes A
000 30 M Electronic No B
001 42 Paper No B
After processing, I would like the data to look like this:
ID Age Gender Form Signature Level
000 30 M Paper, Electronic Yes, No A, B
001 42 Paper No B
First, I filled the nAn cells with "Not Noted" so that I can use the groupby function. I tried the following code:
df = df.groupby(['ID', 'Age', 'Gender'])['Form'].apply(set).reset_index()
This takes care of concatenating the Form column, but I cannot figure out how to incorporate the Signature and Level columns as well. Does anyone have any suggestions?
You can do this by modifying each column separately and then concatenating them with some basic list comprehension and the pd.concat function.
g = df.groupby(['ID', 'Age', 'Gender'])
concatCols = ['Form', 'Signature', 'Level'] #? Used to define which columns should be concatenated
df = pd.concat([g[c].apply(set) for c in concatCols], axis=1).reset_index()
print(df)
You can do it like this:
import pandas as pd
df = pd.DataFrame({'ID': ['000', '000', '001'],
'Age': [30, 30, 42],
'Gender': ['M', 'M', ''],
'Form': ['Paper', 'Electronic', 'Paper'],
'Signature': ['Yes', 'No', 'No'],
'Level': ['A', 'B', 'B']})
df = df.groupby(['ID', 'Age', 'Gender']).agg({'Form': set, 'Signature': set, 'Level': set}).reset_index()
print(df)
Output:
ID Age Gender Form Signature Level
0 000 30 M {Electronic, Paper} {No, Yes} {B, A}
1 001 42 {Paper} {No} {B}
I am merging two data frames with pandas. I would like to avoid that, when joining, the output includes the join column of the right table.
Example:
import pandas as pd
age = [['tom', 10], ['nick', 15], ['juli', 14]]
df1 = pd.DataFrame(age, columns = ['Name', 'Age'])
toy = [['tom', 'GIJoe'], ['nick', 'car']]
df2 = pd.DataFrame(toy, columns = ['Name_child', 'Toy'])
df = pd.merge(df1,df2,left_on='Name',right_on='Name_child',how='left')
df.columns will give the output Index(['Name', 'Age', 'Name_child', 'Toy'], dtype='object'). Is there an easy way to obtain Index(['Name', 'Age', 'Toy'], dtype='object') instead? I can drop the column afterwards of course like this del df['Name_child'], but I'd like my code to be as short as possible.
Based on #mgc comments, you don't have to rename the columns of df2. Just you pass df2 to merge function with renamed columns. df2 column names will remain as it is.
df = pd.merge(df1,df2.rename(columns={'Name_child': 'Name'}),on='Name', how='left')
df
Name Age Toy
0 tom 10 GIJoe
1 nick 15 car
2 juli 14 NaN
df.columns
Index(['Name', 'Age', 'Toy'], dtype='object')
df2.columns
Index(['Name_child', 'Toy'], dtype='object')
Set the index of the second dataframe to "Name_child". If you do this in the merge statement the columns in df2 remain unchanged.
df = pd.merge(df1,df2.set_index('Name_child'),left_on='Name',right_index=True,how='left')
This ouputs the correct columns:
df
Name Age Toy
0 tom 10 GIJoe
1 nick 15 car
2 juli 14 NaN
df.columns
df.columns
Index(['Name', 'Age', 'Toy'], dtype='object')
Seems to be even simpler to drop the column right after.
df = (pd.merge(df1,df2,left_on='Name',right_on='Name_child',how='left')
.drop('Name_child', axis=1))
#----------------
import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
I have a dataframe df and a column df['table'] such that each item in df['table'] is another dataframe with the same headers/number of columns. I was wondering if there's a way to do a groupby like this:
Original dataframe:
name table
Bob Pandas df1
Joe Pandas df2
Bob Pandas df3
Bob Pandas df4
Emily Pandas df5
After groupby:
name table
Bob Pandas df containing the appended df1, df3, and df4
Joe Pandas df2
Emily Pandas df5
I found this code snippet to do a groupby and lambda for strings in a dataframe, but haven't been able to figure out how to append entire dataframes in a groupby.
df['table'] = df.groupby(['name'])['table'].transform(lambda x : ' '.join(x))
I've also tried df['table'] = df.groupby(['name'])['HTML'].apply(list), but that gives me a df['table'] of all NaN.
Thanks for your help!!
Given 3 dataframes
import pandas as pd
dfa = pd.DataFrame({'a': [1, 2, 3]})
dfb = pd.DataFrame({'a': ['a', 'b', 'c']})
dfc = pd.DataFrame({'a': ['pie', 'steak', 'milk']})
Given another dataframe, with dataframes in the columns
df = pd.DataFrame({'name': ['Bob', 'Joe', 'Bob', 'Bob', 'Emily'], 'table': [dfa, dfa, dfb, dfc, dfb]})
# print the type for the first value in the table column, to confirm it's a dataframe
print(type(df.loc[0, 'table']))
[out]:
<class 'pandas.core.frame.DataFrame'>
Each group of dataframes, can be combined into a single dataframe, by using .groupby and aggregating a list for each group, and combining the dataframes in the list, with pd.concat
# if there is only one column, or if there are multiple columns of dataframes to aggregate
dfg = df.groupby('name').agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
# display(dfg.loc['Bob', 'table'])
a
0 1
1 2
2 3
3 a
4 b
5 c
6 pie
7 steak
8 milk
# to specify a single column, or specify multiple columns, from many columns
dfg = df.groupby('name')[['table']].agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
Not a duplicate
Originally, I had marked this question as a duplicate of How to group dataframe rows into list in pandas groupby, thinking the dataframes could be aggregated into a list, and then combined with pd.concat.
df.groupby('name')['table'].apply(list)
df.groupby('name').agg(list)
df.groupby('name')['table'].agg(list)
df.groupby('name').agg({'table': list})
df.groupby('name').agg(lambda x: list(x))
However, these all result in a StopIteration error, when there are dataframes to aggregate.
Here let's create a dataframe with dataframes as columns:
First, I start with three dataframes:
import pandas as pd
#creating dataframes that we will assign to Bob and Joe, notice b's and j':
df1 = pd.DataFrame({'var1':[12, 34, -4, None], 'letter':['b1', 'b2', 'b3', 'b4']})
df2 = pd.DataFrame({'var1':[1, 23, 44, 0], 'letter':['j1', 'j2', 'j3', 'j4']})
df3 = pd.DataFrame({'var1':[22, -3, 7, 78], 'letter':['b5', 'b6', 'b7', 'b8']})
#lets make a list of dictionaries:
list_of_dfs = [
{'name':'Bob' ,'table':df1},
{'name':'Joe' ,'table':df2},
{'name':'Bob' ,'table':df3}
]
#constuct the main dataframe:
original_df = pd.DataFrame(list_of_dfs)
print(original_df)
original_df.shape #shows (3, 2)
Now we have the original dataframe created as the input, we will produce the resulting new dataframe. In doing so, we use groupby(),agg(), and pd.concat(). We also reset the index.
new_df = original_df.groupby('name')['table'].agg(lambda series: pd.concat(series.tolist())).reset_index()
print(new_df)
#check that Bob's table is now a concatenated table of df1 and df3:
new_df[new_df['name']=='Bob']['table'][0]
The output to the last line of code is:
var1 letter
0 12.0 b1
1 34.0 b2
2 -4.0 b3
3 NaN b4
0 22.0 b5
1 -3.0 b6
2 7.0 b7
3 78.0 b8
I have the pandas.DataFrame below:
One of the columns from the Dataframe, pontos, holds a dict for each of the rows.
What I want to do is add one column to the DataFrame for each key from this dict. So the new columns would be, in this example: rodada, mes, etc, and for each row, these columns would be populated with the respective value from the dict.
So far I've tried the following for one of the keys:
df_times["rodada"] = [df_times["pontos"].get('rodada') for d in df_times["pontos"]]
However, as a result I'm getting a new column rodada filled with None values:
Any hints on what I'm doing wrong?
You can create a new dataframe and concat it to the current one like:
Code:
df2 = pd.concat([df, pd.DataFrame(list(df.pontos))], axis=1)
Test Code:
import pandas as pd
df = pd.DataFrame([
['A', dict(col1='1', col2='2')],
['B', dict(col1='3', col2='4')],
], columns=['X', 'D'])
print(df)
df2 = pd.concat([df, pd.DataFrame(list(df.D))], axis=1)
print(df2)
Results:
X D
0 A {'col2': '2', 'col1': '1'}
1 B {'col2': '4', 'col1': '3'}
X D col1 col2
0 A {'col2': '2', 'col1': '1'} 1 2
1 B {'col2': '4', 'col1': '3'} 3 4
You just need a slight change in your comprehension to extract that data.
It should be:
df_times["rodada"] = [d.get('rodada') for d in
df_times["pontos"]]
You want the values of the dictionary key 'rodada' to be the basis of your new column. So you iterate over those dictionary entries in the loop- in other words, d, and then extract the value by key to make the new column.
you can also use join and pandas apply function:
df=df.join(df['pontos'].apply(pd.Series))