Python: Pandas dataframe, merge/join tabels on different keys - python

I have 3 tables of following form:
import pandas as pd
df1 = pd.DataFrame({'ISIN': [1, 4, 7, 10],
'Value1': [2012, 2014, 2013, 2014],
'Value2': [55, 40, 84, 31]})
df1 = df1.set_index("ISIN")
df2 = pd.DataFrame({'ISIN': [1, 4, 7, 10],
'Symbol': ['a', 'b', 'c', 'd']})
df2 = df2.set_index("ISIN")
df3 = pd.DataFrame({'Symbol': ['a', 'b', 'c', 'd'],
'01.01.2020': [1, 2, 3, 4],
'01.01.2021': [3,2,3,2]})
df3 = df3.set_index("Symbol")
My aim now is to merge all 3 tabels together. I would go the following way:
Step1 (merge df1 and df2):
result1 = pd.merge(df1, df2, on=["ISIN"])
print(result1)
The result is ok and gives me the table:
Value1 Value2 Symbol
ISIN
1 2012 55 a
4 2014 40 b
7 2013 84 c
10 2014 31 d
In next step I want to merge it with df3, so I did make a step between and merge df2 and df3:
print(result1)
result2 = pd.merge(df2, df3, on=["Symbol"])
print(result2)
My problem now, the output is:
Symbol 01.01.2020 01.01.2021
0 a 1 3
1 b 2 2
2 c 3 3
3 d 4 2
the column ISIN here is lost. And the step
result = pd.merge(result, result2, on=["ISIN"])
result.set_index("ISIN")
produces an error.
Is there an elegant way to merge this 3 tabels together (with key column ISIN) and why is the key column lost in the second merge process?

Just chain the merge operations:
result = df1.merge(df2.reset_index(), on='ISIN').merge(df3, on='Symbol')
Or using your syntax, use result1 as source for the second merge:
result1 = pd.merge(df1, df2.reset_index(), on=["ISIN"])
result2 = pd.merge(result1, df3, on=["Symbol"])
output:
ISIN Value1 Value2 Symbol 01.01.2020 01.01.2021
0 1 2012 55 a 1 3
1 4 2014 40 b 2 2
2 7 2013 84 c 3 3
3 10 2014 31 d 4 2

You should not set the index prior to joining if you wish to keep it as part of the data in your dataframe. I suggest first merging, then setting the index to your desired value. In a single line:
output = df1.merge(df2,on='ISIN').merge(df3,on='Symbol')
Outputs:
ISIN Value1 Value2 Symbol 01.01.2020 01.01.2021
0 1 2012 55 a 1 3
1 4 2014 40 b 2 2
2 7 2013 84 c 3 3
3 10 2014 31 d 4 2
You can now set the index to ISIN by adding .set_index('ISIN') to output:
Value1 Value2 Symbol 01.01.2020 01.01.2021
ISIN
1 2012 55 a 1 3
4 2014 40 b 2 2
7 2013 84 c 3 3
10 2014 31 d 4 2

Related

Subtract rows from two dataframes based on index value

I have two dataframes:
df1 = pd.DataFrame({
'Name' : ['A', 'A', 'A', 'A', 'B', 'B'],
'Value': [10, 9, 8, 10, 99 , 88],
'Day' : [1,2,3,4,1,2]
})
df2 = pd.DataFrame({
'Name' : ['C', 'C', 'C', 'C'],
'Value': [1,2,3,4],
'Day' : [1,2,3,4]
})
I would like to subtract the values in df1 with the values in df2 based on the day and create a new dataframe called delta_values. If there are no entries for the day then no action should occur.
To explain further: B in the name column only has values for day 1 and 2. df2 should subtract its values associated with day 1 and 2 with B's values for day 1 and 2, but since B has no values for day 3 and 4, no arithmetic should occur. I am having trouble with this part.
The output I am looking for is
If nothing better comes to somebidy's mind, here's a correct but not very elegant solution:
results = df1.set_index(['Day','Name']).unstack()['Value']\
.subtract(df2.set_index('Day')['Value'], axis=0)\
.stack().reset_index()
Make the result look like the expected output:
result.columns = 'Day', 'Name', 'Value'
result.Value = result.Value.astype(int)
result.sort_values(['Name', 'Day'], inplace=True)
result = result[['Name', 'Value', 'Day']]
We can merge the two DataFrame's on the Day column and then subtract from there.
merged = df1.merge(df2, how='inner', on='Day', suffixes=('', '_y'))
print(merged)
Name Value Day Name_y Value_y
0 A 10 1 C 1
1 A 9 2 C 2
2 A 8 3 C 3
3 A 10 4 C 4
4 B 99 1 C 1
5 B 88 2 C 2
delta_values = df1.copy()
delta_values['Value'] = merged['Value'] - merged['Value_y']
print(delta_values)
Name Value Day
0 A 9 1
1 A 7 2
2 A 5 3
3 A 6 4
4 B 98 1
5 B 86 2
You can make do with either map or merge. Here's a map solution:
delta_values = df1.copy()
delta_values['Value'] -= delta_values['Day'].map(df2.set_index('Day')['Value']
).fillna(0)
Output:
Name Value Day
0 A 9 1
1 A 7 2
2 A 5 3
3 A 6 4
4 B 98 1
5 B 86 2

Selecting rows from pandas dataframe limited by count per column value

I have a dataframe defined as follows:
df = pd.DataFrame({'id': [11, 12, 13, 14, 21, 22, 31, 32, 33],
'class': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'count': [2, 2, 2 ,2 ,1, 1, 2, 2, 2]})
For each class, I'd like to select top n rows where n is specified by count column. The expected output from the above dataframe would be like this:
How can I achieve this?
You could use
In [771]: df.groupby('class').apply(
lambda x: x.head(x['count'].iloc[0])
).reset_index(drop=True)
Out[771]:
id class count
0 11 A 2
1 12 A 2
2 21 B 1
3 31 C 2
4 32 C 2
Use:
(df.groupby('class', as_index=False, group_keys=False)
.apply(lambda x: x.head(x['count'].iloc[0])))
Output:
id class count
0 11 A 2
1 12 A 2
4 21 B 1
6 31 C 2
7 32 C 2
Using cumcount
df[(df.groupby('class').cumcount()+1).le(df['count'])]
Out[150]:
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32
Here is a solution which groups by class then then looks at the first value in the smaller dataframe and returns the corresponding rows.
def func(df_):
count_val = df_['count'].values[0]
return df_.iloc[0:count_val]
df.groupby('class', group_keys=False).apply(func)
returns
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32

simply put data on top of another pandas python

i have 2 sample datasets dfa and dfb:
import pandas as pd
a = {
'unit': ['A', 'B', 'C', 'D'],
'count': [ 1, 12, 34, 52]
}
b = {
'department': ['E', 'F'],
'count': [ 6, 12]
}
dfa = pd.DataFrame(a)
dfb = pd.DataFrame(b)
they looks like:
dfa
count unit
1 A
12 B
34 C
52 D
dfb
count department
6 E
12 F
what I want is simply have dfa stack on top of dfb not based on any column or any index. i have checked this page: https://pandas.pydata.org/pandas-docs/stable/merging.html but couldn't find the right one for my purpose.
my desired output is to create a dfc that looks like below dataset, i want to keep the headers:
dfc:
count unit
1 A
12 B
34 C
52 D
count department
6 E
12 F
In [37]: pd.concat([dfa, pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Out[37]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
or
In [39]: dfa.append(pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)) \
.reset_index(drop=True)
Out[39]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
UPDATE: merging 3 DFs:
pd.concat([dfa,
pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns),
pd.DataFrame(dfc.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Option 1
You can construct it from scratch using np.vstack
pd.DataFrame(
np.vstack([dfa.values, dfb.columns, dfb.values]),
columns=dfa.columns
)
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
Option 2
You can export to csv and read it back
from io import StringIO
import pandas as pd
pd.read_csv(StringIO(
'\n'.join([d.to_csv(index=None) for d in [dfa, dfb]])
))
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
dfa.loc[len(dfa),:] = dfb.columns
dfb.columns = dfa.columns
dfa.append(dfb)

Union in more than 2 pandas dataframe

I am trying to convert a sql query to python. The sql statement is as follows:
select * from table 1
union
select * from table 2
union
select * from table 3
union
select * from table 4
Now I have those tables in 4 dataframe df1, df2, df3, df4 and I would like to union 4 pandas dataframe which would match the result as the same as sql query.
I am confused of what operation to be used which is equivalent to sql union?
Thanks in advance!!
Note:
The column name for all the dataframes are the same.
If I understand well the issue, you are looking for the concat function.
pandas.concat([df1, df2, df3, df4]) should work correctly if the column names are the same for both dataframes.
IIUC you can use merge and join by columns matching_col of all dataframes:
import pandas as pd
# Merge multiple dataframes
df1 = pd.DataFrame({"matching_col": pd.Series({1: 4, 2: 5, 3: 7}),
"a": pd.Series({1: 52, 2: 42, 3:7})}, columns=['matching_col','a'])
print df1
matching_col a
1 4 52
2 5 42
3 7 7
df2 = pd.DataFrame({"matching_col": pd.Series({1: 2, 2: 7, 3: 8}),
"a": pd.Series({1: 62, 2: 28, 3:9})}, columns=['matching_col','a'])
print df2
matching_col a
1 2 62
2 7 28
3 8 9
df3 = pd.DataFrame({"matching_col": pd.Series({1: 1, 2: 0, 3: 7}),
"a": pd.Series({1: 28, 2: 52, 3:3})}, columns=['matching_col','a'])
print df3
matching_col a
1 1 28
2 0 52
3 7 3
df4 = pd.DataFrame({"matching_col": pd.Series({1: 4, 2: 9, 3: 7}),
"a": pd.Series({1: 27, 2: 24, 3:7})}, columns=['matching_col','a'])
print df4
matching_col a
1 4 27
2 9 24
3 7 7
Solution1:
df = pd.merge(pd.merge(pd.merge(df1,df2,on='matching_col'),df3,on='matching_col'), df4, on='matching_col')
set columns names
df.columns = ['matching_col','a1','a2','a3','a4']
print df
matching_col a1 a2 a3 a4
0 7 7 28 3 7
Solution2:
dfs = [df1, df2, df3, df4]
#use built-in python reduce
df = reduce(lambda left,right: pd.merge(left,right,on='matching_col'), dfs)
#set columns names
df.columns = ['matching_col','a1','a2','a3','a4']
print df
matching_col a1 a2 a3 a4
0 7 7 28 3 7
But if you need only concat dataframes, use concat with reseting index by parameter ignore_index=True:
print pd.concat([df1, df2, df3, df4], ignore_index=True)
matching_col a
0 4 52
1 5 42
2 7 7
3 2 62
4 7 28
5 8 9
6 1 28
7 0 52
8 7 3
9 4 27
10 9 24
11 7 7
This should be a comment on Jezrael's answer (+1'd for merge over concat) but I haven't sufficient reputation.
The OP asked how to union the dfs, but merge returns intersection by default:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html#pandas.merge
To get unions, add how='outer' to the merge calls.

Pandas: sum DataFrame rows for given columns

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Categories

Resources