Python: Pandas dataframe, merge/join tabels on different keys

Python: Pandas dataframe, merge/join tabels on different keys - python

I have 3 tables of following form:
import pandas as pd
df1 = pd.DataFrame({'ISIN': [1, 4, 7, 10],
'Value1': [2012, 2014, 2013, 2014],
'Value2': [55, 40, 84, 31]})
df1 = df1.set_index("ISIN")
df2 = pd.DataFrame({'ISIN': [1, 4, 7, 10],
'Symbol': ['a', 'b', 'c', 'd']})
df2 = df2.set_index("ISIN")
df3 = pd.DataFrame({'Symbol': ['a', 'b', 'c', 'd'],
'01.01.2020': [1, 2, 3, 4],
'01.01.2021': [3,2,3,2]})
df3 = df3.set_index("Symbol")
My aim now is to merge all 3 tabels together. I would go the following way:
Step1 (merge df1 and df2):
result1 = pd.merge(df1, df2, on=["ISIN"])
print(result1)
The result is ok and gives me the table:
Value1 Value2 Symbol
ISIN
1 2012 55 a
4 2014 40 b
7 2013 84 c
10 2014 31 d
In next step I want to merge it with df3, so I did make a step between and merge df2 and df3:
print(result1)
result2 = pd.merge(df2, df3, on=["Symbol"])
print(result2)
My problem now, the output is:
Symbol 01.01.2020 01.01.2021
0 a 1 3
1 b 2 2
2 c 3 3
3 d 4 2
the column ISIN here is lost. And the step
result = pd.merge(result, result2, on=["ISIN"])
result.set_index("ISIN")
produces an error.
Is there an elegant way to merge this 3 tabels together (with key column ISIN) and why is the key column lost in the second merge process?

Just chain the merge operations:
result = df1.merge(df2.reset_index(), on='ISIN').merge(df3, on='Symbol')
Or using your syntax, use result1 as source for the second merge:
result1 = pd.merge(df1, df2.reset_index(), on=["ISIN"])
result2 = pd.merge(result1, df3, on=["Symbol"])
output:
ISIN Value1 Value2 Symbol 01.01.2020 01.01.2021
0 1 2012 55 a 1 3
1 4 2014 40 b 2 2
2 7 2013 84 c 3 3
3 10 2014 31 d 4 2

You should not set the index prior to joining if you wish to keep it as part of the data in your dataframe. I suggest first merging, then setting the index to your desired value. In a single line:
output = df1.merge(df2,on='ISIN').merge(df3,on='Symbol')
Outputs:
ISIN Value1 Value2 Symbol 01.01.2020 01.01.2021
0 1 2012 55 a 1 3
1 4 2014 40 b 2 2
2 7 2013 84 c 3 3
3 10 2014 31 d 4 2
You can now set the index to ISIN by adding .set_index('ISIN') to output:
Value1 Value2 Symbol 01.01.2020 01.01.2021
ISIN
1 2012 55 a 1 3
4 2014 40 b 2 2
7 2013 84 c 3 3
10 2014 31 d 4 2

Related

Subtract rows from two dataframes based on index value

I have two dataframes:
df1 = pd.DataFrame({
'Name' : ['A', 'A', 'A', 'A', 'B', 'B'],
'Value': [10, 9, 8, 10, 99 , 88],
'Day' : [1,2,3,4,1,2]
})
df2 = pd.DataFrame({
'Name' : ['C', 'C', 'C', 'C'],
'Value': [1,2,3,4],
'Day' : [1,2,3,4]
})
I would like to subtract the values in df1 with the values in df2 based on the day and create a new dataframe called delta_values. If there are no entries for the day then no action should occur.
To explain further: B in the name column only has values for day 1 and 2. df2 should subtract its values associated with day 1 and 2 with B's values for day 1 and 2, but since B has no values for day 3 and 4, no arithmetic should occur. I am having trouble with this part.
The output I am looking for is

If nothing better comes to somebidy's mind, here's a correct but not very elegant solution:
results = df1.set_index(['Day','Name']).unstack()['Value']\
.subtract(df2.set_index('Day')['Value'], axis=0)\
.stack().reset_index()
Make the result look like the expected output:
result.columns = 'Day', 'Name', 'Value'
result.Value = result.Value.astype(int)
result.sort_values(['Name', 'Day'], inplace=True)
result = result[['Name', 'Value', 'Day']]

We can merge the two DataFrame's on the Day column and then subtract from there.
merged = df1.merge(df2, how='inner', on='Day', suffixes=('', '_y'))
print(merged)
Name Value Day Name_y Value_y
0 A 10 1 C 1
1 A 9 2 C 2
2 A 8 3 C 3
3 A 10 4 C 4
4 B 99 1 C 1
5 B 88 2 C 2
delta_values = df1.copy()
delta_values['Value'] = merged['Value'] - merged['Value_y']
print(delta_values)
Name Value Day
0 A 9 1
1 A 7 2
2 A 5 3
3 A 6 4
4 B 98 1
5 B 86 2

You can make do with either map or merge. Here's a map solution:
delta_values = df1.copy()
delta_values['Value'] -= delta_values['Day'].map(df2.set_index('Day')['Value']
).fillna(0)
Output:
Name Value Day
0 A 9 1
1 A 7 2
2 A 5 3
3 A 6 4
4 B 98 1
5 B 86 2

Selecting rows from pandas dataframe limited by count per column value

I have a dataframe defined as follows:
df = pd.DataFrame({'id': [11, 12, 13, 14, 21, 22, 31, 32, 33],
'class': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'count': [2, 2, 2 ,2 ,1, 1, 2, 2, 2]})
For each class, I'd like to select top n rows where n is specified by count column. The expected output from the above dataframe would be like this:
How can I achieve this?

You could use
In [771]: df.groupby('class').apply(
lambda x: x.head(x['count'].iloc[0])
).reset_index(drop=True)
Out[771]:
id class count
0 11 A 2
1 12 A 2
2 21 B 1
3 31 C 2
4 32 C 2

Use:
(df.groupby('class', as_index=False, group_keys=False)
.apply(lambda x: x.head(x['count'].iloc[0])))
Output:
id class count
0 11 A 2
1 12 A 2
4 21 B 1
6 31 C 2
7 32 C 2

Using cumcount
df[(df.groupby('class').cumcount()+1).le(df['count'])]
Out[150]:
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32

Here is a solution which groups by class then then looks at the first value in the smaller dataframe and returns the corresponding rows.
def func(df_):
count_val = df_['count'].values[0]
return df_.iloc[0:count_val]
df.groupby('class', group_keys=False).apply(func)
returns
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32

simply put data on top of another pandas python

i have 2 sample datasets dfa and dfb:
import pandas as pd
a = {
'unit': ['A', 'B', 'C', 'D'],
'count': [ 1, 12, 34, 52]
}
b = {
'department': ['E', 'F'],
'count': [ 6, 12]
}
dfa = pd.DataFrame(a)
dfb = pd.DataFrame(b)
they looks like:
dfa
count unit
1 A
12 B
34 C
52 D
dfb
count department
6 E
12 F
what I want is simply have dfa stack on top of dfb not based on any column or any index. i have checked this page: https://pandas.pydata.org/pandas-docs/stable/merging.html but couldn't find the right one for my purpose.
my desired output is to create a dfc that looks like below dataset, i want to keep the headers:
dfc:
count unit
1 A
12 B
34 C
52 D
count department
6 E
12 F

In [37]: pd.concat([dfa, pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Out[37]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
or
In [39]: dfa.append(pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)) \
.reset_index(drop=True)
Out[39]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
UPDATE: merging 3 DFs:
pd.concat([dfa,
pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns),
pd.DataFrame(dfc.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)

Option 1
You can construct it from scratch using np.vstack
pd.DataFrame(
np.vstack([dfa.values, dfb.columns, dfb.values]),
columns=dfa.columns
)
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
Option 2
You can export to csv and read it back
from io import StringIO
import pandas as pd
pd.read_csv(StringIO(
'\n'.join([d.to_csv(index=None) for d in [dfa, dfb]])
))
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F

dfa.loc[len(dfa),:] = dfb.columns
dfb.columns = dfa.columns
dfa.append(dfb)

Union in more than 2 pandas dataframe

I am trying to convert a sql query to python. The sql statement is as follows:
select * from table 1
union
select * from table 2
union
select * from table 3
union
select * from table 4
Now I have those tables in 4 dataframe df1, df2, df3, df4 and I would like to union 4 pandas dataframe which would match the result as the same as sql query.
I am confused of what operation to be used which is equivalent to sql union?
Thanks in advance!!
Note:
The column name for all the dataframes are the same.

If I understand well the issue, you are looking for the concat function.
pandas.concat([df1, df2, df3, df4]) should work correctly if the column names are the same for both dataframes.

IIUC you can use merge and join by columns matching_col of all dataframes:
import pandas as pd
# Merge multiple dataframes
df1 = pd.DataFrame({"matching_col": pd.Series({1: 4, 2: 5, 3: 7}),
"a": pd.Series({1: 52, 2: 42, 3:7})}, columns=['matching_col','a'])
print df1
matching_col a
1 4 52
2 5 42
3 7 7
df2 = pd.DataFrame({"matching_col": pd.Series({1: 2, 2: 7, 3: 8}),
"a": pd.Series({1: 62, 2: 28, 3:9})}, columns=['matching_col','a'])
print df2
matching_col a
1 2 62
2 7 28
3 8 9
df3 = pd.DataFrame({"matching_col": pd.Series({1: 1, 2: 0, 3: 7}),
"a": pd.Series({1: 28, 2: 52, 3:3})}, columns=['matching_col','a'])
print df3
matching_col a
1 1 28
2 0 52
3 7 3
df4 = pd.DataFrame({"matching_col": pd.Series({1: 4, 2: 9, 3: 7}),
"a": pd.Series({1: 27, 2: 24, 3:7})}, columns=['matching_col','a'])
print df4
matching_col a
1 4 27
2 9 24
3 7 7
Solution1:
df = pd.merge(pd.merge(pd.merge(df1,df2,on='matching_col'),df3,on='matching_col'), df4, on='matching_col')
set columns names
df.columns = ['matching_col','a1','a2','a3','a4']
print df
matching_col a1 a2 a3 a4
0 7 7 28 3 7
Solution2:
dfs = [df1, df2, df3, df4]
#use built-in python reduce
df = reduce(lambda left,right: pd.merge(left,right,on='matching_col'), dfs)
#set columns names
df.columns = ['matching_col','a1','a2','a3','a4']
print df
matching_col a1 a2 a3 a4
0 7 7 28 3 7
But if you need only concat dataframes, use concat with reseting index by parameter ignore_index=True:
print pd.concat([df1, df2, df3, df4], ignore_index=True)
matching_col a
0 4 52
1 5 42
2 7 7
3 2 62
4 7 28
5 8 9
6 1 28
7 0 52
8 7 3
9 4 27
10 9 24
11 7 7

This should be a comment on Jezrael's answer (+1'd for merge over concat) but I haven't sufficient reputation.
The OP asked how to union the dfs, but merge returns intersection by default:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html#pandas.merge
To get unions, add how='outer' to the merge calls.

Pandas: sum DataFrame rows for given columns

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.

You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7

If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.

Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'

This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)

You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:

Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)

You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4

The shortest and simplest way here is to use
df.eval('e = a + b + d')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Pandas dataframe, merge/join tabels on different keys - python

Related

Subtract rows from two dataframes based on index value

Selecting rows from pandas dataframe limited by count per column value

simply put data on top of another pandas python

Union in more than 2 pandas dataframe

Pandas: sum DataFrame rows for given columns

Categories

Resources