Specify columns to output with Pandas Merge function - python

import pandas as pd
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['dog', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8],
'valuea': [9, 10, 11, 12],
'valueb': [13, 14, 15, 16]})
I would like to merge these 2 dataframes based on 'value'. However I don't want the result to give me all of the columns in df2. I would like to keep the one with the 'valuea' column header but not the one with 'valueb' column header as per the squared output in the image.
The code I have tried is
df1.merge(df2, on ='value')
Is there a way exclude column with header = valueb using parameters in the merge function?

You cannot exclude columns with a parameter in the merge function.
Try these approaches instead:
pd.merge(df1, df2).drop(columns=['valueb'])
pd.merge(df1, df2.drop(columns=['valueb']))

Related

Why does merging dataframes writes off some of my data?

I tried to merge two dataframes:
df1 contains columns a b c, row 1,2,3
df2 contains columns a b c, row 4,5,6
when using pd.merge(df1,df2), some of the row data gets erased and disappears from the merged df. Why?
You can try this:
pd.concat([df1,df2])
It works.
Example:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[11, 22, 33], [44, 55, 66], [77, 88, 99]]),
columns=['a', 'b', 'c'])
pd.concat([df1,df2],ignore_index=True)
You will get the table with all elements from two dataframes. ignore_index=True helps to avoid confused numeration of index.
Also you can use:
df1.merge(df2, how='outer')
You should check https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Add total row to dataframe with multi level index

Consider the follow dataframe with a multi level index:
arrays = [np.array(['bar', 'bar', 'baz']),
np.array(['one', 'two', 'one'])]
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'],
index=arrays)
All I'm trying to do is add a 'Totals' row to the bottom (12, 15, 18 would be the expected values here). It seems like I need to calculate the totals and then append them to the dataframe, but I just can't get it work while preserving the multi level index (which I want to do). Thanks in advance!
This does not preserve your multi-level index, but it does append a new row called "total" that contains column sums:
import pandas as pd
import numpy as np
arrays = [np.array(['bar', 'bar', 'baz']),
np.array(['one', 'two', 'one'])]
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'],
index=arrays)
df.append(df.sum().rename('total')).assign(total=lambda d: d.sum(1))
I figured it out. Thanks for the responses. Those plus a little more education about indices in Python got me to something that worked.
# Create df of totals
df2 = pd.DataFrame(df.sum())
# Transpose df
df2 = df2.T
# Reset index
df2 = df2.reset_index()
# Add additional column so the columns of df2 match the columns of df
df2['Index'] = "zTotal"
# Set indices to match df indices
df2 = df2.set_index(['index', 'Index'])
# Concat df and df2
df3 = pd.concat([df, df2])
# Sort in desired order
df3 = df3.sort_index(ascending=[False,True])

Combining two dataframes based on specific column [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I'm attempting to combine different dataframes for NBA data. My first dataframe is from a basketball-reference page and my second dataframe is from a 538 stats page. I've already webscraped them.
I want to combine them so that it is by the player name. One of the dataframes is still bigger than the other. How can I combine the dataframes together? Both have the column id of "Player"
I think you probably want to use pandas .merge().
import pandas as pd
df1 = pd.DataFrame({'player': ['foo', 'bar', 'baz', 'foo', 'bar', 'foo'],
'value': [1, 2, 3, 5, 7, 9]})
df2 = pd.DataFrame({'player': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
merged_df = df1.merge(df2, how='outer', on='player')

Merging 2 csv files

I want to merge 2 csv files. Resulting data frame column should have all the columns from csv 1.
For ex:
df1 = pd.DataFrame({'name': ['foo', 'bar', 'baz', 'foo'],'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'class': ['a', 'b', 'c', 'd'],'value': [5, 6, 7, 8]})
df3 = pd.merge(df1, df2,how='outer')
Result df3:
name value
foo 1
bar 2
baz 3
foo 5
NaN 6
NaN 7
NaN 8
How can i get the above result using joins?
This should get you sorted,
import pandas as pd
df1 = pd.DataFrame({'name': ['foo', 'bar', 'baz', 'foo'],'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'class': ['a', 'b', 'c', 'd'],'value': [5, 6, 7, 8]})
df3 = pd.merge(df1, df2,how='outer')
df3.drop([item for item in df2.columns if item not in df1.columns],axis = 1)
Which gives
import pandas as pd
df1 = pd.DataFrame({'name': ['foo', 'bar', 'baz', 'foo'],'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'class': ['a', 'b', 'c', 'd'],'value': [5, 6, 7, 8]})
result = pd.concat([df1, df2], axis=1)
print(result)

plotting a given column name across different data frames in python

All, I have multiple dataframes like
df1 = pd.DataFrame(np.array([
['a', 1, 2],
['b', 3, 4],
['c', 5, 6]]),
columns=['name', 'attr1', 'attr2'])
df2 = pd.DataFrame(np.array([
['a', 2, 3],
['b', 4, 5],
['c', 6, 7]]),
columns=['name', 'attr1', 'attr2'])
df3 = pd.DataFrame(np.array([
['a', 3, 4],
['b', 5, 6],
['c', 7, 8]]),
columns=['name', 'attr1', 'attr2'])
each of these dataframes are generated at specific time steps says T=[t1, t2, t3]
I would like to plot, attr1 or attr2 of the diff data frames as function of time T. I would like to do this for 'a', 'b' and 'c' on all the same graph.
Plot Attr1 VS time for 'a', 'b' and 'c'
If I understand correctly, first assign a column T to each of your dataframes, then concatenate the three. Then, you can groupby the name column, iterate through each, and plot T against attr1 or attr2:
dfs = pd.concat([df1.assign(T=1), df2.assign(T=2), df3.assign(T=3)])
for name, data in dfs.groupby('name'):
plt.plot(data['T'], data['attr2'], label=name)
plt.xlabel('Time')
plt.ylabel('attr2')
plt.legend()
plt.show()

Categories

Resources