Combining two dataframes based on specific column [duplicate]

Combining two dataframes based on specific column [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I'm attempting to combine different dataframes for NBA data. My first dataframe is from a basketball-reference page and my second dataframe is from a 538 stats page. I've already webscraped them.
I want to combine them so that it is by the player name. One of the dataframes is still bigger than the other. How can I combine the dataframes together? Both have the column id of "Player"

I think you probably want to use pandas .merge().
import pandas as pd
df1 = pd.DataFrame({'player': ['foo', 'bar', 'baz', 'foo', 'bar', 'foo'],
'value': [1, 2, 3, 5, 7, 9]})
df2 = pd.DataFrame({'player': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
merged_df = df1.merge(df2, how='outer', on='player')

Related

Removing duplicated rows in a pandas dataframe without considering order [duplicate]

This question already has answers here:
(pandas) Drop duplicates based on subset where order doesn't matter
(2 answers)
Pandas: remove duplicates that exist in any order
(3 answers)
Closed 10 months ago.
I'm in the situation of having a dataframe on the form:
import pandas as pd
df_1 = pd.DataFrame({
'A': [0, 0, 1, 1, 1, 2],
'B': [0, 1, 0, 1, 2, 1],
'C': ['a', 'a', 'b', 'b', 'c', 'c']
})
what I want to do is to drop rows of that dataframe where the ordered couples coming from numbers of column 'A'and 'B' are duplicated.
So what I want is:
df_1 = pd.DataFrame({
'A': [0, 0, 1, 1],
'B': [0, 1, 1, 2],
'C': ['a', 'a', 'b', 'c']
})
My idea was to add a column with a the sorted couple as a string and to use the drop_duplicates function of the dataframe, but since i'm using a very huge dataframe this solution is very expansive.
Did you have any suggestions? Thanks for the answers.

Specify columns to output with Pandas Merge function

import pandas as pd
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['dog', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8],
'valuea': [9, 10, 11, 12],
'valueb': [13, 14, 15, 16]})
I would like to merge these 2 dataframes based on 'value'. However I don't want the result to give me all of the columns in df2. I would like to keep the one with the 'valuea' column header but not the one with 'valueb' column header as per the squared output in the image.
The code I have tried is
df1.merge(df2, on ='value')
Is there a way exclude column with header = valueb using parameters in the merge function?

You cannot exclude columns with a parameter in the merge function.
Try these approaches instead:
pd.merge(df1, df2).drop(columns=['valueb'])
pd.merge(df1, df2.drop(columns=['valueb']))

Add total row to dataframe with multi level index

Consider the follow dataframe with a multi level index:
arrays = [np.array(['bar', 'bar', 'baz']),
np.array(['one', 'two', 'one'])]
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'],
index=arrays)
All I'm trying to do is add a 'Totals' row to the bottom (12, 15, 18 would be the expected values here). It seems like I need to calculate the totals and then append them to the dataframe, but I just can't get it work while preserving the multi level index (which I want to do). Thanks in advance!

This does not preserve your multi-level index, but it does append a new row called "total" that contains column sums:
import pandas as pd
import numpy as np
arrays = [np.array(['bar', 'bar', 'baz']),
np.array(['one', 'two', 'one'])]
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'],
index=arrays)
df.append(df.sum().rename('total')).assign(total=lambda d: d.sum(1))

I figured it out. Thanks for the responses. Those plus a little more education about indices in Python got me to something that worked.
# Create df of totals
df2 = pd.DataFrame(df.sum())
# Transpose df
df2 = df2.T
# Reset index
df2 = df2.reset_index()
# Add additional column so the columns of df2 match the columns of df
df2['Index'] = "zTotal"
# Set indices to match df indices
df2 = df2.set_index(['index', 'Index'])
# Concat df and df2
df3 = pd.concat([df, df2])
# Sort in desired order
df3 = df3.sort_index(ascending=[False,True])

How to sort rows based on column combinations - python

Is there a way to sort a dataframe by a combination of different columns? As in if specific columns match among rows, they will be clustered together? An example below: Any help is greatly appreciated!
Original DataFrame
Transformed DataFrame

One way to sort pandas dataframe is to use .sort_values().
The code below replicates your sample dataframe:
df= pd.DataFrame({'v1': [1, 3, 2, 1, 4, 3],
'v2': [2, 2, 4, 2, 3, 2],
'v3': [3, 3, 2, 3, 2, 3],
'v4': [4, 5, 1, 4, 2, 5]})
Using the code below, can sort the dataframe by both column v1 and v2. In this case, v2 is only used to break ties.
df.sort_values(by=['v1', 'v2'], ascending=True)
"by" parameter here is not limited to any number of variables, so could extend the list to include more variables in desired order.

This is the best to match your sort pattern shown in the image.
import pandas as pd
df = pd.DataFrame(dict(
v1=[1,3,2,1,4,3],
v2=[2,2,4,2,3,2],
v3=[3,3,2,3,2,3],
v4=[4,5,1,4,2,5],
))
# Make a temp column to sort the df by
df['sort'] = df.astype(str).values.sum(axis=1)
# Sort the df by that column, drop it and reset the index
df = df.sort_values(by='sort').drop(columns='sort').reset_index(drop=1)
print(df)
Link you can refe - Code in python tutor
Edit: Zolzaya Luvsandorj's recommendation is better:
import pandas as pd
df = pd.DataFrame(dict(
v1=[1,3,2,1,4,3],
v2=[2,2,4,2,3,2],
v3=[3,3,2,3,2,3],
v4=[4,5,1,4,2,5],
))
df = df.sort_values(by=list(df.columns)).reset_index(drop=1)
print(df)
Link you can refe - Better code in python tutor

what is meaning of axis=1 in pandas sort_values function? [duplicate]

This question already has answers here:
What does axis in pandas mean?
(27 answers)
Closed 4 years ago.
I have a following code of snippet.
df = pd.DataFrame({'col1' : ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2' : [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3]})
print(df)
sorted=df.sort_values(by=1,axis=1)
print(sorted)
The above data is original dataframe .
The above one is output of the df.sort_values() function.
Can anyone explain what is happening here?

The parameter axis=1 refer to columns, while 0 refers to rows. In this case you are sorting by columns, specifically index 1, which is col2 (indexing in python starts at 0).
Some good examples here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combining two dataframes based on specific column [duplicate] - python

Related

Removing duplicated rows in a pandas dataframe without considering order [duplicate]

Specify columns to output with Pandas Merge function

Add total row to dataframe with multi level index

How to sort rows based on column combinations - python

what is meaning of axis=1 in pandas sort_values function? [duplicate]

Categories

Resources