Combine two dataframes and pick first entry based on common column [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes like
df1
sub_id Weight
1 56
2 67
3 81
5 73
9 59
df2
sub_id Text
1 He is normal.
1 person is healthy.
1 has strong immune power.
3 She is over weight.
3 person is small.
9 Looks good.
5 Not well.
5 Need to be tested.
By combining these two data frame i need to get as
(when there are multiple sub_id's in second df need to pick first text and combine with first df as below)
merge_df
sub_id Weight Text
1 56 He is normal.
2 67 Nan.
3 81 She is over weight.
5 73 Not well.
9 59 Looks good.
Can anyone help me out?
Thanks in advance.

Here you go:
print(pd.merge(df1, df2.drop_duplicates(subset='sub_id'),
on='sub_id',
how='outer'))
Output
sub_id Weight Text
0 1 56 He is normal.
1 2 67 NaN
2 3 81 She is over weight.
3 5 73 Not well.
4 9 59 Looks good.
To keep the last duplicate, you'd use the parameter keep='last'
print(pd.merge(df1, df2.drop_duplicates(subset='sub_id', keep='last'),
on='sub_id',
how='outer'))
Output
sub_id Weight Text
0 1 56 has strong immune power.
1 2 67 NaN
2 3 81 person is small.
3 5 73 Need to be tested.
4 9 59 Looks good.

Related

Comparing Columns in a Pandas Dataframe

I have a pandas data frame with racing results.
Place BibNum Time
0 1 2 5:50
1 2 4 8:09
2 3 7 10:27
3 4 3 11:12
4 5 1 12:13
...
34 1 5 2:03
35 2 9 4:35
36 3 7 5:36
What I would like to know is how can I get a count of how many times the BibNum showed up where the Place was 1, 2, 3 etc?
I know that I can do a "value_counts" but that is for how many times it shows up in a single column. I also looked into using numpy "where" but that is using a conditional like greater than or less than.
IIUC , this is what you need:
out = df.groupby(['Place','BibNum']).size()

Pandas groupby results assign to a new column

Hi I'm trying to create a new column in my dataframe and I want the values to based on a calc. The calc is - score share of Student within the Class. There are 2 different students with the same name in different classes, hence why the first group by below is on Class and Student both.
df['share'] = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
With the code above, I get the error incompatible index of inserted column with frame index.
Can someone please help. Thanks
the problem is the groupby aggregate and the index are the unique values of the column you group. And in your case, the SHARE score is the class's score and not the student's, and this sets up a new dataframe with each student's share score.
I understood your problem this way.
ndf = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
ndf = ndf.reset_index()
ndf
If I understood you correctly, given an example df like the following:
Class Student Score
1 1 1 99
2 1 2 60
3 1 3 90
4 1 4 50
5 2 1 93
6 2 2 93
7 2 3 67
8 2 4 58
9 3 1 54
10 3 2 29
11 3 3 34
12 3 4 46
Do you need the following result?
Class Student Score Score_Share
1 1 1 99 0.331104
2 1 2 60 0.200669
3 1 3 90 0.301003
4 1 4 50 0.167224
5 2 1 93 0.299035
6 2 2 93 0.299035
7 2 3 67 0.215434
8 2 4 58 0.186495
9 3 1 54 0.331288
10 3 2 29 0.177914
11 3 3 34 0.208589
12 3 4 46 0.282209
If so, that can be achieved straight forward with:
df['Score_Share'] = df.groupby('Class')['Score'].apply(lambda x: x / x.sum())
You can apply operations within each group's scope like that.
PS. I don't know why a student with the same name in a different class would be a problem, so maybe I'm not getting something right. I'll edit this according to your response. Can't make a comment because I'm a newbie here :)

Calculating median if for values in a column that match a condition

I am new to Pandas.
My dataset:
df
A B
10 1
15 2
65 3
54 2
51 2
96 1
I am trying to add new column C and calculate the median for values that are in the same group defined by column B.
Expected result:
df
A B C
10 11 53
15 2 34
65 3 65
54 2 34
51 2 34
96 1 53
What I've tried:
df_final['C'] = df_final.groupby('B')['A'].transform('median')
I do get an answer, but due to big DataFrame I am unsure if my code performs correctly, could someone tell me if I am using the right way to achieve this?
You can use:
df_final['C'] = df_final.groupby('B')['A'].transform('median')
As provided in comments.

How to cbind a vector to a dataframe in python and keep the vector name as the name of the new column?

In R, cbind(dataframe, new_column) will return the original dataframe with an extra column called "new_column"
What is best practice for achieving this in Python (preferably using base or pandas)?
To make the question more concrete, suppose
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
and
new_column = [2,4,6,8,10,12]
And that the final output should be
Day Visitors Bounce Rate new_column
0 1 43 65 2
1 2 34 67 4
2 3 65 78 6
3 4 56 65 8
4 5 29 45 10
5 6 76 52 12
You can do this:
web_stats['new_column'] = [2,4,6,8,10,12]

Plotting 3 cols of pandas dataframe as heatmap

I'm almost certain that this is a duplicate, but I'm having trouble finding the answer in a reasonable amount of time.
I have dataframe with the three columns below:
COLS FUNCS FLUFF
32 1 1 3.24707
33 1 2 14.89260
34 1 3 48.60840
35 1 4 73.68160
36 2 1 4.19922
37 2 2 64.89260
38 2 3 87.91500
39 2 4 91.01560
40 4 1 23.58400
41 4 2 87.89060
42 4 3 95.38570
43 4 4 98.33980
44 8 1 34.47270
45 8 2 95.43460
46 8 3 99.04790
47 8 4 99.80470
I want to plot a heat map of these data with COLS on the horizontal axis and FUNCS on the vertical axis with cells that are scaled according to FLUFF. I don't want to use seaborn. I want to use matplotlib and/or pandas exclusively.
If you also have some insight on how to achieve a logarithmic color scheme, that would also would great.
df.set_index(['COLS', 'FUNCS']).FLUFF.unstack(0).pipe(plt.imshow)
should do it for you.
As cel mentioned in the comments, if your data is actually sparse, you might want to do a .reindex to insert all the rows and columns, filling the NaNs appropriately.
For the log scale have a look at http://matplotlib.org/api/ticker_api.html#matplotlib.ticker.LogFormatter

Categories

Resources