Pandas merge two dataframe and drop extra rows - python

How can I merge/join these two dataframes ONLY on "sample_id" and drop the extra rows from the second dataframe when merging/joining?
Using pandas in Python.
First dataframe (fdf)
| sample_id | name |
|-----------|-------|
| 1 | Mark |
| 1 | Dart |
| 2 | Julia |
| 2 | Oolia |
| 2 | Talia |
Second dataframe (sdf)
| sample_id | salary | time |
|-----------|--------|------|
| 1 | 20 | 0 |
| 1 | 30 | 5 |
| 1 | 40 | 10 |
| 1 | 50 | 15 |
| 2 | 33 | 0 |
| 2 | 23 | 5 |
| 2 | 24 | 10 |
| 2 | 28 | 15 |
| 2 | 29 | 20 |
So the resulting df will be like -
| sample_id | name | salary | time |
|-----------|-------|--------|------|
| 1 | Mark | 20 | 0 |
| 1 | Dart | 30 | 5 |
| 2 | Julia | 33 | 0 |
| 2 | Oolia | 23 | 5 |
| 2 | Talia | 24 | 10 |

There are duplicates, so need helper column for correct DataFrame.merge with GroupBy.cumcount for counter:
df = (fdf.assign(g=fdf.groupby('sample_id').cumcount())
.merge(sdf.assign(g=sdf.groupby('sample_id').cumcount()), on=['sample_id', 'g'])
.drop('g', axis=1))
print (df)
sample_id name salary time
0 1 Mark 20 0
1 1 Dart 30 5
2 2 Julia 33 0
3 2 Oolia 23 5
4 2 Talia 24 10

final_res = pd.merge(df,df2,on=['sample_id'],how='left')
final_res.sort_values(['sample_id','name','time'],ascending=[True,True,True],inplace=True)
final_res.drop_duplicates(subset=['sample_id','name'],keep='first',inplace=True)

Related

How to sum up the value from previous row to subsequent rows pandas

I have a dataframe with the below specs
| ID | Name| count |
| -- |---- | ---- |
| 1 | A | 75 |
| 2 | B | 10 |
| 3 | A | 15 |
| 4 | A | 10 |
| 5 | A | 5 |
| 6 | A | 3 |
If I set the threshold for the count to be 15, I want the below rows to get added up uniformly. So the output should be
| ID | Name | count |
| -- |---- | ---- |
| 1 | A | 15 |
| 2 | B | 10 |
| 3 | A | 30 |
| 4 | A | 25 |
| 5 | A | 20 |
| 6 | A | 18 |
75 from ID 1 gets added up based on group "Name" and it is always based on threshold value. Please advice
IIUC you can do something like:
def distribute(sr, thres=15):
idxmax = sr.idxmax()
remain = max((sr[idxmax] - thres) / max(len(sr) - 1, 1), 0)
return np.where(sr.index == idxmax, min(sr[idxmax], thres), sr + remain)
df['newcount'] = df.groupby('Name')['count'].transform(distribute, thres=15)
Output:
ID
Name
count
newcount
1
A
75
15
2
B
10
10
3
A
15
30
4
A
10
25
5
A
5
20
6
A
3
18

Pandas sum column for each column pair

I have a dataframe as follows. I'm attempting to sum the values in the Total column, for each date, for each unique pair from columns P_buy and P_sell.
+--------+----------+-------+---------+--------+----------+-----------------+
| Index | Date | Type | Quantity| P_buy | P_sell | Total |
+--------+----------+-------+---------+--------+----------+-----------------+
| 0 | 1/1/2020 | 1 | 10 | 1 | 1 | 10 |
| 1 | 1/1/2020 | 1 | 10 | 2 | 1 | 20 |
| 2 | 1/1/2020 | 2 | 20 | 3 | 1 | 25 |
| 3 | 1/1/2020 | 2 | 20 | 4 | 1 | 20 |
| 4 | 2/1/2020 | 3 | 30 | 1 | 1 | 35 |
| 5 | 2/1/2020 | 3 | 30 | 2 | 1 | 30 |
| 6 | 2/1/2020 | 1 | 40 | 3 | 1 | 45 |
| 7 | 2/1/2020 | 1 | 40 | 4 | 1 | 40 |
| 8 | 3/1/2020 | 2 | 50 | 1 | 1 | 55 |
| 9 | 3/1/2020 | 2 | 50 | 2 | 1 | 53 |
+--------+----------+-------+---------+--------+----------+-----------------+
My desired output would be as follows: Where for each combination of unique P_buy/P_sell pairs, I'm receiving a sum of the total at each date.
+--------+----------+-------+---------+
| P_buy | P_sell | Total |
+--------+----------+-------+---------+
| 1 | 1 | 100 |
| 2 | 1 | 103 |
| 3 | 1 | 70 |
+--------+----------+-------+---------+
My attempts have been using the groupby function, but I haven't been able to successfully implement.
# use a groupby on the desired columns and sum the total
df.groupby(['P_buy','P_sell'], as_index=False)['Total'].sum()
P_buy P_sell Total
0 1 1 100
1 2 1 103
2 3 1 70
3 4 1 60

update table information based on columns of another table

I am new in python have two dataframes, df1 contains information about all students with their group and score, and df2 contains updated information about few students when they change their group and score. How could I update the information in df1 based on the values of df2 (group and score)?
df1
+----+----------+-----------+----------------+
| |student No| group | score |
|----+----------+-----------+----------------|
| 0 | 0 | 0 | 0.839626 |
| 1 | 1 | 0 | 0.845435 |
| 2 | 2 | 3 | 0.830778 |
| 3 | 3 | 2 | 0.831565 |
| 4 | 4 | 3 | 0.823569 |
| 5 | 5 | 0 | 0.808109 |
| 6 | 6 | 4 | 0.831645 |
| 7 | 7 | 1 | 0.851048 |
| 8 | 8 | 3 | 0.843209 |
| 9 | 9 | 4 | 0.84902 |
| 10 | 10 | 0 | 0.835143 |
| 11 | 11 | 4 | 0.843228 |
| 12 | 12 | 2 | 0.826949 |
| 13 | 13 | 0 | 0.84196 |
| 14 | 14 | 1 | 0.821634 |
| 15 | 15 | 3 | 0.840702 |
| 16 | 16 | 0 | 0.828994 |
| 17 | 17 | 2 | 0.843043 |
| 18 | 18 | 4 | 0.809093 |
| 19 | 19 | 1 | 0.85426 |
+----+----------+-----------+----------------+
df2
+----+-----------+----------+----------------+
| | group |student No| score |
|----+-----------+----------+----------------|
| 0 | 2 | 1 | 0.887435 |
| 1 | 0 | 19 | 0.81214 |
| 2 | 3 | 17 | 0.899041 |
| 3 | 0 | 8 | 0.853333 |
| 4 | 4 | 9 | 0.88512 |
+----+-----------+----------+----------------+
The result
df: 3
+----+----------+-----------+----------------+
| |student No| group | score |
|----+----------+-----------+----------------|
| 0 | 0 | 0 | 0.839626 |
| 1 | 1 | 2 | 0.887435 |
| 2 | 2 | 3 | 0.830778 |
| 3 | 3 | 2 | 0.831565 |
| 4 | 4 | 3 | 0.823569 |
| 5 | 5 | 0 | 0.808109 |
| 6 | 6 | 4 | 0.831645 |
| 7 | 7 | 1 | 0.851048 |
| 8 | 8 | 0 | 0.853333 |
| 9 | 9 | 4 | 0.88512 |
| 10 | 10 | 0 | 0.835143 |
| 11 | 11 | 4 | 0.843228 |
| 12 | 12 | 2 | 0.826949 |
| 13 | 13 | 0 | 0.84196 |
| 14 | 14 | 1 | 0.821634 |
| 15 | 15 | 3 | 0.840702 |
| 16 | 16 | 0 | 0.828994 |
| 17 | 17 | 3 | 0.899041 |
| 18 | 18 | 4 | 0.809093 |
| 19 | 19 | 0 | 0.81214 |
+----+----------+-----------+----------------+
my code to update df1 from df2
dfupdated = df1.merge(df2, how='left', on=['student No'], suffixes=('', '_new'))
dfupdated['group'] = np.where(pd.notnull(dfupdated['group_new']), dfupdated['group_new'],
dfupdated['group'])
dfupdated['score'] = np.where(pd.notnull(dfupdated['score_new']), dfupdated['score_new'],
dfupdated['score'])
dfupdated.drop(['group_new', 'score_new'],axis=1, inplace=True)
dfupdated.reset_index(drop=True, inplace=True)
but I face the following error
KeyError: "['group'] not in index"
I don't know what's wrong I ran same and got the answer
giving a different way to solve it
try :
dfupdated = df1.merge(df2, on='student No', how='left')
dfupdated['group'] = dfupdated['group_y'].fillna(dfupdated['group_x'])
dfupdated['score'] = dfupdated['score_y'].fillna(dfupdated['score_x'])
dfupdated.drop(['group_x', 'group_y','score_x', 'score_y'], axis=1,inplace=True)
will give you the solution you want.
to get the max from each group
dfupdated.groupby(['group'], sort=False)['score'].max()

Can we alter pandas cross tabulation?

I have loaded raw_data from MySQL using sqlalchemy and pymysql
engine = create_engine('mysql+pymysql://[user]:[passwd]#[host]:[port]/[database]')
df = pd.read_sql_table('data', engine)
df is something like this
| Age Category | Category |
|--------------|----------------|
| 31-26 | Engaged |
| 26-31 | Engaged |
| 31-36 | Not Engaged |
| Above 51 | Engaged |
| 41-46 | Disengaged |
| 46-51 | Nearly Engaged |
| 26-31 | Disengaged |
Then i had performed analysis as follow
age = pd.crosstab(df['Age Category'], df['Category'])
| Category | A | B | C | D |
|--------------|---|----|----|---|
| Age Category | | | | |
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
I want to change it to
Pandas DataFrame something like this.
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
Thank you for your time and consideration
Both texts are called columns and index names, solution for change them is use DataFrame.rename_axis:
age = age.rename_axis(index=None, columns='Age Category')
Or set columns names by index names, and then set index names to default - None:
age.columns.name = age.index.name
age.index.name = None
print (age)
Age Category Disengaged Engaged Nearly Engaged Not Engaged
26-31 1 1 0 0
31-26 0 1 0 0
31-36 0 0 0 1
41-46 1 0 0 0
46-51 0 0 1 0
Above 51 0 1 0 0
But this texts are something like metadata, so some functions should remove them.

How to add a new column to pySpark dataframe which contains count its column values which are greater to 0?

I want to add a new column to pyspark dataframe which contains count of all columns values which are greater to 0 in a particular row.
Here is my demo dataframe.
+-----------+----+----+----+----+----+----+
|customer_id|2010|2011|2012|2013|2014|2015|
+-----------+----+----+----+----+----+----+
| 1 | 0 | 4 | 0 | 32 | 0 | 87 |
| 2 | 5 | 5 | 56 | 23 | 0 | 09 |
| 3 | 6 | 6 | 87 | 0 | 45 | 23 |
| 4 | 7 | 0 | 12 | 89 | 78 | 0 |
| 6 | 0 | 0 | 0 | 23 | 45 | 64 |
+-----------+----+----+----+----+----+----+
Above data frame have visit by a customer in a year. I want to count how many years a customer visited. So i need a column visit_count which is having count of visits in year (2010,2011,2012,2013,2014,2015) having value greater to 0.
+-----------+----+----+----+----+----+----+-----------+
|customer_id|2010|2011|2012|2013|2014|2015|visit_count|
+-----------+----+----+----+----+----+----+-----------+
| 1 | 0 | 4 | 0 | 32 | 0 | 87 | 3 |
| 2 | 5 | 5 | 56 | 23 | 0 | 09 | 5 |
| 3 | 6 | 6 | 87 | 0 | 45 | 23 | 5 |
| 4 | 7 | 0 | 12 | 89 | 78 | 0 | 4 |
| 6 | 0 | 0 | 0 | 23 | 45 | 64 | 3 |
+-----------+----+----+----+----+----+----+-----------+
How to Achieve this?
Try this:
df.withColumn('visit_count', sum((df[col] > 0).cast('integer') for col in df.columns))

Categories

Resources