I want to intersect two Pandas dataframes (1 and 2) based on two columns (A and B) present in both dataframes. However, I would like to return a dataframe that only has data with respect to the data in the first dataframe, omitting anything that is not found in the second dataframe.
So for example:
Dataframe 1:
A | B | Extra | Columns | In | 1 |
----------------------------------
1 | 2 | Extra | Columns | In | 1 |
1 | 3 | Extra | Columns | In | 1 |
1 | 5 | Extra | Columns | In | 1 |
Dataframe 2:
A | B | Extra | Columns | In | 2 |
----------------------------------
1 | 3 | Extra | Columns | In | 2 |
1 | 4 | Extra | Columns | In | 2 |
1 | 5 | Extra | Columns | In | 2 |
should return:
A | B | Extra | Columns | In | 1 |
----------------------------------
1 | 3 | Extra | Columns | In | 1 |
1 | 5 | Extra | Columns | In | 1 |
Is there a way I can do this simply?
You can use df.merge:
df = df1.merge(df2, on=['A','B'], how='inner').drop('2', axis=1)
how='inner' is default. Just put it there for your understanding of how df.merge works.
As #piRSquared suggested, you can do:
df1.merge(df2[['A', 'B']], how='inner')
Related
DF 1
| ColA | Colb | Stock | Date |
| -------- | -------------- | -------- | ---------- |
| A | 1 | 3 | 2022-26-12 |
| B | 2 | 3 | 2022-26-12 |
| C | 3 | 3 | 2022-26-12 |
DF 2
| ColA | Colb | Sales | Date |
| -------- | -------------- | -------- | ---------- |
| A | 1 | 1 | 2022-26-12 |
| B | 2 | 1 | 2022-26-12 |
| C | 3 | 1 | 2022-26-12 |
Given any number of columns to join on, how do you do Dataframe arithmetic in pandas, for instance if I wanted to subtract the above two Dataframes to get something like this
STOCK AT END OF THE DAY
| ColA | Colb | Stock | Date |
| -------- | -------------- | -------- | ---------- |
| A | 1 | 2 | 2022-26-12 |
| B | 2 | 2 | 2022-26-12 |
| C | 3 | 2 | 2022-26-12 |
So stock - sales given all the common columns, in this case
Edit:
The equivalent SQL code to my problem is:
SELECT
DF1.ColA,
DF1.Colb,
DF1.Date,
DF1.Stock - coalesce(DF2.Sales, 0)
FROM
DF1
LEFT JOIN DF2
on
DF1.ColA = DF2.ColA and
DF1.Colb = DF2.Colb and
DF1.Date = DF2.Date
If they have the same number of rows and columns then do something like that:
df3 = df1[['ColA', 'Colb','Date']]
df3['Stock'] = df1.Stock - df2.Sales
However, if they are different merge them then do what you want:
df3= pd.merge(df1, df2, on='ColA', how='inner')
df3['Stock'] = df3.Stock - df3.Sales
In your case, based on your edited question:
df3 = pd.merge(df1, df2, how='left', left_on=['ColA','Colb','Date'], right_on = ['ColA','Colb','Date'])
#rename the columns as you want
df3.columns=['col1','col2']
#only select columns you want
df3=df3[['col1','col2']]
#then do your subtraction
df3['Stock'] = df3.col1 - df3.col2
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed last year.
I have below 2 dataframes:
df_1:
| | assign_to_id |
| | ------------ |
| 0 | 1, 2 |
| 1 | 2 |
| 2 | 3,4,5 |
df_2:
| | id | name |
| | ------------| -----------|
| 0 | 1 | John |
| 1 | 2 | Adam |
| 2 | 3 | Max |
| 3 | 4 | Martha |
| 4 | 5 | Robert |
I want to map the Id's in the df_1 to the names in df_2 by matching their id's
final_df:
| | assign_to_name |
| | ----------------- |
| 0 | John, Adam |
| 1 | Adam |
| 2 | Max,Martha,Robert |
I don't know how to achieve this. Looking forward to some help.
Idea is mapping column splitted by , by dictionary and then join back by ,:
d = df_2.assign(id = df_2['id'].astype(str)).set_index('id')['name'].to_dict()
f = lambda x: ','.join(d[y] for y in x.split(',') if y in d)
df_1['assign_to_name'] = df_1['assign_to_id'].replace('\s+', '', regex=True).apply(f)
print (df_1)
assign_to_id assign_to_name
0 1, 2 John,Adam
1 2 Adam
2 3,4,5 Max,Martha,Robert
I have two pandas DataFrame
# python 3
one is | A | B | C | and another is | D | E | F |
|---|---|---| |---|---|---|
| 1 | 2 | 3 | | 3 | 4 | 6 |
| 4 | 5 | 6 | | 8 | 7 | 9 |
| ......... | | ......... |
I want to get 'expected' result
expected result :
| A | D | E | F | C |
|---|---|---|---|---|
| 1 | 3 | 4 | 6 | 3 |
| 4 | 8 | 7 | 9 | 6 |
| ................. |
df1['B'] convert into df2
I have tried
pd.concat([df1,df2], axis=1, sort=False)
and drop column df['B']
but it doesn't seem to be very efficient.
Could it be solved by using insert() or another method?
I think your method is good, also you can remove column before concat:
pd.concat([df1.drop('B', axis=1),df2], axis=1, sort=False)
Another method with DataFrame.join:
df1.drop('B', axis=1).join(df2)
I am new to using Pandas and I am trying to restructure a dataframe to remove the duplicates in my first column, while also keeping the number of each duplicate, and taking the sum of values in the second column.
For example, I would like the conversion to look something like this:
[In]:
+---+------+-------+
| | Name | Value |
+---+------+-------+
| 0 | A | 5 |
| 1 | B | 5 |
| 2 | C | 10 |
| 3 | A | 15 |
| 4 | A | 5 |
| 5 | C | 10 |
+---+------+-------+
[Out]:
+---+------+--------+-------+
| | Name | Number | Total |
+---+------+--------+-------+
| 0 | A | 3 | 25 |
| 1 | B | 1 | 5 |
| 2 | C | 2 | 20 |
+---+------+--------+-------+
So far, I haven't been able to find an efficient method to do this. (Or even a working method.)
I will be working with several hundred thousand rows, so I will need to find a pretty efficient method.
The pandas agg function on a groupby is what you want.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
Here is an example:
import pandas as pd
df=pd.DataFrame({'Name':['A','B','C','A','A','C'],
'Value':[5,5,10,15,5,10]})
df.groupby('Name').agg(['count','sum'])
Hope that helps.
I have the following DataFrame
| name | number |
|------|--------|
| a | 1 |
| a | 1 |
| a | 1 |
| b | 2 |
| b | 2 |
| b | 2 |
| c | 3 |
| c | 3 |
| c | 3 |
| d | 4 |
| d | 4 |
| d | 4 |
I wish to merge all the rows by string, but have their number value added up and kept in line with the name..
Output desired..
| name | number |
|------|--------|
| a | 3 |
| b | 6 |
| c | 9 |
| d | 12 |
It seems you need groupby and aggregate sum:
df = df.groupby('name', as_index=False)['number'].sum()
#or
#df = df.groupby('name')['number'].sum().reset_index()
Assuming DataFrame is your table name
Select name, SUM(number) [number] FROM DataFrame GROUP BY name
Insert the result after deleting the original rows