Update/merge 2 dataftames with different column names [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I'm using dataframes from pandas and I have 2 tables:
The first one:
+----------------------------+
| ID | Code | Name | Desc |
------------------------------
| 00002 | AAAA | Aaaa | A111 |
------------------------------
| 12345 | BBBB | Bbbb | B222 |
------------------------------
| 01024 | EEEE | Eeee | E333 |
------------------------------
| 00010 | CCCC | Cccc | C444 |
------------------------------
| 00123 | ZZZZ | Zzzz | Z555 |
------------------------------
| ..... | .... | .... | .... |
+----------------------------+
The second table:
+--------------------------------+
| EID | Cat | emaN | No | cseD |
----------------------------------
| 00010 | 1 | | | |
----------------------------------
| 12345 | 1 | | | |
----------------------------------
| | 1 | | | |
+--------------------------------+
I want to update the second table with the values ​​from the first one, so that it turns out:
+--------------------------------+
| EID | Cat | emaN | No | cseD |
----------------------------------
| 00010 | 1 | Сссс | | С444 |
----------------------------------
| 12345 | 1 | Bbbb | | B222 |
----------------------------------
| | 1 | | | |
+--------------------------------+
But the difficulty is that the column names are different, the key ID -> EID and the values ​​Name -> emaN, Desc -> cseD, and the column Cat (the values ​​are filled initially) and No (empty values) must remain in the output table unchanged. Also in the 2nd table there can be empty EIDs, so this entry should remain as it was.
How is it possible to make such an update or merge?
Thanks.

Try pd.merge with right_on and left_on param in case columns name are different on which you have to merge.
I am applying check if final_df['emaN'] is null then copy value from Code.
Then drop the column of df1 which are not require
I have save the result in new df final_df if you want you can save the data in 'df2'
import numpy as np
import pandas as pd
final_df = pd.merge(df2,df1,left_on='EID' ,right_on='ID',how='left')
final_df['emaN'] = np.where(final_df['emaN'].isnull(),final_df['Code'],final_df['emaN'])
final_df['cseD'] = np.where(final_df['cseD'].isnull(),final_df['Desc'],final_df['cseD'])
final_df.drop(['ID','Code','Name','Desc'],axis=1,inplace=True)

As far as I understand the question...
pd.merge(FirstDataFrame, SecondDataFrame, left_on='ID', right_on='EID', how='left')['EID','Cat','emaN','No','cseD']
or if you want to join on multiple fields
pd.merge(FirstDataFrame, SecondDataFrame, left_on=['ID', 'Name', 'Desc'],
right_on=['EID', 'emaN','cseD'], how='left')
['EID','Cat','emaN','No','cseD']
Edit: (see comments, filter for the desired columns) added above

Related

Pandas: How do I read specific columns in a file and make them into a new csv

Here is sample 1 :
| district_id | date |
| -------- | ----------- |
| 18 | 1995-03-24 |
| 1 | 1993-02-26 |
Sample 2:
| link_id | type |
| -------- | ----------- |
| 9 | gold |
| 19 | classic |
I want to gather sample 1's date column and sample 2's type column and output them as data.csv
You can use vertical concatenation of dataframes and then render it:
df3 = pandas.concat([df1['date'], df2['type']], axis = 1)
df3.to_csv('data.csv')

Unstack (pivot?) dataframe in Pandas

I have a dataframe somewhat like this:
ID | Relationship | First Name | Last Name | DOB | Address | Phone
0 | 2 | Self | Vegeta | Saiyan | 01/01/1949 | Saiyan Planet | 123-456-7891
1 | 2 | Spouse | Bulma | Saiyan | 04/20/1969 | Saiyan Planet | 123-456-7891
2 | 3 | Self | Krilin | Human | 08/21/1992 | Planet Earth | 789-456-4321
3 | 4 | Self | Goku | Kakarot | 05/04/1975 | Planet Earth | 321-654-9870
4 | 4 | Child | Gohan | Kakarot | 04/02/2001 | Planet Earth | 321-654-9870
5 | 5 | Self | Freezer | Fridge | 09/15/1955 | Deep Space | 456-788-9568
I'm looking to have the rows with same ID appended to the right of the first row with that ID.
Example:
ID | Relationship | First Name | Last Name | DOB | Address | Phone | Spouse_First Name | Spouse_Last Name | Spouse_DOB | Child_First Name | Child_Last Name | Child_DOB |
0 | 2 | Self | Vegeta | Saiyan | 01/01/1949 | Saiyan Planet | 123-456-7891 | Bulma | Saiyan | 04/20/1969 | | |
1 | 3 | Self | Krilin | Human | 08/21/1992 | Planet Earth | 789-456-4321 | | | | | |
2 | 4 | Self | Goku | Kakarot | 05/04/1975 | Planet Earth | 321-654-9870 | | | | Gohan | Kakarot | 04/02/2001 |
3 | 5 | Self | Freezer | Fridge | 09/15/1955 | Deep Space | 456-788-9568 | | | | | |
My real scenario dataframe has more columns, but they all have the same information when the two rows share the same ID, so no need to duplicate those in the other rows. I only need to add to the right the columns that I choose, which in this case would be First Name, Last Name, DOB with the identifier for the new column label depending on what's on the 'Relationship' column (I can rename them later if it's not possible to do in a straight way, just wanted to illustrate my point.
Now that I've said this, I want to add that I have tried different ways and seems like approaching with unstack or pivot is the way to go but I have not been successful in making it work.
Any help would be greatly appreciated.
This solution assumes that the DataFrame is indexed by the ID column.
not_self = (
df.query("Relationship != 'Self'")
.pivot(columns='Relationship')
.swaplevel(axis=1)
.reindex(
pd.MultiIndex.from_product(
(
set(df['Relationship'].unique()) - {'Self'},
df.columns.to_series().drop('Relationship')
)
),
axis=1
)
)
not_self.columns = [' '.join((a, b)) for a, b in not_self.columns]
result = df.query("Relationship == 'Self'").join(not_self)
Please let me know if this is not what was wanted.

How to join two tables in PySpark with two conditions in an optimal way

I have the following two tables in PySpark:
Table A - dfA
| ip_4 | ip |
|---------------|--------------|
| 10.10.10.25 | 168430105 |
| 10.11.25.60 | 168499516 |
And table B - dfB
| net_cidr | net_ip_first_4 | net_ip_last_4 | net_ip_first | net_ip_last |
|---------------|----------------|----------------|--------------|-------------|
| 10.10.10.0/24 | 10.10.10.0 | 10.10.10.255 | 168430080 | 168430335 |
| 10.10.11.0/24 | 10.10.11.0 | 10.10.11.255 | 168430336 | 168430591 |
| 10.11.0.0/16 | 10.11.0.0 | 10.11.255.255 | 168493056 | 168558591 |
I have joined both tables in PySpark using the following command:
dfJoined = dfB.alias('b').join(F.broadcast(dfA).alias('a'),
(F.col('a.ip') >= F.col('b.net_ip_first'))&
(F.col('a.ip') <= F.col('b.net_ip_last')),
how='right').select('a.*, b.*)
So I obtain:
| ip | net_cidr | net_ip_first_4 | net_ip_last_4| ...
|---------------|---------------|----------------|--------------| ...
| 10.10.10.25 | 10.10.10.0/24 | 10.10.10.0 | 10.10.10.255 | ...
| 10.11.25.60 | 10.10.11.0/24 | 10.10.11.0 | 10.10.11.255 | ...
The size of the tables makes this option not optimal due to the 2 conditions, I had thought of sorting table B so that it only implies one join condition.
Is there any way to limit the join and take only the first record that matches the join condition? Or some way to make the join in an optimal way?
Table A (number of records) << Table B (number of records)
Thank you!

vlookup on text field using pandas

I need to use vlookup functionality in pandas.
DataFrame 2: (FEED_NAME has no duplicate rows)
+-----------+--------------------+---------------------+
| FEED_NAME | Name | Source |
+-----------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Effiel Tower router |
+-----------+--------------------+---------------------+
DataFrame 1:
+-------------+
| SYSTEM_NAME |
+-------------+
| DMSN |
| PCSUS |
| DAMJ |
| : |
| : |
+-------------+
DataFrame 1 contains lot more number of rows. It is acutally a column in much larger table. I need to merger df1 with df2 to make it look like:
+-------------+--------------------+---------------------+
| SYSTEM_NAME | Name | Source |
+-------------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main Hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Eiffel Tower router |
| : | | |
| : | | |
+-------------+--------------------+---------------------+
in excel I just would have done VLOOKUP(,,1,TRUE) and then copied the same across all cells.
I have tried various combinations with merge and join but I keep getting KeyError:'SYSTEM_NAME'
Code:
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"})
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
You missed the inplace=True argument in the line _df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}) so the _df2 columns name haven't changed. Try this instead :
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}, inplace=True)
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()

Merge two spark dataframes based on a column

I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much
you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()
You can do that in scala if both dataframes have same columns by
output = df1.union(df2)
First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.
What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()

Categories

Resources