vlookup on text field using pandas - python

I need to use vlookup functionality in pandas.
DataFrame 2: (FEED_NAME has no duplicate rows)
+-----------+--------------------+---------------------+
| FEED_NAME | Name | Source |
+-----------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Effiel Tower router |
+-----------+--------------------+---------------------+
DataFrame 1:
+-------------+
| SYSTEM_NAME |
+-------------+
| DMSN |
| PCSUS |
| DAMJ |
| : |
| : |
+-------------+
DataFrame 1 contains lot more number of rows. It is acutally a column in much larger table. I need to merger df1 with df2 to make it look like:
+-------------+--------------------+---------------------+
| SYSTEM_NAME | Name | Source |
+-------------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main Hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Eiffel Tower router |
| : | | |
| : | | |
+-------------+--------------------+---------------------+
in excel I just would have done VLOOKUP(,,1,TRUE) and then copied the same across all cells.
I have tried various combinations with merge and join but I keep getting KeyError:'SYSTEM_NAME'
Code:
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"})
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()

You missed the inplace=True argument in the line _df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}) so the _df2 columns name haven't changed. Try this instead :
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}, inplace=True)
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()

Related

Pandas: How do I read specific columns in a file and make them into a new csv

Here is sample 1 :
| district_id | date |
| -------- | ----------- |
| 18 | 1995-03-24 |
| 1 | 1993-02-26 |
Sample 2:
| link_id | type |
| -------- | ----------- |
| 9 | gold |
| 19 | classic |
I want to gather sample 1's date column and sample 2's type column and output them as data.csv
You can use vertical concatenation of dataframes and then render it:
df3 = pandas.concat([df1['date'], df2['type']], axis = 1)
df3.to_csv('data.csv')

Unstack (pivot?) dataframe in Pandas

I have a dataframe somewhat like this:
ID | Relationship | First Name | Last Name | DOB | Address | Phone
0 | 2 | Self | Vegeta | Saiyan | 01/01/1949 | Saiyan Planet | 123-456-7891
1 | 2 | Spouse | Bulma | Saiyan | 04/20/1969 | Saiyan Planet | 123-456-7891
2 | 3 | Self | Krilin | Human | 08/21/1992 | Planet Earth | 789-456-4321
3 | 4 | Self | Goku | Kakarot | 05/04/1975 | Planet Earth | 321-654-9870
4 | 4 | Child | Gohan | Kakarot | 04/02/2001 | Planet Earth | 321-654-9870
5 | 5 | Self | Freezer | Fridge | 09/15/1955 | Deep Space | 456-788-9568
I'm looking to have the rows with same ID appended to the right of the first row with that ID.
Example:
ID | Relationship | First Name | Last Name | DOB | Address | Phone | Spouse_First Name | Spouse_Last Name | Spouse_DOB | Child_First Name | Child_Last Name | Child_DOB |
0 | 2 | Self | Vegeta | Saiyan | 01/01/1949 | Saiyan Planet | 123-456-7891 | Bulma | Saiyan | 04/20/1969 | | |
1 | 3 | Self | Krilin | Human | 08/21/1992 | Planet Earth | 789-456-4321 | | | | | |
2 | 4 | Self | Goku | Kakarot | 05/04/1975 | Planet Earth | 321-654-9870 | | | | Gohan | Kakarot | 04/02/2001 |
3 | 5 | Self | Freezer | Fridge | 09/15/1955 | Deep Space | 456-788-9568 | | | | | |
My real scenario dataframe has more columns, but they all have the same information when the two rows share the same ID, so no need to duplicate those in the other rows. I only need to add to the right the columns that I choose, which in this case would be First Name, Last Name, DOB with the identifier for the new column label depending on what's on the 'Relationship' column (I can rename them later if it's not possible to do in a straight way, just wanted to illustrate my point.
Now that I've said this, I want to add that I have tried different ways and seems like approaching with unstack or pivot is the way to go but I have not been successful in making it work.
Any help would be greatly appreciated.
This solution assumes that the DataFrame is indexed by the ID column.
not_self = (
df.query("Relationship != 'Self'")
.pivot(columns='Relationship')
.swaplevel(axis=1)
.reindex(
pd.MultiIndex.from_product(
(
set(df['Relationship'].unique()) - {'Self'},
df.columns.to_series().drop('Relationship')
)
),
axis=1
)
)
not_self.columns = [' '.join((a, b)) for a, b in not_self.columns]
result = df.query("Relationship == 'Self'").join(not_self)
Please let me know if this is not what was wanted.

Create a dataframe by iterating over column of list in another dataframe

In pyspark, I have a DataFrame with a column that contains a list of ordered nodes to go through:
osmDF.schema
Out[1]:
StructType(List(StructField(id,LongType,true),
StructField(nodes,ArrayType(LongType,true),true),
StructField(tags,MapType(StringType,StringType,true),true)))
osmDF.head(3)
Out[2]:
| id | nodes | tags |
|-----------|-----------------------------------------------------|---------------------|
| 62960871 | [783186590,783198852] | "{""foo"":""bar""}" |
| 211528816 | [2215187080,2215187140,2215187205,2215187256] | "{""foo"":""boo""}" |
| 62960872 | [783198772,783183397,783167527,783169067,783198772] | "{""foo"":""buh""}" |
I need to create a dataframe with a row for each consecutive combination of 2 nodes the list of nodes, then save it as parquet.
The expected result will have a length of n-1, with n len(nodes) for each rows. It would look like this (with other columns that I'll add):
| id | from | to | tags |
|-----------------------|------------|------------|---------------------|
| 783186590_783198852 | 783186590 | 783198852 | "{""foo"":""bar""}" |
| 2215187080_2215187140 | 2215187080 | 2215187140 | "{""foo"":""boo""}" |
| 2215187140_2215187205 | 2215187140 | 2215187205 | "{""foo"":""boo""}" |
| 2215187205_2215187256 | 2215187205 | 2215187256 | "{""foo"":""boo""}" |
| 783198772_783183397 | 783198772 | 783183397 | "{""foo"":""buh""}" |
| 783183397_783167527 | 783183397 | 783167527 | "{""foo"":""buh""}" |
| 783167527_783169067 | 783167527 | 783169067 | "{""foo"":""buh""}" |
| 783169067_783198772 | 783169067 | 783198772 | "{""foo"":""buh""}" |
I tried to initiate with the following
from pyspark.sql.functions import udf
def split_ways_into_arcs(row):
arcs = []
for node in range(len(row['nodes']) - 1):
arc = dict()
arc['id'] = str(row['nodes'][node]) + "_" + str(row['nodes'][node + 1])
arc['from'] = row['nodes'][node]
arc['to'] = row['nodes'][node + 1]
arc['tags'] = row['tags']
arcs.append(arc)
return arcs
# Declare function as udf
split = udf(lambda row: split_ways_into_arcs(row.asDict()))
The issue I'm having is I don't know how many nodes there are in each row of the original DataFrame.
I know how to apply a udf to add a column to an existing DataFrame, but not to create a new one from lists of dicts.
Iterate over the nodes array using transform and explode the array afterwards:
from pyspark.sql import functions as F
df = ...
df.withColumn("nodes", F.expr("transform(nodes, (n,i) -> named_struct('from', nodes[i], 'to', nodes[i+1]))")) \
.withColumn("nodes", F.explode("nodes")) \
.filter("not nodes.to is null") \
.selectExpr("concat_ws('_', nodes.to, nodes.from) as id", "nodes.*", "tags") \
.show(truncate=False)
Output:
+---------------------+----------+----------+-----------------+
|id |from |to |tags |
+---------------------+----------+----------+-----------------+
|783198852_783186590 |783186590 |783198852 |{""foo"":""bar""}|
|2215187140_2215187080|2215187080|2215187140|{""foo"":""boo""}|
|2215187205_2215187140|2215187140|2215187205|{""foo"":""boo""}|
|2215187256_2215187205|2215187205|2215187256|{""foo"":""boo""}|
|783183397_783198772 |783198772 |783183397 |{""foo"":""buh""}|
|783167527_783183397 |783183397 |783167527 |{""foo"":""buh""}|
|783169067_783167527 |783167527 |783169067 |{""foo"":""buh""}|
|783198772_783169067 |783169067 |783198772 |{""foo"":""buh""}|
+---------------------+----------+----------+-----------------+

How to join two tables in PySpark with two conditions in an optimal way

I have the following two tables in PySpark:
Table A - dfA
| ip_4 | ip |
|---------------|--------------|
| 10.10.10.25 | 168430105 |
| 10.11.25.60 | 168499516 |
And table B - dfB
| net_cidr | net_ip_first_4 | net_ip_last_4 | net_ip_first | net_ip_last |
|---------------|----------------|----------------|--------------|-------------|
| 10.10.10.0/24 | 10.10.10.0 | 10.10.10.255 | 168430080 | 168430335 |
| 10.10.11.0/24 | 10.10.11.0 | 10.10.11.255 | 168430336 | 168430591 |
| 10.11.0.0/16 | 10.11.0.0 | 10.11.255.255 | 168493056 | 168558591 |
I have joined both tables in PySpark using the following command:
dfJoined = dfB.alias('b').join(F.broadcast(dfA).alias('a'),
(F.col('a.ip') >= F.col('b.net_ip_first'))&
(F.col('a.ip') <= F.col('b.net_ip_last')),
how='right').select('a.*, b.*)
So I obtain:
| ip | net_cidr | net_ip_first_4 | net_ip_last_4| ...
|---------------|---------------|----------------|--------------| ...
| 10.10.10.25 | 10.10.10.0/24 | 10.10.10.0 | 10.10.10.255 | ...
| 10.11.25.60 | 10.10.11.0/24 | 10.10.11.0 | 10.10.11.255 | ...
The size of the tables makes this option not optimal due to the 2 conditions, I had thought of sorting table B so that it only implies one join condition.
Is there any way to limit the join and take only the first record that matches the join condition? Or some way to make the join in an optimal way?
Table A (number of records) << Table B (number of records)
Thank you!

Merge two spark dataframes based on a column

I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much
you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()
You can do that in scala if both dataframes have same columns by
output = df1.union(df2)
First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.
What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()

Categories

Resources