I have two Pandas dataframes ie:
+-------+-------------------+--+
| Name | Class | |
+-------+-------------------+--+
| Alice | Physics | |
| Bob | "" (Empty string) | |
+-------+-------------------+--+
Table 2:
+-------+-----------+
| Name | Class |
+-------+-----------+
| Alice | Chemistry |
| Bob | Math |
+-------+-----------+
Is there a way to combine it easily on the column Class so the resulting table is like:
+-------+--------------------+
| Name | Class |
+-------+--------------------+
| Alice | Physics, Chemistry |
| Bob | Math |
+-------+--------------------+
I also want to make sure there are no extra commas when adding columns. Thanks!
df = pd.DataFrame({'Name':['Alice','Bob'],
'Class':['Physics',np.nan]})
df2 = pd.DataFrame({'Name':['Alice','Bob'],
'Class':['Chemistry','Math']})
df3 = df.append(df2).dropna(subset=['Class']).groupby('Name')['Class'].apply(list).reset_index()
# to remove list
df3['Class'] = df3['Class'].apply(lambda x: ', '.join(x))
Try with concat and groupby:
>>> pd.concat([df1, df2]).groupby("Name").agg(lambda x: ", ".join(i for i in x.tolist() if len(i.strip())>0)).reset_index()
Name Class
Alice Physics, Chemistry
Bob Math
Related
I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again. I was thinking of doing a multiple withColumn() with a when() function.
For example my two dataframes can be something like:
df1
| id | value |
| ----- | ---- |
| hello | 1111 |
| world | 2222 |
df2
| id | value |
| ------ | ---- |
| hello | 1111 |
| world | 3333 |
| people | 2222 |
And the result I wish to obtain is to check first if the value of df1.id exists in df2.id and if true return me the df2.value, for example I was trying something like:
df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))
So I get something like:
df1
| id | value | df2_value |
| ----- | ---- | --------- |
| hello | 1111 | 1111 |
| world | 2222 | 3333 |
So that now I can do another check between these two value columns in the df1 dataframe, and return a boolean column (1or 0) in a new dataframe.
The result I wish to get would be something like:
df3
| id | value | df2_value | match |
| ----- | ---- | --------- | ----- |
| hello | 1111 | 1111 | 1 |
| world | 2222 | 3333 | 0 |
Left join df1 with df2 on id after prefixing all df2 columns except id with df2_*:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])
df = df1.join(
df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
["id"],
"left"
)
Then using functools.reduce you can construct a boolean expression to check if columns match in the 2 dataframes like this:
from functools import reduce
check_expr = reduce(
lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
[c for c in df1.columns if c != 'id'],
F.lit(True)
)
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#| id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111| 1111| 1|
#|world| 2222| 3333| 0|
#+-----+-----+---------+-----+
For an assignment I have been asked to shorten the names of clients to only the first letter of each name where they are separated by a space character.
I found a lot of solutions for this in Python, but I am not able to translate this to a dataframe.
The DF looks like this:
| ID | Name |
| -------- | -------------- |
| 1 | John Doe |
| 2 | Roy Lee Winters|
| 3 | Mary-Kate Baron|
My desired output would be:
| ID | Name | Shortened_name|
| -------- | -------- | -------------- |
| 1 | John Doe | JD |
| 2 | Roy Lee Winters | RLW |
| 3 | Mary-Kate Baron | MB |
I've had some result with the code below but this is not working when there are more then 2 names. I would also like to have some more 'flexible' code as some people have 4 or 5 names where others only have 1.
df.withColumn("col1", F.substring(F.split(F.col("Name"), " ").getItem(0), 1, 1))\
.withColumn("col2", F.substring(F.split(F.col("Name"), " ").getItem(1), 1, 1))\
.withColumn('Shortened_name', F.concat('col1', 'col2'))
You can split the Name column then use transform function on the resulting array to get first letter of each element:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, "John Doe"), (2, "Roy Lee Winters"), (3, "Mary-Kate Baron")], ["ID", "Name"])
df1 = df.withColumn(
"Shortened_name",
F.array_join(F.expr("transform(split(Name, ' '), x -> left(x, 1))"), "")
)
df1.show()
# +---+---------------+--------------+
# | ID| Name|Shortened_name|
# +---+---------------+--------------+
# | 1| John Doe| JD|
# | 2|Roy Lee Winters| RLW|
# | 3|Mary-Kate Baron| MB|
# +---+---------------+--------------+
Or by using aggregate function:
df1 = df.withColumn(
"Shortened_name",
F.expr("aggregate(split(Name, ' '), '', (acc, x) -> acc || left(x, 1))")
)
I would like to group by customer_id, so that I can collect the key/value pair of field_name and field_value as a JSON struct. So, for example, I have a dataframe like this:
customerID | field_name | field_value
-------------------------------------
A | age | 20
A | sex | M
A | country | US
B | country | US
B | age | 34
c | sex | F
All columns in the DataFrame have a String data type. What I want is this:
customerID | custom_attributes
-------------------------------------
A | {'age':'20', 'sex': 'M', 'country':'US'}
B | {'age':'34', 'country':'US'}
c | {'sex':'F'}
This is what I tried:
test = (data
.groupBy('customer_id')
.agg(
collect_list(struct(col('field_name'), col('field_value'))).alias('custom_attributes'))
)
But this gets me as far as getting an array that I dont know how to flatten:
customer_id | custom_attributes
--------------------------------
A | [{'field_name':'sex', 'field_value':'M'},
| {'field_name':'age', 'field_value':'34'},
| {'field_name':'country', 'field_value':'US'}]
You need to do a pivot here:
import pyspark.sql.functions as F
df2 = (df.groupBy('customerID')
.pivot('field_name')
.agg(F.first('field_value'))
.select('customerID', F.to_json(F.struct('age', 'country', 'sex')).alias('custom_attributes'))
.orderBy('customerID')
)
df2.show(truncate=False)
+----------+-------------------------------------+
|customerID|custom_attributes |
+----------+-------------------------------------+
|A |{"age":"20","country":"US","sex":"M"}|
|B |{"age":"34","country":"US"} |
|c |{"sex":"F"} |
+----------+-------------------------------------+
I need to use vlookup functionality in pandas.
DataFrame 2: (FEED_NAME has no duplicate rows)
+-----------+--------------------+---------------------+
| FEED_NAME | Name | Source |
+-----------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Effiel Tower router |
+-----------+--------------------+---------------------+
DataFrame 1:
+-------------+
| SYSTEM_NAME |
+-------------+
| DMSN |
| PCSUS |
| DAMJ |
| : |
| : |
+-------------+
DataFrame 1 contains lot more number of rows. It is acutally a column in much larger table. I need to merger df1 with df2 to make it look like:
+-------------+--------------------+---------------------+
| SYSTEM_NAME | Name | Source |
+-------------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main Hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Eiffel Tower router |
| : | | |
| : | | |
+-------------+--------------------+---------------------+
in excel I just would have done VLOOKUP(,,1,TRUE) and then copied the same across all cells.
I have tried various combinations with merge and join but I keep getting KeyError:'SYSTEM_NAME'
Code:
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"})
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
You missed the inplace=True argument in the line _df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}) so the _df2 columns name haven't changed. Try this instead :
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}, inplace=True)
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I'm using dataframes from pandas and I have 2 tables:
The first one:
+----------------------------+
| ID | Code | Name | Desc |
------------------------------
| 00002 | AAAA | Aaaa | A111 |
------------------------------
| 12345 | BBBB | Bbbb | B222 |
------------------------------
| 01024 | EEEE | Eeee | E333 |
------------------------------
| 00010 | CCCC | Cccc | C444 |
------------------------------
| 00123 | ZZZZ | Zzzz | Z555 |
------------------------------
| ..... | .... | .... | .... |
+----------------------------+
The second table:
+--------------------------------+
| EID | Cat | emaN | No | cseD |
----------------------------------
| 00010 | 1 | | | |
----------------------------------
| 12345 | 1 | | | |
----------------------------------
| | 1 | | | |
+--------------------------------+
I want to update the second table with the values from the first one, so that it turns out:
+--------------------------------+
| EID | Cat | emaN | No | cseD |
----------------------------------
| 00010 | 1 | Сссс | | С444 |
----------------------------------
| 12345 | 1 | Bbbb | | B222 |
----------------------------------
| | 1 | | | |
+--------------------------------+
But the difficulty is that the column names are different, the key ID -> EID and the values Name -> emaN, Desc -> cseD, and the column Cat (the values are filled initially) and No (empty values) must remain in the output table unchanged. Also in the 2nd table there can be empty EIDs, so this entry should remain as it was.
How is it possible to make such an update or merge?
Thanks.
Try pd.merge with right_on and left_on param in case columns name are different on which you have to merge.
I am applying check if final_df['emaN'] is null then copy value from Code.
Then drop the column of df1 which are not require
I have save the result in new df final_df if you want you can save the data in 'df2'
import numpy as np
import pandas as pd
final_df = pd.merge(df2,df1,left_on='EID' ,right_on='ID',how='left')
final_df['emaN'] = np.where(final_df['emaN'].isnull(),final_df['Code'],final_df['emaN'])
final_df['cseD'] = np.where(final_df['cseD'].isnull(),final_df['Desc'],final_df['cseD'])
final_df.drop(['ID','Code','Name','Desc'],axis=1,inplace=True)
As far as I understand the question...
pd.merge(FirstDataFrame, SecondDataFrame, left_on='ID', right_on='EID', how='left')['EID','Cat','emaN','No','cseD']
or if you want to join on multiple fields
pd.merge(FirstDataFrame, SecondDataFrame, left_on=['ID', 'Name', 'Desc'],
right_on=['EID', 'emaN','cseD'], how='left')
['EID','Cat','emaN','No','cseD']
Edit: (see comments, filter for the desired columns) added above