Is there a simple function (both on pandas or numpy) to create a new column with true or false values, based on matching criteria from different dataframes?
I'm actually trying to compare two dataframes that have the column email and see, for example, which emails match with the emails on the second data frame. The goal is to print a table that looks like this (where hola#lorem.com it's actually both on the first and second dataframe):
| id | email | match |
|:------|:------ |:-------|
| 1 | hola#lorem.com | true|
| 2 | adios#lorem.com | false|
| 3 | bye#lorem.com | false|
Thanks in advance for your help
pd.assign
df1 = df1.assign(match=df2["email"].isin(df1["email"]))
You can for example use the function isin:
df1['match'] = df1['email'].isin(df2['email'])
df2['match'] = df2['email'].isin(df1['email'])
This question already has answers here:
Concat multiple columns of a dataframe using pyspark
(1 answer)
Concatenate columns in Apache Spark DataFrame
(18 answers)
How to concatenate multiple columns in PySpark with a separator?
(2 answers)
Closed 2 years ago.
I have a pyspark dataframe that has fields:
"id",
"fields_0_type" ,
"fields_0_price",
"fields_1_type",
"fields_1_price"
+------------------+--------------+-------------+-------------+---
|id |fields_0_type|fields_0_price|fields_1_type|fields_1_price|
+------------------+-----+--------+-------------+----------+
|1234| Return |45 |New |50 |
+--------------+----------+--------------------+------------+
How can I save the values of these values into two columns called "type" and"price" as a list and separate the values by ",". The ideal dataframe looks like this:
+--------------------------- +------------------------------+
|id |type | price
+---------------------------+------------------------------+
|1234 |Return,Upgrade |45,50
Note that the data I am providing here is a sample. In reality I have more than just "type" and "price" columns that will need to be combined.
Update:
Thanks it works. But is there any way that I can get rid of the extra ","? These are caused by the fact that there are blank values in the columns. Is there a way just to not to take in those columns with blank values in it?
What it is showing now:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, |
|New,New,Sale,Sale,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,|
+------------------------------------------------------------------+
How I want it:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New |
|New,New,Sale,Sale,New|
+------------------------------------------------------------------+
Cast all columns in array then use concat_ws function.
Example:
df.show()
#+----+-------------+-------------+-------------+
#| id|fields_0_type|fields_1_type|fields_2_type|
#+----+-------------+-------------+-------------+
#|1234| a| b| c|
#+----+-------------+-------------+-------------+
columns=df.columns
columns.remove('id')
df.withColumn("type",concat_ws(",",array(*columns))).drop(*columns).show()
#+----+-----+
#| id| type|
#+----+-----+
#|1234|a,b,c|
#+----+-----+
UPDATE:
df.show()
#+----+-------------+--------------+-------------+--------------+
#| id|fields_0_type|fields_0_price|fields_1_type|fields_1_price|
#+----+-------------+--------------+-------------+--------------+
#|1234| a| 45| b| 50|
#+----+-------------+--------------+-------------+--------------+
type_cols=[f for f in df.columns if 'type' in f]
price_cols=[f for f in df.columns if 'price' in f]
df.withColumn("type",concat_ws(",",array(*type_cols))).withColumn("price",concat_ws(",",array(*price_cols))).\
drop(*type_cols,*price_cols).\
show()
#+----+----+-----+
#| id|type|price|
#+----+----+-----+
#|1234| a,b|45,50|
#+----+----+-----+
I have a Spark DataFrame, like the one shown below.
I need an algorithm that whenever I find 'M' in the row, I need to select the next two columns after 'M' and create a new row. If there are two 'M' in a single row, then I need to create two rows 1 with the two columns from first 'M' and 2nd row with two columns from second 'M'.
Input Dataframe:
+------+---+--------------------+---+--------+--------+-----+----+-----------+----+
|rownum|_c0| _c1|_c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9|_c10|_c11|
+------+---+--------------------+---+--------+--------+-----+----+-----------+---
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Drinks|30|M|Food|20|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Food|40| M|Bar |50|M|Drinks|100
+------+---+--------------------+---+--------+--------+-----+----+-----------+----+-----+
New Output Dataframe:
+------+---+--------------------+---+--------+--------+-----+----+-------
|rownum|_c0| _c1|_c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9|
+------+---+--------------------+---+--------+--------+-----+----+-------
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Drinks|30|
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Food|20|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Food|40|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Bar |50|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Drinks |100|
+------+---+--------------------+---+--------+--------+-----+----+--+
This question already has answers here:
get first N elements from dataframe ArrayType column in pyspark
(2 answers)
Closed 4 years ago.
I wish to remove the last element of the array from this DataFrame. We have this link demonstrating the same thing, but with UDFs and that I wish to avoid. Is there is simple way to do this - something like list[:2]?
data = [(['cat','dog','sheep'],),(['bus','truck','car'],),(['ice','pizza','pasta'],)]
df = sqlContext.createDataFrame(data,['data'])
df.show()
+-------------------+
| data|
+-------------------+
| [cat, dog, sheep]|
| [bus, truck, car]|
|[ice, pizza, pasta]|
+-------------------+
Expected DataFrame:
+--------------+
| data|
+--------------+
| [cat, dog]|
| [bus, truck]|
| [ice, pizza]|
+--------------+
UDF is the best thing you can find for PySpark :)
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType
# Get the fist two elements
split_row = udf(lambda row: row[:2])
# apply the udf to each row
new_df = df.withColumn("data", split_row(df["data"]))
new_df.show()
# Output
+------------+
| data|
+------------+
| [cat, dog]|
|[bus, truck]|
|[ice, pizza]|
+------------+
Suppose I've got a data frame df (created from a hard-coded array for tests)
+----+----+---+
|name| c1|qty|
+----+----+---+
| a|abc1| 1|
| a|abc2| 0|
| b|abc3| 3|
| b|abc4| 2|
+----+----+---+
I am grouping and aggregating it to get df1
import pyspark.sql.functions as sf
df1 = df.groupBy('name').agg(sf.min('qty'))
df1.show()
+----+--------+
|name|min(qty)|
+----+--------+
| b| 2|
| a| 0|
+----+--------+
What is the expected order of the rows in df1 ?
Suppose now that I am writing a unit test. I need to compare df1 with the expected data frame. Should I compare them ignoring the order of rows. What is the best way to do it ?
The ordering of the rows in the dataframe is not fixed. There is an easy way to use the expected Dataframe in test cases
Do a dataframe diff . For scala:
assert(df1.except(expectedDf).count == 0)
And
assert(expectedDf.except(df1).count == 0)
For python you need to replace except by subtract
From documentation:
subtract(other)
Return a new DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.