Iterate pyspark dataframe rows and apply UDF - python

I have a dataframe that looks like this:
partitionCol orderCol valueCol
+--------------+----------+----------+
| partitionCol | orderCol | valueCol |
+--------------+----------+----------+
| A | 1 | 201 |
| A | 2 | 645 |
| A | 3 | 302 |
| B | 1 | 335 |
| B | 2 | 834 |
+--------------+----------+----------+
I want to group by the partitionCol, then within each partition to iterate over the rows, ordered by orderCol and apply some function to calculate a new column based on the valueCol and a cached value.
e.g.
def foo(col_value, cached_value):
tmp = <some value based on a condition between col_value and cached_value>
<update the cached_value using some logic>
return tmp
I understand I need to groupby the partitionCol and apply a UDF that will operate on each chink separately, but struggling to find a good way to iterate the rows and applying the logic I described, to get a desired output of:
+--------------+----------+----------+---------------+
| partitionCol | orderCol | valueCol | calculatedCol -
+--------------+----------+----------+---------------+
| A | 1 | 201 | C1 |
| A | 2 | 645 | C1 |
| A | 3 | 302 | C2 |
| B | 1 | 335 | C1 |
| B | 2 | 834 | C2 |
+--------------+----------+----------+---------------+

I think the best way for you to do that is to apply an UDF on the whole set of data :
# first, you create a struct with the order col and the valu col
df = df.withColumn("my_data", F.struct(F.col('orderCol'), F.col('valueCol'))
# then you create an array of that new column
df = df.groupBy("partitionCol").agg(F.collect_list('my_data').alias("my_data")
# finaly, you apply your function on that array
df = df.withColumn("calculatedCol", my_udf(F.col("my_data"))
But without knowing exactly what you want to do, that is all I can offer.

Related

PySpark - How to group by rows and then map them using custom function

Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.
You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)

how to find the sum of a dataframe?

while finding sum as follows
g.loc[g.index[0], 'sum'] = g[RDM].sum()
where RDM is
RDM = [f"R_Dist_meas_{i}" for i in range(48)]
the error was as follows:
KeyError: "None of [Index(['R_Dist_meas_0', 'R_Dist_meas_1', 'R_Dist_meas_2',\n .........................'R_Dist_meas_45', 'R_Dist_meas_46', 'R_Dist_meas_47'],\n dtype='object')] are in the [columns]"
the sample dataframe is as follows,it have many other column other than distance(angle,velocity etc..)
The format of dataframe is A0B0C0 A1B1C1 A2B2C2 A3B3C3 ....... A47B47C47
| R_Dist_meas_0 |R_vel_meas_0 | R_Dist_meas_1 |R_vel_meas_1 | R_Dist_meas_2 |R_vel_meas_2 |--------| R_Dist_meas_47 |R_vel_meas_47 |
|---------------|-------------|---------------|-------------|---------------|-------------|
| 5 | | | | | |
| | | | |10 | |
| | | | | 8 | |
| 2 | | 8 | | | |
the sum = 33
How to solve it?
Your list comprehension will go out of bounds if you try to search the dataframe since you only have columns up to R_Dist_meas_2. If you try to use the RDM as header keys you will be looking for columns not rows.
sum(g.iloc[:,:2].sum())
Excluding the sum outside, this allows you to sum up the rows of each column seperately and then add their totals for the final summation. This should give you the sum you are looking for.

PySpark search inside very large dataframe

I have a very large dataframe in pyspark. It has over 10 million rows and over 30 columns.
What is the best and efficient method to search the entire dataframe for a given list of values and remove the row which contains that value?
The given list of values:
list=['1097192','10727550','1098754']
The dataframe(df) is :
+---------+--------------+---------------+---------+------------+
| id | first_name | last_name | Salary | Verifycode |
+---------+--------------+---------------+---------+------------+
| 1986 | Rollie | Lewin | 1097192 | 42254172 | -Remove Row
| 289743 | Karil | Sudron | 2785190 | 3703538 |
| 3864 | Massimiliano | Dallicott | 1194553 | 23292573 |
| 49074 | Gerry | Grinnov | 1506584 | 62291161 |
| 5087654 | Nat | Leatherborrow | 1781870 | 55183252 |
| 689 | Thaine | Tipple | 2150105 | 40583249 |
| 7907 | Myrlene | Croley | 2883250 | 70380540 |
| 887 | Nada | Redier | 2676139 | 10727550 | -Remove Row
| 96533 | Sonny | Bosden | 1050067 | 13110714 |
| 1098754 | Dennie | McGahy | 1804487 | 927935 | -Remove Row
+---------+--------------+---------------+---------+------------+
If it was a smaller dataframe I could use collect() or toLocalIterator() functions and then iterate over the rows and remove it based on list values.
Since it is a very large dataframe what is the best way to solve this?
I have come up with this solution now but is there a better way:
column_names = df.schema.names
for name in column_names:
df = df.filter(~col(name).isin(list))
You got the correct approach of filtering the Dataframe using filter and isin function. You can use isin function if the list is small (in few thousands not millions). Also make sure that your dataframe is partitioned to at least 3*number of CPUs on the executors. It is a must to have lot of partitions without that parallelism will suffer.
I am comfortable with Scala so please take the concept from the below code. You need to build a Column object by combining all columns to be filtered on. Then provide that final column object on the dataframe.filter
column_names = df.schema.names
colFinal // initialize with 1 column name as col("colName").isin(list)
for name in column_names:
colFinal = colFinal.or(col(name).isin(list))
df = df.filter(!colFinal) // apply negation of final column object

Drop duplicates based on first level column in MultiIndex DataFrame

I have a MultiIndex Pandas DataFrame like so:
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
| | VECTOR | SEGMENTS | OVERALL | INDIVIDUAL |
| | | | TIP X | TIP Y | CURVATURE | TIP X | TIP Y | CURVATURE |
| 0 | (TOP, TOP) | 2 | 3.24 | 1.309 | 44 | 1.62 | 0.6545 | 22 |
| 1 | (TOP, BOTTOM) | 2 | 3.495 | 0.679 | 22 | 1.7475 | 0.3395 | 11 |
| 2 | (BOTTOM, TOP) | 2 | 3.495 | -0.679 | -22 | 1.7475 | -0.3395 | -11 |
| 3 | (BOTTOM, BOTTOM) | 2 | 3.24 | -1.309 | -44 | 1.62 | -0.6545 | -22 |
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
How can I drop duplicates based on all columns contained under 'OVERALL' or 'INDIVIDUAL'? So if I choose 'INDIVIDUAL' to drop duplicates from the values of TIP X, TIP Y, and CURVATURE under INDIVIDUAL must all match for it to be a duplicate?
And further, as you can see from the table 1 and 2 are duplicates that are simply mirrored about the x-axis. These must also be dropped.
Also, can I center the OVERALL and INDIVIDUAL headings?
EDIT: frame.drop_duplicates(subset=['INDIVIDUAL'], inplace=True) produces KeyError: Index(['INDIVIDUAL'], dtype='object')
You can pass pandas .drop_duplicates a subset of tuples for multi-indexed columns:
df.drop_duplicates(subset=[
('INDIVIDUAL', 'TIP X'),
('INDIVIDUAL', 'TIP Y'),
('INDIVIDUAL', 'CURVATURE')
])
Or, if your row indices are unique, you could use the following approach that saves some typing:
df.loc[df['INDIVIDUAL'].drop_duplicates().index]
Update:
As you suggested in the comments, if you want to do operations on the dataframe you can do that in-line:
df.loc[df['INDIVIDUAL'].abs().drop_duplicates().index]
Or for non-pandas functions, you can use .transform:
df.loc[df['INDIVIDUAL'].transform(np.abs).drop_duplicates().index]

Merge two spark dataframes based on a column

I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much
you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()
You can do that in scala if both dataframes have same columns by
output = df1.union(df2)
First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.
What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()

Categories

Resources