select random columns from a very large dataframe in pyspark

select random columns from a very large dataframe in pyspark - python

I have a dataframe in pyspark which has around 150 columns. These columns are obtained from joining different tables. Now my requirement is to write the dataframe to a file but in a specific order like first write 1 to 50 columns then column 90 to 110 and then column 70 and 72. That is I want to select only specific columns along with rearranging them.
I know one one of the way is to use df.select("give your column order") but in my case, the columns are very large and it is not possible to write each and every column name in 'select'.
Please tell me how can I achieve this in pyspark.
Note- I cannot provide any sample data as the number of columns is very large and the column number is the main road blocker in my case.

It sounds like all that you want to do is to programmatically return the list of column names, pick out some slice or slices from that list, and then select that subset of columns in some order from your dataframe. You can do this by manipulating the list df.columns. As an example:
a=[list(range(10)),list(range(1,11)),list(range(2,12))]
df=sqlContext.createDataFrame(a,schema=['col_'+i for i in 'abcdefghij'])
df is a dataframe is with columns ['col_a', 'col_b', 'col_c', 'col_d', 'col_e', 'col_f', 'col_g', 'col_h', 'col_i', 'col_j']. You can return that list by calling df.columns which you can slice and reorder like you would any other python list. How you do that is up to you and which columns you want to select from the df and in which order. For example:
mycolumnlist=df.columns[8:9]+df.columns[0:5]
df[mycolumnlist].show()
Returns
+-----+-----+-----+-----+-----+-----+
|col_i|col_a|col_b|col_c|col_d|col_e|
+-----+-----+-----+-----+-----+-----+
| 8| 0| 1| 2| 3| 4|
| 9| 1| 2| 3| 4| 5|
| 10| 2| 3| 4| 5| 6|
+-----+-----+-----+-----+-----+-----+

You can create list of columns programmatically
first_df.join(second_df, on-'your_condition').select([column_name for column_name in first_df.columns] + [column_name for column_name in second_df.columns])
You can select random subset of columns by using random.sample(first_df.columns, number_of_columns) function.
Hope this helps :)

Related

Combine values from multiple columns into one Pyspark Dataframe [duplicate]

This question already has answers here:
Concat multiple columns of a dataframe using pyspark
(1 answer)
Concatenate columns in Apache Spark DataFrame
(18 answers)
How to concatenate multiple columns in PySpark with a separator?
(2 answers)
Closed 2 years ago.
I have a pyspark dataframe that has fields:
"id",
"fields_0_type" ,
"fields_0_price",
"fields_1_type",
"fields_1_price"
+------------------+--------------+-------------+-------------+---
|id |fields_0_type|fields_0_price|fields_1_type|fields_1_price|
+------------------+-----+--------+-------------+----------+
|1234| Return |45 |New |50 |
+--------------+----------+--------------------+------------+
How can I save the values of these values into two columns called "type" and"price" as a list and separate the values by ",". The ideal dataframe looks like this:
+--------------------------- +------------------------------+
|id |type | price
+---------------------------+------------------------------+
|1234 |Return,Upgrade |45,50
Note that the data I am providing here is a sample. In reality I have more than just "type" and "price" columns that will need to be combined.
Update:
Thanks it works. But is there any way that I can get rid of the extra ","? These are caused by the fact that there are blank values in the columns. Is there a way just to not to take in those columns with blank values in it?
What it is showing now:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, |
|New,New,Sale,Sale,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,|
+------------------------------------------------------------------+
How I want it:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New |
|New,New,Sale,Sale,New|
+------------------------------------------------------------------+

Cast all columns in array then use concat_ws function.
Example:
df.show()
#+----+-------------+-------------+-------------+
#| id|fields_0_type|fields_1_type|fields_2_type|
#+----+-------------+-------------+-------------+
#|1234| a| b| c|
#+----+-------------+-------------+-------------+
columns=df.columns
columns.remove('id')
df.withColumn("type",concat_ws(",",array(*columns))).drop(*columns).show()
#+----+-----+
#| id| type|
#+----+-----+
#|1234|a,b,c|
#+----+-----+
UPDATE:
df.show()
#+----+-------------+--------------+-------------+--------------+
#| id|fields_0_type|fields_0_price|fields_1_type|fields_1_price|
#+----+-------------+--------------+-------------+--------------+
#|1234| a| 45| b| 50|
#+----+-------------+--------------+-------------+--------------+
type_cols=[f for f in df.columns if 'type' in f]
price_cols=[f for f in df.columns if 'price' in f]
df.withColumn("type",concat_ws(",",array(*type_cols))).withColumn("price",concat_ws(",",array(*price_cols))).\
drop(*type_cols,*price_cols).\
show()
#+----+----+-----+
#| id|type|price|
#+----+----+-----+
#|1234| a,b|45,50|
#+----+----+-----+

Calculating percentage of multiple column values of a Spark DataFrame in PySpark

I have multiple binary columns (0 and 1) in my Spark DataFrame. I want to calculate the percentage of 1 in each column and project the result in another DataFrame.
The input DataFrame dF looks like:
+------------+-----------+
| a| b|
+------------+-----------+
| 0| 1|
| 1| 1|
| 0| 0|
| 1| 1|
| 0| 1|
+------------+-----------+
Expected output would look like:
+------------+-----------+
| a| b|
+------------+-----------+
| 40| 80|
+------------+-----------+
40 (2/5) and 80 (4/5) is the percentage of 1 in columns a and b respectively.
What I tried so far is creating a custom aggregation function, passing over the two columns a and b to it, doing a group by to get the count of 0s and 1s, calculating percentages of 0s and 1s, and finally filtering the DataFrame to only keep the 1.
selection = ['a', 'b']
#F.udf
def cal_perc(c, dF):
grouped = dF.groupBy(c).count()
grouped = grouped.withColumn('perc_' + str(c), ((grouped['count']/5) * 100))
return grouped[grouped[c] == 1].select(['perc_' + str(c)])
dF.select(*(dF[c].alias(c) for c in selection)).agg(*(cal_perc(c, dF).alias(c) for c in selection)).show()
This does not seem to be working. I'm not able to figure out where I'm going wrong. Any help appreciated. Thanks.

If your columns in fact always are 0/1 and no other digits a mean should be equivalent.
It is implemented natively in spark.

How to update a pyspark dataframe with new values from another dataframe?

I have two spark dataframes:
Dataframe A:
|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |
and dataframe B:
|col_1 | col_2 | ... | col_m |
|val_1 | val_2 | ... | val_m |
Dataframe B can contain duplicate, updated and new rows from dataframe A. I want to write an operation in spark where I can create a new dataframe containing the rows from dataframe A and the updated and new rows from dataframe B.
I started by creating a hash column containing only the columns that are not updatable. This is the unique id. So let's say col1 and col2 can change value (can be updated), but col3,..,coln are unique. I have created a hash function as hash(col3,..,coln):
A=A.withColumn("hash", hash(*[col(colname) for colname in unique_cols_A]))
B=B.withColumn("hash", hash(*[col(colname) for colname in unique_cols_B]))
Now I want to write some spark code that basically selects the rows from B that have the hash not in A (so new rows and updated rows) and join them into a new dataframe together with the rows from A. How can I achieve this in pyspark?
Edit:
Dataframe B can have extra columns from dataframe A, so a union is not possible.
Sample example
Dataframe A:
+-----+-----+
|col_1|col_2|
+-----+-----+
| a| www|
| b| eee|
| c| rrr|
+-----+-----+
Dataframe B:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| d| yyy| 2|
| c| rer| 3|
+-----+-----+-----+
Result:
Dataframe C:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| b| eee| null|
| c| rer| 3|
| d| yyy| 2|
+-----+-----+-----+

This is closely related to update a dataframe column with new values, except that you also want to add the rows from DataFrame B. One approach would be to first do what is outlined in the linked question and then union the result with DataFrame B and drop duplicates.
For example:
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
'col_1',
f.when(
~f.isnull(f.col('b.col_2')),
f.col('b.col_2')
).otherwise(f.col('a.col_2')).alias('col_2'),
'b.col_3'
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| a| wew| 1|
#| b| eee| null|
#| c| rer| 3|
#| d| yyy| 2|
#+-----+-----+-----+
Or more generically using a list comprehension if you have a lot of columns to replace and you don't want to hard code them all:
cols_to_update = ['col_2']
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
*[
['col_1'] +
[
f.when(
~f.isnull(f.col('b.{}'.format(c))),
f.col('b.{}'.format(c))
).otherwise(f.col('a.{}'.format(c))).alias(c)
for c in cols_to_update
] +
['b.col_3']
]
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()

I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.
replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
resultDf = dfA.subtract(replaceDf).union(dfB).show()
Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would
prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.

If you want to keep only unique values, and require strictly correct results, then union followed by dropDupilcates should do the trick:
columns_which_dont_change = [...]
old_df.union(new_df).dropDuplicates(subset=columns_which_dont_change)

Order of rows in DataFrame after aggregation

Suppose I've got a data frame df (created from a hard-coded array for tests)
+----+----+---+
|name| c1|qty|
+----+----+---+
| a|abc1| 1|
| a|abc2| 0|
| b|abc3| 3|
| b|abc4| 2|
+----+----+---+
I am grouping and aggregating it to get df1
import pyspark.sql.functions as sf
df1 = df.groupBy('name').agg(sf.min('qty'))
df1.show()
+----+--------+
|name|min(qty)|
+----+--------+
| b| 2|
| a| 0|
+----+--------+
What is the expected order of the rows in df1 ?
Suppose now that I am writing a unit test. I need to compare df1 with the expected data frame. Should I compare them ignoring the order of rows. What is the best way to do it ?

The ordering of the rows in the dataframe is not fixed. There is an easy way to use the expected Dataframe in test cases
Do a dataframe diff . For scala:
assert(df1.except(expectedDf).count == 0)
And
assert(expectedDf.except(df1).count == 0)
For python you need to replace except by subtract
From documentation:
subtract(other)
Return a new DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.

How to iterate over a pyspark.sql.Column?

I have a pyspark DataFrame and I want to get a specific column and iterate over its values. For example:
userId itemId
1 2
2 2
3 7
4 10
I get the userId column by df.userId and for each userId in this column I want to apply a method. How can I achieve this?

Your question is not very specific about the type of function you want to apply, so I have created an example that adds an item description based on the value of itemId.
First let's import the relevant libraries and create the data:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = spark.createDataFrame([(1,2),(2,2),(3,7),(4,10)], ['userId', 'itemId'])
Secondly, create the function and convert it into an UDF function that can be used by PySpark:
def item_description(itemId):
items = {2 : "iPhone 8",
7 : "Apple iMac",
10 : "iPad"}
return items[itemId]
item_description_udf = udf(item_description,StringType())
Finally, add a new column for ItemDescription and populate it with the value returned by the item_description_udf function:
df = df.withColumn("ItemDescription",item_description_udf(df.itemId))
df.show()
This gives the following output:
+------+------+---------------+
|userId|itemId|ItemDescription|
+------+------+---------------+
| 1| 2| iPhone 8|
| 2| 2| iPhone 8|
| 3| 7| Apple iMac|
| 4| 10| iPad|
+------+------+---------------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

select random columns from a very large dataframe in pyspark - python

Related

Combine values from multiple columns into one Pyspark Dataframe [duplicate]

Calculating percentage of multiple column values of a Spark DataFrame in PySpark

How to update a pyspark dataframe with new values from another dataframe?

Order of rows in DataFrame after aggregation

How to iterate over a pyspark.sql.Column?

Categories

Resources