Pyspark groupBy DataFrame without aggregation or count

Pyspark groupBy DataFrame without aggregation or count - python

Can it iterate through the Pyspark groupBy dataframe without aggregation or count?
For example code in Pandas:
for i, d in df2:
mycode ....
^^ if using pandas ^^
Is there a difference in how to iterate groupby in Pyspark or have to use aggregation and count?

At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas.
ex:
from pyspark.sql import functions as f
df.groupBy(df['some_col']).agg(f.first(df['col1']), f.first(df['col2'])).show()
Since their is a basic difference between the way the data is handled in pandas and spark not all functionalities can be used in the same way.
Their are a few work arounds to get what you want like:
for diamonds DataFrame:
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21| Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 4| 0.29| Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63|
| 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
You can use:
l=[x.cut for x in diamonds.select("cut").distinct().rdd.collect()]
def groups(df,i):
import pyspark.sql.functions as f
return df.filter(f.col("cut")==i)
#for multi grouping
def groups_multi(df,i):
import pyspark.sql.functions as f
return df.filter((f.col("cut")==i) & (f.col("color")=='E'))# use | for or.
for i in l:
groups(diamonds,i).show(2)
output :
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 4| 0.29|Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 2 rows
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23|Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 12| 0.23|Ideal| J| VS1| 62.8| 56.0| 340|3.93| 3.9|2.46|
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
...
In Function groups you can decide what kind of grouping you want for the data. It is a simple filter condition but it will get you all the groups separately.

When we do a GroupBy we end up with a RelationalGroupedDataset, which is a fancy name for a DataFrame that has a grouping specified but needs the user to specify an aggregation before it can be queried further.
When you try to do any functions on the Grouped dataframe it throws an error
AttributeError: 'GroupedData' object has no attribute 'show'

Yes, don't use group by, use distinct with select instead.
df.select("col1", "col2", ...).distinct()
Then you could do any number of things for iterating through your DataFrame.
i.e.
1- Convert PySpark DF to Pandas.
DataFrame.toPandas()
2- If your DF is small, you could convert it to list.
DataFrame.collect()
3- Apply a method with foreach(your_method).
Dataframe.foreach(your_method)
4- Convert to RDD and use map with a lambda.
DataFrame.rdd.map(lambda x: your_method(x))

Related

Reversing Group By in PySpark

I am not sure about the correctness of the question itself. The solutions I've found for SQL do not work at Hive SQL or recursion is prohibited.
Thus, I'd like to solve the problem in Pyspark and need a solution or at least ideas, how to tackle the problem.
I have an original table which looks like this:
+--------+----------+
|customer|nr_tickets|
+--------+----------+
| A| 3|
| B| 1|
| C| 2|
+--------+----------+
This is how I want the table:
+--------+
|customer|
+--------+
| A|
| A|
| A|
| B|
| C|
| C|
+--------+
Do you have any suggestions?
Thank you very much in advance!

For Spark2.4+, use array_repeat with explode.
from pyspark.sql import functions as F
df.selectExpr("""explode(array_repeat(customer,cast(nr_tickets as int))) as customer""").show()
#+--------+
#|customer|
#+--------+
#| A|
#| A|
#| A|
#| B|
#| C|
#| C|
#+--------+

You can make a new dataframe by iterating over rows(groups).
1st make list of Rows havingcustomer (Row(customer=a["customer"])) repeated nr_tickets times for that customer using range(int(a["nr_tickets"]))
df_list + [Row(customer=a["customer"]) for T in range(int(a["nr_tickets"]))]
you can store and append these in a list and later make a dataframe with it.
df= spark.createDataFrame(df_list)
Overall,
from pyspark.sql import Row
df_list = []
for a in df.select(["customer","nr_tickets"]).collect():
df_list = df_list + [Row(customer=a["customer"]) for T in range(int(a["nr_tickets"]))]
df= spark.createDataFrame(df_list)
df.show()
you can also do it with list comprehension as
from pyspark.sql import Row
from functools import reduce #python 3
df_list = [
[Row(customer=a["customer"])]*int(a["nr_tickets"])
for a in df.select(["customer","nr_tickets"]).collect()
]
df= spark.createDataFrame(reduce(lambda x,y: x+y,df_list))
df.show()
Produces
+--------+
|customer|
+--------+
| A|
| A|
| A|
| B|
| C|
| C|
+--------+

in the meanwhile I have also found a solution by myself:
for i in range(1, max_nr_of_tickets):
table = table.filter(F.col('nr_tickets') >= 1).union(test)
table = table.withColumn('nr_tickets', F.col('nr_tickets') - 1)
Explanation: The DFs "table" and "test" are the same at the beginning.
So "max_nr_of_tickets" is just the highest "nr_tickets". It works.
I am only struggling with the format of the max number:
max_nr_of_tickets = df.select(F.max('nr_tickets')).collect()
I cannot use the result in the for loop's range as it is a list. So I manually enter the highest number.
Any ideas how I could get the max_nr_of_tickets into the right format so the loops range will accept it?
Thanks

How to map each i-th element of a dataframe to a key from another dataframe defined by ranges in PySpark

what I want to do
Transform the input file df0 into the desired output df2 based on the clustering define in df1
What I have
df0 = spark.createDataFrame(
[('A',0.05),('B',0.01),('C',0.75),('D',1.05),('E',0.00),('F',0.95),('G',0.34), ('H',0.13)],
("items","quotient")
)
df1 = spark.createDataFrame(
[('C0',0.00,0.00),('C1',0.01,0.05),('C2',0.06,0.10), ('C3',0.11,0.30), ('C4',0.31,0.50), ('C5',0.51,99.99)],
("cluster","from","to")
)
What I want
df2 = spark.createDataFrame(
[('A',0.05,'C1'),('B',0.01,'C1'),('C',0.75,'C5'),('D',1.05,'C5'),('E',0.00,'C0'),('F',0.95,'C3'),('G',0.34,'C2'), ('H',0.13,'C4')],
("items","quotient","cluster")
)
notes
the coding environment is PySpark within Palantir.
the structure and content of DataFrame df1 can be adjusted for the sake of simplification in coding: df1 is what tells which cluster the items from df0 should be linked to.
Thank you very in advance for your time and feedback !

This is a simple left join problem.
df0.join(df1, df0['quotient'].between(df1['from'], df1['to']), "left") \
.select(*df0.columns, df1['cluster']).show()
+-----+--------+-------+
|items|quotient|cluster|
+-----+--------+-------+
| A| 0.05| C1|
| B| 0.01| C1|
| C| 0.75| C5|
| D| 1.05| C5|
| E| 0.0| C0|
| F| 0.95| C5|
| G| 0.34| C4|
| H| 0.13| C3|
+-----+--------+-------+

Applying map function on dataframe's columns

I need to merge all the values of the dataframe's columns into a single value for each column. So the columns stay intact but I am just summing all the respective values.
For this purpose I intend to utilize this function:
def sum_col(data, col):
return data.select(f.sum(col)).collect()[0][0]
I was now thinking to do sth like this:
data = data.map(lambda current_col: sum_col(data, current_col))
Is this doable, or I need another way to merge all the values of the columns?

You can achieve this by sum function
import pyspark.sql.functions as f
df.select(*[f.sum(cols).alias(cols) for cols in df.columns]).show()
+----+---+---+
|val1| x| y|
+----+---+---+
| 36| 29|159|
+----+---+---+

To sum all your columns to a new column you can use list comprehension with the sum function of python
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
tst= sqlContext.createDataFrame([(10,7,14),(5,1,4),(9,8,10),(2,6,90),(7,2,30),(3,5,11)],schema=['val1','x','y'])
tst_sum= tst.withColumn("sum_col",sum([tst[coln] for coln in tst.columns]))
results:
tst_sum.show()
+----+---+---+-------+
|val1| x| y|sum_col|
+----+---+---+-------+
| 10| 7| 14| 31|
| 5| 1| 4| 10|
| 9| 8| 10| 27|
| 2| 6| 90| 98|
| 7| 2| 30| 39|
| 3| 5| 11| 19|
+----+---+---+-------+
Note : If you had imported sum function from pyspark function as from import pyspark.sql.functions import sum then you have to change the name to some thing else , like from import pyspark.sql.functions import sum_pyspark

How to rename multiple column names as single column?

I have a table which has columns [col1, col2, col3 .... col9].
I want to merge all the columns data into one column as col in python?

from pyspark.sql.functions import concat
values = [('A','B','C','D'),('E','F','G','H'),('I','J','K','L')]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| B| C| D|
| E| F| G| H|
| I| J| K| L|
+----+----+----+----+
req_column = ['col1','col2','col3','col4']
df = df.withColumn('concatenated_cols',concat(*req_column))
df.show()
+----+----+----+----+-----------------+
|col1|col2|col3|col4|concatenated_cols|
+----+----+----+----+-----------------+
| A| B| C| D| ABCD|
| E| F| G| H| EFGH|
| I| J| K| L| IJKL|
+----+----+----+----+-----------------+

using Spark SQL
new_df=sqlContext.sql("SELECT CONCAT(col1,col2,col3,col3) FROM df")
Using Non Spark SQL way you can use Concat function
new_df = df.withColumn('joined_column', concat(col('col1'),col('col2'),col('col3'),col('col4'))

In Spark(pySpark) for reasons, there is no edit of existing data. What you can do is create a new column. Please check the following link.
How do I add a new column to a Spark DataFrame (using PySpark)?
Using a UDF function, you can aggregate/combine all those values in a row and return you as a single value.
Few cautions, please look out for following data issues while aggregation
Null values
Type mismatches
String Encoding issues

Pyspark: filter function error with .isNotNull() and other 2 other conditions

I'm trying to filter my dataframe in Pyspark and I want to write my results in a parquet file, but I get an error every time because something is wrong with my isNotNull() condition. I have 3 conditions in the filter function, and if one of them is true the resulting row should be written in the parquet file.
I tried different versions with OR and | and different versions with isNotNull(), but nothing helped me.
This is one example I tried:
from pyspark.sql.functions import col
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df.where(col("col2").isNotNull()))
).write.save("new_parquet.parquet")
This is the other example I tried, but in that example it ignores the rows with attribute1 or attribute2:
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df['col2'].isNotNull())
).write.save("new_parquet.parquet")
This is the error message:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I hope you can help me, I'm new to the topic. Thank you so much!

First of, about the col1 filter, you could do it using isin like this:
df['col1'].isin(['attribute1', 'attribute2'])
And then:
df.filter((df['col1'].isin(['atribute1', 'atribute2']))|(df['col2'].isNotNull()))
AFAIK, the dataframe.column.isNotNull() should work, but I dont have sample data to test it, sorry.

See the example below:
from pyspark.sql import functions as F
df = spark.createDataFrame([(3,'a'),(5,None),(9,'a'),(1,'b'),(7,None),(3,None)], ["id", "value"])
df.show()
The original DataFrame
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 5| null|
| 9| a|
| 1| b|
| 7| null|
| 3| null|
+---+-----+
Now we do the filter:
df = df.filter( (df['id']==3)|(df['id']=='9')|(~F.isnull('value')))
df.show()
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 9| a|
| 1| b|
| 3| null|
+---+-----+
So you see
row(3, 'a') and row(3, null) are selected because of `df['id']==3'
row(9, 'a') is selected because of `df['id']==9'
row(1, 'b') is selected because of ~F.isnull('value'), but row(5, null) and row(7, null) are not selected.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark groupBy DataFrame without aggregation or count - python

Can it iterate through the Pyspark groupBy dataframe without aggregation or count? For example code in Pandas: for i, d in df2: mycode .... ^^ if using pandas ^^ Is there a difference in how to iterate groupby in Pyspark or have to use aggregation and count?

Related

Reversing Group By in PySpark

How to map each i-th element of a dataframe to a key from another dataframe defined by ranges in PySpark

Applying map function on dataframe's columns

How to rename multiple column names as single column?

Pyspark: filter function error with .isNotNull() and other 2 other conditions

Categories

Resources