Reversing Group By in PySpark - python

I am not sure about the correctness of the question itself. The solutions I've found for SQL do not work at Hive SQL or recursion is prohibited.
Thus, I'd like to solve the problem in Pyspark and need a solution or at least ideas, how to tackle the problem.
I have an original table which looks like this:
+--------+----------+
|customer|nr_tickets|
+--------+----------+
| A| 3|
| B| 1|
| C| 2|
+--------+----------+
This is how I want the table:
+--------+
|customer|
+--------+
| A|
| A|
| A|
| B|
| C|
| C|
+--------+
Do you have any suggestions?
Thank you very much in advance!

For Spark2.4+, use array_repeat with explode.
from pyspark.sql import functions as F
df.selectExpr("""explode(array_repeat(customer,cast(nr_tickets as int))) as customer""").show()
#+--------+
#|customer|
#+--------+
#| A|
#| A|
#| A|
#| B|
#| C|
#| C|
#+--------+

You can make a new dataframe by iterating over rows(groups).
1st make list of Rows havingcustomer (Row(customer=a["customer"])) repeated nr_tickets times for that customer using range(int(a["nr_tickets"]))
df_list + [Row(customer=a["customer"]) for T in range(int(a["nr_tickets"]))]
you can store and append these in a list and later make a dataframe with it.
df= spark.createDataFrame(df_list)
Overall,
from pyspark.sql import Row
df_list = []
for a in df.select(["customer","nr_tickets"]).collect():
df_list = df_list + [Row(customer=a["customer"]) for T in range(int(a["nr_tickets"]))]
df= spark.createDataFrame(df_list)
df.show()
you can also do it with list comprehension as
from pyspark.sql import Row
from functools import reduce #python 3
df_list = [
[Row(customer=a["customer"])]*int(a["nr_tickets"])
for a in df.select(["customer","nr_tickets"]).collect()
]
df= spark.createDataFrame(reduce(lambda x,y: x+y,df_list))
df.show()
Produces
+--------+
|customer|
+--------+
| A|
| A|
| A|
| B|
| C|
| C|
+--------+

in the meanwhile I have also found a solution by myself:
for i in range(1, max_nr_of_tickets):
table = table.filter(F.col('nr_tickets') >= 1).union(test)
table = table.withColumn('nr_tickets', F.col('nr_tickets') - 1)
Explanation: The DFs "table" and "test" are the same at the beginning.
So "max_nr_of_tickets" is just the highest "nr_tickets". It works.
I am only struggling with the format of the max number:
max_nr_of_tickets = df.select(F.max('nr_tickets')).collect()
I cannot use the result in the for loop's range as it is a list. So I manually enter the highest number.
Any ideas how I could get the max_nr_of_tickets into the right format so the loops range will accept it?
Thanks

Related

How to rename multiple column names as single column?

I have a table which has columns [col1, col2, col3 .... col9].
I want to merge all the columns data into one column as col in python?
from pyspark.sql.functions import concat
values = [('A','B','C','D'),('E','F','G','H'),('I','J','K','L')]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| B| C| D|
| E| F| G| H|
| I| J| K| L|
+----+----+----+----+
req_column = ['col1','col2','col3','col4']
df = df.withColumn('concatenated_cols',concat(*req_column))
df.show()
+----+----+----+----+-----------------+
|col1|col2|col3|col4|concatenated_cols|
+----+----+----+----+-----------------+
| A| B| C| D| ABCD|
| E| F| G| H| EFGH|
| I| J| K| L| IJKL|
+----+----+----+----+-----------------+
using Spark SQL
new_df=sqlContext.sql("SELECT CONCAT(col1,col2,col3,col3) FROM df")
Using Non Spark SQL way you can use Concat function
new_df = df.withColumn('joined_column', concat(col('col1'),col('col2'),col('col3'),col('col4'))
In Spark(pySpark) for reasons, there is no edit of existing data. What you can do is create a new column. Please check the following link.
How do I add a new column to a Spark DataFrame (using PySpark)?
Using a UDF function, you can aggregate/combine all those values in a row and return you as a single value.
Few cautions, please look out for following data issues while aggregation
Null values
Type mismatches
String Encoding issues

Pyspark: filter function error with .isNotNull() and other 2 other conditions

I'm trying to filter my dataframe in Pyspark and I want to write my results in a parquet file, but I get an error every time because something is wrong with my isNotNull() condition. I have 3 conditions in the filter function, and if one of them is true the resulting row should be written in the parquet file.
I tried different versions with OR and | and different versions with isNotNull(), but nothing helped me.
This is one example I tried:
from pyspark.sql.functions import col
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df.where(col("col2").isNotNull()))
).write.save("new_parquet.parquet")
This is the other example I tried, but in that example it ignores the rows with attribute1 or attribute2:
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df['col2'].isNotNull())
).write.save("new_parquet.parquet")
This is the error message:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I hope you can help me, I'm new to the topic. Thank you so much!
First of, about the col1 filter, you could do it using isin like this:
df['col1'].isin(['attribute1', 'attribute2'])
And then:
df.filter((df['col1'].isin(['atribute1', 'atribute2']))|(df['col2'].isNotNull()))
AFAIK, the dataframe.column.isNotNull() should work, but I dont have sample data to test it, sorry.
See the example below:
from pyspark.sql import functions as F
df = spark.createDataFrame([(3,'a'),(5,None),(9,'a'),(1,'b'),(7,None),(3,None)], ["id", "value"])
df.show()
The original DataFrame
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 5| null|
| 9| a|
| 1| b|
| 7| null|
| 3| null|
+---+-----+
Now we do the filter:
df = df.filter( (df['id']==3)|(df['id']=='9')|(~F.isnull('value')))
df.show()
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 9| a|
| 1| b|
| 3| null|
+---+-----+
So you see
row(3, 'a') and row(3, null) are selected because of `df['id']==3'
row(9, 'a') is selected because of `df['id']==9'
row(1, 'b') is selected because of ~F.isnull('value'), but row(5, null) and row(7, null) are not selected.

How to add rows from a json (arrays of dicts) in a dataframe already existing?

hi i have already a dataframe:
df_init with all column:
A|B|C|D
i receive a json like:
json=[{"A":"1","B":"2","C":"3"},
{"A":"1","B":"2","C":"3","D":"4"},
{"A":"1","B":"2"}]
i want to have df_final like:
A|B| C |D
1|2| 3 |None
1|2| 3 |4
1|2|None|None
if i do:
msgJSON=self.spark.sparkContext.parallelize([json_string],1)
df = self.sqlContext.read.option("multiLine", "true").options(samplingRatio=1.0).json(msgJSON)
but i have some problems with error.
thanks
json = [{"A":"1","B":"2","C":"3"},
{"A":"1","B":"2","C":"3","D":"4"},
{"A":"1","B":"2"}]
msgJSON = spark.sparkContext.parallelize([json],1)
df_final = sqlContext.read.option("multiLine","true").options(samplingRatio=1.0).json(msgJSON)
df_final.show()
+---+---+----+----+
| A| B| C| D|
+---+---+----+----+
| 1| 2| 3|null|
| 1| 2| 3| 4|
| 1| 2|null|null|
+---+---+----+----+
I replicated your code without keyword self. You cannot use self, "everywhere".
For more information, refer: The self variable in python explained

Using isin to simulate sql's IN clause

I have 2 dataframes
df1 = sqlContext.createDataFrame(sc.parallelize([(1,'a'),(2,'b'),(3,'c'),(10,'z')]),['id','value'])
df2 = sqlContext.createDataFrame(sc.parallelize([(1,'x'),(2,'y')]),['id','value'])
>>> df1.show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
| 10| z|
+---+-----+
I want to simulate select df1.* from df1 where df1.id in (select df2.id from df2). How do I do it using isin?
I tried some but didn't work which means I am missing something important.
df1.where(col('id').isin(df2['id']))
df1.where(col('id').isin(*df2.id)).show() //isin() argument after * must be a sequence, not Column
df1.where(col('id').isin(tuple(df2.id))) //Column is not iterable
You need to have a local collection to work with isin, while a data frame column is distributed. Alternatively you can use inner join to filter the data frame:
df1.join(df2.select('id').dropDuplicates(), ['id']).show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 2| b|
+---+-----+
You could also just use the exact query you provided with pyspark-sql:
df1.registerTempTable('df1')
df2.registerTempTable('df2')
query = "select df1.* from df1 where df1.id in (select df2.id from df2)"
sqlContext.sql(query).show()
#+---+-----+
#| id|value|
#+---+-----+
#| 1| a|
#| 2| b|
#+---+-----+

How to retrieve all columns using pyspark collect_list functions

I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that
z=data1.groupby('country').agg(F.collect_list('names'))
will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?
I tried this code too
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
but got error message
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist
Use struct to combine the columns before calling groupBy
suppose you have a dataframe
df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")
df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
| a|newcol|
+---+------+
| 0| [1,2]|
| 0| [4,5]|
| 1| [7,8]|
| 1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
| a| collected_col|
+---+--------------+
| 0|[[1,2], [4,5]]|
| 1|[[7,8], [8,7]]|
+---+--------------+
Aggregation operation can be done only on single columns.
After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a
udf to separate the combined columns.
from pyspark.sql.types import *
def foo(x):
x1 = [y[0] for y in x]
x2 = [y[1] for y in x]
return(x1,x2)
st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol",
udf_foo("collected_col")).select("a",
col("ncol").getItem("b").alias("b"),
col("ncol").getItem("c").alias("c"))
df.show()
+---+------+------+
| a| b| c|
+---+------+------+
| 0|[1, 4]|[2, 5]|
| 1|[7, 8]|[8, 7]|
+---+------+------+
Actually we can do it in pyspark 2.2 .
First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.
Below is the code:
import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
Input Data:
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
Output Data:
df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+
in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:
df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()
result :
+---+-----------------+
| a|res |
+---+-----------------+
| 0|[[1, 2], [4, 5]] |
| 1|[[7, 8], [8, 7]] |
+---+-----------------+
I just use Concat_ws function it's perfectly fine.
> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()

Categories

Resources