Add new rows to pyspark Dataframe - python

Am very new pyspark but familiar with pandas.
I have a pyspark Dataframe
# instantiate Spark
spark = SparkSession.builder.getOrCreate()
# make some test data
columns = ['id', 'dogs', 'cats']
vals = [
(1, 2, 0),
(2, 0, 1)
]
# create DataFrame
df = spark.createDataFrame(vals, columns)
wanted to add new Row (4,5,7) so it will output:
df.show()
+---+----+----+
| id|dogs|cats|
+---+----+----+
| 1| 2| 0|
| 2| 0| 1|
| 4| 5| 7|
+---+----+----+

As thebluephantom has already said union is the way to go. I'm just answering your question to give you a pyspark example:
# if not already created automatically, instantiate Sparkcontext
spark = SparkSession.builder.getOrCreate()
columns = ['id', 'dogs', 'cats']
vals = [(1, 2, 0), (2, 0, 1)]
df = spark.createDataFrame(vals, columns)
newRow = spark.createDataFrame([(4,5,7)], columns)
appended = df.union(newRow)
appended.show()
Please have also a lookat the databricks FAQ: https://kb.databricks.com/data/append-a-row-to-rdd-or-dataframe.html

From something I did, using union, showing a block partial coding - you need to adapt of course to your own situation:
val dummySchema = StructType(
StructField("phrase", StringType, true) :: Nil)
var dfPostsNGrams2 = spark.createDataFrame(sc.emptyRDD[Row], dummySchema)
for (i <- i_grams_Cols) {
val nameCol = col({i})
dfPostsNGrams2 = dfPostsNGrams2.union(dfPostsNGrams.select(explode({nameCol}).as("phrase")).toDF )
}
union of DF with itself is the way to go.

To append row to dataframe one can use collect method also. collect() function converts dataframe to list and you can directly append data to list and again convert list to dataframe.
my spark dataframe called df is like
+---+----+------+
| id|name|gender|
+---+----+------+
| 1| A| M|
| 2| B| F|
| 3| C| M|
+---+----+------+
convert this dataframe to list using collect
collect_df = df.collect()
print(collect_df)
[Row(id=1, name='A', gender='M'),
Row(id=2, name='B', gender='F'),
Row(id=3, name='C', gender='M')]
append new row to this list
collect_df.append({"id" : 5, "name" : "E", "gender" : "F"})
print(collect_df)
[Row(id=1, name='A', gender='M'),
Row(id=2, name='B', gender='F'),
Row(id=3, name='C', gender='M'),
{'id': 5, 'name': 'E', 'gender': 'F'}]
convert this list to dataframe
added_row_df = spark.createDataFrame(collect_df)
added_row_df.show()
+---+----+------+
| id|name|gender|
+---+----+------+
| 1| A| M|
| 2| B| F|
| 3| C| M|
| 5| E| F|
+---+----+------+

Another alternative would be to utilize the partitioned parquet format, and add an extra parquet file for each dataframe you want to append. This way you can create (hundreds, thousands, millions) of parquet files, and spark will just read them all as a union when you read the directory later.
This example uses pyarrow
Note I also showed how to write a single parquet (example.parquet) that isn't partitioned, if you already know where you want to put the single parquet file.
import pyarrow.parquet as pq
import pandas as pd
headers=['A', 'B', 'C']
row1 = ['a1', 'b1', 'c1']
row2 = ['a2', 'b2', 'c2']
df1 = pd.DataFrame([row1], columns=headers)
df2 = pd.DataFrame([row2], columns=headers)
df3 = df1.append(df2, ignore_index=True)
table = pa.Table.from_pandas(df3)
pq.write_table(table, 'example.parquet', flavor='spark')
pq.write_to_dataset(table, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
# Adding a new partition (B=b2/C=c3
row3 = ['a3', 'b3', 'c3']
df4 = pd.DataFrame([row3], columns=headers)
table2 = pa.Table.from_pandas(df4)
pq.write_to_dataset(table2, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
# Add another parquet file to the B=b2/C=c2 partition
# Note this does not overwrite existing partitions, it just appends a new .parquet file.
# If files already exist, then you will get a union result of the two (or multiple) files when you read the partition
row5 = ['a5', 'b2', 'c2']
df5 = pd.DataFrame([row5], columns=headers)
table3 = pa.Table.from_pandas(df5)
pq.write_to_dataset(table3, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
Reading the output afterwards
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.appName("testing parquet read")
.getOrCreate())
df_spark = spark.read.parquet('test_part_file')
df_spark.show(25, False)
You should see something like this
+---+---+---+
|A |B |C |
+---+---+---+
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
+---+---+---+
If you run the same thing end to end again, you should see duplicates like this (since all of the previous parquet files are still there, spark unions them).
+---+---+---+
|A |B |C |
+---+---+---+
|a2 |b2 |c2 |
|a5 |b2 |c2 |
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
|a3 |b3 |c3 |
+---+---+---+

Related

Pivot array of structs into columns using pyspark - not explode the array

I currently have a dataframe with an id and a column which is an array of structs:
root
|-- id: string (nullable = true)
|-- lists: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: string (nullable = true)
Here is an example table with data:
id | list1 | list2
------------------------------------------
1 | [[a, av], [b, bv]]| [[e, ev], [f,fv]]
2 | [[c, cv]] | [[g,gv]]
How do I transform the above dataframe to the one below? I need to "explode" the array and add columns based on first value in the struct.
id | a | b | c | d | e | f | g
----------------------------------------
1 | av | bv | null| null| ev | fv | null
2 | null| null| cv | null|null|null|gv
A pyspark code to create the dataframe is as below:
d1 = spark.createDataFrame([("1", [("a","av"),("b","bv")], [("e", "ev"), ("f", "fv")]), \
("2", [("c", "cv")], [("g", "gv")])], ["id","list1","list2"])
Note: I have a spark version of 2.2.0 so some sql functions don't work such as concat_map, etc.
You can do this using hogher order functions without exploding the arrays like:
d1.select('id',
f.when(f.size(f.expr('''filter(list1,x->x._1='a')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list1,x->x._1='a'),value->value._2)'''))).alias('a'),\
f.when(f.size(f.expr('''filter(list1,x->x._1='b')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list1,x->x._1='b'),value->value._2)'''))).alias('b'),\
f.when(f.size(f.expr('''filter(list1,x->x._1='c')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list1,x->x._1='c'),value->value._2)'''))).alias('c'),\
f.when(f.size(f.expr('''filter(list1,x->x._1='d')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list1,x->x._1='d'),value->value._2)'''))).alias('d'),\
f.when(f.size(f.expr('''filter(list2,x->x._1='e')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list2,x->x._1='e'),value->value._2)'''))).alias('e'),\
f.when(f.size(f.expr('''filter(list2,x->x._1='f')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list2,x->x._1='f'),value->value._2)'''))).alias('f'),\
f.when(f.size(f.expr('''filter(list2,x->x._1='g')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list2,x->x._1='g'),value->value._2)'''))).alias('g'),\
f.when(f.size(f.expr('''filter(list2,x->x._1='h')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list2,x->x._1='h'),value->value._2)'''))).alias('h')\
).show()
+---+----+----+----+----+----+----+----+----+
| id| a| b| c| d| e| f| g| h|
+---+----+----+----+----+----+----+----+----+
| 1| av| bv|null|null| ev| fv|null|null|
| 2|null|null| cv|null|null|null| gv|null|
+---+----+----+----+----+----+----+----+----+
Hope it helps
UPD - For Spark 2.2.0
You can define similar functions in 2.2.0 using udfs. They will be much less efficient in terms of performance and you'll need a special function for each output value type (i.e. you won't be able to have one element_at function which could output value of any type from any map type), but they will work. The code below works for Spark 2.2.0:
from pyspark.sql.functions import udf
from pyspark.sql.types import MapType, ArrayType, StringType
#udf(MapType(StringType(), StringType()))
def map_from_entries(l):
return {x:y for x,y in l}
#udf(MapType(StringType(), StringType()))
def map_concat(m1, m2):
m1.update(m2)
return m1
#udf(ArrayType(StringType()))
def map_keys(m):
return list(m.keys())
def element_getter(k):
#udf(StringType())
def element_at(m):
return m.get(k)
return element_at
d2 = d1.select('id',
map_concat(map_from_entries('list1'),
map_from_entries('list2')).alias('merged_map'))
map_keys = d2.select(f.explode(map_keys('merged_map')).alias('mk')) \
.agg(f.collect_set('mk').alias('keys')) \
.collect()[0].keys
map_keys = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
selects = [element_getter(k)('merged_map').alias(k) for k in sorted(map_keys)]
d = d2.select('id', *selects)
ORIGINAL ANSWER (working for Spark 2.4.0+)
Not clear where d column came from in your example (d never appeared in the initial dataframe). If columns should be created based on the first elements in the array, then this should work (assuming total number of unique first values in the lists is small enough):
import pyspark.sql.functions as f
d2 = d1.select('id',
f.map_concat(f.map_from_entries('list1'),
f.map_from_entries('list2')).alias('merged_map'))
map_keys = d2.select(f.explode(f.map_keys('merged_map')).alias('mk')) \
.agg(f.collect_set('mk').alias('keys')) \
.collect()[0].keys
selects = [f.element_at('merged_map', k).alias(k) for k in sorted(map_keys)]
d = d2.select('id', *selects)
Output (no column for d because it never mentioned in the initial DataFrame):
+---+----+----+----+----+----+----+
| id| a| b| c| e| f| g|
+---+----+----+----+----+----+----+
| 1| av| bv|null| ev| fv|null|
| 2|null|null| cv|null|null| gv|
+---+----+----+----+----+----+----+
If you actually had in mind that list of the columns is fixed from the beginning (and they are not taken from the array), then you can just replace the definition of varaible map_keys with the fixed list of columns, e.g. map_keys=['a', 'b', 'c', 'd', 'e', 'f', 'g']. In that case you get the output you mention in the answer:
+---+----+----+----+----+----+----+----+
| id| a| b| c| d| e| f| g|
+---+----+----+----+----+----+----+----+
| 1| av| bv|null|null| ev| fv|null|
| 2|null|null| cv|null|null|null| gv|
+---+----+----+----+----+----+----+----+
By the way - what you want to do is not what is called explode in Spark. explode in Spark is for the situation when you create multiple rows from one. E.g. if you wanted to get from dataframe like this:
+---+---------+
| id| arr|
+---+---------+
| 1| [a, b]|
| 2|[c, d, e]|
+---+---------+
to this:
+---+-------+
| id|element|
+---+-------+
| 1| a|
| 1| b|
| 2| c|
| 2| d|
| 2| e|
+---+-------+
You can solve this in an alternative way:
from pyspark.sql.functions import concat, explode, first
d1 = spark.createDataFrame([("1", [("a", "av"), ("b", "bv")], [("e", "ev"), ("f", "fv")]), \
("2", [("c", "cv")], [("g", "gv")])], ["id", "list1", "list2"])
d2 = d1.withColumn('concat', concat('list1', 'list2'))
d3 = d2.withColumn('explode', explode('concat'))
d4 = d3.groupby('id').pivot('explode._1').agg(first('explode._2'))
d4.show()
+---+----+----+----+----+----+----+
|id |a |b |c |e |f |g |
+---+----+----+----+----+----+----+
|1 |av |bv |null|ev |fv |null|
|2 |null|null|cv |null|null|gv |
+---+----+----+----+----+----+----+

Append list of lists as column to PySpark's dataframe (Concatenating two dataframes without common column)

I have some dataframe in Pyspark:
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession.builder.getOrCreate()
sqlcontext = SQLContext(spark)
df = sqlcontext.createDataFrame([['a'],['b'],['c'],['d'],['e']], ['id'])
df.show()
+---+
| id|
+---+
| a|
| b|
| c|
| d|
| e|
+---+
And I have a list of lists:
l = [[1,1], [2,2], [3,3], [4,4], [5,5]]
Is it possible to append this list as a column to df? Namely, the first element of l should appear next to the first row of df, the second element of l next to the second row of df, etc. It should look like this:
+----+---+--+
| id| l|
+----+---+--+
| a| [1,1]|
| b| [2,2]|
| c| [3,3]|
| d| [4,4]|
| e| [5,5]|
+----+---+--+
UDF's are generally slow but a more efficient way without using any UDF's would be:
import pyspark.sql.functions as F
ldf = spark.createDataFrame(l, schema = "array<int>")
df1 = df.withColumn("m_id", F.monotonically_increasing_id())
df2 = ldf.withColumn("m_id", F.monotonically_increasing_id())
df3 = df2.join(df1, "m_id", "outer").drop("m_id")
df3.select("id", "value").show()
+---+------+
| id| value|
+---+------+
| a|[1, 1]|
| b|[2, 2]|
| d|[4, 4]|
| c|[3, 3]|
| e|[5, 5]|
+---+------+
Assuming that you are going to have same amount of rows in your df and items in your list (df.count==len(l)).
You can add a row_id (to specify the order) to your df, and based on that, access to the item on your list (l).
from pyspark.sql.functions import row_number, lit
from pyspark.sql.window import *
df = df.withColumn("row_num", row_number().over(Window().orderBy(lit('A'))))
df.show()
Above code will look like:
+---+-------+
| id|row_num|
+---+-------+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
+---+-------+
Then, you can just iterate your df and access the specified index in your list:
def map_df(row):
return (row.id, l[row.row_num-1])
new_df = df.rdd.map(map_df).toDF(["id", "l"])
new_df.show()
Output:
+---+------+
| id| l|
+---+------+
| 1|[1, 1]|
| 2|[2, 2]|
| 3|[3, 3]|
| 4|[4, 4]|
| 5|[5, 5]|
+---+------+
Thanks to Cesar's answer, I figured out how to do it without making the dataframe an RDD and coming back. It would be something like this:
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import row_number, lit, udf
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, FloatType, IntegerType
spark = SparkSession.builder.getOrCreate()
sqlcontext = SQLContext(spark)
df = sqlcontext.createDataFrame([['a'],['b'],['c'],['d'],['e']], ['id'])
df = df.withColumn("row_num", row_number().over(Window().orderBy(lit('A'))))
new_col = [[1.,1.], [2.,2.], [3.,3.], [4.,4.], [5.,5.]]
map_list_to_column = udf(lambda row_num: new_col[row_num -1], ArrayType(FloatType()))
df.withColumn('new_col', map_list_to_column(df.row_num)).drop('row_num').show()

How to create a sample single-column Spark DataFrame in Python?

I want to create a sample single-column DataFrame, but the following code is not working:
df = spark.createDataFrame(["10","11","13"], ("age"))
## ValueError
## ...
## ValueError: Could not parse datatype: age
The expected result:
age
10
11
13
the following code is not working
With single element you need a schema as type
spark.createDataFrame(["10","11","13"], "string").toDF("age")
or DataType:
from pyspark.sql.types import StringType
spark.createDataFrame(["10","11","13"], StringType()).toDF("age")
With name elements should be tuples and schema as sequence:
spark.createDataFrame([("10", ), ("11", ), ("13", )], ["age"])
Well .. There is some pretty easy method for creating sample dataframe in PySpark
>>> df = sc.parallelize([[1,2,3], [2,3,4]]).toDF()
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
to create with some column names
>>> df1 = sc.parallelize([[1,2,3], [2,3,4]]).toDF(("a", "b", "c"))
>>> df1.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
In this way, no need to define schema too.Hope this is the simplest way
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
Output: (no need to define schema)
+---+---+---+
| a | b | c |
+---+---+---+
| x| y| 3|
+---+---+---+
For pandas + pyspark users, if you've already installed pandas in the cluster, you can do this simply:
# create pandas dataframe
df = pd.DataFrame({'col1':[1,2,3], 'col2':['a','b','c']})
# convert to spark dataframe
df = spark.createDataFrame(df)
Local Spark Setup
import findspark
findspark.init()
import pyspark
spark = (pyspark
.sql
.SparkSession
.builder
.master("local")
.getOrCreate())
See my farsante lib for creating a DataFrame with fake data:
import farsante
df = farsante.quick_pyspark_df(['first_name', 'last_name'], 7)
df.show()
+----------+---------+
|first_name|last_name|
+----------+---------+
| Tommy| Hess|
| Arthur| Melendez|
| Clemente| Blair|
| Wesley| Conrad|
| Willis| Dunlap|
| Bruna| Sellers|
| Tonda| Schwartz|
+----------+---------+
Here's how to explicitly specify the schema when creating the PySpark DataFrame:
df = spark.createDataFrame(
[(10,), (11,), (13,)],
StructType([StructField("some_int", IntegerType(), True)]))
df.show()
+--------+
|some_int|
+--------+
| 10|
| 11|
| 13|
+--------+
You can also try something like this -
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc) # sc is the spark context
sample = sqlContext.createDataFrame(
[
('qwe', 23), # enter your data here
('rty',34),
('yui',56),
],
['abc', 'def'] # the row header/column labels should be entered here
There are several ways to create a DataFrame, PySpark Create DataFrame is one of the first steps you learn while working on PySpark
I assume you already have data, columns, and an RDD.
1) df = rdd.toDF()
2) df = rdd.toDF(columns) //Assigns column names
3) df = spark.createDataFrame(rdd).toDF(*columns)
4) df = spark.createDataFrame(data).toDF(*columns)
5) df = spark.createDataFrame(rowData,columns)
Besides these, you can find several examples on pyspark create dataframe

Removing duplicate columns after a DF join in Spark

When you join two DFs with similar column names:
df = df1.join(df2, df1['id'] == df2['id'])
Join works fine but you can't call the id column because it is ambiguous and you would get the following exception:
pyspark.sql.utils.AnalysisException: "Reference 'id' is ambiguous,
could be: id#5691, id#5918.;"
This makes id not usable anymore...
The following function solves the problem:
def join(df1, df2, cond, how='left'):
df = df1.join(df2, cond, how=how)
repeated_columns = [c for c in df1.columns if c in df2.columns]
for col in repeated_columns:
df = df.drop(df2[col])
return df
What I don't like about it is that I have to iterate over the column names and delete them why by one. This looks really clunky...
Do you know of any other solution that will either join and remove duplicates more elegantly or delete multiple columns without iterating over each of them?
If the join columns at both data frames have the same names and you only need equi join, you can specify the join columns as a list, in which case the result will only keep one of the join columns:
df1.show()
+---+----+
| id|val1|
+---+----+
| 1| 2|
| 2| 3|
| 4| 4|
| 5| 5|
+---+----+
df2.show()
+---+----+
| id|val2|
+---+----+
| 1| 2|
| 1| 3|
| 2| 4|
| 3| 5|
+---+----+
df1.join(df2, ['id']).show()
+---+----+----+
| id|val1|val2|
+---+----+----+
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 4|
+---+----+----+
Otherwise you need to give the join data frames alias and refer to the duplicated columns by the alias later:
df1.alias("a").join(
df2.alias("b"), df1['id'] == df2['id']
).select("a.id", "a.val1", "b.val2").show()
+---+----+----+
| id|val1|val2|
+---+----+----+
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 4|
+---+----+----+
df.join(other, on, how) when on is a column name string, or a list of column names strings, the returned dataframe will prevent duplicate columns.
when on is a join expression, it will result in duplicate columns. We can use .drop(df.a) to drop duplicate columns. Example:
cond = [df.a == other.a, df.b == other.bb, df.c == other.ccc]
# result will have duplicate column a
result = df.join(other, cond, 'inner').drop(df.a)
Assuming 'a' is a dataframe with column 'id' and 'b' is another dataframe with column 'id'
I use the following two methods to remove duplicates:
Method 1: Using String Join Expression as opposed to boolean expression. This automatically remove a duplicate column for you
a.join(b, 'id')
Method 2: Renaming the column before the join and dropping it after
b.withColumnRenamed('id', 'b_id')
joinexpr = a['id'] == b['b_id']
a.join(b, joinexpr).drop('b_id)
The code below works with Spark 1.6.0 and above.
salespeople_df.show()
+---+------+-----+
|Num| Name|Store|
+---+------+-----+
| 1| Henry| 100|
| 2| Karen| 100|
| 3| Paul| 101|
| 4| Jimmy| 102|
| 5|Janice| 103|
+---+------+-----+
storeaddress_df.show()
+-----+--------------------+
|Store| Address|
+-----+--------------------+
| 100| 64 E Illinos Ave|
| 101| 74 Grand Pl|
| 102| 2298 Hwy 7|
| 103|No address available|
+-----+--------------------+
Assuming -in this example- that the name of the shared column is the same:
joined=salespeople_df.join(storeaddress_df, ['Store'])
joined.orderBy('Num', ascending=True).show()
+-----+---+------+--------------------+
|Store|Num| Name| Address|
+-----+---+------+--------------------+
| 100| 1| Henry| 64 E Illinos Ave|
| 100| 2| Karen| 64 E Illinos Ave|
| 101| 3| Paul| 74 Grand Pl|
| 102| 4| Jimmy| 2298 Hwy 7|
| 103| 5|Janice|No address available|
+-----+---+------+--------------------+
.join will prevent the duplication of the shared column.
Let's assume that you want to remove the column Num in this example, you can just use .drop('colname')
joined=joined.drop('Num')
joined.show()
+-----+------+--------------------+
|Store| Name| Address|
+-----+------+--------------------+
| 103|Janice|No address available|
| 100| Henry| 64 E Illinos Ave|
| 100| Karen| 64 E Illinos Ave|
| 101| Paul| 74 Grand Pl|
| 102| Jimmy| 2298 Hwy 7|
+-----+------+--------------------+
After I've joined multiple tables together, I run them through a simple function to drop columns in the DF if it encounters duplicates while walking from left to right. Alternatively, you could rename these columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = dropDupeDfCols(NamesAndDates)
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where dropDupeDfCols is defined as:
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Date'].
In my case I had a dataframe with multiple duplicate columns after joins and I was trying to same that dataframe in csv format, but due to duplicate column I was getting error. I followed below steps to drop duplicate columns. Code is in scala
1) Rename all the duplicate columns and make new dataframe
2) make separate list for all the renamed columns
3) Make new dataframe with all columns (including renamed - step 1)
4) drop all the renamed column
private def removeDuplicateColumns(dataFrame:DataFrame): DataFrame = {
var allColumns: mutable.MutableList[String] = mutable.MutableList()
val dup_Columns: mutable.MutableList[String] = mutable.MutableList()
dataFrame.columns.foreach((i: String) =>{
if(allColumns.contains(i))
if(allColumns.contains(i))
{allColumns += "dup_" + i
dup_Columns += "dup_" +i
}else{
allColumns += i
}println(i)
})
val columnSeq = allColumns.toSeq
val df = dataFrame.toDF(columnSeq:_*)
val unDF = df.drop(dup_Columns:_*)
unDF
}
to call the above function use below code and pass your dataframe which contains duplicate columns
val uniColDF = removeDuplicateColumns(df)
Here is simple solution for remove duplicate column
final_result=df1.join(df2,(df1['subjectid']==df2['subjectid']),"left").drop(df1['subjectid'])
If you join on a list or string, dup cols are automatically]1 removed
This is a scala solution, you could translate the same idea into any language
// get a list of duplicate columns or use a list/seq
// of columns you would like to join on (note that this list
// should include columns for which you do not want duplicates)
val duplicateCols = df1.columns.intersect(df2.columns)
// no duplicate columns in resulting DF
df1.join(df2, duplicateCols.distinct.toSet)
Spark SQL version of this answer:
df1.createOrReplaceTempView("t1")
df2.createOrReplaceTempView("t2")
spark.sql("select * from t1 inner join t2 using (id)").show()
# +---+----+----+
# | id|val1|val2|
# +---+----+----+
# | 1| 2| 2|
# | 1| 2| 3|
# | 2| 3| 4|
# +---+----+----+
This works for me when multiple columns used to join and need to drop more than one column which are not string type.
final_data = mdf1.alias("a").join(df3.alias("b")
(mdf1.unique_product_id==df3.unique_product_id) &
(mdf1.year_week==df3.year_week) ,"left" ).select("a.*","b.promotion_id")
Give a.* to select all columns from one table and from the other table choose specific columns.

How to retrieve all columns using pyspark collect_list functions

I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that
z=data1.groupby('country').agg(F.collect_list('names'))
will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?
I tried this code too
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
but got error message
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist
Use struct to combine the columns before calling groupBy
suppose you have a dataframe
df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")
df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
| a|newcol|
+---+------+
| 0| [1,2]|
| 0| [4,5]|
| 1| [7,8]|
| 1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
| a| collected_col|
+---+--------------+
| 0|[[1,2], [4,5]]|
| 1|[[7,8], [8,7]]|
+---+--------------+
Aggregation operation can be done only on single columns.
After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a
udf to separate the combined columns.
from pyspark.sql.types import *
def foo(x):
x1 = [y[0] for y in x]
x2 = [y[1] for y in x]
return(x1,x2)
st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol",
udf_foo("collected_col")).select("a",
col("ncol").getItem("b").alias("b"),
col("ncol").getItem("c").alias("c"))
df.show()
+---+------+------+
| a| b| c|
+---+------+------+
| 0|[1, 4]|[2, 5]|
| 1|[7, 8]|[8, 7]|
+---+------+------+
Actually we can do it in pyspark 2.2 .
First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.
Below is the code:
import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
Input Data:
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
Output Data:
df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+
in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:
df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()
result :
+---+-----------------+
| a|res |
+---+-----------------+
| 0|[[1, 2], [4, 5]] |
| 1|[[7, 8], [8, 7]] |
+---+-----------------+
I just use Concat_ws function it's perfectly fine.
> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()

Categories

Resources