I have a data frame that looks like so:
>>> l = [('a', 'foo', 1), ('b', 'bar', 1), ('a', 'biz', 6), ('c', 'bar', 3), ('c', 'biz', 2)]
>>> df = spark.createDataFrame(l, ('uid', 'code', 'level'))
>>> df.show()
+---+----+-----+
|uid|code|level|
+---+----+-----+
| a| foo| 1|
| b| bar| 1|
| a| biz| 6|
| c| bar| 3|
| c| biz| 2|
+---+----+-----+
What I'm trying to do is group the code and level values into a list of dict and dump that list as a JSON string so that I can save the data frame to disk. The result would look like:
>>> df.show()
+---+--------------------------+
|uid| json |
+---+--------------------------+
| a| '[{"foo":1}, {"biz":6}]' |
| b| '[{"bar":1}]' |
| c| '[{"bar":3}, {"biz":2}]' |
+---+--------------------------+
I'm still pretty new to use PySpark and I'm having a lot of trouble figuring out how to get this result. I almost surely need a groupBy and I've tried implementing this by creating a new StringType column called "json" and then using the pandas_udf decorator but I'm getting errors about unhasable types, because, as I've found out, the way I'm accessing the data is accessing the whole column, not just the row.
>>> df = df.withColumn('json', F.list(''))
>>> schema = df.schema
>>> #pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
..: def to_json(pdf):
..: return pdf.assign(serial=json.dumps({pdf.code:pdf.level}))
I've considered using string concatenation between the two columns and using collect_set but that feels wrong as well since it has the potential to write to disk that which can't be JSON loaded just because it has a string representation. Any help is appreciated.
There's no need for a pandas_udf in this case. to_json, collect_list and create_map should be all you need:
import pyspark.sql.functions as f
df.groupby('uid').agg(
f.to_json(
f.collect_list(
f.create_map('code', 'level')
)
).alias('json')
).show(3, False)
+---+---------------------+
|uid|json |
+---+---------------------+
|c |[{"bar":3},{"biz":2}]|
|b |[{"bar":1}] |
|a |[{"foo":1},{"biz":6}]|
+---+---------------------+
Related
I have a question about pyspark.
I have dataframe with 2 columns "country" and "web". I need to save this dataframe as dictionary to iterate through it later another dataframe column.
I am saving dictionaru like this:
sorted_dict = result.rdd.sortByKey()
But when I am trying to iterate through it I have an exception:
"It appears that you are attempting to broadcast an RDD or reference an RDD from an " Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example SPARK-5063
I understood that I can't use two RDDs together, but unfortunately I dont know how to use SparkContext.broadcast in this way, because I have an error
TypeError: broadcast() missing 2 required positional arguments: 'self' and 'value'
Can anyone help me do get it clear? I need to make dictionary from dataframe:
+--------------------+-------+
| web|country|
+--------------------+-------+
| alsudanalyoum.com| SD|
|periodicoequilibr...| SV|
| telesurenglish.net| UK|
| nytimes.com| US|
|portaldenoticias....| AR|
+----------------------------+
Then take another dataframe:
+--------------------+-------+
| split_url|country|
+--------------------+-------+
| alsudanalyoum.com| Null|
|periodicoequilibr...| Null|
| telesurenglish.net| Null|
| nytimes.com| Null|
|portaldenoticias....| Null|
+----------------------------+
... and put values of dictionary to country column.
P.S. join does not fit for me because of other reasons.
If you can, you should use join(), but since you cannot, you can combine the use of df.rdd.collectAsMap() and pyspark.sql.functions.create_map() and itertools.chain to achieve the same thing.
NB: sortByKey() does not return a dictionary (or a map), but instead returns a sorted RDD.
from itertools import chain
import pyspark.sql.functions as f
df = spark.createDataFrame([
("a", 5),
("b", 20),
("c", 10),
("d", 1),
], ["key", "value"])
# create map from the origin df
rdd_map = df.rdd.collectAsMap()
# yes, these are not real null values, but here it doesn't matter
df_target = spark.createDataFrame([
("a", "NULL"),
("b", "NULL"),
("c", "NULL"),
("d", "NULL"),
], ["key", "value"])
df_target.show()
+---+-----+
|key|value|
+---+-----+
| a| NULL|
| b| NULL|
| c| NULL|
| d| NULL|
+---+-----+
value_map = f.create_map(
[f.lit(x) for x in chain(*rdd_map.items())]
)
# map over the "key" column into the "value" column
df_target.withColumn(
"value",
value_map[f.col("key")]
).show()
+---+-----+
|key|value|
+---+-----+
| a| 5|
| b| 20|
| c| 10|
| d| 1|
+---+-----+
I want to check each column of the pyspark dataframe and if the column meets specific dtypes then it will perform certain functions. below is my codes and dataset.
dataset:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('Word Count').config('spark.some.config.option', 'some-value').getOrCreate()
df = spark.createDataFrame(
[
('A',1),
('A', 2),
('A',3),
('A', 4),
('B',5),
('B', 6),
('B',7),
('B', 8),
],
['id', 'v']
) #I save this to csv so can just ignore my read csv park below.
Codes:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load('test.csv',
format ='com.databricks.spark.csv',
header='true',
inferSchema='true')
from functools import reduce
from pyspark.sql.functions import col
import numpy as np
i = (reduce(lambda x, y: x.withColumn(y, np.where(col(y).dtypes != 'str', col(y)+2, col(y))), df.columns, df)) # this is the part that I wanted to change.
Side learning request: If possible can anyone tell me how to edit only specific column? I understand using .select but can someone show some examples with some dataset if possible. thank you
My expected output:
+---+---+
| id| v|
+---+---+
| A| 3|
| A| 4|
| A| 5|
| A| 6|
| B| 7|
| B| 8|
| B| 9|
| B| 10|
+---+---+
Side note: I am new to pyspark so I dont get why need to use 'col'. what is it anyway actually?
I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.
I tried something like this:
some_table = sql('SELECT * FROM some_table')
some_table = some_table.na.fill(None)
But I got the following error:
ValueError: value should be a float, int, long, string, bool or dict
So it seems like na.fill() doesn't support None. I specifically need to replace with NULL, not some other value, like 0.
df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()
+----+---+
| a| b|
+----+---+
| 1|NaN|
|null|1.0|
+----+---+
df = df.replace(float('nan'), None)
df.show()
+----+----+
| a| b|
+----+----+
| 1|null|
|null| 1.0|
+----+----+
You can use the .replace function to change to null values in one line of code.
I finally found the answer after Googling around a bit.
df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()
+----+---+
| a| b|
+----+---+
| 1|NaN|
|null|1.0|
+----+---+
import pyspark.sql.functions as F
columns = df.columns
for column in columns:
df = df.withColumn(column,F.when(F.isnan(F.col(column)),None).otherwise(F.col(column)))
sqlContext.registerDataFrameAsTable(df, "df2")
sql('select * from df2').show()
+----+----+
| a| b|
+----+----+
| 1|null|
|null| 1.0|
+----+----+
It doesn't use na.fill(), but it accomplished the same result, so I'm happy.
I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that
z=data1.groupby('country').agg(F.collect_list('names'))
will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?
I tried this code too
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
but got error message
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist
Use struct to combine the columns before calling groupBy
suppose you have a dataframe
df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")
df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
| a|newcol|
+---+------+
| 0| [1,2]|
| 0| [4,5]|
| 1| [7,8]|
| 1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
| a| collected_col|
+---+--------------+
| 0|[[1,2], [4,5]]|
| 1|[[7,8], [8,7]]|
+---+--------------+
Aggregation operation can be done only on single columns.
After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a
udf to separate the combined columns.
from pyspark.sql.types import *
def foo(x):
x1 = [y[0] for y in x]
x2 = [y[1] for y in x]
return(x1,x2)
st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol",
udf_foo("collected_col")).select("a",
col("ncol").getItem("b").alias("b"),
col("ncol").getItem("c").alias("c"))
df.show()
+---+------+------+
| a| b| c|
+---+------+------+
| 0|[1, 4]|[2, 5]|
| 1|[7, 8]|[8, 7]|
+---+------+------+
Actually we can do it in pyspark 2.2 .
First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.
Below is the code:
import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
Input Data:
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
Output Data:
df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+
in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:
df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()
result :
+---+-----------------+
| a|res |
+---+-----------------+
| 0|[[1, 2], [4, 5]] |
| 1|[[7, 8], [8, 7]] |
+---+-----------------+
I just use Concat_ws function it's perfectly fine.
> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()
I'm attempting to cast multiple String columns to integers in a dataframe using PySpark 2.1.0. The data set is a rdd to begin, when created as a dataframe it generates the following error:
TypeError: StructType can not accept object 3 in type <class 'int'>
A sample of what I'm trying to do:
import pyspark.sql.types as typ
from pyspark.sql.functions import *
labels = [
('A', typ.StringType()),
('B', typ.IntegerType()),
('C', typ.IntegerType()),
('D', typ.IntegerType()),
('E', typ.StringType()),
('F', typ.IntegerType())
]
rdd = sc.parallelize(["1", 2, 3, 4, "5", 6])
schema = typ.StructType([typ.StructField(e[0], e[1], False) for e in labels])
df = spark.createDataFrame(rdd, schema)
df.show()
cols_to_cast = [dt[0] for dt in df.dtypes if dt[1]=='string']
#df2 = df.select(*(c.cast("integer").alias(c) for c in cols_to_cast))
df2 = df.select(*( df[dt[0]].cast("integer").alias(dt[0])
for dt in df.dtypes if dt[1]=='string'))
df2.show()
The problem to begin with is the dataframe is not being created based on the RDD.
Thereafter, I have tried two ways to cast (df2), the first is commented out.
Any suggestions?
Alternatively is there anyway I could use the .withColumn functions for casting all columns in 1 go, instead of specifying each column?
The actual dataset, although not large, has many columns.
Problem isnt your code, its your data. You are passing single list which will be treated as single column instead of six that you want.
Try rdd line as below and it should work fine.( Notice extra brackets around list )-
rdd = sc.parallelize([["1", 2, 3, 4, "5", 6]])
you code with above corrected line shows me following output :
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+
+---+---+
| A| E|
+---+---+
| 1| 5|
+---+---+