Pyspark - casting multiple columns from Str to Int - python

I'm attempting to cast multiple String columns to integers in a dataframe using PySpark 2.1.0. The data set is a rdd to begin, when created as a dataframe it generates the following error:
TypeError: StructType can not accept object 3 in type <class 'int'>
A sample of what I'm trying to do:
import pyspark.sql.types as typ
from pyspark.sql.functions import *
labels = [
('A', typ.StringType()),
('B', typ.IntegerType()),
('C', typ.IntegerType()),
('D', typ.IntegerType()),
('E', typ.StringType()),
('F', typ.IntegerType())
]
rdd = sc.parallelize(["1", 2, 3, 4, "5", 6])
schema = typ.StructType([typ.StructField(e[0], e[1], False) for e in labels])
df = spark.createDataFrame(rdd, schema)
df.show()
cols_to_cast = [dt[0] for dt in df.dtypes if dt[1]=='string']
#df2 = df.select(*(c.cast("integer").alias(c) for c in cols_to_cast))
df2 = df.select(*( df[dt[0]].cast("integer").alias(dt[0])
for dt in df.dtypes if dt[1]=='string'))
df2.show()
The problem to begin with is the dataframe is not being created based on the RDD.
Thereafter, I have tried two ways to cast (df2), the first is commented out.
Any suggestions?
Alternatively is there anyway I could use the .withColumn functions for casting all columns in 1 go, instead of specifying each column?
The actual dataset, although not large, has many columns.

Problem isnt your code, its your data. You are passing single list which will be treated as single column instead of six that you want.
Try rdd line as below and it should work fine.( Notice extra brackets around list )-
rdd = sc.parallelize([["1", 2, 3, 4, "5", 6]])
you code with above corrected line shows me following output :
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+
+---+---+
| A| E|
+---+---+
| 1| 5|
+---+---+

Related

Fast way to use dictionary in pyspark

I have a question about pyspark.
I have dataframe with 2 columns "country" and "web". I need to save this dataframe as dictionary to iterate through it later another dataframe column.
I am saving dictionaru like this:
sorted_dict = result.rdd.sortByKey()
But when I am trying to iterate through it I have an exception:
"It appears that you are attempting to broadcast an RDD or reference an RDD from an " Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example SPARK-5063
I understood that I can't use two RDDs together, but unfortunately I dont know how to use SparkContext.broadcast in this way, because I have an error
TypeError: broadcast() missing 2 required positional arguments: 'self' and 'value'
Can anyone help me do get it clear? I need to make dictionary from dataframe:
+--------------------+-------+
| web|country|
+--------------------+-------+
| alsudanalyoum.com| SD|
|periodicoequilibr...| SV|
| telesurenglish.net| UK|
| nytimes.com| US|
|portaldenoticias....| AR|
+----------------------------+
Then take another dataframe:
+--------------------+-------+
| split_url|country|
+--------------------+-------+
| alsudanalyoum.com| Null|
|periodicoequilibr...| Null|
| telesurenglish.net| Null|
| nytimes.com| Null|
|portaldenoticias....| Null|
+----------------------------+
... and put values of dictionary to country column.
P.S. join does not fit for me because of other reasons.
If you can, you should use join(), but since you cannot, you can combine the use of df.rdd.collectAsMap() and pyspark.sql.functions.create_map() and itertools.chain to achieve the same thing.
NB: sortByKey() does not return a dictionary (or a map), but instead returns a sorted RDD.
from itertools import chain
import pyspark.sql.functions as f
df = spark.createDataFrame([
("a", 5),
("b", 20),
("c", 10),
("d", 1),
], ["key", "value"])
# create map from the origin df
rdd_map = df.rdd.collectAsMap()
# yes, these are not real null values, but here it doesn't matter
df_target = spark.createDataFrame([
("a", "NULL"),
("b", "NULL"),
("c", "NULL"),
("d", "NULL"),
], ["key", "value"])
df_target.show()
+---+-----+
|key|value|
+---+-----+
| a| NULL|
| b| NULL|
| c| NULL|
| d| NULL|
+---+-----+
value_map = f.create_map(
[f.lit(x) for x in chain(*rdd_map.items())]
)
# map over the "key" column into the "value" column
df_target.withColumn(
"value",
value_map[f.col("key")]
).show()
+---+-----+
|key|value|
+---+-----+
| a| 5|
| b| 20|
| c| 10|
| d| 1|
+---+-----+

checking column of the pyspark dataframe columns

I want to check each column of the pyspark dataframe and if the column meets specific dtypes then it will perform certain functions. below is my codes and dataset.
dataset:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('Word Count').config('spark.some.config.option', 'some-value').getOrCreate()
df = spark.createDataFrame(
[
('A',1),
('A', 2),
('A',3),
('A', 4),
('B',5),
('B', 6),
('B',7),
('B', 8),
],
['id', 'v']
) #I save this to csv so can just ignore my read csv park below.
Codes:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load('test.csv',
format ='com.databricks.spark.csv',
header='true',
inferSchema='true')
from functools import reduce
from pyspark.sql.functions import col
import numpy as np
i = (reduce(lambda x, y: x.withColumn(y, np.where(col(y).dtypes != 'str', col(y)+2, col(y))), df.columns, df)) # this is the part that I wanted to change.
Side learning request: If possible can anyone tell me how to edit only specific column? I understand using .select but can someone show some examples with some dataset if possible. thank you
My expected output:
+---+---+
| id| v|
+---+---+
| A| 3|
| A| 4|
| A| 5|
| A| 6|
| B| 7|
| B| 8|
| B| 9|
| B| 10|
+---+---+
Side note: I am new to pyspark so I dont get why need to use 'col'. what is it anyway actually?

PySpark replace value in several column at once

I want to replace a value in a dataframe column with another value and I've to do it for many column (lets say 30/100 columns)
I've gone through this and this already.
from pyspark.sql.functions import when, lit, col
df = sc.parallelize([(1, "foo", "val"), (2, "bar", "baz"), (3, "baz", "buz")]).toDF(["x", "y", "z"])
df.show()
# I can replace "baz" with Null separaely in column y and z
def replace(column, value):
return when(column != value, column).otherwise(lit(None))
df = df.withColumn("y", replace(col("y"), "baz"))\
.withColumn("z", replace(col("z"), "baz"))
df.show()
I can replace "baz" with Null separaely in column y and z. But I want to do it for all columns - something like list comprehension way like below
[replace(df[col], "baz") for col in df.columns]
Since there are to the tune of 30/100 columns, so let's add a few more columns to the DataFrame to generalize it well.
# Loading the requisite packages
from pyspark.sql.functions import col, when
df = sc.parallelize([(1,"foo","val","baz","gun","can","baz","buz","oof"),
(2,"bar","baz","baz","baz","got","pet","stu","got"),
(3,"baz","buz","pun","iam","you","omg","sic","baz")]).toDF(["x","y","z","a","b","c","d","e","f"])
df.show()
+---+---+---+---+---+---+---+---+---+
| x| y| z| a| b| c| d| e| f|
+---+---+---+---+---+---+---+---+---+
| 1|foo|val|baz|gun|can|baz|buz|oof|
| 2|bar|baz|baz|baz|got|pet|stu|got|
| 3|baz|buz|pun|iam|you|omg|sic|baz|
+---+---+---+---+---+---+---+---+---+
Let's say we want to replace baz with Null in all the columns except in column x and a. Use list comprehensions to choose those columns where replacement has to be done.
# This contains the list of columns where we apply replace() function
all_column_names = df.columns
print(all_column_names)
['x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f']
columns_to_remove = ['x','a']
columns_for_replacement = [i for i in all_column_names if i not in columns_to_remove]
print(columns_for_replacement)
['y', 'z', 'b', 'c', 'd', 'e', 'f']
Finally, doing the replacement using when(), which actually is a pseudonym for if clause.
# Doing the replacement on all the requisite columns
for i in columns_for_replacement:
df = df.withColumn(i,when((col(i)=='baz'),None).otherwise(col(i)))
df.show()
+---+----+----+---+----+---+----+---+----+
| x| y| z| a| b| c| d| e| f|
+---+----+----+---+----+---+----+---+----+
| 1| foo| val|baz| gun|can|null|buz| oof|
| 2| bar|null|baz|null|got| pet|stu| got|
| 3|null| buz|pun| iam|you| omg|sic|null|
+---+----+----+---+----+---+----+---+----+
There is no need to create a UDF and define a function to do the replacement if it can be done with normal if-else clause. UDFs are in general a costly operation and should be avoided when ever possible.
use a reduce() function:
from functools import reduce
reduce(lambda d, c: d.withColumn(c, replace(col(c), "baz")), [df, 'y', 'z']).show()
#+---+----+----+
#| x| y| z|
#+---+----+----+
#| 1| foo| val|
#| 2| bar|null|
#| 3|null| buz|
#+---+----+----+
You can use select and a list comprehension:
df = df.select([replace(f.col(column), 'baz').alias(column) if column!='x' else f.col(column)
for column in df.columns])
df.show()

Creating JSON String from Two Columns in PySpark GroupBy

I have a data frame that looks like so:
>>> l = [('a', 'foo', 1), ('b', 'bar', 1), ('a', 'biz', 6), ('c', 'bar', 3), ('c', 'biz', 2)]
>>> df = spark.createDataFrame(l, ('uid', 'code', 'level'))
>>> df.show()
+---+----+-----+
|uid|code|level|
+---+----+-----+
| a| foo| 1|
| b| bar| 1|
| a| biz| 6|
| c| bar| 3|
| c| biz| 2|
+---+----+-----+
What I'm trying to do is group the code and level values into a list of dict and dump that list as a JSON string so that I can save the data frame to disk. The result would look like:
>>> df.show()
+---+--------------------------+
|uid| json |
+---+--------------------------+
| a| '[{"foo":1}, {"biz":6}]' |
| b| '[{"bar":1}]' |
| c| '[{"bar":3}, {"biz":2}]' |
+---+--------------------------+
I'm still pretty new to use PySpark and I'm having a lot of trouble figuring out how to get this result. I almost surely need a groupBy and I've tried implementing this by creating a new StringType column called "json" and then using the pandas_udf decorator but I'm getting errors about unhasable types, because, as I've found out, the way I'm accessing the data is accessing the whole column, not just the row.
>>> df = df.withColumn('json', F.list(''))
>>> schema = df.schema
>>> #pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
..: def to_json(pdf):
..: return pdf.assign(serial=json.dumps({pdf.code:pdf.level}))
I've considered using string concatenation between the two columns and using collect_set but that feels wrong as well since it has the potential to write to disk that which can't be JSON loaded just because it has a string representation. Any help is appreciated.
There's no need for a pandas_udf in this case. to_json, collect_list and create_map should be all you need:
import pyspark.sql.functions as f
df.groupby('uid').agg(
f.to_json(
f.collect_list(
f.create_map('code', 'level')
)
).alias('json')
).show(3, False)
+---+---------------------+
|uid|json |
+---+---------------------+
|c |[{"bar":3},{"biz":2}]|
|b |[{"bar":1}] |
|a |[{"foo":1},{"biz":6}]|
+---+---------------------+

How to retrieve all columns using pyspark collect_list functions

I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that
z=data1.groupby('country').agg(F.collect_list('names'))
will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?
I tried this code too
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
but got error message
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist
Use struct to combine the columns before calling groupBy
suppose you have a dataframe
df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")
df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
| a|newcol|
+---+------+
| 0| [1,2]|
| 0| [4,5]|
| 1| [7,8]|
| 1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
| a| collected_col|
+---+--------------+
| 0|[[1,2], [4,5]]|
| 1|[[7,8], [8,7]]|
+---+--------------+
Aggregation operation can be done only on single columns.
After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a
udf to separate the combined columns.
from pyspark.sql.types import *
def foo(x):
x1 = [y[0] for y in x]
x2 = [y[1] for y in x]
return(x1,x2)
st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol",
udf_foo("collected_col")).select("a",
col("ncol").getItem("b").alias("b"),
col("ncol").getItem("c").alias("c"))
df.show()
+---+------+------+
| a| b| c|
+---+------+------+
| 0|[1, 4]|[2, 5]|
| 1|[7, 8]|[8, 7]|
+---+------+------+
Actually we can do it in pyspark 2.2 .
First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.
Below is the code:
import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
Input Data:
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
Output Data:
df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+
in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:
df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()
result :
+---+-----------------+
| a|res |
+---+-----------------+
| 0|[[1, 2], [4, 5]] |
| 1|[[7, 8], [8, 7]] |
+---+-----------------+
I just use Concat_ws function it's perfectly fine.
> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()

Categories

Resources