Fast way to use dictionary in pyspark - python

I have a question about pyspark.
I have dataframe with 2 columns "country" and "web". I need to save this dataframe as dictionary to iterate through it later another dataframe column.
I am saving dictionaru like this:
sorted_dict = result.rdd.sortByKey()
But when I am trying to iterate through it I have an exception:
"It appears that you are attempting to broadcast an RDD or reference an RDD from an " Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example SPARK-5063
I understood that I can't use two RDDs together, but unfortunately I dont know how to use SparkContext.broadcast in this way, because I have an error
TypeError: broadcast() missing 2 required positional arguments: 'self' and 'value'
Can anyone help me do get it clear? I need to make dictionary from dataframe:
+--------------------+-------+
| web|country|
+--------------------+-------+
| alsudanalyoum.com| SD|
|periodicoequilibr...| SV|
| telesurenglish.net| UK|
| nytimes.com| US|
|portaldenoticias....| AR|
+----------------------------+
Then take another dataframe:
+--------------------+-------+
| split_url|country|
+--------------------+-------+
| alsudanalyoum.com| Null|
|periodicoequilibr...| Null|
| telesurenglish.net| Null|
| nytimes.com| Null|
|portaldenoticias....| Null|
+----------------------------+
... and put values of dictionary to country column.
P.S. join does not fit for me because of other reasons.

If you can, you should use join(), but since you cannot, you can combine the use of df.rdd.collectAsMap() and pyspark.sql.functions.create_map() and itertools.chain to achieve the same thing.
NB: sortByKey() does not return a dictionary (or a map), but instead returns a sorted RDD.
from itertools import chain
import pyspark.sql.functions as f
df = spark.createDataFrame([
("a", 5),
("b", 20),
("c", 10),
("d", 1),
], ["key", "value"])
# create map from the origin df
rdd_map = df.rdd.collectAsMap()
# yes, these are not real null values, but here it doesn't matter
df_target = spark.createDataFrame([
("a", "NULL"),
("b", "NULL"),
("c", "NULL"),
("d", "NULL"),
], ["key", "value"])
df_target.show()
+---+-----+
|key|value|
+---+-----+
| a| NULL|
| b| NULL|
| c| NULL|
| d| NULL|
+---+-----+
value_map = f.create_map(
[f.lit(x) for x in chain(*rdd_map.items())]
)
# map over the "key" column into the "value" column
df_target.withColumn(
"value",
value_map[f.col("key")]
).show()
+---+-----+
|key|value|
+---+-----+
| a| 5|
| b| 20|
| c| 10|
| d| 1|
+---+-----+

Related

How to map each i-th element of a dataframe to a key from another dataframe defined by ranges in PySpark

what I want to do
Transform the input file df0 into the desired output df2 based on the clustering define in df1
What I have
df0 = spark.createDataFrame(
[('A',0.05),('B',0.01),('C',0.75),('D',1.05),('E',0.00),('F',0.95),('G',0.34), ('H',0.13)],
("items","quotient")
)
df1 = spark.createDataFrame(
[('C0',0.00,0.00),('C1',0.01,0.05),('C2',0.06,0.10), ('C3',0.11,0.30), ('C4',0.31,0.50), ('C5',0.51,99.99)],
("cluster","from","to")
)
What I want
df2 = spark.createDataFrame(
[('A',0.05,'C1'),('B',0.01,'C1'),('C',0.75,'C5'),('D',1.05,'C5'),('E',0.00,'C0'),('F',0.95,'C3'),('G',0.34,'C2'), ('H',0.13,'C4')],
("items","quotient","cluster")
)
notes
the coding environment is PySpark within Palantir.
the structure and content of DataFrame df1 can be adjusted for the sake of simplification in coding: df1 is what tells which cluster the items from df0 should be linked to.
Thank you very in advance for your time and feedback !
This is a simple left join problem.
df0.join(df1, df0['quotient'].between(df1['from'], df1['to']), "left") \
.select(*df0.columns, df1['cluster']).show()
+-----+--------+-------+
|items|quotient|cluster|
+-----+--------+-------+
| A| 0.05| C1|
| B| 0.01| C1|
| C| 0.75| C5|
| D| 1.05| C5|
| E| 0.0| C0|
| F| 0.95| C5|
| G| 0.34| C4|
| H| 0.13| C3|
+-----+--------+-------+

Pyspark: filter function error with .isNotNull() and other 2 other conditions

I'm trying to filter my dataframe in Pyspark and I want to write my results in a parquet file, but I get an error every time because something is wrong with my isNotNull() condition. I have 3 conditions in the filter function, and if one of them is true the resulting row should be written in the parquet file.
I tried different versions with OR and | and different versions with isNotNull(), but nothing helped me.
This is one example I tried:
from pyspark.sql.functions import col
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df.where(col("col2").isNotNull()))
).write.save("new_parquet.parquet")
This is the other example I tried, but in that example it ignores the rows with attribute1 or attribute2:
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df['col2'].isNotNull())
).write.save("new_parquet.parquet")
This is the error message:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I hope you can help me, I'm new to the topic. Thank you so much!
First of, about the col1 filter, you could do it using isin like this:
df['col1'].isin(['attribute1', 'attribute2'])
And then:
df.filter((df['col1'].isin(['atribute1', 'atribute2']))|(df['col2'].isNotNull()))
AFAIK, the dataframe.column.isNotNull() should work, but I dont have sample data to test it, sorry.
See the example below:
from pyspark.sql import functions as F
df = spark.createDataFrame([(3,'a'),(5,None),(9,'a'),(1,'b'),(7,None),(3,None)], ["id", "value"])
df.show()
The original DataFrame
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 5| null|
| 9| a|
| 1| b|
| 7| null|
| 3| null|
+---+-----+
Now we do the filter:
df = df.filter( (df['id']==3)|(df['id']=='9')|(~F.isnull('value')))
df.show()
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 9| a|
| 1| b|
| 3| null|
+---+-----+
So you see
row(3, 'a') and row(3, null) are selected because of `df['id']==3'
row(9, 'a') is selected because of `df['id']==9'
row(1, 'b') is selected because of ~F.isnull('value'), but row(5, null) and row(7, null) are not selected.

Creating JSON String from Two Columns in PySpark GroupBy

I have a data frame that looks like so:
>>> l = [('a', 'foo', 1), ('b', 'bar', 1), ('a', 'biz', 6), ('c', 'bar', 3), ('c', 'biz', 2)]
>>> df = spark.createDataFrame(l, ('uid', 'code', 'level'))
>>> df.show()
+---+----+-----+
|uid|code|level|
+---+----+-----+
| a| foo| 1|
| b| bar| 1|
| a| biz| 6|
| c| bar| 3|
| c| biz| 2|
+---+----+-----+
What I'm trying to do is group the code and level values into a list of dict and dump that list as a JSON string so that I can save the data frame to disk. The result would look like:
>>> df.show()
+---+--------------------------+
|uid| json |
+---+--------------------------+
| a| '[{"foo":1}, {"biz":6}]' |
| b| '[{"bar":1}]' |
| c| '[{"bar":3}, {"biz":2}]' |
+---+--------------------------+
I'm still pretty new to use PySpark and I'm having a lot of trouble figuring out how to get this result. I almost surely need a groupBy and I've tried implementing this by creating a new StringType column called "json" and then using the pandas_udf decorator but I'm getting errors about unhasable types, because, as I've found out, the way I'm accessing the data is accessing the whole column, not just the row.
>>> df = df.withColumn('json', F.list(''))
>>> schema = df.schema
>>> #pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
..: def to_json(pdf):
..: return pdf.assign(serial=json.dumps({pdf.code:pdf.level}))
I've considered using string concatenation between the two columns and using collect_set but that feels wrong as well since it has the potential to write to disk that which can't be JSON loaded just because it has a string representation. Any help is appreciated.
There's no need for a pandas_udf in this case. to_json, collect_list and create_map should be all you need:
import pyspark.sql.functions as f
df.groupby('uid').agg(
f.to_json(
f.collect_list(
f.create_map('code', 'level')
)
).alias('json')
).show(3, False)
+---+---------------------+
|uid|json |
+---+---------------------+
|c |[{"bar":3},{"biz":2}]|
|b |[{"bar":1}] |
|a |[{"foo":1},{"biz":6}]|
+---+---------------------+

Pyspark replace NaN with NULL

I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.
I tried something like this:
some_table = sql('SELECT * FROM some_table')
some_table = some_table.na.fill(None)
But I got the following error:
ValueError: value should be a float, int, long, string, bool or dict
So it seems like na.fill() doesn't support None. I specifically need to replace with NULL, not some other value, like 0.
df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()
+----+---+
| a| b|
+----+---+
| 1|NaN|
|null|1.0|
+----+---+
df = df.replace(float('nan'), None)
df.show()
+----+----+
| a| b|
+----+----+
| 1|null|
|null| 1.0|
+----+----+
You can use the .replace function to change to null values in one line of code.
I finally found the answer after Googling around a bit.
df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()
+----+---+
| a| b|
+----+---+
| 1|NaN|
|null|1.0|
+----+---+
import pyspark.sql.functions as F
columns = df.columns
for column in columns:
df = df.withColumn(column,F.when(F.isnan(F.col(column)),None).otherwise(F.col(column)))
sqlContext.registerDataFrameAsTable(df, "df2")
sql('select * from df2').show()
+----+----+
| a| b|
+----+----+
| 1|null|
|null| 1.0|
+----+----+
It doesn't use na.fill(), but it accomplished the same result, so I'm happy.

Encode a column with integer in pyspark

I have to encode the column in a big DataFrame in pyspark(spark 2.0). All the values are almost unique(about 1000mln values).
The best choice could be StringIndexer, but at some reason it always fails and kills my spark session.
Can I somehow write a function like that:
id_dict() = dict()
def indexer(x):
id_dict.setdefault(x, len(id_dict))
return id_dict[x]
And map it to DataFrame with id_dict saving the items()? Will this dict will be synced on each executor?
I need all this for preprocessing tuples ('x', 3, 5) for spark.mllib ALS model.
Thank you.
StringIndexer keeps all labels in memory, so if values are almost unique, it just won't scale.
You can take unique values, sort and add id, which is expensive, but more robust in this case:
from pyspark.sql.functions import monotonically_increasing_id
df = spark.createDataFrame(["a", "b", "c", "a", "d"], "string").toDF("value")
indexer = (df.select("value").distinct()
.orderBy("value")
.withColumn("label", monotonically_increasing_id()))
df.join(indexer, ["value"]).show()
# +-----+-----------+
# |value| label|
# +-----+-----------+
# | d|25769803776|
# | c|17179869184|
# | b| 8589934592|
# | a| 0|
# | a| 0|
# +-----+-----------+
Note that labels are not consecutive and can differ from run to run or can change if spark.sql.shuffle.partitions changes. If it is not acceptable you'll have to use RDDs:
from operator import itemgetter
indexer = (df.select("value").distinct()
.rdd.map(itemgetter(0)).zipWithIndex()
.toDF(["value", "label"]))
df.join(indexer, ["value"]).show()
# +-----+-----+
# |value|label|
# +-----+-----+
# | d| 0|
# | c| 1|
# | b| 2|
# | a| 3|
# | a| 3|
# +-----+-----+

Categories

Resources