I have a pyspark dataframe in which I want to use two of its columns to output a dictionary.
input pyspark dataframe:
col1|col2|col3
v | 3 | a
d | 2 | b
q | 9 | g
output:
dict = {'v': 3, 'd': 2, 'q': 9}
how should I do this efficiently?
I believe you can achieve it by converting the DF (with only the two columns you want) to rdd:
data_rdd = data.selet(['col1', 'col2']).rdd
create a rdd containing key, pair with both columns using rdd.map function:
kp_rdd = data_rdd.map(lambda row : (row[0],row[1]))
and then collect as map:
dict = kp_rdd.collectAsMap()
that's the main idea, sorry I don't have an instance of pyspark running right now to test it.
Given your example, after selecting the applicable columns and converting to an rdd, collectAsMap will accomplish the desired dictionary without any additional steps:
df.select('col1', 'col2').rdd.collectAsMap()
few different options here depending on the format needed ... check this out ... am using structured api ... if you need to persist then either save as json dict or preserve schema with parquet
from pyspark.sql.functions import to_json
from pyspark.sql.functions import create_map
from pyspark.sql.functions import col
df = spark\
.createDataFrame([\
('v', 3, 'a'),\
('d', 2, 'b'),\
('q', 9, 'g')],\
["c1", "c2", "c3"])
mapDF = df.select(create_map(col("c1"), col("c2")).alias("mapper"))
mapDF.show(3)
+--------+
| mapper|
+--------+
|[v -> 3]|
|[d -> 2]|
|[q -> 9]|
+--------+
dictDF = df.select(to_json(create_map(col("c1"), col("c2")).alias("mapper")).alias("dict"))
dictDF.show()
+-------+
| dict|
+-------+
|{"v":3}|
|{"d":2}|
|{"q":9}|
+-------+
keyValueDF = df.selectExpr("(c1, c2) as keyValueDict").select(to_json(col("keyValueDict")).alias("keyValueDict"))
keyValueDF.show()
+-----------------+
| keyValueDict|
+-----------------+
|{"c1":"v","c2":3}|
|{"c1":"d","c2":2}|
|{"c1":"q","c2":9}|
+-----------------+
Related
I have a dictionary:
dict = {10: 1, 50: 2, 200: 3, 500: 4}
And a Dask DataFrame:
+---+---+
| a| b|
+---+---+
| 1| 24|
| 1| 49|
| 2|125|
| 3|400|
+---+---+
I want to groupBy a and get the minimum b value. After that, I want to check which dict key is closest to b and create a new column with the dict value.
As a example, when b=24, the closest key is 10. So I want to assign the value 1.
This is the result I am expecting:
+---+---+-------+
| a| b|closest|
+---+---+-------+
| 1| 24| 1|
| 1| 49| 2|
| 2|125| 3|
| 3|400| 4|
+---+---+-------+
I have found something similar with PySpark. I have not been able to make it run, but it apparently run for other people. Sharing it anyway for reference.
df = spark.createDataFrame(
[
(1, 24),
(1, 49),
(2, 125),
(3, 400)
],
["a", "b"]
)
dict = {10:1, 50:2, 200: 3, 500: 4}
def func(value, dict):
closest_key = (
value if value in dict else builtins.min(
dict.keys(), key=lambda k: builtins.abs(k - value)
)
)
score = dict.get(closest_key)
return score
df = (
df.groupby('a')
.agg(
min('b')
)
).withColumn('closest', func('b', dict))
From what I understand, I think on the spark version the calculation was done per row and I have not been able to replicate that.
Instead of thinking of a row-rise operation, you can think of it as a partition-wise operation. If my interpretation is off, you can still use this sample I wrote for the most part with a few tweaks.
I will show a solution with Fugue that lets you just define your logic in Pandas, and then bring it to Dask. This will return a Dask DataFrame.
First some setup, note that df is a Pandas DataFrame. This is meant to represent a smaller sample you can test on:
import pandas as pd
import dask.dataframe as dd
import numpy as np
_dict = {10: 1, 50: 2, 200: 3, 500: 4}
df = pd.DataFrame({"a": [1,1,2,3], "b":[24,49,125,400]})
ddf = dd.from_pandas(df, npartitions=2)
and then we define the logic. This is written to handle one partition so everything in column a will already be the same value.
def logic(df: pd.DataFrame) -> pd.DataFrame:
# handles the logic for 1 group. all values in a are the same
min_b = df['b'].min()
keys = np.array(list(_dict.keys()))
# closest taken from https://stackoverflow.com/a/10465997/11163214
closest = keys[np.abs(keys - min_b).argmin()]
closest_val = _dict[closest]
df = df.assign(closest=closest_val)
return df
We can test this on Pandas:
logic(df.loc[df['a'] == 1])
and we'll get:
a b closest
0 1 24 1
1 1 49 1
So then we can just bring it to Dask with Fugue. We just need to call the transform function:
from fugue import transform
ddf = transform(ddf,
logic,
schema="*,closest:int",
partition={"by":"a"},
engine="dask")
ddf.compute()
This can take in either Pandas or Dask DataFrames and will output the Dask DataFrame because we specified the "dask" engine. There is also a "spark" engine if you want a Spark DataFrame.
Schema is a requirement for distributed computing so we specify the output schema here. We also partition by column a.
So here it is another approach for you friend, this will return a numpy array, but hey it will be faster than spark, and you can easily reindex it.
import numpy as np
a = pydf.toNumpy()
a = a[:,1] # Grabs your b column
np.where([a <=10,a <=50,a<=200,a<=500],[1,2,3,4],a) # Check the closest values and fill them with what you want
I am using below code snippets to extract a portion of a dataframe column .
df.withColumn("chargemonth",getBookedMonth1(df['chargedate']))
def getBookedMonth1(chargedate):
booked_year=chargedate[0:3]
booked_month=chargedate[5:7]
return booked_year+"-"+booked_month
I have also used getBookedMonth for the same , but I am getting null value for the new column chargemonth in both cases.
from pyspark.sql.functions import substring
def getBookedMonth(chargedate):
booked_year=substring(chargedate, 1,4)
booked_month=substring(chargedate,5, 6)
return booked_year+"-"+booked_month
Is this correct way of extraction/substring of columns in pyspark ?
Please DON'T use udf for this! UDFs are known for bad performance.
I would suggest you tu use Spark builtin functions to manipulate dates. Here is an example:
# DF sample
data = [(1, "2019-12-05"), (2, "2019-12-06"), (3, "2019-12-07")]
df = spark.createDataFrame(data, ["id", "chargedate"])
# format dates as 'yyyy-MM'
df.withColumn("chargemonth", date_format(to_date(col("chargedate")), "yyyy-MM")).show()
+---+----------+-----------+
| id|chargedate|chargemonth|
+---+----------+-----------+
| 1|2019-12-05| 2019-12|
| 2|2019-12-06| 2019-12|
| 3|2019-12-07| 2019-12|
+---+----------+-----------+
You need to make a new function as Pyspark UDF.
>>> from pyspark.sql.functions import udf
>>> data = [
... {"chargedate":"2019-01-01"},
... {"chargedate":"2019-02-01"},
... {"chargedate":"2019-03-01"},
... {"chargedate":"2019-04-01"}
... ]
>>>
>>> booked_month = udf(lambda a:"{0}-{1}".format(a[0:4], a[5:7]))
>>>
>>> df = spark.createDataFrame(data)
>>> df = df.withColumn("chargemonth",booked_month(df['chargedate'])).drop('chargedate')
>>> df.show()
+-----------+
|chargemonth|
+-----------+
| 2019-01|
| 2019-02|
| 2019-03|
| 2019-04|
+-----------+
>>>
withColumn is a right way to add a column, drop is used to drop a column.
So I have a dataframe df like so,
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
+---+-----+
I also have a dict like so:
{"COL_B":"abc","COL_C":""}
Now, what I have to do is to update df with keys in dict being the new column name and the value of key being the costant value of the column.
Expected df should be like:
+---+-----+-----+-----+
| ID|COL_A|COL_B|COL_C|
+---+-----+-----+-----+
| 1| 123| abc| |
+---+-----+-----+-----+
Now here's my python code to do it which is working fine...
input_data = pd.read_csv(inputFilePath,dtype=str)
for key, value in mapRow.iteritems(): #mapRow is the dict
if value is None:
input_data[key] = ""
else:
input_data[key] = value
Now I'm migrating this code to pyspark and would like to know how to do it in pyspark?
Thanks for the help.
To combine RDDs, we use use zip or join . Below is the explanation using zip. zip is to concat them and map to flatten.
from pyspark.sql import Row
rdd_1 = sc.parallelize([Row(ID=1,COL_A=2)])
rdd_2 = sc.parallelize([Row(COL_B="abc",COL_C=" ")])
result_rdd = rdd_1.zip(rdd_2).map(lamda x: [j for i in x for j in i])
NOTE I didn't have payspark currently with me so this isn't tested.
I want to know how to map values in a specific column in a dataframe.
I have a dataframe which looks like:
df = sc.parallelize([('india','japan'),('usa','uruguay')]).toDF(['col1','col2'])
+-----+-------+
| col1| col2|
+-----+-------+
|india| japan|
| usa|uruguay|
+-----+-------+
I have a dictionary from where I want to map the values.
dicts = sc.parallelize([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')])
The output I want is:
+-----+-------+--------+--------+
| col1| col2|col1_map|col2_map|
+-----+-------+--------+--------+
|india| japan| ind| jpn|
| usa|uruguay| us| urg|
+-----+-------+--------+--------+
I have tried using the lookup function but it doesn't work. It throws error SPARK-5063. Following is my approach which failed:
def map_val(x):
return dicts.lookup(x)[0]
myfun = udf(lambda x: map_val(x), StringType())
df = df.withColumn('col1_map', myfun('col1')) # doesn't work
df = df.withColumn('col2_map', myfun('col2')) # doesn't work
I think the easier way is just to use a simple dictionary and df.withColumn.
from itertools import chain
from pyspark.sql.functions import create_map, lit
simple_dict = {'india':'ind', 'usa':'us', 'japan':'jpn', 'uruguay':'urg'}
mapping_expr = create_map([lit(x) for x in chain(*simple_dict.items())])
df = df.withColumn('col1_map', mapping_expr[df['col1']])\
.withColumn('col2_map', mapping_expr[df['col2']])
df.show(truncate=False)
udf way
I would suggest you to change the list of tuples to dicts and broadcast it to be used in udf
dicts = sc.broadcast(dict([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')]))
from pyspark.sql import functions as f
from pyspark.sql import types as t
def newCols(x):
return dicts.value[x]
callnewColsUdf = f.udf(newCols, t.StringType())
df.withColumn('col1_map', callnewColsUdf(f.col('col1')))\
.withColumn('col2_map', callnewColsUdf(f.col('col2')))\
.show(truncate=False)
which should give you
+-----+-------+--------+--------+
|col1 |col2 |col1_map|col2_map|
+-----+-------+--------+--------+
|india|japan |ind |jpn |
|usa |uruguay|us |urg |
+-----+-------+--------+--------+
join way (slower than udf way)
All you have to do is change the dicts rdd to dataframe too and use two joins with aliasings as following
df = sc.parallelize([('india','japan'),('usa','uruguay')]).toDF(['col1','col2'])
dicts = sc.parallelize([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')]).toDF(['key', 'value'])
from pyspark.sql import functions as f
df.join(dicts, df['col1'] == dicts['key'], 'inner')\
.select(f.col('col1'), f.col('col2'), f.col('value').alias('col1_map'))\
.join(dicts, df['col2'] == dicts['key'], 'inner') \
.select(f.col('col1'), f.col('col2'), f.col('col1_map'), f.col('value').alias('col2_map'))\
.show(truncate=False)
which should give you the same result
Similar to Ali AzG, but pulling it all out into a handy little method if anyone finds it useful
from itertools import chain
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from typing import Dict
def map_column_values(df:DataFrame, map_dict:Dict, column:str, new_column:str="")->DataFrame:
"""Handy method for mapping column values from one value to another
Args:
df (DataFrame): Dataframe to operate on
map_dict (Dict): Dictionary containing the values to map from and to
column (str): The column containing the values to be mapped
new_column (str, optional): The name of the column to store the mapped values in.
If not specified the values will be stored in the original column
Returns:
DataFrame
"""
spark_map = F.create_map([F.lit(x) for x in chain(*map_dict.items())])
return df.withColumn(new_column or column, spark_map[df[column]])
This can be used as follows
from pyspark.sql import Row, SparkSession
spark = SparkSession.builder.master("local[3]").getOrCreate()
df = spark.createDataFrame([Row(A=0), Row(A=1)])
df = map_column_values(df, map_dict={0:"foo", 1:"bar"}, column="A", new_column="B")
df.show()
#>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#+---+---+
#| A| B|
#+---+---+
#| 0|foo|
#| 1|bar|
#+---+---+
I am having difficulty in converting an RDD of the follwing structure to a dataframe in spark using python.
df1=[['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),(itm22,6)]]
After converting, my dataframe should look like the following:
usr1 usr2
itm1 2.0 NaN
itm2 NaN 3.0
itm22 NaN 6.0
itm3 3.0 5.0
I was initially thinking of coverting the above RDD structure to the following:
df1={'usr1': {'itm1': 2, 'itm3': 3}, 'usr2': {'itm2': 3, 'itm3': 5, 'itm22':6}}
Then use python's pandas module pand=pd.DataFrame(dat2) and then convert pandas dataframe back to a spark dataframe using spark_df = context.createDataFrame(pand). However, I beleive, by doing this, I am converting an RDD to a non-RDD object and then converting back to RDD, which is not correct. Can some please help me out with this problem?
With data like this:
rdd = sc.parallelize([
['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),('itm22',6)]
])
flatten the records:
def to_record(kvs):
user, *vs = kvs # For Python 2.x use standard indexing / splicing
for item, value in vs:
yield user, item, value
records = rdd.flatMap(to_record)
convert to DataFrame:
df = records.toDF(["user", "item", "value"])
pivot:
result = df.groupBy("item").pivot("user").sum()
result.show()
## +-----+----+----+
## | item|usr1|usr2|
## +-----+----+----+
## | itm1| 2|null|
## | itm2|null| 3|
## | itm3| 3| 5|
## |itm22|null| 6|
## +-----+----+----+
Note: Spark DataFrames are designed to handle long and relatively thin data. If you want to generate wide contingency table, DataFrames won't be useful, especially if data is dense and you want to keep separate column per feature.