Pardon my ignorance, I am new to pyspark. I'm trying to improve a udf to create a new column count_adj based on values from another column a_type using a dictionary. How do I account for None / Null types in this process to create my new column. This is super easy in pandas (df['adj_count'] = df.a_type.map(count_map)) but struggling do this in pyspark.
Sample data / imports:
# all imports used -- not just for this portion of the script
from pyspark.sql import SparkSession, HiveContext, SQLContext
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark import sql
import pyspark.sql.functions as F
import random
from pyspark.sql.functions import lit
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from datetime import datetime
from datetime import date
from datetime import timedelta
from pyspark.sql import Window
from pyspark.sql.functions import broadcast
from pyspark.sql.functions import rank, row_number, max as max_, col
import sys
import os
spark = SparkSession.builder.appName('a_type_tests').getOrCreate()
# note: sample data has results from the original udf for comparison
dataDictionary = [
(26551, 491, '2022-01-22', '740', -1, 16),
(24192, 338, '2022-01-22', '740', -1, 16),
(26555, 3013, '2022-01-22', '740', -1, 16),
(26571, 937, '2022-01-22', '740', -1, 16),
(24376, 371, '2022-01-22', '740', -1, 16),
(17716, 118, '2022-01-22', '740', -1, 16),
(26554, 3013, '2022-01-22', '740', -1, 16),
(26734, 105, '2022-01-22', '740', -1, 16),
(26051, 415, '2022-01-22', '600', -1, 8),
(26602, 501, '2022-01-22', '740', -1, 16),
(26125, 501, '2022-01-22', None, -1, 0)
]
sdf = spark.createDataFrame(data=dataDictionary, schema = ['id', 'loc_id', 'a_date', 'a_type', 'adj_val', 'udf_original'])
sdf.printSchema()
sdf.show(truncate=False)
The original udf is similar to:
def count_adj(a_type):
if a_type is None:
return 0
elif a_type in ('703','704','705','708','900','910'):
return 4
elif a_type in ('701','702'):
return 2
elif a_type in ('711','712'):
return 1
elif a_type in ('600', '704'):
return 8
elif a_type in ('740'):
return 16
elif a_type in ('305','306'):
return 32
elif a_type in ('601','612','615'):
return 64
else:
return 128
I've created a dictionary to correspond to these values.
# remove 0:None type pairing because None is not iterable to invert dict
count_map = {1:['711','712'], \
2:['701','702'], \
4:['703','704','705','708','900','910'], \
8:['600', '704'], \
16:['740'], \
32:['305','306'], \
64:['601','612','615'], \
128: ['1600', '1601', '1602']
}
# invert dict
count_map = {c:key for key, vals in count_map.items() for c in vals}
# create None mapping manually
count_map[None] = 0
Searching SO I came across this which resulted in in the following error:
# Code Tried:
# Changes None type to NULL -- fine but how do I account for these None/Null Values in my dict?
map_expr = F.create_map([lit(x) for x in chain(*count_map.items())])
sdf2 = sdf.withColumn('count_adj', map_expr.getItem('a_type'))
# or:
sdf2 = sdf.withColumn('count_adj',map_expr[col('a_type')]).show()
# Error
Py4JJavaError: An error occurred while calling o334.showString.
: java.lang.RuntimeException: Cannot use null as map key.
How do I account for None / NULL types when using a dictionary to create a new column based on values from another column? Is it possible to include a NULL check in my map expression or something else entirely?
For completeness, I removed the None type from the dictionary and utilized a method similar to Karthik's answer and a combo of other SO posts mentioned in the question.
My final solution relied on the code below and using .when() and .isNull() to account for None / NULL conversions.
# Original Mapping
# remove 0:None type pairing because None is not iterable to invert dict
count_map = {1:['711','712'], \
2:['701','702'], \
4:['703','704','705','708','900','910'], \
8:['600', '704'], \
16:['740'], \
32:['305','306'], \
64:['601','612','615'], \
128: ['1600', '1601', '1602']
}
# invert dict
count_map = {c:key for key, vals in count_map.items() for c in vals}
# New below:
map_expr = F.create_map([lit(x) for x in chain(*count_map.items())])
sdf2 = sdf.withColumn('count_adj', F.when( col('a_type').isNull(), 0 ).otherwise( map_expr.getItem(col('a_type') ) ) )
The key columns must all have the same data type, and can't be null. The key of the map won’t accept None/Null values.
Instead of above code, you can use when function, which gives your desired output as shown below:
newDF = sdf.withColumn("count_adj",F.when(F.col("a_type").isNull(),0)\
.when(F.col("a_type").isin('711','712'),1)\
.when(F.col("a_type").isin('701','702'),2)\
.when(F.col("a_type").isin('703','704','705','708','900','910'),4)\
.when(F.col("a_type").isin('600', '704'),8)\
.when(F.col("a_type").isin('740'),16)\
.when(F.col("a_type").isin('305','306'),32)\
.when(F.col("a_type").isin('601','612','615'),64)\
.otherwise(128))
Related
How to extract column names along with their conditions applied from a SQL query string using Python, so that I can map the result to other table and get the required data?
query = "SELECT COUNT(*) as cnt FROM dbo.main_table WHERE (VKRKA10001920015 = 1 AND age IN (15, 16, 17, 18, 19) AND income IN (101, 102, 301, 302, 305))"
Required_Output = {
'VKRKA10001920015': 1,
'age' : (15, 16, 17, 18, 19),
'income' : (101, 102, 301, 302, 305),
}
As of now I am able to extract only column names using sql-metadata python package. But I need the conditions as well to perform other operations
from sql_metadata.compat import get_query_columns
print(get_query_columns(query))
['VKRKA10001920015', 'age', 'householdincome']
To identify the tokens ("IN", "=", "LIKE") I have used the get_tokens() which recursively get the tokens added to the extracted_key() dictionary. Here in the extracted_keys dict the key address stored instead of the actual key Since the functions is going recursively. So I have used create_dict() another function to create a proper dictionary out of extracted_keys dictionary. For now we are able to extract columns along with their conditions from the SQL WHERE conditon for following operators i.e., "IN" operator, "=" operator, "Like" operator.
To install sqlparse module
pip install sqlparse
import sqlparse
def get_tokens(where):
identifier = None
extracted_keys = {}
for i in where.tokens:
try:
name = i.get_real_name()
if name and isinstance(i, sqlparse.sql.Identifier):
name = i.get_real_name()
identifier = i
extracted_keys[identifier] = i.value
elif identifier and isinstance(i, sqlparse.sql.Parenthesis):
extracted_keys[identifier] = i.value
elif i and "in" not in i.value.lower() and "or" not in i.value.lower() and isinstance(i, sqlparse.sql.Comparison):
if "=" in i.value:
equal_index = i.value.index("=")
if i.value[equal_index -1] not in ["!", ">", "<"]:
key,value = i.value.split("=")[0], i.value.split("=")[1]
extracted_keys[key] = value
if "!=" in i.value:
key,value = i.value.split("!=")[0], i.value.split("!=")[1]
extracted_keys[key] = value
elif "like" in i.value.lower():
key,value = i.value.lower().split("like")[0].upper(), i.value.lower().split("like")[1]
extracted_keys[key] = value
else:
extracted_keys.update(get_tokens(i))
except Exception as error:
pass
return extracted_keys
def create_dict(extracted_keys):
cols = {}
for key, value in extracted_keys.items():
try:
if key.get_real_name():
cols[key.get_real_name()] = value
except Exception as e:
cols[key] = value
return cols
For example:
query = "SELECT COUNT(*) as cnt FROM dbo.main_table WHERE (VKRKA10001920015 = 1 AND age IN (15, 16, 17, 18, 19) AND income IN (101, 102, 301, 302, 305))"
parsed = sqlparse.parse(query)
where = parsed[0][-1]
extracted_keys = get_tokens(where)
extracted_keys = create_dict(extracted_keys)
Output
{
'VKRKA10001920015': 1,
'age' : (15, 16, 17, 18, 19),
'income' : (101, 102, 301, 302, 305),
}
And I need to extract from utc_timestamp its date and its hour into two different columns depending on time zone. Time zone name is defined by id from configuration const variable.
Input DF Output DF
+-------------+--+ +-------------+--+----------+----+
|utc_timestamp|id| |utc_timestamp|id|date |hour|
+-------------+--+ +-------------+--+----------+----|
|1608000000782|1 | |1608000000782|1 |2020-12-14|20 |
+-------------+--+ +-------------+--+----------+----+
|1608000240782|2 | |1608000240782|2 |2020-12-15|11 |
+-------------+--+ +-------------+--+----------+----+
I have pandas_udf that allows me to extract one column at a time and I have to create it twice:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import DateType, IntegerType
import pandas as pd
import pytz
TIMEZONE_LIST = {1: 'America/Chicago', 2: 'Asia/Tokyo'}
class TimezoneUdfProvider(object):
def __init__(self):
self.extract_date_udf = pandas_udf(self._extract_date, DateType(), PandasUDFType.SCALAR)
self.extract_hour_udf = pandas_udf(self._extract_hour, IntegerType(), PandasUDFType.SCALAR)
def _extract_date(self, utc_timestamps: pd.Series, ids: pd.Series) -> pd.Series:
return pd.Series([extract_date(c1, c2) for c1, c2 in zip(utc_timestamps, ids)])
def _extract_hour(self, utc_timestamps: pd.Series, ids: pd.Series) -> pd.Series:
return pd.Series([extract_hour(c1, c2) for c1, c2 in zip(utc_timestamps, ids)])
def extract_date(utc_timestamp: int, id: str):
timezone_name = TIMEZONE_LIST[id]
timezone_nw = pytz.timezone(timezone_name)
return pd.datetime.fromtimestamp(utc_timestamp / 1000e00, tz=timezone_nw).date()
def extract_hour(utc_timestamp: int, id: str) -> int:
timezone_name = TIMEZONE_LIST[id]
timezone_nw = pytz.timezone(timezone_name)
return pd.datetime.fromtimestamp(utc_timestamp / 1000e00, tz=timezone_nw).hour
def extract_from_utc(df: DataFrame) -> DataFrame:
timezone_udf1 = TimezoneUdfProvider()
df_with_date = df.withColumn('date', timezone_udf1.extract_date_udf(f.col(utc_timestamp), f.col(id)))
timezone_udf2 = TimezoneUdfProvider()
df_with_hour = df_with_date.withColumn('hour', timezone_udf2.extract_hour_udf(f.col(utc_timestamp), f.col(id)))
return df_with_hour
Is there a better way to do it? Without a need to use the same udf provider twice?
you can do this without using udf using spark inbuilt functions.
We can use create_map to map the dictionary and create new timezone column , then convert using from_unixtime and from_utc_timestamp using the timezone as the newly mapped column. Once we have the timestamp as per the timezones, we can then fetch Hour and date feilds.
TIMEZONE_LIST = {1: 'America/Chicago', 2: 'Asia/Tokyo'}
import pyspark.sql.functions as F
from itertools import chain
map_exp = F.create_map([F.lit(i) for i in chain(*TIMEZONE_LIST.items())])
final = (df.withColumn("TimeZone", map_exp.getItem(col("id")))
.withColumn("Timestamp",
F.from_utc_timestamp(F.from_unixtime(F.col("utc_timestamp")/1000),F.col("TimeZone")))
.withColumn("date",F.to_date("Timestamp")).withColumn("Hour",F.hour("Timestamp"))
.drop("Timestamp"))
final.show()
(3) Spark Jobs
final:pyspark.sql.dataframe.DataFrame = [utc_timestamp: long, id: long ... 3 more fields]
+-------------+---+---------------+----------+----+
|utc_timestamp| id| TimeZone| date|Hour|
+-------------+---+---------------+----------+----+
|1608000000782| 1|America/Chicago|2020-12-14| 20|
|1608000240782| 2| Asia/Tokyo|2020-12-15| 11|
+-------------+---+---------------+----------+----+
EDIT: replacing create_map with a udf:
import pyspark.sql.functions as F
from pyspark.sql.functions import StringType
TIMEZONE_LIST = {1: 'America/Chicago', 2: 'Asia/Tokyo'}
def fun(x):
return TIMEZONE_LIST.get(x,None)
map_udf = F.udf(fun,StringType())
final = (df.withColumn("TimeZone", map_udf("id")).withColumn("Timestamp",
F.from_utc_timestamp(F.from_unixtime(F.col("utc_timestamp")/1000),F.col("TimeZone")))
.withColumn("date",F.to_date("Timestamp")).withColumn("Hour",F.hour("Timestamp"))
.drop("Timestamp"))
final.show()
I need to delete accents from characters in Spanish and others languages from different datasets.
I already did a function based in the code provided in this post that removes special the accents. The problem is that the function is slow because it uses an UDF.
I'm just wondering if I can improve the performance of my function to get results in less time, because this is good for small dataframes but not for big ones.
Thanks in advance.
Here the code, you will be able to run it as it is presented:
# Importing sql types
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
from pyspark.sql.functions import udf, col
import unicodedata
# Building a simple dataframe:
schema = StructType([StructField("city", StringType(), True),
StructField("country", StringType(), True),
StructField("population", IntegerType(), True)])
countries = ['Venezuela', 'US#A', 'Brazil', 'Spain']
cities = ['Maracaibó', 'New York', ' São Paulo ', '~Madrid']
population = [37800000,19795791,12341418,6489162]
# Dataframe:
df = sqlContext.createDataFrame(list(zip(cities, countries, population)), schema=schema)
df.show()
class Test():
def __init__(self, df):
self.df = df
def clearAccents(self, columns):
"""This function deletes accents in strings column dataFrames,
it does not eliminate main characters, but only deletes special tildes.
:param columns String or a list of column names.
"""
# Filters all string columns in dataFrame
validCols = [c for (c, t) in filter(lambda t: t[1] == 'string', self.df.dtypes)]
# If None or [] is provided with column parameter:
if (columns == "*"): columns = validCols[:]
# Receives a string as an argument
def remove_accents(inputStr):
# first, normalize strings:
nfkdStr = unicodedata.normalize('NFKD', inputStr)
# Keep chars that has no other char combined (i.e. accents chars)
withOutAccents = u"".join([c for c in nfkdStr if not unicodedata.combining(c)])
return withOutAccents
function = udf(lambda x: remove_accents(x) if x != None else x, StringType())
exprs = [function(col(c)).alias(c) if (c in columns) and (c in validCols) else c for c in self.df.columns]
self.df = self.df.select(*exprs)
foo = Test(df)
foo.clearAccents(columns="*")
foo.df.show()
One possible improvement is to build a custom Transformer, which will handle Unicode normalization, and corresponding Python wrapper. It should reduce overall overhead of passing data between JVM and Python and doesn't require any modifications in Spark itself or access to private API.
On JVM side you'll need a transformer similar to this one:
package net.zero323.spark.ml.feature
import java.text.Normalizer
import org.apache.spark.ml.UnaryTransformer
import org.apache.spark.ml.param._
import org.apache.spark.ml.util._
import org.apache.spark.sql.types.{DataType, StringType}
class UnicodeNormalizer (override val uid: String)
extends UnaryTransformer[String, String, UnicodeNormalizer] {
def this() = this(Identifiable.randomUID("unicode_normalizer"))
private val forms = Map(
"NFC" -> Normalizer.Form.NFC, "NFD" -> Normalizer.Form.NFD,
"NFKC" -> Normalizer.Form.NFKC, "NFKD" -> Normalizer.Form.NFKD
)
val form: Param[String] = new Param(this, "form", "unicode form (one of NFC, NFD, NFKC, NFKD)",
ParamValidators.inArray(forms.keys.toArray))
def setN(value: String): this.type = set(form, value)
def getForm: String = $(form)
setDefault(form -> "NFKD")
override protected def createTransformFunc: String => String = {
val normalizerForm = forms($(form))
(s: String) => Normalizer.normalize(s, normalizerForm)
}
override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == StringType, s"Input type must be string type but got $inputType.")
}
override protected def outputDataType: DataType = StringType
}
Corresponding build definition (adjust Spark and Scala versions to match your Spark deployment):
name := "unicode-normalization"
version := "1.0"
crossScalaVersions := Seq("2.11.12", "2.12.8")
organization := "net.zero323"
val sparkVersion = "2.4.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-mllib" % sparkVersion
)
On Python side you'll need a wrapper similar to this one.
from pyspark.ml.param.shared import *
# from pyspark.ml.util import keyword_only # in Spark < 2.0
from pyspark import keyword_only
from pyspark.ml.wrapper import JavaTransformer
class UnicodeNormalizer(JavaTransformer, HasInputCol, HasOutputCol):
#keyword_only
def __init__(self, form="NFKD", inputCol=None, outputCol=None):
super(UnicodeNormalizer, self).__init__()
self._java_obj = self._new_java_obj(
"net.zero323.spark.ml.feature.UnicodeNormalizer", self.uid)
self.form = Param(self, "form",
"unicode form (one of NFC, NFD, NFKC, NFKD)")
# kwargs = self.__init__._input_kwargs # in Spark < 2.0
kwargs = self._input_kwargs
self.setParams(**kwargs)
#keyword_only
def setParams(self, form="NFKD", inputCol=None, outputCol=None):
# kwargs = self.setParams._input_kwargs # in Spark < 2.0
kwargs = self._input_kwargs
return self._set(**kwargs)
def setForm(self, value):
return self._set(form=value)
def getForm(self):
return self.getOrDefault(self.form)
Build Scala package:
sbt +package
include it when you start shell or submit. For example for Spark build with Scala 2.11:
bin/pyspark --jars path-to/target/scala-2.11/unicode-normalization_2.11-1.0.jar \
--driver-class-path path-to/target/scala-2.11/unicode-normalization_2.11-1.0.jar
and you should be ready to go. All what is left is a little bit of regexp magic:
from pyspark.sql.functions import regexp_replace
normalizer = UnicodeNormalizer(form="NFKD",
inputCol="text", outputCol="text_normalized")
df = sc.parallelize([
(1, "Maracaibó"), (2, "New York"),
(3, " São Paulo "), (4, "~Madrid")
]).toDF(["id", "text"])
(normalizer
.transform(df)
.select(regexp_replace("text_normalized", "\p{M}", ""))
.show())
## +--------------------------------------+
## |regexp_replace(text_normalized,\p{M},)|
## +--------------------------------------+
## | Maracaibo|
## | New York|
## | Sao Paulo |
## | ~Madrid|
## +--------------------------------------+
Please note that this follows the same conventions as built in text transformers and is not null safe. You can easily correct for that by check for null in createTransformFunc.
Another way for doing using python Unicode Database :
import unicodedata
import sys
from pyspark.sql.functions import translate, regexp_replace
def make_trans():
matching_string = ""
replace_string = ""
for i in range(ord(" "), sys.maxunicode):
name = unicodedata.name(chr(i), "")
if "WITH" in name:
try:
base = unicodedata.lookup(name.split(" WITH")[0])
matching_string += chr(i)
replace_string += base
except KeyError:
pass
return matching_string, replace_string
def clean_text(c):
matching_string, replace_string = make_trans()
return translate(
regexp_replace(c, "\p{M}", ""),
matching_string, replace_string
).alias(c)
So now let's test it :
df = sc.parallelize([
(1, "Maracaibó"), (2, "New York"),
(3, " São Paulo "), (4, "~Madrid"),
(5, "São Paulo"), (6, "Maracaibó")
]).toDF(["id", "text"])
df.select(clean_text("text")).show()
## +---------------+
## | text|
## +---------------+
## | Maracaibo|
## | New York|
## | Sao Paulo |
## | ~Madrid|
## | Sao Paulo|
## | Maracaibo|
## +---------------+
acknowledge #zero323
This solution is Python only, but is only useful if the number of possible accents is low (e.g. one single language like Spanish) and the character replacements are manually specified.
There seems to be no built-in way to do what you asked for directly without UDFs, however you can chain many regexp_replace calls to replace each possible accented character. I tested the performance of this solution and it turns out that it only runs faster if you have a very limited set of accents to replace. If that's the case it can be faster than UDFs because it is optimized outside of Python.
from pyspark.sql.functions import col, regexp_replace
accent_replacements_spanish = [
(u'á', 'a'), (u'Á', 'A'),
(u'é', 'e'), (u'É', 'E'),
(u'í', 'i'), (u'Í', 'I'),
(u'ò', 'o'), (u'Ó', 'O'),
(u'ú|ü', 'u'), (u'Ú|Ű', 'U'),
(u'ñ', 'n'),
# see http://stackoverflow.com/a/18123985/3810493 for other characters
# this will convert other non ASCII characters to a question mark:
('[^\x00-\x7F]', '?')
]
def remove_accents(column):
r = col(column)
for a, b in accent_replacements_spanish:
r = regexp_replace(r, a, b)
return r.alias('remove_accents(' + column + ')')
df = sqlContext.createDataFrame([['Olà'], ['Olé'], ['Núñez']], ['str'])
df.select(remove_accents('str')).show()
I haven't compared the performance with the other responses and this function is not as general, but it is at least worth considering because you don't need to add Scala or Java to your build process.
Here's my implementation.
Apart from accents I also remove speciach characters. Because I needed to pivot and save a table, and you can't save a table with column name that has " ,;{}()\n\t=\/" characters.
import re
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, StructType, StructField
from unidecode import unidecode
spark = SparkSession.builder.getOrCreate()
data = [(1, " \\ / \\ {____} aŠdá_ \t = \n () asd ____aa 2134_ 23_"), (1, "N"), (2, "false"), (2, "1"), (3, "NULL"),
(3, None)]
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
df.show()
for col_name in ["txt"]:
tmp_dict = {}
for col_value in [row[0] for row in df.select(col_name).distinct().toLocalIterator()
if row[0] is not None]:
new_col_value = re.sub("[ ,;{}()\\n\\t=\\\/]", "_", col_value)
new_col_value = re.sub('_+', '_', new_col_value)
if new_col_value.startswith("_"):
new_col_value = new_col_value[1:]
if new_col_value.endswith("_"):
new_col_value = new_col_value[:-1]
new_col_value = unidecode(new_col_value)
tmp_dict[col_value] = new_col_value.lower()
df = df.na.replace(to_replace=tmp_dict, subset=[col_name])
df.show()
if you can't access external librares (like me) you can replace unidecode with
new_col_value = new_col_value.translate(str.maketrans(
"ä,ö,ü,ẞ,á,ä,č,ď,é,ě,í,ĺ,ľ,ň,ó,ô,ŕ,š,ť,ú,ů,ý,ž,Ä,Ö,Ü,ẞ,Á,Ä,Č,Ď,É,Ě,Í,Ĺ,Ľ,Ň,Ó,Ô,Ŕ,Š,Ť,Ú,Ů,Ý,Ž",
"a,o,u,s,a,a,c,d,e,e,i,l,l,n,o,o,r,s,t,u,u,y,z,A,O,U,S,A,A,C,D,E,E,I,L,L,N,O,O,R,S,T,U,U,Y,Z"))
I have a file A and B which are exactly the same. I am trying to perform inner and outer joins on these two dataframes. Since I have all the columns as duplicate columns, the existing answers were of no help.
The other questions that I have gone through contain a col or two as duplicate, my issue is that the whole files are duplicates of each other: both in data and in column names.
My code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import DataFrameReader, DataFrameWriter
from datetime import datetime
import time
# #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
print("All imports were successful.")
df = spark.read.orc(
's3://****'
)
print("First dataframe read with headers set to True")
df2 = spark.read.orc(
's3://****'
)
print("Second dataframe read with headers set to True")
# df3 = df.join(df2, ['c_0'], "outer")
# df3 = df.join(
# df2,
# df["column_test_1"] == df2["column_1"],
# "outer"
# )
df3 = df.alias('l').join(df2.alias('r'), on='c_0') #.collect()
print("Dataframes have been joined successfully.")
output_file_path = 's3://****'
)
df3.write.orc(
output_file_path
)
print("Dataframe has been written to csv.")
job.commit()
The error that I am facing is:
pyspark.sql.utils.AnalysisException: u'Duplicate column(s): "c_4", "c_38", "c_13", "c_27", "c_50", "c_16", "c_23", "c_24", "c_1", "c_35", "c_30", "c_56", "c_34", "c_7", "c_46", "c_49", "c_57", "c_45", "c_31", "c_53", "c_19", "c_25", "c_10", "c_8", "c_14", "c_42", "c_20", "c_47", "c_36", "c_29", "c_15", "c_43", "c_32", "c_5", "c_37", "c_18", "c_54", "c_3", "__created_at__", "c_51", "c_48", "c_9", "c_21", "c_26", "c_44", "c_55", "c_2", "c_17", "c_40", "c_28", "c_33", "c_41", "c_22", "c_11", "c_12", "c_52", "c_6", "c_39" found, cannot save to file.;'
End of LogType:stdout
There is no shortcut here. Pyspark expects the left and right dataframes to have distinct sets of field names (with the exception of the join key).
One solution would be to prefix each field name with either a "left_" or "right_" as follows:
# Obtain columns lists
left_cols = df.columns
right_cols = df2.columns
# Prefix each dataframe's field with "left_" or "right_"
df = df.selectExpr([col + ' as left_' + col for col in left_cols])
df2 = df2.selectExpr([col + ' as right_' + col for col in right_cols])
# Perform join
df3 = df.alias('l').join(df2.alias('r'), on='c_0')
Here is a helper function to join two dataframes adding aliases:
def join_with_aliases(left, right, on, how, right_prefix):
renamed_right = right.selectExpr(
[
col + f" as {col}_{right_prefix}"
for col in df2.columns
if col not in on
]
+ on
)
right_on = [f"{x}{right_prefix}" for x in on]
return left.join(renamed_right, on=on, how=how)
and here an example in how to use it:
df1 = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"]], ("id", "value"))
df2 = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"]], ("id", "value"))
join_with_aliases(
left=df1,
right=df2,
on=["id"],
how="inner",
right_prefix="_right"
).show()
+---+-----+------------+
| id|value|value_right|
+---+-----+------------+
| 1| a| a|
| 3| c| c|
| 2| b| b|
+---+-----+------------+
I did something like this but in scala, you can convert the same into pyspark as well...
Rename the column names in each dataframe
dataFrame1.columns.foreach(columnName => {
dataFrame1 = dataFrame1.select(dataFrame1.columns.head, dataFrame1.columns.tail: _*).withColumnRenamed(columnName, s"left_$columnName")
})
dataFrame1.columns.foreach(columnName => {
dataFrame2 = dataFrame2.select(dataFrame2.columns.head, dataFrame2.columns.tail: _*).withColumnRenamed(columnName, s"right_$columnName")
})
Now join by mentioning the column names
resultDF = dataframe1.join(dataframe2, dataframe1("left_c_0") === dataframe2("right_c_0"))
I am trying to modify the fuzzywuzzy library. The module process returns the score and the array element. But I want it to return the index of the element along with the group of score,item,index.
Here is what I tried:
#!/usr/bin/env python
# encoding: utf-8
from fuzzywuzzy import fuzz
from fuzzywuzzy import utils
import heapq
import logging
from functools import partial
default_scorer = fuzz.WRatio
default_processor = utils.full_process
def extractWithoutOrder(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0):
def no_process(x):
return x
try:
if choices is None or len(choices) == 0:
raise StopIteration
except TypeError:
pass
if processor is None:
processor = no_process
processed_query = processor(query)
if len(processed_query) == 0:
logging.warning(u"Applied processor reduces input query to empty string, "
"all comparisons will have score 0. "
"[Query: \'{0}\']".format(query))
# Don't run full_process twice
if scorer in [fuzz.WRatio, fuzz.QRatio,
fuzz.token_set_ratio, fuzz.token_sort_ratio,
fuzz.partial_token_set_ratio, fuzz.partial_token_sort_ratio,
fuzz.UWRatio, fuzz.UQRatio] \
and processor == utils.full_process:
processor = no_process
# Only process the query once instead of for every choice
if scorer in [fuzz.UWRatio, fuzz.UQRatio]:
pre_processor = partial(utils.full_process, force_ascii=False)
scorer = partial(scorer, full_process=False)
elif scorer in [fuzz.WRatio, fuzz.QRatio,
fuzz.token_set_ratio, fuzz.token_sort_ratio,
fuzz.partial_token_set_ratio, fuzz.partial_token_sort_ratio]:
pre_processor = partial(utils.full_process, force_ascii=True)
scorer = partial(scorer, full_process=False)
else:
pre_processor = no_process
processed_query = pre_processor(processed_query)
count = -1
try:
# See if choices is a dictionary-like object.
for key, choice in choices.items():
count = count + 1
processed = pre_processor(processor(choice))
score = scorer(processed_query, processed)
if score >= score_cutoff:
yield (choice, score, key,count)
except AttributeError:
# It's a list; just iterate over it.
for choice in choices:
count = count + 1
processed = pre_processor(processor(choice))
score = scorer(processed_query, processed)
if score >= score_cutoff:
yield (choice, score,count)
def extract(query, choices, processor=default_processor, scorer=default_scorer, limit=5):
sl = extractWithoutOrder(query, choices, processor, scorer)
return heapq.nlargest(limit, sl, key=lambda i: i[1]) if limit is not None else \
sorted(sl, key=lambda i: i[1], reverse=True)
When I tried to implement it, the result was what it was previously showing by fuzzywuzzy.
import process as p
box=['ness', 'apple','banana','carrot','duck','eagle','fish','gate','hitler']
p.extract('b',box)
[('banana', 90), ('apple', 0), ('carrot', 0), ('duck', 0), ('eagle', 0)]
But what I am expecting it to return is:
[('banana', 90, 2), ('apple', 0, 1), ('carrot', 0, 3), ('duck', 0, 4), ('eagle', 0, 5)]
Kindly let me know the suggestion.
As an alternative to FuzzyWuzzy you could use RapidFuzz (I am the author) which will return the index as well:
from rapidfuzz import process
box=['ness', 'apple','banana','carrot','duck','eagle','fish','gate','hitler']
p.extract('b',box)
which returns
[('banana', 90, 2), ('apple', 0, 1), ('carrot', 0, 3), ('duck', 0, 4), ('eagle', 0, 5)]
For those looking for the answer, can pass a dictionary to the process.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
box = ['apple','banana','carrot','duck','eagle']
box_dict = {i: val for i, val in enumerate(box)}
process.extract("b", box_dict, scorer=fuzz.WRatio)
# O/P -> [("banana", 90, 1), ('apple', 0, 0), ('carrot', 0, 2), ('duck', 0, 3), ('eagle', 0, 4)]