PySpark where clause condition on condition

PySpark where clause condition on condition - python

I'm getting Syntax error on below query:
df_result = df_checkout.join(df_checkin,
(
(df_checkout.product == df_checkin.product)
(df_checkout.host == df_checkin.host)
),
how = 'full_outer').where(df_checkout.rank =
F.when(((df_checkout.rank = df_checkin.rank) and (F.unix_timestamp(df_checkout.checkout_date, 'MM/dd/YYYY HH:MI:SS') <= F.unix_timestamp(df_checkin.checkin_date, 'MM/dd/YYYY HH:MI:SS'))), (df_checkin.rank - 1)).when(((df_checkout.rank = df_checkin.rank) and (F.unix_timestamp(df_checkout.checkout_date, 'MM/dd/YYYY HH:MI:SS') >= F.unix_timestamp(df_checkin.checkin_date, 'MM/dd/YYYY HH:MI:SS'))), df_checkin.rank).otherwise(None)
)
What is the error I'm having ?

you have a = instead of ==:
(df_checkout.rank = df_checkin.rank)
should be
(df_checkout.rank == df_checkin.rank)

Related

Read bytes file using pyspark

I have a file contains bytes information like below with delimiter : '#*'
b'\x00\x00V\x97'#*b'2%'#*b'\x00\x00'#*b'\xc5'#*b'\t'#*b'\xc0'
I want to read this data using Pyspark and each column has different conversion methods to convert into ascii.
I have read this file using String Type() but when I'm doing conversion it is throwing error
'Attribute Error: 'str' object has no attribute 'decode''
customSchema = StructType(
[
StructField("Col1", StringType(), True),
StructField("Col2", StringType(), True),
StructField("Col3", StringType(), True),
]
)
df = (
spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.schema(customSchema)
.option("sep", "#*")
.load("/FileStore/tables/EbcdicTextData.txt")
)
df.show()
# col1 col2 col3
# |b'\x00\x01d0'|b'I\x08'|b'\x00\x00'|
def unpack_ch_or_zd(bytes: bytearray) -> str:
ascii_text = bytes.decode("cp037").replace("\x00", "").rstrip()
return ascii_text if ascii_text.isascii() else "Non-ASCII"
def unpack_pd_or_pd_plus(bytes) -> str:
ascii_text = (
"" if bytes.hex()[-1:] != "d" and bytes.hex()[-1:] != "b" else "-"
) + bytes.hex()[:-1]
return ascii_text if ascii_text.isascii() else "Non-ASCII"
def unpack_pd_or_pd_plus_dec(bytes, decimal: int) -> str:
ascii_text = (
"" if bytes.hex()[-1:] != "d" and bytes.hex()[-1:] != "b" else "-"
) + bytes.hex()[:-1]
ascii_text = (
ascii_text[:-decimal] + "." + ascii_text[-decimal:]
if ascii_text.isascii()
else "Non-ASCII"
)
return ascii_text
def unpack_bi_or_biplus_no_dec(bytes) -> str:
# elif (type.lower() == "bi" or ( type.lower() == "bi+" and bytes.hex() <= HighestPositive[:len(bytes) * 2])) and decimal == 0:
a = str(int("0x" + bytes.hex(), 0))
return a if a.isascii() else "Non-ASCII"
unpack_ch_or_zd_UDF = udf(lambda x: unpack_ch_or_zd(x), StringType())
unpack_pd_or_pd_plus_UDF = udf(lambda x: unpack_pd_or_pd_plus(x), StringType())
unpack_pd_or_pd_plus_dec_UDF = udf(lambda x: unpack_pd_or_pd_plus_dec(x), StringType())
Layout_df is a dataframe that Contains column names and conversion Type of the particular column:
for row in layout_df.collect():
column_name = row["Field_name"]
conversion_type = row["Python_Data_type"]
if conversion_type.lower() == "ch" or conversion_type.lower() == "zd":
df = df.withColumn(column_name, unpack_ch_or_zd_UDF(col(column_name)))
elif (
conversion_type.lower() == "pd" or conversion_type.lower() == "pd+"
) and decimal == 0:
df = df.withColumn(column_name, unpack_pd_or_pd_plus_UDF(col(column_name)))
elif (
conversion_type.lower() == "pd" or conversion_type.lower() == "pd+"
) and decimal > 0:
df = df.withColumn(column_name, unpack_pd_or_pd_plus_dec_UDF(col(column_name)))

Error after elif statement is added to df.iterrows() loop

So I was not obtaining any syntax errors with the first if statement. When I add the second elif, i get a syntax error for the next line:
NPI = df["NPI2"]
^
SyntaxError: invalid syntax
Not sure why this is happening since the first if statement is essentially the same as the elif
Here's my code:
for i, row in df.iterrows():
NPI2 = row["NPI2"]
if row["CapDesignation"] == "R" and row["BR"] >= row["CapThreshold"]:
NPI2 = row["CapThreshold"]*row['KBETR']-.005
df.at[i,"NPI2"] = NPI2
df.at[i,"BR"] = NPI2/row["KBETR"]
elif row["CapDesignation"] == "N" and row["BN"] >= row["CapThreshold"]:
NPI2 = row["CapThreshold"]*(row["KBETR"]-row["P"])-.005
df.at[i,"NPI2"] = NPI2
df.at[i,"BN"] = NPI2/((row["KBETR"]-row["P"])
NPI = df["NPI2"]
df["KBETR_N"] = round(R + Delta_P + NPI,2)

Not sure where the error is without the traceback message, but it should be:
df.at[i,"BN"] = NPI2/((row["KBETR"]-row["P"]) where you have 2 parentheses in ((row["KBETR"] instead of 1(if you wish to have row["KBETR"]-row["P"] surrounded by parentheses).

Filter spark dataframe with multiple conditions on multiple columns in Pyspark

I would like to implement the below SQL conditions in Pyspark
SELECT *
FROM table
WHERE NOT ( ID = 1
AND Event = 1
)
AND NOT ( ID = 2
AND Event = 2
)
AND NOT ( ID = 1
AND Event = 0
)
AND NOT ( ID = 2
AND Event = 0
)
What would be the clean way to do this?

you use filter or where function for DataFrame API version.
the equivalent code would be as follows :
df.filter(~((df.ID == 1) & (df.Event == 1)) &
~((df.ID == 2) & (df.Event == 2)) &
~((df.ID == 1) & (df.Event == 0)) &
~((df.ID == 2) & (df.Event == 0)))

If you're lazy, you can just copy and paste the SQL filter expression into the pyspark filter:
df.filter("""
NOT ( ID = 1
AND Event = 1
)
AND NOT ( ID = 2
AND Event = 2
)
AND NOT ( ID = 1
AND Event = 0
)
AND NOT ( ID = 2
AND Event = 0
)
""")

making a Pandas while loop faster

I have a while loop which runs through a data frame A of 30000 rows and updates another data frame B and uses dataframe B for further iterations. its taking too much time. want to make it faster! any ideas
for x in range(0, dataframeA.shape[0]):
AuthorizationID_Temp = dataframeA["AuthorizationID"].iloc[x]
Auth_BeginDate = dataframeA["BeginDate"].iloc[x]
Auth_EndDate = dataframeA["EndDate"].iloc[x]
BeginDate_Temp = pd.to_datetime(Auth_BeginDate).date()
ScriptsFlag = dataframeA["ScriptsFlag"].iloc[x]
Legacy_PlacementID = dataframeA["Legacy_PlacementID"].iloc[x]
Legacy_AncillaryServicesID = dataframeA["Legacy_AncillaryServicesID"].iloc[x]
ProviderID_Temp = dataframeA["ProviderID"].iloc[x]
SRSProcode_Temp = dataframeA["SRSProcode"].iloc[x]
Rate_Temp = dataframeA["Rate"].iloc[x]
Scripts2["BeginDate1_SC"] = pd.to_datetime(Scripts2["BeginDate_SC"]).dt.date
Scripts2["EndDate1_SC"] = pd.to_datetime(Scripts2["EndDate_SC"]).dt.date
# BeginDate_Temp = BeginDate_Temp.date()
# EndDate_Temp = EndDate_Temp.date()
Scripts_New_Modified1 = Scripts2.loc[
((Scripts2["ScriptsFlag_SC"].isin(["N", "M"])) & (Scripts2["AuthorizationID_SC"] == AuthorizationID_Temp))
& ((Scripts2["ProviderID_SC"] == ProviderID_Temp) & (Scripts2["SRSProcode_SC"] == SRSProcode_Temp)),
:,
]
Scripts_New_Modified = Scripts_New_Modified1.loc[
(Scripts_New_Modified1["BeginDate1_SC"] == BeginDate_Temp)
& ((Scripts_New_Modified1["EndDate1_SC"] == EndDate_Temp) & (Scripts_New_Modified1["Rate_SC"] == Rate_Temp)),
"AuthorizationID_SC",
]
if ScriptsFlag == "M":
if Legacy_PlacementID is not None:
InsertA = insertA(AuthorizationID_Temp, BeginDate_Temp, EndDate_Temp, Units_Temp, EndDate_Temp_DO)
dataframeB = dataframeB.append(InsertA)
print("ScriptsTemp6 shape is {}".format(dataframeB.shape))
# else:
# ScriptsTemp6 = ScriptsTemp5.copy()
# print('ScriptsTemp6 shape is {}'.format(ScriptsTemp6.shape))
if Legacy_AncillaryServicesID is not None:
InsertB = insertB(AuthorizationID_Temp, BeginDate_Temp, EndDate_Temp, Units_Temp, EndDate_Temp_DO)
dataframeB = dataframeB.append(InsertB)
print("ScriptsTemp7 shape is {}".format(dataframeB.shape))
dataframe_New = dataframeB.loc[
((dataframeB["ScriptsFlag"] == "N") & (dataframeB["AuthorizationID"] == AuthorizationID_Temp))
& ((dataframeB["ProviderID"] == ProviderID_Temp) & (dataframeB["SRSProcode"] == SRSProcode_Temp)),
:,
]
dataframe_New1 = dataframe_New.loc[
(pd.to_datetime(dataframe_New["BeginDate"]).dt.date == BeginDate_Temp)
& ((pd.to_datetime(dataframe_New["EndDate"]).dt.date == EndDate_Temp_DO) & (dataframe_New["Rate"] == Rate_Temp)),
"AuthorizationID",
]
# PLAATN = dataframeA.copy()
Insert1 = insert1(dataframe_New1, BeginDate_Temp, AuthorizationID_Temp, EndDate_Temp, Units_Temp, EndDate_Temp_DO)
if Insert1.shape[0] > 0:
dataframeB = dataframeB.append(Insert1.iloc[0])
# else:
# ScriptsTemp8 = ScriptsTemp7
print("ScriptsTemp8 shape is {}".format(dataframeB.shape))
dataframe_modified1 = dataframeB.loc[
((dataframeB["ScriptsFlag"] == "M") & (dataframeB["AuthorizationID"] == AuthorizationID_Temp))
& ((dataframeB["ProviderID"] == ProviderID_Temp) & (dataframeB["SRSProcode"] == SRSProcode_Temp)),
:,
]
dataframe_modified = dataframe_modified1.loc[
(dataframe_modified1["BeginDate"] == BeginDate_Temp)
& ((dataframe_modified1["EndDate"] == EndDate_Temp_DO) & (dataframe_modified1["Rate"] == Rate_Temp)),
"AuthorizationID",
]
Insert2 = insert2(
dataframe_modified,
Scripts_New_Modified,
AuthorizationID_Temp,
BeginDate_Temp,
EndDate_Temp,
Units_Temp,
EndDate_Temp_DO,
)
if Insert2.shape[0] > 0:
dataframeB = dataframeB.append(Insert2.iloc[0])
dataframeA having 30000 rows
dataframeB should be inserted with new rows every iteration(30000 iterations) from DataframeA
updated dataframeB should be used in middle of each iteration for filtering conditions
insertA and InsertB are two functions which has additional filtering
it takes too much time to run for 30000 rows so
so it takes more time to run.
provide suggestions for making the loop faster in terms of execution time

Concise way updating values based on column values

Background: I have a DataFrame whose values I need to update using some very specific conditions. The original implementation I inherited used a lot nested if statements wrapped in for loop, obfuscating what was going on. With readability primarily in mind, I rewrote it into this:
# Other Widgets
df.loc[(
(df.product == 0) &
(df.prod_type == 'OtherWidget') &
(df.region == 'US')
), 'product'] = 5
# Supplier X - All clients
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'X')
), 'product'] = 6
# Supplier Y - Client A
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'A')
), 'product'] = 1
# Supplier Y - Client B
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'B')
), 'product'] = 3
# Supplier Y - Client C
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'C')
), 'product'] = 4
Problem: This works well, and makes the conditions clear (in my opinion), but I'm not entirely happy because it's taking up a lot of space. Is there anyway to improve this from a readability/conciseness perspective?

Per EdChum's recommendation, I created a mask for the conditions. The code below goes a bit overboard in terms of masking, but it gives the general sense.
prod_0 = ( df.product == 0 )
ptype_OW = ( df.prod_type == 'OtherWidget' )
rgn_UKUS = ( df.region.isin['UK', 'US'] )
rgn_US = ( df.region == 'US' )
supp_X = ( df.supplier == 'X' )
supp_Y = ( df.supplier == 'Y' )
clnt_A = ( df.client == 'A' )
clnt_B = ( df.client == 'B' )
clnt_C = ( df.client == 'C' )
df.loc[(prod_0 & ptype_OW & reg_US), 'prod_0'] = 5
df.loc[(prod_0 & rgn_UKUS & supp_X), 'prod_0'] = 6
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_A), 'prod_0'] = 1
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_B), 'prod_0'] = 3
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_C), 'prod_0'] = 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark where clause condition on condition - python

you have a = instead of ==: (df_checkout.rank = df_checkin.rank) should be (df_checkout.rank == df_checkin.rank)

Related

Read bytes file using pyspark

Error after elif statement is added to df.iterrows() loop

Filter spark dataframe with multiple conditions on multiple columns in Pyspark

making a Pandas while loop faster

Concise way updating values based on column values

Categories

Resources