PySpark Count Over Windows Function

PySpark Count Over Windows Function - python

I have a function that is driving me crazy and I am supposed to use only PySpark.
The table below is a representation of the data:
There are IDs, Name, Surname and Validity over which I can partition by, but I should lit the value of the percentage of emails that are set correctly by ID.
Like the image below:
How can I solve this problem?
window = Window.partitionBy("ID", "email", "name", "surname", "validity").orderBy(col("ID").desc())
df = df.withColumn("row_num", row_number().over(window))
df_new = df.withColumn("total valid emails per ID", df.select("validity").where(df.validity == "valid" & df.row_num == 1)).count()

Something like:
win = Window.partitionBy("ID", "email", "name", "surname")
df = df.withColumn(
"pct_valid",
F.sum(F.when(F.col("validity") == "Valid", 1).otherwise(0)).over(win)
/ F.col("total emails"),
)

This would work:
df.withColumn("ValidAsNumber", F.when(F.col("Validity") == "Valid", 1).otherwise(0))\
.withColumn("TotalValid", F.sum("ValidAsNumber").over(Window.partitionBy("ID")))\
.withColumn("PercentValid", F.expr("(TotalValid/TotalEmails)*100")).show()
Input:
Output (I kept the intermediate columns for understanding, you can drop them):

Related

Pyspark - Converting a stringtype nested json to columns in dataframe

I am working on processing a CDC data recieved via kafka tables, and load them into databricks delta tables. I am able to get it working all, except for a nested JSON string which is not getting loaded when using from_json, spark.read.json.
When I try to fetch schema of the json from level 1, using "spark.read.json(df.rdd.map(lambda row: row.value)).schema", the column INPUT_DATA is considered as string loaded as a string object. Am giving sample json string, the code that I tried, and the expected results.
I have many topics to process and each topic will have different schema, so I would like to process dynamically, and do not prefer to store the schemas, since the schema may change over time, and i would like to have my code handle the changes automatically.
Appreciate any help as I have spent whole day to figure out, and still trying. Thanks in advance.
Sample Json with nested tree:
after = {
"id_transaction": "121",
"product_id": 25,
"transaction_dt": 1662076800000000,
"creation_date": 1662112153959000,
"product_account": "40012",
"input_data": "{\"amount\":[{\"type\":\"CASH\",\"amount\":1000.00}],\"currency\":\"USD\",\"coreData\":{\"CustId\":11021,\"Cust_Currency\":\"USD\",\"custCategory\":\"Premium\"},\"context\":{\"authRequired\":false,\"waitForConfirmation\":false,\"productAccount\":\"CA12001\"},\"brandId\":\"TOYO-2201\",\"dealerId\":\"1\",\"operationInfo\":{\"trans_Id\":\"3ED23-89DKS-001AA-2321\",\"transactionDate\":1613420860087},\"ip_address\":null,\"last_executed_step\":\"PURCHASE_ORDER_CREATED\",\"last_result\":\"OK\",\"output_dataholder\":\"{\"DISCOUNT_AMOUNT\":\"0\",\"BONUS_AMOUNT_APPLIED\":\"10000\"}",
"dealer_id": 1,
"dealer_currency": "USD",
"Cust_id": 11021,
"process_status": "IN_PROGRESS",
"tot_amount": 10000,
"validation_result_code": "OK_SAVE_AND_PROCESS",
"operation": "Create",
"timestamp_ms": 1675673484042
}
I have created following script to get all the columns of the json structure:
import json
# table_column_schema = {}
json_keys = {}
child_members = []
table_column_schema = {}
column_schema = []
dbname = "mydb"
tbl_name = "tbl_name"
def get_table_keys(dbname):
table_values_extracted = "select value from {mydb}.{tbl_name} limit 1"
cmd_key_pair_data = spark.sql(table_values_extracted)
jsonkeys=cmd_key_pair_data.collect()[0][0]
json_keys = json.loads(jsonkeys)
column_names_as_keys = json_keys["after"].keys()
value_column_data = json_keys["after"].values()
column_schema = list(column_names_as_keys)
for i in value_column_data:
if ("{" in str(i) and "}" in str(i)):
a = json.loads(i)
for i2 in a.values():
if (str(i2).startswith("{") and str(i2).endswith('}')):
column_schema = column_schema + list(i2.keys())
table_column_schema['temp_table1'] = column_schema
return 0
get_table_keys("dbname")
The following code is used to process the json and create a dataframe with all nested jsons as the columns:
from pyspark.sql.functions import from_json, to_json, col
from pyspark.sql.types import StructType, StructField, StringType, LongType, MapType
import time
dbname = "mydb"
tbl_name = "tbl_name"
start = time.time()
df = spark.sql(f'select value from {mydb}.{tbl_name} limit 2')
tbl_columns = table_column_schema[tbl_name]
data = []
for i in tbl_columns:
if i == 'input_data':
# print('FOUND !!!!')
data.append(StructField(f'{i}', MapType(StringType(),StringType()), True))
else:
data.append(StructField(f'{i}', StringType(), True))
schema2 = spark.read.json(df.rdd.map(lambda row: row.value)).schema
print(type(schema2))
df2 = df.withColumn("value", from_json("value", schema2)).select(col('value.after.*'), col('value.op'))
Note: The VALUE is a column in my delta table (bronze layer)
Current dataframe output:
Expected dataframe output:

You can use rdd to get the schema and from_json to read the value as json.
schema = spark.read.json(df.rdd.map(lambda r: r.input_data)).schema
df = df.withColumn('input_data', f.from_json('input_data', schema))
new_cols = df.columns + df.select('input_data.*').columns
df = df.select('*', 'input_data.*').toDF(*new_cols).drop('input_data')
df.show(truncate=False)
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+
|Cust_id|creation_date |dealer_currency|dealer_id|id_transaction|operation|process_status|product_account|product_id|timestamp_ms |tot_amount|transaction_dt |validation_result_code|amount |brandId |context |coreData |currency|dealerId|ip_address|last_executed_step |last_result|operationInfo |output_dataholder|
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+
|11021 |1662112153959000|USD |1 |121 |Create |IN_PROGRESS |40012 |25 |1675673484042|10000 |1662076800000000|OK_SAVE_AND_PROCESS |[{1000.0, CASH}]|TOYO-2201|{false, CA12001, false}|{11021, USD, Premium}|USD |1 |null |PURCHASE_ORDER_CREATED|OK |{3ED23-89DKS-001AA-2321, 1613420860087}|{10000, 0} |
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+

Get the value of a specific record in dicts with lists in python

I have a dict like this:
contactos = dict([
"id", id,
"nombres", nombres,
"apellidos", apellidos,
"telefonos", telefonos,
"correos", correos
])
And it works when I put a new register in every key:value, my problem is, how can I get the record for only one contact?
I have a part where I can input a number and search the position in the list of the dict, then I want to only show the record of that specific record in every key:value
I made this code, but it doesn´t work.
telefo = input(Fore.LIGHTGREEN_EX + "TELEFONO CONTACTO: " + Fore.RESET)
for x in range(len(telefonos)):
if(telefonos[x] == telefo):
print(contactos["telefonos"][x])
else:
print("No encontrado")
I print only the telefono value, ´cause it´s my test code.

This should be your working script:
# I imagine your data to be somethig like this. If it isn't, sorry:
id = 0
nombres = ['John', 'Anna', 'Robert']
apellidos = ['J.', 'A.', 'Rob.']
telefonos = ['333-444', '222-111', '555-888']
correos = ['john#email.com', 'anna#email.com', 'rob#email.com']
# This is the part where you made it wrong.
# Dictionaries are created with {}
#
# [] creates a list, not a dictionary structure.
#
# Also, key and values must be grouped as:
# "key": value
contactos = dict({
"id": id,
"nombres": nombres,
"apellidos": apellidos,
"telefonos": telefonos,
"correos": correos
})
# Now, imagine this this is the input from user:
telefo = "333-444"
for x in range(len(telefonos)):
if (telefonos[x] == telefo):
print(contactos["telefonos"][x])
break
else:
print("No encontrado")
When testing the script, the output is 333-444.

Python SQLAlchemy query distinct returns list of lists instead of dict

I'm using SQLAlchemy to setup some data models and query it. I have the following table class
class Transactions(Base):
__tablename__ = 'simulation_data'
sender_account = db.Column('sender_account', db.BigInteger)
recipient_account = db.Column('recipient_account', db.String)
sender_name = db.Column('sender_name', db.String)
recipient_name = db.Column('recipient_name', db.String)
date = db.Column('date', db.DateTime)
text = db.Column('text', db.String)
amount = db.Column('amount', db.Float)
currency = db.Column('currency', db.String)
transaction_type = db.Column('transaction_type', db.String)
fraud = db.Column('fraud', db.BigInteger)
swift_bic = db.Column('swift_bic', db.String)
recipient_country = db.Column('recipient_country', db.String)
internal_external = db.Column('internal_external', db.String)
ID = Column('ID', db.BigInteger, primary_key=True)
I'm trying to get distinct row values for columns recipient_country and internal_external using the following script
data = db.query(
Transactions.recipient_country,
Transactions.internal_external).distinct()
However, this doesn't retrieve all distinct combinations of these two columns (it neglects values for Transactions.internal_external in this case). Example:
{
"China": "External",
"Croatia": "External",
"Denmark": "Internal",
"England": "External",
"Germany": "External",
"Norway": "External",
"Portugal": "External",
"Sweden": "External",
"Turkey": "External"
}
When I try
data = db.query(
Transactions.recipient_country,
Transactions.internal_external).distinct().all()
The correct output is returned, however it comes out as a list of lists, and not a dict. Example:
[["China","External"],["Croatia","External"],["Denmark","External"],["Denmark","Internal"],["England","External"],["Germany","External"],["Norway","External"],["Portugal","External"],["Sweden","External"],["Turkey","External"]]
I'm trying to reproduce the following SQL query:
SELECT DISTINCT
[recipient_country],
[internal_external]
FROM [somedb].[dbo].[simulation_data];
I want it to return the data as a dict instead. What am I doing wrong?

The key in a dictionary is always unique, so if the country (China) occurs multiple times - once for external and once for external - then setting the value the second time will overwrite the first value:
result = {}
result['China'] = 'internal'
result['China'] = 'external'
print(result) # { 'China': 'external' }
How you should visualise your query more is as a list of objects (or dictionaries), with each object representing one row. Then you can have something like
[dict(country="China", internal="internal"), dict(country="China", internal="external"), ...]
Here, country and internal are the column names. You can also get these from the Query object, using query.column_descriptions
before you execute .all().
EDIT: You can also store the values in an array:
query = db.query(
Transactions.recipient_country,
func.array_agg(Transactions.internal_external.distinct())
).group_by(Transactions.recipient_country)
data = {country: options for country, options in query}
print(data) # { 'China': ['internal', 'external'] }
Or you can use "both" as an identifier to show that internal and external are both possible:
query = db.query(
Transactions.recipient_country,
Transactions.internal_external
).distinct()
data = {}
for country, option in query:
if country in data:
option = 'both'
data[country] = option
print(data) # { 'China': 'both' }

pandas add row using lookup value

I have a dataframe with a model id and associated values. The columns are date, client_id, model_id, category1, category2, color, and price. I have a simple flask app where the user can select a model id and add to their "purchase" history. Based on the model id I would like to add a row to the dataframe and bring the associated values of category1, category2, color, and price. What is the best way to do this using Pandas? I know in Excel I'd use a vlookup but I am unsure how to go about it using Python. Assume category1, category2, color, and price are unique to each model id.
client_id = input("ENTER Model ID: ")
model_id = input("ENTER Model ID: ")
def update_history(df, client_id, model_id):
today=pd.to_datetime('today')
#putting in tmp but just need to "lookup" these values from the original dataframe somehow
df.loc[len(df)]=[today, client_id, model_id, today, 'tmp', 'tmp','tmp', 'tmp']
return df

Code below adds a new row with new values to an existing dataframe. The list of new values could be passed in to the function.
Import libraries
import pandas as pd
import numpy as np
import datetime
Create sample dataframe
model_id = ['M1', 'M2', 'M3']
today = ['2018-01-01', '2018-01-02', '2018-01-01']
client_id = ['C1', 'C2', 'C3']
category1 = ['orange', 'apple', 'beans']
category2 = ['fruit', 'fruit', 'grains']
df = pd.DataFrame({'today':today, 'model_id': model_id, 'client_id':client_id,
'category1': category1, 'category2':category2})
df['today'] = pd.to_datetime(df['today'])
df
Function
def update_history(df, client_id, model_id, category1, category2):
today=pd.to_datetime('today')
# Create a temp dataframe with new values.
# Column names in this dataframe should match the existing dataframe
temp = pd.DataFrame({'today':[today], 'model_id': [model_id], 'client_id':[client_id],
'category1': [category1], 'category2':[category2]})
df = df.append(temp)
return df
Call function to append a row with new values to existing dataframe
update_history(df, client_id='C4', model_id='M4', category1='apple', category2='fruit')

You could try this. In case you are appending more than one row at a time, appending a dictionary to list and then appending them at once to a dataframe is faster.
modelid = ['MOD1', 'MOD2', 'MOD3']
today = ['2018-07-15', '2018-07-18', '2018-07-20']
clients = ['CLA', 'CLA', 'CLB']
cat_1 = ['CAT1', 'CAT2', 'CAT3']
cat_2 = ['CAT11', 'CAT12', 'CAT13']
mdf = pd.DataFrame({"model_id": modelid, "today": today, "client_id": clients, "cat_1":cat_1, "cat_2":cat_2})
def update_history(df, client_id, model_id):
today = pd.to_datetime('today')
row = df[df.model_id==model_id].iloc[0]
rows_list = []
dict = {"today":today, "client_id":client_id,
"model_id":model_id,"cat_1":row["cat_1"],
"cat_2":row["cat_2"]}
rows_list.append(dict)
df2 = pd.DataFrame(rows_list)
df = df.append(df2)
return df
mdf = update_history(mdf,"CLC","MOD1")

This is what I ended up doing. I still think there is a more elegant solution, so please let me know!
#create dataframe
modelid = ['MOD1', 'MOD2', 'MOD3']
today = ['2018-07-15', '2018-07-18', '2018-07-20']
clients = ['CLA', 'CLA', 'CLB']
cat_1 = ['CAT1', 'CAT2', 'CAT3']
cat_2 = ['CAT11', 'CAT12', 'CAT13']
mdf = pd.DataFrame({"model_id": modelid, "today": today, "client_id": clients, "cat_1":cat_1, "cat_2":cat_2})
#reorder columns
mdf = mdf[['cat_1', 'cat_2', 'model_id', 'client_id', 'today']]
#create lookup table
lookup=mdf[['cat_1','cat_2','model_id']]
lookup.drop_duplicates(inplace=True)
#get values
client_id = input("ENTER Client ID: ")
model_id = input("ENTER Model ID: ")
#append model id to list
model_id_lst=[]
model_id_lst.append(model_id)
today=pd.to_datetime('today')
#grab associated cat_1, and cat_2 from lookup table
temp=lookup[lookup['model_id'].isin(model_id_lst)]
out=temp.values.tolist()
out[0].extend([client_id, today])
#add this as a row to the df
mdf.loc[len(mdf)]=out[0]

how to split a dataframe to form a multiple dataframe in a single context in django python

How can i split by big dataframe into smaller dataframe and able to print all the dataframe separately on web? any idea on edit code can place a loop in context?
here is my code:
def read_raw_data(request):
Wb = pd.read_excel(r"LookAhead.xlsm", sheetname="Step")
Step1 = Wb.replace(np.nan, '', regex=True)
drop_column =
Step1_Result.drop(['facility','volume','indicator_product'], 1)
uniquevaluesproduct = np.unique(drop_column[['Product']].values)
total_count=drop_column['Product'].nunique()
row_array=[]
for name, group in drop_column.groupby('Product')
group=group.values.tolist()
row_array.append(group)
i=1
temp=row_array[0]
while i<total_count:
newb = temp + row_array[i]
temp=newb
i = i + 1
b = ['indicator', 'Product']
test=pd.DataFrame.from_records(temp, columns=b)
table = test.style.set_table_attributes('border="" class = "dataframe table table-hover table-bordered"').set_precision(10).render()
context = { "result": table}
return render(request, 'result.html', context)

If you want to show a big dataframe in different pages, I recommend you using a Paginator. The documentation has a good example on how to implement it.
https://docs.djangoproject.com/en/1.10/topics/pagination/#using-paginator-in-a-view

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark Count Over Windows Function - python

Something like: win = Window.partitionBy("ID", "email", "name", "surname") df = df.withColumn( "pct_valid", F.sum(F.when(F.col("validity") == "Valid", 1).otherwise(0)).over(win) / F.col("total emails"), )

Related

Pyspark - Converting a stringtype nested json to columns in dataframe

Get the value of a specific record in dicts with lists in python

Python SQLAlchemy query distinct returns list of lists instead of dict

pandas add row using lookup value

how to split a dataframe to form a multiple dataframe in a single context in django python

Categories

Resources